Attention Based Detection and Recognition of Hand Postures Against Complex Backgrounds
- First Online:
- Cite this article as:
- Pisharady, P.K., Vadakkepat, P. & Loh, A.P. Int J Comput Vis (2013) 101: 403. doi:10.1007/s11263-012-0560-5
- 1.7k Views
A system for the detection, segmentation and recognition of multi-class hand postures against complex natural backgrounds is presented. Visual attention, which is the cognitive process of selectively concentrating on a region of interest in the visual field, helps human to recognize objects in cluttered natural scenes. The proposed system utilizes a Bayesian model of visual attention to generate a saliency map, and to detect and identify the hand region. Feature based visual attention is implemented using a combination of high level (shape, texture) and low level (color) image features. The shape and texture features are extracted from a skin similarity map, using a computational model of the ventral stream of visual cortex. The skin similarity map, which represents the similarity of each pixel to the human skin color in HSI color space, enhanced the edges and shapes within the skin colored regions. The color features used are the discretized chrominance components in HSI, YCbCr color spaces, and the similarity to skin map. The hand postures are classified using the shape and texture features, with a support vector machines classifier. A new 10 class complex background hand posture dataset namely NUS hand posture dataset-II is developed for testing the proposed algorithm (40 subjects, different ethnicities, various hand sizes, 2750 hand postures and 2000 background images). The algorithm is tested for hand detection and hand posture recognition using 10 fold cross-validation. The experimental results show that the algorithm has a person independent performance, and is reliable against variations in hand sizes and complex backgrounds. The algorithm provided a recognition rate of 94.36 %. A comparison of the proposed algorithm with other existing methods evidences its better performance.
KeywordsComputer visionPattern recognitionHand gesture recognitionComplex backgroundsVisual attentionBiologically inspired features
Visual interaction is a natural, easy, and effective way of interaction, which does not require any physical contact and does not get affected by noises produced by sound. Hand gesture recognition, which is an important area of research in visual pattern analysis, have wide applications in sign language recognition, human-computer interaction (HCI), human-robot interaction (HRI), and virtual reality (VR). The presence of complex and cluttered backgrounds make the recognition of hand gestures difficult.
The mainstream computer vision research has always been challenged by human vision, and the mechanism of human visual system is yet to be understood well. The human visual system rapidly and effortlessly recognizes a large number of diverse objects in cluttered, natural scenes and identifies specific patterns. This capability of the human vision system inspired the development of computational models of biological vision systems. Intermediate and higher visual processes in primates select a subset of the available sensory information before further processing (Tsotsos et al. 1995), in order to reduce the complexity of scene analysis. This selection is implemented in the form of a focus of attention (Niebur and Koch 1998). Recent developments in the use of neurobiological models in computer vision try to bridge the gap between neuroscience, computer vision and pattern recognition (Poggio and Bizzi 2004; Poggio and Riesenhuber 1999; Serre et al. 2007; Itti et al. 1998; Itti and Koch 2001; Siagian and Itti 2007; Rao 2005; Chikkerur et al. 2010).
1.1 Hand Gesture Recognition
There exist several reviews on hand modeling, pose estimation, and hand gesture recognition (Erol et al. 2007; Ong and Ranganath 2005; Pavlovic et al. 1997; Wu and Huang 1999). The tools used for vision based hand gesture recognition can be classified into three categories (Fig. 1(b)). They are (1) Hidden Markov Model (HMM) based methods (Lee and Kim 1999; Yoon et al. 2001; Ramamoorthy et al. 2003; Just and Marcel 2009; Chen et al. 2003; Yang et al. 2007), (2) Neural network (NN) and learning based methods (Pramod Kumar et al. 2010a, 2010c, 2011; Alon et al. 2009; Su 2000; Licsar and Sziranyi 2005; Ge et al. 2008; Yang et al. 2002; Yang and Ahuja 1998; Zhao et al. 1998; Teng et al. 2005; Hasanuzzamana et al. 2007; Eng-Jon and Bowden 2004), and (3) Other methods (Graph algorithm based methods (Pramod Kumar et al. 2010b; Triesch and Malsburg 1996a, 1998, 2001), 3D model based methods (Athitsos and Sclaroff 2003; Ueda et al. 2003; Yin and Xie 2003; Lee and Kunii 1995), Statistical and syntactic methods (Chen et al. 2008; Wang and Tung 2008), and Eigen space based methods (Patwardhan and Roy 2007; Daniel et al. 2010)).
Eng-Jon et al. proposed an unsupervised algorithm for hand detection and static hand shape recognition (Eng-Jon and Bowden 2004). A tree of boosted (hand) detectors is reported with two layers; the top layer for hand detection and the branches in the bottom layer for hand shape classification. A shape context based distance matrix is utilized for clustering similar looking hand shapes in order to construct the tree structure. The algorithm provided good detection and recognition accuracy.
A systematic approach to building a hand appearance detector is presented in Kolsch and Turk (2004). The paper proposes a view specific hand posture detection algorithm based on the object recognition method proposed by Viola and Jones. A frequency analysis based method is utilized for instantaneous estimation of class separability, without the need for training. The algorithm is applied for the detection of six hand postures. Wu et al. proposed an algorithm for view independent hand posture recognition (Wu and Huang 2000). The suitability of a number of classification methods is investigated to make the algorithm view independent. The work combined supervised and unsupervised learning paradigms to propose a learning approach called Discriminant-EM (D-EM). The D-EM uses an unlabeled dataset to help supervised learning to reduce the number of labeled training samples. The image datasets utilized to test the above algorithms have simple (uniform) and relatively similar backgrounds, and these works did not address the issues with complex backgrounds (which contain clutter and other distracting objects).
An algorithm for the recognition of hand postures in complex natural environments is useful for the real-world applications of interactive systems. Triesch and Malsburg (1996a, 2001) addressed the complex background problem in hand posture recognition using elastic graph matching (EGM). Bunch graph method (Triesch and Malsburg 1996a) is utilized to improve the performance in complex environments. In graph algorithms, the entire image is scanned to detect the object, which increases the computational burden1. In addition, in a bunch graph each node is represented using a bunch of identical node features which further decreases the processing speed. Athitsos et al. proposed another algorithm to estimate hand pose from cluttered images (Athitsos and Sclaroff 2003). The algorithm segmented the image using skin color, and it needs fairly accurate estimates of the center and the size of the hand. The above algorithms cannot deal with complex backgrounds which contain skin colored regions, and large variations in the hand size.
1.2 The Proposed Approach
This paper focuses on the detection and recognition of hand postures in cluttered natural environments. The proposed algorithm utilizes a biologically inspired approach, which is based on the computational model of visual cortex (Serre et al. 2007) and the Bayesian model of visual attention (Chikkerur et al. 2010). The Bayesian theory of attention is utilized to detect and identify the hand region in complex background images. The where information is extracted using feature based visual attention. The features utilized are based on shape, texture and color.
The shape and texture based features are extracted from a map which represents the similarity of pixels to human skin color, using the computational model of the ventral stream of visual cortex (Serre et al. 2007). The color features are extracted by the discretization of chrominance color components in HSI and YCbCr color spaces, and the similarity to skin map. A saliency map is created by calculating the posterior probabilities of pixel locations to be part of a hand region, using the Bayesian model of attention. The presence of hand is detected by thresholding the saliency map, and the hand region is extracted by the segmentation of input image using the thresholded saliency map. The hand postures are recognized using shape and texture based features (of the hand region), with a Support Vector Machines (SVM) classifier.
The proposed algorithm is reliable against complex and skin colored backgrounds as the segmentation of hand region is done using the attention mechanism, which utilizes a combination of color, shape, and texture features. The experimental results show that the algorithm has a person independent performance. The proposed algorithm has robustness against variations in hand size and its position in the image.
The number of hand posture databases available to the research community is limited (Triesch and Malsburg 1996b). This paper contributes a new 10 class complex background hand posture dataset, namely NUS hand posture dataset-II.2 The postures are obtained from 40 subjects, with various hand sizes, and with different ethnicities. The images have a variety of indoor as well as outdoor complex backgrounds. The hand postures have wide intra class variations in hand sizes and appearances. The database also contains a set of background images which is used to test the hand detection capability of the proposed algorithm. The recognition algorithm is tested with the new dataset using a 10 fold cross validation strategy. It provided an accuracy of 94.36 %.
Section 2 of this paper explains the shape and texture based feature extraction system, and the Bayesian model of attention. The proposed system is presented in Sect. 3. Section 4 discusses the experimental results. The paper is concluded with the remarks provided in Sect. 5.
2 The Feature Extraction System and the Model of Attention
The biologically inspired, shape and texture based feature extraction system, and the Bayesian model of visual attention (Chikkerur et al. 2010) are briefed in this section. The shape and texture based features are extracted using a cortex like mechanism (Serre et al. 2007), and the visual attention is implemented using these features and a set of color features.
2.1 Biologically Inspired Features for Visual Pattern Recognition
the spatial aspect ratio of the Gaussian function,
the standard deviation of the Gaussian function,
the wavelength of the sinusoidal term,
the orientation of the Gaussian from the x-axis.
Gabor wavelet based features have good discriminative power among various image textures and shapes. Riesenhuber and Poggio extended the use of 2D Gabor wavelets to propose a hierarchical model of ventral visual object-processing stream in the visual cortex (Poggio and Riesenhuber 1999). Serre et al. implemented a computational model of the system, and utilized it for robust object recognition (Serre et al. 2005, 2007). The features extracted by this model are known as the C1 and C2 standard model features (SMFs). These features are scale and position-tolerant, and the number of extracted features is independent of the input image size. The C2 SMFs are later used for hand writing recognition (Van der Zant et al. 2008) and face recognition (Lai and Wang 2008; Pramod Kumar et al. 2010a). The proposed algorithm utilizes the C2 features for multi-class hand posture recognition.
2.1.1 Extraction of Shape and Texture Features
Different layers in the shape and texture feature extraction system
simple cells in the primary visual cortex (V1)
complex cells in the primary visual cortex (V1)
Radial basis functions
visual area V4 & posterior inferotemporal cortex
The scale invariant C2 responses are computed by taking a global maximum over all the scales for each S2 type. Each C2 response matrix corresponds to a particular prototype patch with a specific patch size. The more the number of extracted features the better is the classification accuracy. However, the computational burden (for feature extraction as well as classification) increases with the number of features. In the present work 15 prototype patches with 4 patch sizes are extracted from each of the 10 classes. The total number of patches5 is 600, leading to 600 shape and texture based features.
2.1.2 Modifications in the Shape and Texture Feature Extraction System
The shape and texture features are extracted from a similarity to skin map (Sect. 3.1.2) (not from the grey scale image).
The prototype patches are extracted from different class images. The centers of the patches are placed at geometrically significant and textured positions of the hand postures.
The output feature component is a C2 response matrix (instead of a real number) which retains the hand location information.
The parameters of the shape-texture feature extraction system are those reported in Serre et al. (2007) except for the number (600) and position of the prototype patches.
2.2 Feature Based Visual Attention
Visual attention is the part of the inference process that addresses the visual recognition problem of what is where (Chikkerur et al. 2010). Visual attention helps to infer the identity and position of objects in a visual scene. Attention reduces the size of the search space and the computational complexity of recognition.
The visual attention is directed selectively to objects in a scene using both bottom-up, image-based saliency cues and top-down, task-dependent cues. The top-down task based attention is more deliberate and powerful (Itti and Koch 2001), and depends on the features of the object. The proposed pattern recognition system is task based and it utilizes a top-down approach. The attention is focussed on the region of interest using the object features.
Visual perception is interpreted as a Bayesian inference process whereby priors (top-down) help to disambiguate noisy sensory input signals (bottom-up) (Dayan et al. 1995). Visual recognition corresponds to estimating posterior probabilities of visual features for specific object categories, and their locations in an image. The posterior probabilities of location variables serve as a saliency map. A Bayesian model of spatial attention is proposed in Rao (2005). Chikkerur et al. (2010) modified the model to include feature based attention, in addition to the spatial attention. The model imitates the interactions between the parietal and ventral streams of visual cortex, using a Bayesian network (Bayes net).
3 Hand Posture Detection, Segmentation and Recognition
The proposed algorithm addresses the complex image background issue by utilizing a combination of different features. The hand region in image is identified by calculating the joint posterior probability of the feature combination. Bayesian inference is utilized to create a saliency map, which helps in the segmentation of the hand region.
In Bayesian inference, the likelihood of a particular state of the world being true is calculated based on the present input and the prior knowledge about the world. The significance of an input is decided based on the prior experience. In images with complex backgrounds, the shape patterns emerging from the background affect the pattern recognition task negatively. To recognize a pattern from a complex background image, the features corresponding to the foreground object are given higher weightage compared to that corresponding to the background.
In this work, the shape and texture features of the hand postures, and the color features of the human skin are utilized to focus attention on the hand region. The posterior probability of a pixel location to be the part of a hand region is calculated by assigning higher priors (which is learned from the training images) to the features corresponding to the hand area. The hand postures are detected and segmented by thresholding the posterior probabilities. Classification of hand postures is done using the shape and texture features of the hand region, with an SVM classifier.
3.1 Image Pre-processing
The image pre-processing includes color space conversions and the generation of similarity to skin map.
3.1.1 Color Space Conversions—RGB to HSI and YCbCr
The input image in RGB space is converted to HSI and YCbCr color spaces. The conversion between RGB and HSI is nonlinear whereas that between RGB and YCbCr is linear. The chrominance components in these color spaces (H, S, Cb, and Cr) are utilized to detect the hand region in images.6 The hue value H refers to the color type (such as red, blue, or yellow), and the saturation value S refers to the vibrancy or purity of the color. The values of Cb and Cr represent the blue component (B−Y) and the red component (R−Y) (Chaves-González et al. 2010) respectively (Y stands for the luminance value). The values of H and S are in the range [0 1], and those of Cb and Cr are within [16 240].
3.1.2 Similarity to Skin Map
the similarity of the pixel to skin color,
- H & S
the hue and saturation values of the pixel,
- Hs0 & Ss0
the average hue and saturation values of the skin colors,
- Hsmax & Hsmin
the maximum and minimum of the hue values of the skin colors, and,
- Ssmax & Ssmin
the maximum and minimum of the saturation values of the skin colors.
Skin color parameters
Hue span (Hsmax−Hsmin)
Saturation span (Ssmax−Ssmin)
3.2 Extraction of Color, Shape and Texture Features
The proposed algorithm utilizes a combination of low and high level image features to develop the visual attention model. A saliency map is generated and the attention is focussed on the hand region using color features (low level) and shape-texture features (high level). The postures are classified using the high level features.
3.2.1 Color based Features
Jones and Rehg proposed a statistical color model for the detection of skin region in images (Jones and Rehg 1999). They suggested that skin color can be a powerful cue for detecting people in unconstrained imagery. However they did the skin color detection only using the RGB color space. Analysis and comparison of the usage of different color spaces for skin segmentation are provided in Phung et al. (2005) and Chaves-González et al. (2010). Chaves-González et al. (2010) rates HSI space as the best choice for skin segmentation. Cb and Cr components in the YCbCr space provided better results (compared to that provided by H–S and R–G components) for the experiments conducted in Phung et al. (2005). A combination of HSI and YCbCr color spaces can improve the segmentation of skin colored region in images. The proposed algorithm utilizes a combination of the chrominance color components (H, S, Cb, and Cr) of these color spaces as the color features for hand region segmentation.
Average H, S, Cb, and Cr values of the six skin samples in Fig. 6
Color based features. (a) Chrominance component features: Discretized chrominance color components (H, S, Cb, Cr) in HSI and YCbCr color spaces. Each component is discretized into 10 subcomponents as shown. (b) Skin similarity features: Discretized skin color similarity (3) values
3.2.2 Shape and Texture based Features
The shape and texture descriptors (Sect. 2.1.1) are extracted from the similarity to skin map. In the proposed algorithm, 15 prototype patches with 4 patch sizes, are extracted from each of the 10 classes. The 15 patches are from 15 different images of the same class (experiments are also run using patches extracted from a single image per class, but better accuracy is achieved in the earlier case as it provides better invariance). Each of the extracted patch contains the four orientations (0∘, 45∘, 90∘, 135∘). The total number of patches is 600, leading to 600 shape and texture based features.
3.3 Feature Based Visual Attention and Saliency Map Generation
Description of the conditional probabilities (priors, evidences, and the posterior probability)
The top-down shape and texture feature priors; the probability of shape and texture features being present, given the presence of hand
By counting the frequency of occurrence of features* within the training images (maximum one count per image)
The top-down color feature priors; the probability of color features being present, given the presence of hand
By counting the frequency of occurrence of features (Table 4) within the hand region in the training images (400 images, 1 image per class per subject, are considered)
Bottom-up evidence for the shape and texture features; provides the likelihood that a particular location is active for the shape and texture features
By the shape and texture feature extraction of test images (Sect. 2.1.1)
Bottom-up evidence for the color features; provides the likelihood that a particular location is active for the color features
By the color feature extraction of test images (Sect. 3.2.1)
Posterior probabilities of location, which acts as the saliency map
By the belief propagation algorithm (Pearl 1988)
3.4 Hand Segmentation and Classification
The hand region is segmented using the saliency map generated. For the segmentation, a bounding box is created around the most salient (top 30 %) locations in the image. The shape and texture based features of the hand region are extracted next. The same prototype patches selected earlier (Sect. 2.1.1) are utilized for the feature extraction. The C2 SMFs are extracted by taking maximum over positions of the C2 response matrices, similar to that done in Serre et al. (2007). That is, the value of the best match between a stored prototype and the input image is kept and the rest are discarded. An SVM classifier with linear kernel is utilized for the classification.
4 Experimental Results and Discussion
The proposed algorithm is tested using a 10 class complex background hand posture dataset.
4.1 The Dataset and the Experimental Setup
Different subsets in NUS hand posture dataset-II
2000 hand posture color images (40 subjects, 10 classes, 5 images per class per subject, image size: 160×120) with complex backgrounds
750 hand posture color images (15 subjects, 10 classes, 5 images per class per subject, image size: 320×240) with noises like body/face of the posturer or the presence of a group of human in the background
2000 background images without the hand postures (used for testing hand detection capability)
The proposed algorithm is tested in two aspects: hand posture detection and recognition. The hand posture detection capability is tested using data subsets A and C. The hand posture recognition capability is tested using data subsets A and B.
4.2 Hand Posture Detection
4.3 Hand Region Segmentation
4.4 Hand Posture Recognition
Hand posture recognition accuracies: data subset A
When the attention is implemented using shape and texture features, the algorithm provided good improvement in the accuracy (87.72 %) compared to that achieved (75.71 %) by the algorithm proposed in Serre et al. (2007). The color feature attention alone provided lesser accuracy (81.75 %) compared to that provided by shape and texture attention. The lesser accuracy with color feature attention is due to the skin colored pixels in the background. The color features are extracted using point processing, whereas the shape-texture features are extracted using neighborhood processing. This is another reason for the lesser accuracy with color feature attention. However when color features are combined with the shape and texture features, it resulted in the best accuracy (94.36 %).
Table 7 also shows a comparison of the accuracy provided by the proposed algorithm with that provided by the EGM algorithm (Triesch and Malsburg 2001). The EGM algorithm provided only 69.80 % recognition accuracy in spite of the high computational complexity of graph matching. The EGM algorithm performs poor when the complex background of the image contains skin colored objects. A majority of the samples misclassified by the EGM algorithm are the images with skin colored complex backgrounds. The proposed algorithm has robustness to skin colored backgrounds as it utilizes shape and texture patterns with color features. The shape-texture selectivity of the feature extraction system is improved as the prototype patches are extracted from the geometrically significant and textured positions of the hand postures.
4.5 Performance with Human Skin and Body Parts as Noises
4.6 Comparison of the Recognition Time
Comparison of the recognition time
Elastic graph matching (EGM) (Triesch and Malsburg 2001)
An attention based system is proposed for the recognition of hand postures against complex backgrounds. A combination of high and low level image features is utilized to detect the hand, and to focus the attention on the hand region. A saliency map is generated using Bayesian inference. The postures are classified using the shape and texture based features of the hand region with an SVM classifier. The proposed algorithm is tested with a 10 class complex background dataset, the NUS hand posture dataset-II.
The proposed algorithm has a person independent performance. It provided good hand posture detection and recognition accuracy in spite of variations in hand sizes. The algorithm provided reliable performance against cluttered natural environments including skin colored complex backgrounds. The proposed algorithm is tested with color based attention alone, with shape and texture based attention alone, and with the combination of color, shape, and texture attention. On comparison, the algorithm provided the best recognition accuracy when the combination of color, shape, and texture attention is utilized.
The proposed feature attention based algorithm can be extended for the recognition of dynamic gestures and human body postures in cluttered natural environments. The utilization of color features may not be effective in the case of human body postures due to clothing on the body. However a body posture provides more reliable texture features compared to that of a hand posture. Another possible future work is the modification of the algorithm to improve its processing speed and computational complexity.
Graph matching is considered to be one of the most complex algorithms in vision based object recognition (Bienenstock and Malsburg 1987). The complexity is due to its combinatorial nature.
The dataset is available for free download: http://www.ece.nus.edu.sg/stfpage/elepv/NUS-HandSet/.
V1, V2, V3, V4, and V5 are the visual areas in the visual cortex. V1 is the primary visual cortex. V2 to V5 are the secondary visual areas, and are collectively termed as the extrastriate visual cortex.
The number of prototype patches and orientations are tunable parameters in the system. Computational complexity increases with these parameters. The reported values provided optimal results (considering the accuracy and computational complexity).
The luminance color components are not utilized as these components are sensitive to skin color as well as lighting.
The dataset consists of hand postures by 40 subjects, with different ethnic origins.
400 images (1 image per class per subject) are considered. During the training phase the hand area is selected manually.
The dataset is available for academic research purposes: http://www.ece.nus.edu.sg/stfpage/elepv/NUS-HandSet/.
For cross validation the dataset is divided into 10 subsets each containing 200 images, the data from 4 subjects.
The authors would like to thank Ms. Ma Zin Thu Shein for taking part in the shooting of NUS hand posture dataset-II. Also the authors express their appreciation to all the 40 subjects volunteered for the development of the dataset.