Gradient local auto-correlation features for depth human action recognition

Human action classification is a dynamic research topic in computer vision and has applications in video surveillance, human–computer interaction, and sign-language recognition. This paper aims to present an approach for the categorization of depth video oriented human action. In the approach, the enhanced motion and static history images are computed and a set of 2D auto-correlation gradient feature vectors is obtained from them to describe an action. Kernel-based Extreme Learning Machine is used with the extracted features to distinguish the diverse action types promisingly. The proposed approach is thoroughly assessed for the action datasets namely MSRAction3D, DHA, and UTD-MHAD. The approach achieves an accuracy of 97.44% for MSRAction3D, 99.13% for DHA, and 88.37% for UTD-MHAD. The experimental results and analysis demonstrate that the classification performance of the proposed method is considerable and surpasses the state-of-the-art human action classification methods. Besides, from the complexity analysis of the approach, it is turn out that our method is consistent for the real-time operation with low computational complexity. The work proposes to process depth action videos through 3D Motion Trail Model (3DMTM) to represent the video as a set of 2D motion and motionless images. This work improves the above action representation by configuring all the 2D motion and motionless images as binary-coded images with the help of the Local Binary Pattern (LBP) algorithm. This work evaluates the use of auto-correlation features extracted on binary-coded versions of the 2D action representations instead of extracting these features from the non-binary-coded versions of the early action illustration. The work proposes to process depth action videos through 3D Motion Trail Model (3DMTM) to represent the video as a set of 2D motion and motionless images. This work improves the above action representation by configuring all the 2D motion and motionless images as binary-coded images with the help of the Local Binary Pattern (LBP) algorithm. This work evaluates the use of auto-correlation features extracted on binary-coded versions of the 2D action representations instead of extracting these features from the non-binary-coded versions of the early action illustration.

Human actions refer to distinctive sorts of activities including walking, jumping, waving, etc. However, the vivid variations in human body sizes, appearances, postures, motions, clothing, camera motions, viewing angles, illumination changes make the action recognition task very challenging. Over few years, a large number of researchers introduced several action or activity recognition model by using data sensors like RGB video cameras [10], depth video cameras [2], and wearable sensors [11]. Among these two video data sources, action recognition research based on conventional RGB cameras (e.g. [12] has achieved great progress in the last few decades. However, utilization of RGB cameras for action recognition raises significant impediments such as lighting variations and cluttered background [13]. On the contrary, depth cameras generate depth images, which are insensitive to lighting variations and make background subtraction and segmentation easier. In addition, we can obtain body shape and structure characteristics and the human skeleton information from depth images. Many previous attempts can be listed for efficient recognition systems such as DMM [14], HON4D [15], Super Normal Vector [16], Skeletons Lie group [17] and etc. But, those existing methods still face some crucial challenges such as depth video processing, appropriate feature extraction and reliable performance of classification model. Considering the aforementioned challenges, this study focuses to build an effective and efficient human action recognition framework on depth action video sequences. The main objective of this work is to enhance the classification accuracy by proposing an efficient recognition framework, which can overcome the above challenges more effectively. More specifically, the action video is illustrated through three 2D motion and three 2D static segments oriented images of the action. In fact, the dynamic and motionless maps are derived from the implementation of 3DMTM [18] on a video. However, the obtained representations are then enhanced with the help of LBP [19] tool. The tool enriches the action illustration by encoding the motion and motionless maps into binary pattern. Eventually, the outputs of the LBP are treated as input of GLAC [20] to generate the auto-correlation gradient vectors. In fact, there are three feature vectors for the action motion segments images and another three feature vectors for the action static segments images. The first three vectors are concatenated to construct a motion information based GLAC vector. Similarly, another single GLAC vector is gained by incorporating the above mentioned last three vectors. For more boosting the proposed method, the aforementioned two single action representation vectors are concatenated for building the final action description. Finally, the action is recognized by passing the vector to a supervised learning algorithm named Extreme Learning Machine with Kernel (KELM) [21].
The main contributions of this paper are: • We enhance the auto-correlation features for the optimal description of an action. Besides, to observe the significance of the feature augmentation, the action is also presented with the ordinary auto-correlation features. In fact, our action representation technique addresses the intra-class variation and inter-class similarity problem significantly. • We report recognition results on three benchmark datasets namely; MSRAction3D [22], DHA [23] and UTD-MHAD [24]. The recognition results are compared with state-of-the-art handcrafted as well as deep learning methods. • We compare the recognition results based on the enhanced auto-correlation features compared to recognition results using the auto-correlation features only. These comparisons are made for the same data sets to fairly evaluate and elaborate the effectiveness of enhanced auto-correlation features. • Finally, we report the computational efficiency in terms of the running and computational requirements.
Based on three publicly available data sets -MSRAction3D [22], DHA [23] and UTD-MHAD [24], the proposed method is compared with handcrafted and deep learning methods extensively. The computational efficiency assessment indicates that the proposed approach offer feasibility for the real-time implementation. The working flow of the system is illustrated in Fig. 1. This paper is organized as follows: in Sect. 2, we present some related literature review. Section 3 describes research methodology. The results of experimental and discussions are presented in Sect. 4. Finally, Sect. 5 concludes the work.

Related work
Feature extraction is a key step in Computer Vision research problems like object localization, human gait recognition, face recognition, action recognition, text recognition and etc. As a result, researchers have given more attention to extract features effectively. For example, for object recognition, Ahmed et al. [25] introduced a Saliency map on RGB-D indoor data which had numerous applications such as vehicle monitoring system, violence detection, driverless driving system, etc. Hough voting and distinct features were used to measure the efficiency of that work. To explore silhouettes of humans from noisy backgrounds, Jalal et al. [26] applied embedded HMM for activity classifications where spatiotemporal body joints and depth silhouettes were fused to improve accuracy. In another work, to recognize online human action and activity, Jalal et al. [27] performed multi-features fusion along with skeleton joints and shape features of humans. For feature extractions in activity recognition, Tahir et al. [28] applied 1-D LBP and 1-D Hadamard wavelet transform along with Random Forest. On depth video sequences, Kamal et al. [29] utilized modified HMM to complete another fusion process of temporal joint features and spatial depth shape features. On the other hand, to recognize facial expression Rizwan et al. [30] implemented local transform features where HOG and LBP were used for feature extraction. Again, skin joint features by using skin color and self-organized maps were used for activity recognition [31]. In another work, Kamal et al. [32] employed distance parameters features and motion features. Yaacob et al. [33] introduced a discrete cosine transform, particularly for gait action recognition.
In developing vision based handcrafted action recognition, researchers have also done struggle in feature extraction for optimal action representation. The motion features of an action through the simplified depth motion maps were extracted by works in DMM [14], DMM-CT-HOG [34], DLE [35]. The texture features extracted by LBP were utilized in [36]. Recently, Dhiman et al. [37] introduced Zernike moments and R-transform to create a powerful feature vector for abnormal action detection. A genetic algorithms based system was proposed by Chaaroui et al. [38] to improve the efficiency of the skeleton joint-based recognition system by optimizing skeleton-joint subset. Vemulapalli et al. [17] represented human actions as curves that contain skeletal action sequences. Gao et al. [39] proposed a model to recognize 3D actions where they constructed a difference motion history image for RGB and depth sequences. Then, they captured motions through multi-perspective projections. Next, they extracted the pyramid histogram of oriented gradients. Finally, human action was identified by combining multi-perspective and multi-modality discriminated and joint representation. In the work by Rahmani et al [40], the features obtained from depth images were combined with skeleton movements encoded by difference histogram and finally a Random Decision Forest (RDF) was applied to obtain discriminant features for action classifications. On the other hand, Luo et al. [41] represented features by sparse coding-based temporal pyramid matching approach (ScTPM). They also proposed a capturing technique for spatio-temporal features from RGB videos called Center-Symmetric Motion Local Ternary Pattern (CS-Mltp). Finally, they explored the feature-level fusion and classifier-level fusion applying the above mentioned features to improve recognition accuracy. Again, decisive pose features were used by imposing another two distinct transformations called Ridgelet and Gabor Wavelet Transform to detect human action [42]. Moreover, Wang et al. [43] studied ten Kinect-based methods for cross-view and cross-subject action recognition on six dissimilar datasets and finally concluded that skeletal-based recognition is superior to other processes for cross-view.
Deep learning models usually learn features automatically from raw depth sequences, which are then useful to compute high level semantic representations. For example, 2D-CNN and 3D-CNN were employed by Yang and Yang to address the deep learning based depth action classification [44]. Wang et al. [45] used to improve the action representation, unlike DMM, Wang et al. proposed Weighted Hierarchical Depth Motion Maps (WHDMM). The WHDMM was fed into CNN along three CNN streams to recognize actions. In another concept, before passing to CNN, the depth video was described by Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI) [46]. In [47], a novel notion in action classification is introduced by using the RGB domain features as depth domain features by domain adaptation. Motion History Images (MHI) from RGB videos and DMM of depth videos are utilized together to generate a four-stream CNN architectures [48]. By using inertial sensor data and depth data, Ahmad et al. [11] expressed a multimodal M 2 fusion process with the help of CNN and multi-class SVM. Very recently, Dhiman et al. [49] have merged shape and motion temporal dynamics by proposing a deep view-invariant human action system. To detect the human gesture and 3D action, Weng et al. [50] proposed pose traversal convolution Network which applied joint pattern features from the human body. They also represented human gesture and action as a sequence of 3D poses. A self-supervised alignment method was used for unsupervised domain adaptation (UDA) [51] to recognize human action. Busto et al. [52] expressed another concept for action recognition and image classification called open set domain adaptation which works for unsupervised and semi-supervised domain adaptation model.

Proposed system
Our proposed framework consists of feature extraction, action representation and action classification. In this section, we discuss the three parts respectively. Figure 2 shows the pipeline of the system.

Feature extraction
For each action video, three motion and three static information images are firstly computed by applying the 3DMTM [18] on the video. The 3DMTM yields the set MHI XOY , MHI YOZ , MHI XOZ of motion images and the set SHI XOY , SHI YOZ , SHI XOZ of static images by simultaneously stacking all the moving and stationary body parts (along the front, side and top projection views) of an actor in a depth map sequence. Now, the MHIs and SHIs are converted to the binary coded form by the LBP [19]. In fact, the later versions are more enhanced than the earlier version of those images. Figure 3 shows an MHI and the corresponding BC − MHI is represented by Fig. 4. It is clear that the motion information of the action is improved in the BC − MHI.
The binary coded motion images (BC − MHIs) are referred to as BC − MHI XOY , BC − MHI YOZ and BC − MHI XOZ on three Euclidean faces whereas the binary coded static images on those faces are denoted as BC − SHI XOY , BC − SHI YOZ and BC − SHI XOZ . The binary coded images thus obtained, are fed into the GLAC [20] descriptor to extract spatial and orientational auto-correlations for illustrating action. This paper extracts the 0th order and 1st order auto-correlation features to describe an action. In fact, the auto-correlation features are used to describe an action through the rich texture information from images. The texture information includes the image gradients and curvatures simultaneously. Overall, the auto-correlation features are more dominant over the standard histogram oriented features. Thus, we consider the auto-correlation features in our approach. For a spontaneous discussion of the GLAC utilization on BC − MHIs /BC − SHIs , let I be a binary coded motion/static image (i.e., BC − MHI/BC − SHI ). For each pixel of I, we use image gradient operators to obtain a gradient vector. The magnitude and the orientation of the gradient vector are computed as follows: The above orientation can be coded into D orientation bins by voting weights to the nearest bins to form a sparse gradient orientation vector g ∈ ℝ D .
Through the gradient orientation vector g and the gradient magnitude m, the Kth order auto-correlation function of local gradients can be written as where b 1 , b 2 , … , b K are the shifting vectors from the position vector r (indicates the position of each pixel in image According to the suggestion by [20], K ∈ {0, 1} , b 1x,y ∈ {±Δr, 0} and w(.) ≡ min(.) are considered in this paper. The Δr is the displacement interval in both horizontal and vertical directions. The interval is same for both directions due to the isotropic property of the image. Now, from Eq. (3), for K ∈ {0, 1} the 0th order (F 0 ) and the 1st order (F 1 ) GLAC features are as follows:   A single mask pattern can be used for Eq. (4), and there are four independent mask patterns for Eq. (5) for computing the auto-correlations. The mask/spatial auto-correlation patterns of (r, r + b 1 ) are depicted with Fig. 5). Since there is a single mask pattern for F 0 and four mask patterns for F 1 then the dimensionality of the above GLAC features (F 0 and F 1 ) is D + 4D 2 . Although the dimensionality of the GLAC features is high, the computational cost is low due to the sparseness of g . It is worth noting that the computational cost is invariant to the number of bins, D, since the sparseness of g does not depend on D. Figure 7 shows an example of 0th and 1st order GLAC features with 8 orientation bins ( bins are shown in Fig. 6). Based on texture features, an action with motion images can be described as a vect o r EAMF = [EAMF XOY , EAMF YOZ , EAMF XOZ ] , w h e r e EAMF XOY , EAMF YOZ and EAMF XOZ are vectors, which are obtained by conveying the set of binary coded motion information images to the 2D GLAC. In order to represent the static image action based on texture features, the vector EASF = [EASF XOY , EASF YOZ , EASF XOZ ] is obtained by linking the enhanced auto-correlation feature vectors extracted on multi-view static images. The EAMF is complementary to the EASF, therefore we fuse these two vectors in to a single vector to get optimal representation of an action. In our work (for all experiments), the dimension of the single feature vector is of 4752. It is flexible to compute the feature vector due to the sparse vector g . The work in [20] provides more detail on GLAC (Fig. 7).

Action classification
To gain the promising classification outcome, the fused version of EAMF and EASF is passed to Kernel based Extreme Learning Machine (KELM). The classification algorithm is discussed in detail in this section. The KELM [21] is an enhancement of Extreme Learning Machine (ELM) classifier [53]. By associating a suitable kernel with ELM, the KELM is derived to improve the discriminatory power of the classification algorithm. The Radial Basis Function (RBF) kernel is employed in our work. For an intuitive illustration, the classifier is described as a single algorithm in Algorithm 1.

Experimental results and discussion
We evaluate the proposed framework on MSRAction3D [22], DHA [23] and UTD-MHAD [24] datasets. Example depth images of each dataset are illustrated in Fig. 8. From the figure, it is clear that these datasets are ready for direct use without any segmentation process. Like other methods in [22][23][24], we straightforward input the depth map sequences in the proposed system without employing a preprocessing algorithm on the sequences.

Experiments on MSRAction3D dataset
MSRAction3D dataset [22] consists of 20 actions delivered by 10 diverse actors. The dataset includes inter-class similarity in different types of actions. The action examples generated by odd label oriented actors are utilized for model training and even label oriented samples are employed for the model testing [22]. The KELM uses C = 10 4 and = 0.03 for training the classification model as optimal parameter values which are determined by 5-fold cross validation. Table 1 reports a notable accuracy of 97.44% of our approach. The table indicates that the proposed approach is able to achieve better classification accuracy over other existing methods considerably. It can be seen that our method overwhelms deep learning system described in [44] by 6.34% and by 11.34% (see Table 1). To clarify the effectiveness of the feature enhancement, the system based on only the auto-correlation features is also evaluated on this dataset. The enhanced auto-correlation feature based system improves the recognition accuracy by 5.5% over the system without feature enhancement on the same setup and parameters. Figure 9 shows the confusion matrix corresponding to the accurate and incorrect classification rates. Through Table 2, the failure cases of the approach are listed. The table shows that the "horizontal wave" is confused with "hammer" by 8.3%; "draw x" is confusion with "high wave" by 7.14% and "draw circle" by 21.43%. Also action named "draw tick" is confused with "draw x" by 16.67%. Overall, among 20 actions, 17 actions are classified correctly (i.e., 100% classification accuracy) and rest 3 actions are classified incorrectly being confused with other actions.

Experiments on DHA dataset
DHA dataset proposed by Lin et al. [23] with 23 action categories captured by 21 individuals. Due to having interclass similarity of different types of action categories, such as golf-swing and rod-swing the dataset looks complex. The training and the test instances are separated according to the technique as discussed in the previous dataset [23]. The classification parameters C = 10 4 and = 0.06 are decided through the 5-fold cross validation technique.
Our approach gains a significant classification accuracy of %99.13 on this dataset. It can be seen from Table 3 For this dataset, the enhanced auto-correlation method overcomes the auto-correlation method by an accuracy of 2.17%. The confusion matrix of the dataset is shown in Fig. 10. Furthermore a table is figured out to show the class based confusion information. Table 4 clarifies that the "skip" and "side-clap" are misclassified with low confusion rates and other 21 actions are classified with 100% accuracy. The wrong classification occurs when "skip" is confused with "jump" by 9.09% and "side-clap" is confused with "side-box" by 9.09%.

Fig. 9 Confusion matrix obtained for the MSRAction3D dataset
Experimental evaluation of our approach on UTD-MHAD dataset is represented by Table 5. The approach is able to acquire 88.37% overall classification accuracy on the dataset. The comparison of our method with other existing methods is also shown in the table. Our method outperforms [24] (Kinect) by 22.27%, [24] (Kinect+Inertial) by 9.27%, [58] by 3.97%, [72] by 2.57% and [68] by 6.87%. The enhanced auto-correlation system overwhelms the auto-correlation system by 0.93%. The confusion matrix is shown by Fig. 11. The figure describes, the approach has misclassified action classes although the overall classification rate for this dataset is of 88.37%. Due to interclass similarity, 16 action classes show confusion among 27 action classes. Table 6 is extracted from the confusion matrix to furthermore analyze the experimental results. The table mentions that the action swipe-right is confused with the action "swipe-left", and the confusion/ misclassification rate is 15.79%. For the "wave" action this rate is 20.0% for having confusion with the action "drawcircle-CW". Similarly, the confusion rate for the action class "clap" and "wave" is of 20.0% and for "wave" and "clap" is of 15.79%. Also, all the samples of the action classes "basketball shoot", "draw-x", "draw-circle-CW", "draw-circle-CCW" "draw-triangle", "baseball-swing", "pus", "knock", "catch", "jog", "stand2sit" and "lunge" are not classified perfectly. Those misclassified samples are confused with samples of similar body postures of subjects.

System competency
The computational time and the complexity of key factors are considered to examine the system's efficiency .

Computational complexity
In fact, the PCA and the KELM are the key components in the computational complexity calculation of the introduced system. The PCA has complexity of O m 3 + m 2 r [14] and and the KELM has complextity of O r 3 [73]. As a result, the complexity of the system can be revealed as O m 3 + m 2 r + O r 3 . Table 8 represents the calculated Microsoft kinect [24] 66.1 Wearable inertial [24] 67.2 Microsoft kinect + wearble inertial [24] 79.1 3DHoT + MBC [58] 84.4 Joint trajectory + CNN [72] 85.8 Hierarchical Gaussian [68] 81.5 MHF + SHF + MSVM [36] 83.26 Only auto-correlation features 87.44 Proposed method (enhanced auto-correlation features) 88.37

Fig. 11
Confusion matrix obtained for the UTD-MHAD dataset complexity and compares with complexities of other existing methods. It can be seen that our method has lower computational complexity than other methods listed in the table. our method is also superior over them from the recognition perspective. Thus, our approach is superior in terms of recognition accuracy as well as computational efficiency.

Conclusion
This paper has introduced an efficacious and efficient human action framework based on enhanced auto-correlation features. The system uses the 3DMTM to derive three motion information images and three motionless information images from an action video. Those motion and static information-oriented maps are improved by engaging the LBP algorithm on them. The outputs of the LBP are fed into GLAC descriptor to get an action description vector. With the obtained feature vectors from GLAC, the action classes are distinguished through the KELM classification model. The approach is extensively assessed in terms of the MSRAction3D, DHA, and UTD-MHAD datasets. Because of our action representation strategy, the proposed algorithm exhibits considerable performance over the existing handcrafted as well as deep learning methods. It is also obvious that the enhanced auto-correlation featuresbased method overwhelms the simple auto-correlation features-based method successfully. Thus, the improvement of the features is significant to enhance the system. Furthermore, the computational efficiency of the method specifies its suitability for real-time operation.
It is worth mentioning, some misclassifications are observed in our method. Note that the proposed method did not remove noise to improve the performance. The system only employed the LBP as preprocessing method for edge enhancing. Besides the LBP, a noise removing algorithm can be utilized to address the miss-classification issues of the proposed approach and thus for further improvement of the overall recognition accuracy. The descriptor can be more improved to increase the discriminatory power of the approach.
In our future work, we aim to build a deep model using the obtained 2D motion and static images. Besides, the current approach is not evaluated on large and complex RGB datasets like UCF101 and HMBD51. With a proper modification, the approach would be tested on these datasets in the future. Furthermore, we have a plan to build a new recognition framework using the GLAC descriptor on RGB and depth datasets jointly..
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless  indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.