Keywords

1 Introduction

Recognition of human activity is crucial and popular area in computer vision research, because it is the foundation of development of many applications [1, 2]. Human monitoring-based surveillance system has many industrial applications [3]. These types of system are also very useful in police investigation after a crime or any other illegal action has occurred. Hence, one can say that, human activity recognition-based system is very essential for effective video surveillance system. Clutter background, multi-view point, varying lighting condition, and other similar problems, make accurate and efficient human activity recognition a challenging task. The objective of human activity recognition is to automatically analyze ongoing activities from a video (sequence of frames), and classify an activity which is a member of a given set of abstract activities into one such abstract activity, for ex. Running, walking, jumping, etc.

Based on different detail studies [1, 2, 4, 5], human activity recognition approach can be classified in various categories. According to analysis done by Aggarwal and Ryoo [2], human activities can be conceptually categorized into four levels, depending on their complexities: gesture, actions, interaction, and group activities. Gesture are elementary movements of body part of a person for ex. ‘raising a leg’ or ‘stretching an arm’. Actions are single person activity that are results of multiple gestures temporally such as ‘walking’, running’, ‘jogging’, etc. In Interaction type of human activity, two or more persons and/or objects are involved, for ex. ‘two-person fighting’ is an interaction between two humans. Group activities have more complexity is comparison to other three types. In group activities, activities are performed by a one group, which is composed of multiple persons and/or objects. For ex. ‘A group of persons marching’ is a typical example of group activities.

Now a day’s machine learning based approach for human activity recognition is widely used because the amount of unannotated training data is readily available for the unsupervised training of these systems. In this type of approach, after learning from a collection of data, algorithm try to answer questions related to that data. The training used in this type of approach is either pixel-based or feature-based [6]. Feature-based approach is better than the pixel-based approach in terms of execution time and speed. For any good computer vision application, an activity recognition algorithm should hold two properties - (i). activity recognition algorithm should perform under real time constraint, and (ii). activity recognition algorithm should be able to solve multi-view as well as multiclass problem.

A lot of works have been proposed to solve activity recognition problem in last few decades. A method for activity recognition using Principal Component Analysis, and Hidden Markov Model was proposed by Uddin et al. [7]. Method based on background subtraction with shape and motion information features for activity recognition method was proposed by Qian et al. [8] for a smart surveillance system. Bobick and Davis [9] proposed an activity recognition method based on motion templates. Human recognition method based on constrained Delaunay triangulation technique, which divides the posture into different triangular meshes, was proposed by Hsieh et al. [10]. Holte et al. [11] proposed a machine learning-based approach to detect motion of the actors by computing the optical flow in video data.

While most existing human action recognition methods adopted feature(s) that works for single resolution of images, an image can be with complex structures and consists of varying levels of details. To remedy this issue, multi-resolution analysis (MRA) can be adopted. In MRA, images can be analyzed at more than one resolution, so that the features that are left undetected at one level can get a chance to consider in another level. Wavelet transform is the most popular tool of MRA. Wavelet transform can be categorized into real valued and complex valued wavelet transforms. Real valued wavelet transform uses real-valued filters to get real valued coefficients while complex wavelet transform uses complex valued filters to get complex valued coefficients.

Method proposed by Khare et al. [12] uses real valued wavelet transform for activity recognition, but real valued wavelet transform is not suitable in various computer vision application. Use of complex wavelet transform can avoid different shortcomings of real valued wavelet transform. Khare et al. [13] proposed dual tree complex wavelet transform based approach for human action recognition. Dual tree complex wavelet transform is not a true sense complex wavelet transform and its implementation is based on real valued wavelet transform.

Motivated by the work of Khare et al. [12, 13], in this paper, we have proposed Daubechies complex wavelet transform (DCxWT) based approach for recognition of human activities. DCxWT is a true sense complex wavelet transform and having advantages of approximate shift-invariance and better edge representation as compared to real valued wavelet transform. We have used multiclass support vector machine as a classifier for classifying different human activities in video frames. We have experimented the proposed method at multiple levels of DCxWT and shown that performance of the proposed method is becoming better as we move toward higher levels. We have conducted the experiments on different standard action datasets for evaluation of the proposed method. The proposed method is compared with some other state-of-the-art methods proposed by Qian et al. [8], Holte et al. [11], and Khare et al. [12, 13], for showing effectiveness of DCxWT.

The rest of paper is organized as follows: Sect. 2 describes DCxWT used as a feature for recognition of human activities. Section 3 describes the proposed method in detail. Experimental results of the proposed method and other state-of-the-art methods are given in Sect. 4. Finally, conclusions of the work are given in Sect. 5.

2 Daubechies Complex Wavelet Transform

In any recognition algorithm, selection of appropriate feature is very important. If correct feature is selected for recognition then performance of classifier will improve. In our proposed work for recognition of human activities, we have used DCxWT coefficients as feature set. A brief description of Daubechies complex wavelet transform and why this is useful for recognition of human activities is given in below –

For activity recognition, we require a feature which remains invariant by shift, translation and rotation of object, because object may be present in translated and rotated form among different scenes. Due to its approximate shift-invariance and better edge representation property, we have used DCxWT as a feature for recognition of human activity.

Any function \( f(t) \) can be decomposed into complex scaling function and mother wavelet as:

$$ f(t) = \sum\limits_{k} {c_{k}^{{j_{0} }} } \phi_{{j_{0} ,k}} (t) + \sum\limits_{{j = j_{0} }}^{{j_{ \hbox{max} } - 1}} {d_{k}^{j} } \psi_{j,k} (t) $$
(1)

where \( j_{0} \) is given low resolution level, \( \left\{ {c_{k}^{{j_{0} }} } \right\} \) and \( \left\{ {d_{k}^{j} } \right\} \) are approximation coefficients \( \left[ {\phi (u) = 2\sum\limits_{i} {a_{i} \phi (2u - i)} } \right] \) and detail coefficients \( \left[ {\psi (t) = 2\sum\limits_{n} {( - 1)^{n} \overline{{a_{1 - n} }} \phi (2t - n)} } \right] \).

where \( \phi (t) \) and \( \psi (t) \) share same compact support [−L, L + 1] and ai’s are coefficients. The ai’s can be real as well as complex valued and \( \sum {a_{i} = 1} \).

Daubechies’s wavelet bases \( \left\{ {\psi_{j,k} (t)} \right\} \) in one-dimension is defined through above scaling function \( \phi (u) \) and multiresolution analysis of \( L_{2} (\Re ) \)[14]. During the formulation of general solution if we relax the condition for ai to be real [14], it leads to complex valued scaling function.

DCxWT holds various properties [14], in which reduced shift sensitivity and better edge representation properties of DCxWT are important one for classification

3 The Proposed Method

This section describes, new method for human activity recognition, in which we used Daubechies complex wavelet transform coefficients as a feature of objects. Block diagram of the proposed method is given in following Fig. 1.

Fig. 1.
figure 1

Block diagram of the proposed method

Details of the proposed method are described in following steps –

Step 1: first step of the proposed method is to collect and represent data for training and testing. In our method, we have tested the proposed method with various publicly available standard datasets for ex. CASIA dataset [15] and KTH dataset [16]. Human activities are presented in form of video which is a sequence of frames. These videos can be used either for training or testing, depends on purpose. Each frame of video can be considered as an image.

Step 2: Videos in dataset, are may be of different size or having different color format. Therefore, normalization of these videos is required to reduce complexity and maintain uniformity between all dataset with respect to algorithm. In the present work, the collected data are scale normalized to 256 × 256-pixel dimension. After scale normalization, color format of normalized collected data was converted into gray-level format.

Step 3: Third step of the proposed method is feature set computation. For feature set computation in the proposed method, frames of video dataset are decomposed into complex wavelet coefficients using DCxWT. After applying DCxWT, coefficients are in form of four sub-bands namely – LL, HL, LH, HH, in which LL is approximately coefficients and rest other HL, LH, HH gives details coefficients. Each detail coefficient matrix (HL, LH, and HH) is used separately to construct feature vector. Values of LL sub-bands are dropped for construction of feature and this is used for further higher-level decomposition due to multi-resolution property of DCxWT. When we decompose wavelet transform for level 2, then in this level, we further decompose approximation coefficients (LL sub-band) of level 1. This again produces one approximation coefficients and three detail coefficients matrix. In case of level 2, the feature of level 1 are combined with the present level 2, which constitute level 2. Similar process is repeated for successive level of decomposition.

Step 4: Forth and last step of the proposed method is recognition of human activities. For this process, we have performed classification-based recognition of human activity. For this process, we compute feature set of test and trained video dataset using step 1–3 of the proposed method, then both trained and test feature sets are supplied into multi-class SVM classifier. Multi-class SVM classifier analyzes test feature set with trained feature set and gives result in form of recognized human activities. Same process will repeat for all other video datasets. In multi-class SVM classifier we have used radial basis function (RBF) as a kernel function.

4 Experimental Results

In this section, we demonstrate the experimental results of the proposed method, with those of the method proposed by Qian et al. [8], Holte et al. [11], and Khare et al. [12, 13], in terms of Precision, Recall, F-Score, and Recognition Accuracy. The proposed method and other state-of-the-art methods, mentioned above for human activity recognition have been presented here for CASIA action datasets [15] and KTH dataset [16].

CASIA dataset is a collection of sequences of human activities captured by video cameras in outdoor environment from different angle of view. In CASIA dataset, video sequences have non-uniform background. In this dataset, a total of 1446 video sequences, containing fifteen types of different actions. All video sequences were taken simultaneously with three non-calibrated cameras from different view angles (horizontal view, angle view, and top-down view). Resolution of videos in dataset is 320 × 240 pixels with frame rate 25 frames per second. Figure 2, shows the different activities with different viewpoints of CASIA dataset, in which we have performed experiments. For the experiments, we have taken five different activities out of 15 activities, namely – walk, run, bend, fight, and rob. KTH dataset [16] is one of the largest dataset with 192 videos for training, 192 videos for validation, and 216 videos for testing, in which six types of human actions (Walking, Jogging, Running, Boxing, Hand waving, and Hand clapping), performed by 25 persons. Figure 3 shows the different activities of KTH dataset [16].

Fig. 2.
figure 2

Examples of different activities in the CASIA Dataset [15]

Fig. 3.
figure 3

Examples of KTH Dataset frames for different activities [16] [(a) Walking, (b) Jogging, (c) Running, (d) Boxing, (e) Hand waving, (f) Hand clapping]

From the Figs. 2 and 3, one can see that, human activities are differently illuminated and objects are scaled in different camera positions as well as one can observe that both frontal as well as side view of objects are taken for experimentation of human action recognition.

Recognition results for CASIA dataset [15] and KTH dataset respectively are shown in Tables 1 and 2 for the proposed method and other state-of-the-art methods [8, 11,12,13]. For comparison we have used four different performance metrices [17] including Precision, Recall, F-score and Recognition accuracy.

Table 1. Performance measures values for CASIA Dataset [15]
Table 2. Performance measures values for KTH Dataset [16]

Precision is an average per-class agreement of the data class labels with those of a classifier. Recall is an average per-class effectiveness of a classifier to identify class labels. F-Score is a relation between data positive labels and those given by a classifier based on a per-class average; it conveys the balance between the precision and recall. Recognition accuracy is an average per-class effectiveness of a classifier.

$$ Precision = \frac{{\sum\limits_{i = 1}^{L} {\left( {\frac{{TP_{i} }}{{TP_{i} + FP_{i} }}} \right)} }}{L} $$
(2)
$$ Recall = \frac{{\sum\limits_{i = 1}^{L} {\left( {\frac{{TP_{i} }}{{TP_{i} + FN_{i} }}} \right)} }}{L} $$
(3)
$$ F{ - }Score = 2.\left( {\frac{Precision * Recall}{Precision + Recall}} \right) $$
(4)
$$ Accuracy = \frac{{\sum\limits_{i = 1}^{L} {\left( {\frac{{TP_{i} + TN_{i} }}{{TP_{i} + FN_{i} + FP_{i} + TN_{i} }}} \right)} }}{L} $$
(5)

where L is the number of classes, TP is true positive, FP is false positive, FN is false negative, and TN is true negative.

Tables 1 and 2, shows the recognition results in form of four different performance evaluation matrices for the proposed method and other state-of-the art methods [8, 11,12,13] for CASIA dataset [15] and KTH dataset [16] respectively. From Tables 1 and 2, we can see that, the proposed method gives better recognition accuracy in comparison to other state-of-art methods [8, 11,12,13]. Methods proposed by Holte et al. [11] works for multi-view human activity recognition, but from table we can see that the proposed method gives better recognition results in comparison to these methods, so that we can see that the proposed method works good for multi-view poses as well.

5 Conclusion

In this paper, we demonstrated a new method for recognition of human activities. The proposed approach used DCxWT coefficients as a feature of activities of human objects. We evaluated results for multiple levels of DCxWT. we compared results of the proposed method with other state-of-the-art methods [8, 11,12,13] on CASIA dataset [15] and KTH dataset [16], in terms of four different performance measures: Precision, Recall, F-Score, and Recognition accuracy. From the results, we could conclude that the proposed method for human action recognition has better recognition accuracy at higher levels of DCxWT in comparison to DWT or DTCWT, and the proposed method outperforms the other methods. From the results, we can conclude that the complex wavelet transform is better than real wavelet transforms for action recognition in terms of the different performance measures.