Keywords

1 Introduction

Human Action Classification is a pattern recognition task in which the main goal is to identify the action displayed in a media content. Regardless of the media source, such as images, sequential selection of video frames, raw video data or annotations, this task gained a lot of attention in the past few years. Here the focus is on video data, and one could define the main task as: given a video, one needs to classify the action displayed into one of a predetermined set of actions using only the content presented in the video.

To achieve this goal we explore ways to better represent the video content, transforming the original input information into a more suitable representation for the classifier. Usually researchers address this problem using two stages [9]: (i) feature extraction; and (ii) action classification. In a typical approach, feature extraction is performed directly on the raw data, here called low-level description, trying to avoid noise or irrelevant information. Action classification involves learning statistical models from the extracted features, and using those models to classify new feature observations.

The most discriminative low-level descriptors available in literature today rely on identifying regions of interest. Once these regions are identified, desirable features are then extracted around these regions. The output created by this process is a set of features, which are related to regions, representing the media. Facing this scenario, one popular approach is to map the set of local descriptors into one vector used as a global representation, so-called mid-level representation.

Among the methods for creating a mid-level representation, standout Bag-of-Words (BoW), Spatial Pyramids and Convolutional Networks for their notable results. As stated by [2], mid-level representations have three steps in common: (i) coding; (ii) pooling; and (iii) concatenation. Coding stands for the transformation locally applied into features vectors, extracting distribution characteristics. Pooling, in turn, explores the spatial relation between these characteristics; and concatenation constructs the final vector representation.

Here we explore a new strategy for the pooling step based on a volumetric partition of an hypersphere centered at codewords. The goal is to maintain the same probability of assignment to a given hyper-region. We argue that this kind of pooling could decrease the quantization error created during codification.

This paper is organized as follows. In Sect. 2, some related works involving mid-level representations and human action classification are described. While in Sect. 3, a formalization of traditional BoW is presented, in Sect. 4, the new mid-level representation is given. Experiments for human action recognition taking into account three well-known datasets are presented in Sect. 5, and finally, some conclusions are drawn in Sect. 6.

2 Related Work

Human action recognition is a popular topic in video processing, but it still an open problem due to the difficult in creating a representation able to capture and describe action motions in different scenarios. Among the most pronounced action descriptors, local spatio-temporal features, as proposed in [6, 15], have been successfully used in several applications. In [6] the Space-Time Interest Point (STIP) descriptor are proposed. In STIP, interest points are detected in multiple scales and associated to a patch. Each path is described using Histogram of Oriented Gradients (HOG) and Histogram of Optical Flow (HOF). Regarding Dense Trajectories descriptor, proposed in [15], the trajectories are obtained by densely tracking sampled points obtained with optical flow fields. After tracking, feature point descriptors are extracted using HOG, HOF and Motion Boundary Histogram (MBH) around the trajectories.

These two description approaches represent action videos by a set of local features. Inspired by the success of mid-level representations of local features in image processing, they rely on vector quantization in a BoW scheme to create a global video representation. Although they achieve good results, the loss of information during codification still an open problem.

In order to deal with quantification errors, a wide range of methodologies for creating mid-level representation has been proposed. Most of these representations, when applied in action recognition task, follow the vector quantization based on BoW model, but try to preserve spatial temporal relations during the coding process taking into account multiple weighted representation.

In [18], it was proposed the combination of local histograms with body regions histograms in order to preserve spatial temporal relations between interest points. In [8], it was proposed a sparse coding with max pooling framework applied in multiple contexts, which are defined according to the spatial scale and nearest neighbors of local features. Moreover, it is constructed one vocabulary and one histogram for each defined context. At the end, these data are concatenated for the final representation.

In [17], it was proposed a method based on multiple hierarchical levels for creating histograms. The first level is constructed using the descriptors extracted from video cuboids. The other levels are created by applying a neighboring function regarding previous level description and by creating a new codebook and new histogram for the current level. This scheme is called hierarchical BoW. In [13] an hierarchical BoW is constructed by recursive computing partitions of depth maps sequence in temporal domain, called Temporal Bag-of-Words.

In order to explore a representation driven by the histogram information, in [20], it was defined a contextual domain surrounding a spatial temporal area. After that, a contextual distance is calculated by adding a penalty value proportional to the probability density function computed from the local descriptors and codewords of the contextual domain. The contextual information is also used in [19], however it is obtained by histogram intersections, using both spatial and temporal distances as weighted controlling factors. In [16], Term Frequency-Inverse Document Frequency (TF-IDF) of visual words is used to create histograms representing video segments. These histograms are applied in a continuous framework using a data stream algorithm to update the system knowledge based on the classification score obtained by the histogram. In [3] contextual information incorporate depth camera data and global frame descriptors in a BoW framework.

In image processing domain, enriched BoW representations with extra knowledge from the set of local descriptors have been explored on several approaches [11, 21]. However, those works use parametric models leading to a very high-dimensional representation. On other hand, BossaNova model [1], which follows BoW formalism (coding/pooling), keeps more information than the traditional BoW during the pooling step. It estimates a probability density function by computing a histogram of distances between local descriptors and codewords. In addition to the pooling strategy, in [1], it also proposed a localized soft-assignment coding that considers only the k-nearest codewords for coding a local descriptor.

3 Traditional Bag-of-Words

In the traditional Bag-of-Words (BoW) model for mid-level representation, the input is a set of unordered local descriptors, representing the whole data. The BoW model first requires a dictionary learned from the feature points. The most common approach to create the dictionary is by an unsupervised clustering algorithm (e.g., K-means algorithm). The dictionary is composed by a set of M codewords. More precisely, let \({\mathbb {X}}=\left\{ { {\mathbf {x}}}_{j}\in {\mathbb {R}}^{ d}\right\} _{j=1}^{N}\) be an unordered set of d-dimensional descriptors \({ {\mathbf {x}}}_{j}\) extracted from the data and let \(\mathbb {C}=\{\mathbf {c}_m\in {\mathbb {R}}^{ d}\}_{m=1}^{M}\) and \(\mathbb {Z}\in {\mathbb {R}}^{M}\) be the dictionary learned and the final vector representation, respectively. As formalized in [2], the mapping from \(\mathbb {X}\) to \(\mathbb {Z}\) can be decomposed into three sucessive steps: (i) coding; (ii) pooling; and (iii) concatenation, as follows:

$$\begin{aligned} \alpha _j&=f({ {\mathbf {x}}}_{j}), j\in [1,N]&\text {(coding)} \end{aligned}$$
(1)
$$\begin{aligned} h_m&=g(\alpha _m=\{\alpha _{m,j}\}_{j=1}^{N}),m\in [1,M]&\text {(pooling)} \end{aligned}$$
(2)
$$\begin{aligned} z&=[h_1^T,\dots ,h_M^T]&\text {(concatenation)} \end{aligned}$$
(3)
Fig. 1.
figure 1

Matrix \(\mathbf{H}\) of BoW model in which the rows and columns are related to coding and pooling functions, respectively, as presented in [1].

In the traditional BoW framework [14], the coding function f minimizes the distance to a codebook, and the pooling function g computes the sum over the pooling region. As illustrated in Fig. 1, the coding and pooling functions can be visualized in terms of the matrix H with N column and M rows, in this example, the coding function f for a given descriptor \({ {\mathbf {x}}}_{j}\) corresponds to information obtained from the \(j^{st}\) column. Moreover, the pooling function g for a given visual word \(\mathbf {c}_{m}\) corresponds to the \(m^{st}\) row of the H matrix. Both functions could be, more precisely, defined as follows:

$$\begin{aligned} \alpha _j\in \{0,1\}^M&=\alpha _{m,j}=1 ~\text {iff}~ j=\mathop {\mathbf {arg\,min}}\limits _{m\le M}{\parallel { {\mathbf {x}}}_{j}-\mathbf {c}_m\parallel }_2^2 \end{aligned}$$
(4)
$$\begin{aligned} h_m&=\frac{1}{N}\sum _{j=1}^{N}\alpha _{m,j} \end{aligned}$$
(5)

in which \(\mathbf {c}_m\) denotes the m-th codeword.

In [4], some improvements were obtained by smoothing the distribution during the pooling function. This approach, called soft-assignment, models an ambiguity concept in the attribution, creating more expressive models for classification. They indicate that large vocabulary increases the probability of multiple relevant visual words to represent one feature point. This is called visual word uncertainty and can be formulated as follows:

$$\begin{aligned} \alpha _{m,j} = \frac{\exp (-\beta \parallel { {\mathbf {x}}}_{j}-\mathbf {c}_m\parallel _{2})}{\sum _{k=1}^{M}\exp (-\beta \parallel { {\mathbf {x}}}_{j}-\mathbf {c}_{k}\parallel _{2})} \end{aligned}$$
(6)

where \(\beta \) is a parameter that controls the softness of the soft assignment (hard assignment is the limit when \(\beta \rightarrow \infty \)).

4 An Extended BoW Formalism

In the traditional BoW framework [14], the function g for pooling computes the number of descriptors over the pooling region, thus the mid-level representation could be defined by a concatenation of all values related to the codewords. Unfortunately, this pooling strategy is quite poor in terms of information inside each pooling region, mainly related to spatial distribution of the descriptors. To cope with this lack of information, we propose a new mid-level representation, so-called BOH (Bag Of local distribution of descriptors on concentric Hyperspheres), which explores the descriptor position inside the largest hypersphere centred at each codeword for computing the pooling. For that, we propose to divide this hypersphere into equally probable hyper-regions in which the descriptors inside one hyper-region have similar distances to the codeword.

Let \(S_{i}\) and \(S_{j}\) be two hyperspheres centered at codeword \(\mathbf {c}_m\) with radius \(r_i\) and \(r_j\), respectively, in which \(r_i<r_j\). We define the hyper-region \(R_{{i,j}}\) between the hyperspheres \(S_{i}\) and \(S_{j}\) as the hyper-region computed by the difference of \(S_{i}\) and \(S_{j}\). More precisely, a \(d\)-dimensional descriptor belongs to the \(R_{{i,j}}\) if the distance to the codeword \(\mathbf {c}_m\) is higher than \(r_i\) and smaller than or equal to \(r_j\).

Two hyper-regions \(R_{{i,j}}\) and \(R_{{i',j'}}\) are considered equally probables if they have the same volume, i.e., \(V_{}^{}(R_{{i,j}})=V_{}^{}(R_{{i',j'}})\). Let \({E}\) be the number of equally probable hyper-regions related to the codeword \(\mathbf {c}_m\). Without loss of generality, let \(S_{{E}}\) and \(S_{1}\) be two hyperspheres with radius \(r_{E}\) and \(r_1\) centered at \(\mathbf {c}_m\), \(V_{}^{}(R_{{{E}-1,{E}}})=V_{}^{}(R_{{0,1}})\) iff \(V_{}^{}(S_{{E}})={E}\times V_{}^{}(S_{1})\). From this definition, it is easy to show that \(r_{e} = r_{1} \times {\root n \of {e}}\), \(\forall ~e\in [1,{E}]\).

Fig. 2.
figure 2

Example of pooling strategy for \(d\)-dimensional descriptors taking into account BoW, BossaNova and BOH. For BossaNova the number of hyper-regions is equal to 3, and for BOH, the number of equally probable hyper-regions related to each codeword is also equal to 3.

Considering these E equally probable hyper-regions, the proposed pooling strategy is the histogram of distances between the local descriptors and the codewords taking into account the radius of the largest n-dimensional hypersphere over the pooling region. Let \({\mathbb {X}}=\left\{ \mathbf {x}_j \right\} \) be an unordered set of \(d\)-dimensional descriptors \({ {\mathbf {x}}}_{j}\) extracted from a video, such that \(j\in [1,N]\). The proposed strategy for pooling is defined by:

(7)

in which \(r_{{E}}^{\mathbf {c}_m}\) is the radius of largest n-dimensional hypersphere centered at codeword \(\mathbf {c}_m\). The final representation \(\mathbf {z}\) is given by:

$$\begin{aligned} \mathbf {z} = \left[ h_{m,e}\right] ^\text {T},~(m,e) \in \{1,...,M\} \times \{1,...,{E}\} \end{aligned}$$
(8)

where \(\mathbf {z}\) is a vector of size \(M\times {E}\).

When the number of equally probable hyper-regions is equal to 1, our pooling strategy is similar to the traditional BoW. As the number of hyper-regions and codewords increase, the vector \(\mathbf {z}\) is more sparse but it approximates better the actual distribution of distances. Thus there is a trade-off between the sparsity and this size.

In order to exemplify the traditional BoW, the BossaNova and the proposed method pooling strategies, it is illustrated in Fig. 2 how the regions related to two codewords are divided. In this example, the coding is done by a hard-assignment in which the \(d\)-dimensional descriptor is associated with just one codeword. In the traditional BoW, as illustrated in Fig. 2(a), the codewords are represented by the number of descriptors which are assigned to them. For BossaNova and BOH each codeword is represented by a histogram of descriptors which are quantized according to their distance to the codeword. While the quantization used by BossaNova is based on linear function in terms of the distance-to-codeword, as shown in Fig. 2(b), the quantization used by BOH is based on the volumes of the hyper-regions obtained by hyperspheres centered at codewords, as illustrated in Fig. 2(c).

5 Experimental Analysis

In this section, we describe the three used datasets, the protocols for classification and the experimental setup. Moreover, a quantitative analysis, in terms of classification rates, comparing our method with the state-of-the-art approaches is given.

5.1 Datasets and Protocols

In order to validate the proposed method we tested our approach in three well-known action recognition datasets: (i) KTH [12]; (ii) UCF Sports [10]; and (iii) UCF 11 [7]. The datasets choices were made due their distinctive characteristics, such as video duration, intraclass variability and noise scene elements.

The KTH dataset [12] contains six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping. These actions are performed by 25 different subjects in four scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and inside. Some examples are illustrated in Fig. 3. There are totally 600 video clips with \(160\times 120\) pixels size and different video durations. We adopt the same experimental setup as in [12, 15], so-called split, where the videos are divided into a training set (eight subjects), a validation set (eight subjects) and a test set (nine subjects).

Fig. 3.
figure 3

Example of hand waving with same subject in different scenarios

The UCF sports dataset [10] contains ten different types of sports actions: swinging, diving, kicking, weight-lifting, horse-riding, running, skateboarding, swinging at the high bar, golf swinging and walking. The dataset consists of 150 real videos with a large intra-class variability. Each action class is performed in different ways, and the frequencies of various actions also differ considerably, as can be seen in Fig. 4. Contrary to what has been done in many works that apply their methods on this dataset, we do not extended the dataset with a flipped version of the videos, trying to prevent the classifier from learning the background instead of the actions. We adopt a split set dividing the dataset into 103 training and 47 test samples as in [5].

Fig. 4.
figure 4

Example of intra-class variability in UCF Sports dataset. (a) and (b) are both examples from running class, (c) and (d) from swinging, (e) and (f) from kicking; while (g) and (h) from walking.

Fig. 5.
figure 5

Example of UCF11 challenges, such as object appearence in (a) and (b), viewpoint in (c) and (d), cluttered background in (e) and (f), and illumination conditions in (g) and (h).

The UCF11 dataset [7] contains 11 action categories: biking/cycling, diving, golf swinging, soccer juggling, trampoline jumping, horse riding, basketball shooting, volleyball spiking, swinging, tennis swinging, and walking with a dog. This dataset is challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, and illumination conditions. Some examples are illustrated in Fig. 5. The dataset contains a total of 1646 videos. We adopt the original setup [7] using the leave-one-out cross-validation for a pre-defined set of 25 folds.

5.2 Experimental Setup

Regarding the feature descriptor, we have chosen to use an approach with a dense descriptor (dense trajectories [15]) because it is simple and achieved good results. After the feature extraction step, BoW, BossaNova and BOH are used to organize the low-level features to represent each video using the mid-level representation. Here, we used the following parameter values for computing the BossaNova: \(\lambda _{min}\) = 0.4, \(\lambda _{max}\) = 2, \(knn = 10\) (semi-soft assignment), B = \(\{2,4\}\) and M = \(\{512,2048\}\) (number of visual codewords). For the proposed pooling strategy BOH, we used the following parameter values: \(knn = 10\) (semi-soft assignment), E = \(\{2,4\}\) and M = \(\{512,2048\}\) (number of visual codewords).

For classification we used non-linear SVM with an RBF kernel which is a popular classifier that is used throughout different works for human action classification [15]. Since this classifier is vastly used in human action classification, it is interesting to use it to make fair comparisons between different approaches.

5.3 Comparison with the State-of-the-art

In order to compare the proposed method to some of the state-of-the-art approaches we adopted the classification rate (also called recognition rate). Usually, in literature, there is a bit confusion between the use of classification rate and accuracy. For the sake of clarification, in this work, the classification rate is the number of correct video classification by the number of videos. In Table 1, a comparison, in terms of classification rate, is presented. Except for BossaNova, the rates of the compared methods were obtained from the original paper. As one can note, ours give competitive rates for KTH and UCF 11, and much better results for UCF Sports. When compared to BossaNova, which uses a similar pooling strategy, our results are better in UCF Sports and UCF 11.

Table 1. The classification rates for the compared approaches.

The performances, in terms of classification rates, of the compared approaches applied to the KTH, UCF Sports and UCF 11 are illustrated in Fig. 6. As we can see, the rate for BOH increases when the number of codewords and hyper-regions increase. This behavior does not occurs for BossaNova since there is no a monotonic increasing neither for number of codewords nor for the number of hyper-regions. Furthermore, both methods are better than traditional BoW. In terms of time performance, there is no significant difference between BOH, BossaNova and BoW, once the main time consuming operation rely on calculating the distances between feature points and codewords during coding and are the same for all methods.

Fig. 6.
figure 6

A comparison between the proposed method and BossaNova concerning the classification rate according to the number of hyperpheres from 1 to 5. The classification rate for BoW is also illustrated (when the number of hypersphere is equal to 1).

6 Conclusions and Further Works

In this work, we addressed the task of human action classification using only the information present in the content of the video. Also, we focused on an intermediate stage between feature extraction and classification by using an extended BoW formalism, so-called BOH to generate a new mid-level video representation which is obtained directly from densely sampled features extracted around trajectories.

The idea is to increase the classification rate by careful use of a well disseminate motion descriptor. Here we explored a new strategy for the pooling step based on a volumetric partition of the hypersphere centered at codewords in order to maintain the same probability of assignment to a given hyper-region. The results indicates that this kind of pooling could decrease the quantization error of the descriptors.

Regarding classification protocols, we experimented the training and testing classification (here called split) for KTH and UCF Sports, and the leave-one-group-out cross-validation for UCF 11. Experimental results demonstrated that our strategy either has improved the recognition rates with respect to the BossaNova, expect for KTH.

For further endeavors, we will study different ways encoding quantization errors into video descriptors. Another interesting research path is to investigate the quality of video data used during (and filter it out before) training time for the classification step and its relationship with the support vectors needed to produce better accuracy results in human action classification.