1 Introduction

Human action recognition is one of the most important research areas in computer vision due to its usefulness in real-world applications such as video surveillance, human computer interaction, and video archival systems. However, action recognition still remains a difficult problem when dealing with unconstrained videos such as web videos, movie and TV shows, and surveillance videos. There are a wide range of issues, ranging from object-based variations, such as appearance, view pose and occlusions, to more complicated scene-related variations such as illumination changes, shadows, and camera motions [1].

While there is significant amount of progress in solving these problems, the issue of video quality [1, 2] has received much less research attention. The recognition of human actions from low-quality video is highly challenging as valuable visual information is compromised by various internal and external factors such as low resolution, sampling rate, compression artifacts and motion blur, camera jitter, and shake. Figure 1 shows a few sample frames from videos that have been severely compromised in the aspect of quality. Many surveillance systems require further video analysis to be performed on compactly stored video data [3] while mobile devices strive to incorporate high-level semantics into real-time streaming [4]. Therefore, for reasons such as these, action recognition in low-quality videos should be further investigated as it offers new insights and challenges to the research community.

Fig. 1
figure 1

Sample videos from low-quality subsets of the KTH, UCF-YouTube, and HMDB datasets. The top row are samples from KTH, the middle row are samples from UCF-YouTube, and the bottom row are samples from HMDB. We can observe that the videos are severely compromised by many quality-related factors such as low resolution, lossy compression, and camera motion blurring. As such, the characterization of actions from these videos would be much more challenging

Shape and motion features have recently become popular for their great success in action recognition [58]. Existing methods that utilize these features mainly consists of two main steps: feature detection and feature description. In feature detection, important salient points are detected from a video and then a visual pattern surrounding the detected point (often called a “patch”) is then described in the feature description phase. The quality of detected interest points is highly dependent on the quality of the video as important points may be missed in cases where video quality is poor. Also, shape and motion descriptors such as HOG [9], HOF [5, 6], and MBH [7, 8] becomes less discriminative when the quality of video deteriorates; noisy image pixels can cause gradient and orientation information to be less consistent across action samples of the similar class. Spatio-temporal dynamic textures particularly local binary patterns extended to three orthogonal planes (LBP-TOP) [10] have also been proposed for action recognition [1113] but they are not as popular or as widely used as shape and motion features. These methods find statistical regularities to describe visual patterns that lie within video frames. Some evaluations [14] observed that textural features, on its own, do not perform consistent enough in videos with complex scenes. This is an expected outcome since the statistical aggregation of patterns is devoid of any spatial or temporal localization. Very few works in literature have particularly address the issue of recognizing actions in low-quality videos; some in the form experimental extensions [15], or to tackle a specific problem such as frame rate reduction [16]. Nevertheless, this is a testament of a potential interest in this issue, but there is presently no systematic investigation into various video-quality-related issues.

In this paper, we attempt to investigate the problem of recognizing human actions from low-quality videos and to uncover how textural features can be used alongside classical shape and motion features to improve the recognition performance under these circumstances. We propose a joint feature utilization framework where local shape-motion descriptors obtained from contemporary feature detectors were supplemented by spatio-temporal extensions of global texture descriptors. To facilitate the nature of our work, we perform an extensive evaluation on various low-quality versions or quality-oriented subsets of benchmark action recognition datasets: KTH, UCF-YouTube, and HMDB.

The rest of the paper is organized as follows. In Section 2, we delve into some related works in literature while Section 3 introduces our proposed framework and provides a description of the methods employed in our work. We then report and analyze the experimental results in Section 4. Finally, Section 5 concludes the paper and provides future directions.

2 Related works

Vision-based action recognition is a well-studied problem, and many methods [1, 58, 1719] have been proposed in recent years. There are a number of recent survey papers that offer a good overview of related works from the broad, generic scope [2022] and selected perspectives [23, 24]. Here, we concisely describe related methods from the aspect of their feature selection and representation methods.

2.1 Shape and motion features

Local shape and motion are the most popular features [2, 58] for action recognition. A variety of methods that extract shape and motion features have been proposed in literature [58, 17, 18]. Inspired by the capability of space-time interest points (STIPs) [25] at capturing local variations, Schüldt et al. [26] use them to extract spatio-temporal local features from action videos. They calculate a multi-scale derivative at center of every local interest point to encode motion information. Their method demonstrated that local motion features are comparatively better than global features. Laptev et al. [27] further improved it by introducing two essential types of local features: shape, in the form of histogram of oriented gradients (HOG), and motion, in the form of histogram of optical flow (HOF). Klaser et al. [28] extended the HOG shape descriptor to three dimensions, i.e., histogram of 3D gradient orientations (HOG3D) [28], which quantizes 3D gradient orientations on regular polyhedrons. However, detection of interest points in [25] solely depends on the spatial quality and temporal fidelity of the video, thus this may be affected if the video quality deteriorates. Also, these mere spatio-temporal extensions do not consider temporal relationship between interest points of subsequent frames. Dollár et al. [17] directly considered the temporal domain in the selection process by proposing a Cuboid detector which selects features surrounding spatio-temporal interest points detected by temporal Gabor filters. The detected cuboids in the video volume are then described by a Cuboid descriptor. Willems et al. [18] use spatio-temporal Hessian Matrix to detect interest points and the extended SURF (ESURF) descriptor for describing features around the detected points. Generally, these methods appear to work well with videos captured in a relatively controlled environment [6] such as KTH [26] action videos.

For videos captured in a relatively complex environment such as the HMDB [29] dataset, interest point based methods such as STIP [25] may sometimes fail to detect important points due to motion clutter from background scenes. To overcome this problem, Wang et al. [6] proposed dense sampling of spatio-temporal video blocks at regular scales and positions and represent them using popular features such as HOG, HOF, and HOG3D. However, the dense sampling strategy is computationally expensive and has a large memory footprint. The authors later proposed Dense Trajectories [7] where densely sampled points are tracked based on the optical flow field. However, feature tracking from dense optical flow fields inadvertently includes camera motion, which may yield less discriminative feature sets. The authors improved their trajectories by performing irrelevant background motion removal using warped flow [8]. Features are then constructed from the trajectories using HOG, HOF, and a robust descriptor called motion boundary histogram (MBH).

2.2 Textural features

The use of textural features is less common in action recognition; a number of notable works are worth mentioning [1113, 19, 30] but their reported performances were. Kellokumpu et al. [11] first proposed the use of local spatio-temporal texture features for action recognition. They use local binary pattern hon three orthogonal planes (LBP-TOP) [10] to represent an action video in the form of dynamic textures. Their proposed method is capable of capturing the statistical distribution of local neighborhood variations but the holistic nature of extracting these features mean that they may easily be affected by unnecessary background variations and occlusions. Mattivi and Shao [12] proposed to use part-based representations such as interest points to overcome background- and occlusion-related problems. They employed Dollár’s feature detector [17] to extract cuboids from video and subsequently, each cuboid is described by extended LBP-TOP (an extension of LBP-TOP to nine slices, three for each plane) descriptor. They also demonstrated that using LBP on gradient images can obtain better performance than using LBP on raw image values, but at the expense of more computations.

Besides directly applying LBP on image frames, there were alternative strategies in literature that used it to extract textures from other forms of images. Kellokumpu et al. [19] used local binary patterns (LBP) to describe motion history and motion energy images which encodes shape and motion information respectively. Ahsan et al. [13] use LBP features to describe mixed block-based directional MHI (DMHI) templates [31].

LBP-based methods are sensitive to noise and illumination changes and they also lack explicit motion encoding. Addressing these issues, Yeffet and Wolf proposed local trinary patterns (LTP) [32] which combined local binary patterns with appearance and adaptability invariance of patch similarity matching approaches. Their method encodes local motion information by taking into account the self-similarity in three neighborhood circles at a particular spatial position. The LTP produces a notably large feature vector, which depends on the number of grid blocks and time slices chosen for a video. More recently, Kataoka et al. [14] investigated thoroughly into the performance of various features of different types (motion, texture, etc.) including texture-based features such as LBP and LTP, on a dense trajectory framework. In their evaluation, they observed that textural features alone do not perform as well as shape or motion features. However, it remains inconclusive as to whether they are more useful under low-quality conditions.

In another new work, Baumann et al. [30] proposed motion binary pattern (MBP) which encodes motion by calculating pixel variations in three consecutive image frames. In order to capture slow and fast motion, they use different time step sizes. Inspired by volume local binary pattern (VLBP) [10] and optical flow, their method produces a very lengthy feature vector and relies heavily on several free parameters that are crucial to its success. So far, they have tested only on the KTH, Weizmann, and IXMAS datasets, which are all relatively smaller than the contemporary datasets used today.

2.3 Action recognition in low-quality videos

There are only a few works in literature that specifically address the impact of low-video quality on action recognition performance; all are limited to only certain factors influencing video quality or done rather ad hocly [15, 16, 3335].

Chen and Aggarwal [33] proposed to use supervised PCA projected time series of histogram of oriented gradients (HOG) and histogram of oriented optical flow (HOOF) descriptor, to encode pose and motion information of action videos from a far distance. Their work focused on recognizing human actions from a far field of view where the size of humans is typically not more than 40 pixels. A trained classifier is applied to localize action in each image frame, and the features are then computed from these localized coordinates. However, their work only considers the problem of spatial resolution, but did not address other related issues such as camera motion and video compression. The authors also experimented with downsampled frames (with persons as small as 15-pixel tall) and found that the performance deteriorates greatly.

In another work, Reddy et al. [15] conducted a few sensitivity tests involving varying frame rate, resolution (scale), and translation, to test their effect on action recognition performance. Due to the tedious nature of such experimentation, the authors only conducted these tests on the small UT-Tower dataset [33] and not other larger databases. Moreover, the test cases to study scale changes were designed in a rather ad hoc manner. More recently, Harjanto et al. [16] investigated the effects of different frame rates with four popular action recognition methods. A key frame selector was used to select important frames in video. Their evaluation suggests that by selecting a significant amount of important frames, it is still possible to obtain a decent level of recognition accuracy. However, the proposed key-frame selection strategy is solely based on interest points and may not work well if video spatial resolution becomes poor. On the other hand, the work by Ahad et al. [36] focuses solely on the problem of low resolution in activity recognition, but not on low frame rate.

Aside from feature design, a few works [1, 2, 35] focus primarily on the formation of feasible frameworks. They leverage on existing features while crafting the recognition pipeline in an effective way that enhances its performance under low-quality conditions. Our preliminary efforts [1, 2] at exploring the problem of recognizing actions in low-quality video resulted in the establishment of a spatial and temporal downsampling protocol, which provides a systematic procedure for investigating the robustness of methods against decreasing resolution and frame rate. Inspired by the recent breakthrough in deep learning, a recent work [35] incorporated frame-level object features from an ImageNet-trained deep convolutional neural network (CNN) as part of the recognition pipeline, achieving promising results.

Another work by Gao et al. [34] followed the low-resolution downsampling protocol mooted in [2] but using Dempster-Shafer’s theory (DS theory) to model activities. Using the KTH and Weizmann datasets, they showed that using DS theory to combine estimated basic belief assignments at each frame can help achieve better performance than popular encoding techniques such as bag-of-visual-words (BoVW) and key-pose modeling. However, they stop short at examining the impact of decreasing video frame rate; their scheme may generalize poorly in such cases since it models the consecutive changes in video.

3 Methods

In this section, at first, we introduce our proposed action framework. Then, we discuss the different components involved such as feature detectors and descriptors.

3.1 Proposed action recognition framework

We propose an action recognition framework based on the shape, motion, and texture features, as illustrated in Fig. 2. The main idea revolves around the utilization of textural information with conventional shape and motion features to improve the recognition of human actions in low-quality videos. In this framework, every input video goes through two distinct extraction steps. In the first step, space-time shape/motion features (derived from interest points or dense trajectories) are extracted by their respective descriptors (i.e., HOG, HOF, MBH). The shape and motion features are then encoded by bag-of-visual-words (BOVW) method [37] to obtain the local features. In the second extraction step, spatio-temporal textural features (based on BSIF, LBP, LPQ) are obtained by means of the three orthogonal planes (TOP) extension, forming the global features. Finally, both feature vectors are concatenated and a support vector machine (SVM) classifier is used for classification.

Fig. 2
figure 2

Proposed joint feature utilization framework for human action recognition

3.2 Shape and motion features

Shape and motion are an expressive abstraction of visual patterns, in space and time respectively. They are critical cues for action recognition, as they are sufficiently invariant to represent commonalities of different instances of a particular action type, while preserving sufficient details in order to differentiate them from different types. To provide a comprehensive coverage of state-of-the-art feature detectors, we employ two different methods that have been widely used in literature: space-time interest point (STIP) [5] and space-time trajectories, in the form of improved dense trajectories (iDT) [8]. For description, we used the HOG and HOF descriptors in concert [6] for the STIP, and the motion boundary histogram (MBH) for the iDT. These are the most effective descriptors for each detector, as reported in their original works. A brief description of these detectors and descriptors is given as follows:

Space-time interest point: Given an action video, local space-time interest points (STIP) are detected around the location of large variations of image values, which corresponds to motions. Interest points are detected using the Harris3D detector proposed in [5], which is an extension of the popular Harris detector used in image domain [38]. It can detect a decent amount of corner points in space-time domain and is perhaps one of most widely used feature detector for action recognition.

To characterize the shape and motion information accumulated in space-time neighborhoods of the detected STIPs, we applied Histogram of Gradient (HOG) and Histogram of Optical Flow (HOF) feature descriptors as proposed in [26]. The combination of HOG/HOF descriptors produces descriptors of size Δ x (σ)=Δ y (σ)=18σ,Δ t (τ)=8τ (σ and τ are the spatial and temporal scales). Each volume is subdivided into n x ×n y ×n t grid of cells; for each cell, 4-bin histograms of gradient orientations (HOG) and 5-bin histograms of optical flow (HOF) are computed. We use the original implementation from [5] and standard parameter settings from [6], i.e., k=0.0005, σ 2={4,8,16,32,64,128}, τ 2={2,4}, {n x ,n y }=3 and n t =2.

Space-time trajectories: Motion information of a video is captured in a dense manner by sampling interest points at an uniform interval and tracking them over a fixed number of frames. To detect space-time trajectories, we used improved dense trajectories (iDT) [8], an extension of the original dense trajectories [7]. A set of points are densely sampled on a grid on eight different spatial scales with a step size of 5 pixels. Points from homogeneous areas are removed by thresholding small eigenvalues of their respective auto-correlation matrices. Tracking of these sampled points are then performed by applying median filtering to the dense optical flow field computed from Färneback’s algorithm [39]. Also, static trajectories with lack of motion and trajectories with large displacements due to incorrect optical flow estimation are removed.

In contrast to dense trajectories, iDT is capable of boosting recognition performance by considering camera motions in action videos. It characterizes background motions between two consecutive frames by a homography matrix, which can be calculated by finding similarities between two consecutive frames using SURF [40] and optical flow-based feature matching. After finding feature similarities, RANSAC [41] algorithm is applied to calculate the homography matrix. Based on that, camera motion is removed from video frames before re-computing optical flow. This method known as warped flow, results in better descriptor formation with motion estimation that is free from camera motions.

For computational tractability, we use iDT on a single scale in our experiments. We observed that tracking points on multiple spatial scales is computationally very expensive. So, we only track points at the original spatial scale and extract features around its trajectories. Despite that, using a single scale still offers a decent recognition rate (reportedly 2–3% less than multi-scale in [42]). In brief, given an action video V, we obtain N number of trajectories:

$$ \mathcal{T}(V)=\{T_{1},T_{2},T_{3}, \ldots,T_{N}\} $$

and, T n is the n-th trajectory at original spatial scale,

$$ T_{n}\,=\,\left\{ \left(x_{1}^{n},y_{1}^{n},t_{1}^{n}\right),\left(x_{2}^{n},y_{2}^{n},t_{2}^{n}\right), \left(x_{3}^{n},y_{3}^{n},t_{3}^{n}\right), \ldots, \left(x_{L}^{n},y_{L}^{n},t_{L}^{n}\right) \right\} $$

where there are L number of points (x,y,t) on the trajectory.

In our work, we only consider the motion boundary histogram (MBH) to describe features from the detected trajectories. Unlike the HOF descriptor, MBH uses optical flow information I w =(I x ,I y ) but computes the spatial derivatives separately for its horizontal (MBHx) and vertical (MBHy) components. These are then used to obtain 8-bin histograms for each component. MBH is also robust to camera and background motions and has reported superior results compared to the HOG and HOF [8]. In detail, the combination of MBHx/MBHy descriptors produces descriptors of size N×N×L (N is the size of space-time volume in pixels and L is the length of of trajectories). Each volume is subdivided into n x ×n y ×n t grid of cells; for each cell, 8-bin motion boundary histograms in each direction are then computed. We use the original implementation from [8] and follow standard parameter settings, i.e., L=15, W=5, N=32, {n x ,n y }=2, n t =3.

3.3 Textural features

We evaluate three types of textural features in our experiments: local binary pattern (LBP), local phase quantization (LPQ), and binarized statistical image features (BSIF). We briefly describe these techniques, followed by how they can be extended for the spatio-temporal case by three orthogonal planes (TOP).

LBP features: Local binary pattern (LBP) [43] uses binary patterns calculated over a region to describe textural properties of an image. The LBP operator describes each image pixel based on the relative gray levels of its neighborhood pixels. If the gray level of the neighboring pixel is higher or equal, the value is set to one, otherwise to zero. The binary pattern is described by considering these binary numbers over the neighborhood as:

$$ {LBP}_{P,R}(x, y)=\sum_{i=0}^{N-1}s(n_{i}-n_{c})2^{i}, \; s_{x}=\left\{\begin{array}{lll} 1 & x\leq 0 & \\ 0 & otherwise & \end{array}\right. $$

where n c corresponds to the gray level of the center pixel of a local neighborhood, and n i , the gray levels of N equally spaced pixels on a circle of radius R. The LBP P,R operator produces 2P possible output values, corresponding to the possible number of binary patterns that can be formed by the P neighborhood pixels. The feature histogram is produced by considering the frequency distribution of the LBP values.

LPQ features: Local phase quantization (LPQ) [44] operator uses local phase information to produce blur-invariant image features extracted by computing the short-term Fourier transform (STFT) in rectangular neighborhoods N x :

$$ F(u,x)=\sum_{y\in N_{x}}f(x-y)e^{-j2\rho u^{T}y} = \mathbf{W}^{T}_{u} \mathbf{f}_{x} $$

where W u is the basis vector of the discrete Fourier transform (DFT) at the frequency u and f x is a vector that contains all image samples from N x . Four complex coefficients corresponding to 2D frequencies are considered for forming LPQ features: u 1=[ a,0]T, u 2=[ 0,a]T, u 3=[ a,a]T, and u 4=[ a,−a]T, where a is a scalar. To express the phase information, a binary coefficient b is formed from the sign of imaginary and real part of these Fourier coefficients. An image is then produced by representing eight binary values (obtained from binary coefficients) as the integer value between 0 to 255. Finally, the LPQ feature histogram is constructed from these output values.

BSIF features: Binarized statistical image features (BSIF) [45] is a more contemporary rmethod that efficiently encodes texture information, in a similar vein to the aforementioned methods that produce binary codes. Given an image X of size p×p, BSIF applies a linear filter F i that is learnt from natural images by independent component analysis (ICA) [46], on the pixel values of X, obtaining the filter response,

$$ r_{i} = \sum_{u,v} F_{i}(u,v)X(u,v) = \mathbf{f}_{i}^{T}\mathbf{x} $$

where f and x are the vectorized form of F i and X, respectively. The binarized feature b i is then obtained by thresholding r i at the level zero, i.e., b i =1 if r i >0 and b i =0 otherwise. The decomposition of the filter mask F i allows the independent components or basis vectors to be learnt by ICA. Succinctly, we can learn n number of l×l linear filters W i , stacked into a matrix W such that all responses can be efficiently computed by s = W x; where, s is a vector contains r i responses. Thus, an n-bit binary code is produced for each pixel; all of which builds the feature histogram for the image.

Spatio-temporal extension of textural features: Motivated by the success of recent works related to the recognition of dynamic sequences [10, 12], we consider the three orthogonal planes (TOP) approach to extend the 2D textural operators to cater for videos. Given a video (XYT), the TOP approach extracts the texture descriptors along the XY, XT, and YT orthogonal planes where, the XY plane encodes structural information while XT and YT planes encode space-time transitional information. The histograms of all three planes are concatenated to form the final feature histogram. Generally speaking, the textural histogram given a volumetric space of X×Y×T can be defined as:

$$ h_{j}^{plane}=\sum_{p\in P}\mathcal{I}\{b(p)=j\} $$

where, j∈{1,…,2n}, p is a pixel at location (x,y,t) at a particular plane, b is the binarized code, and \(\mathcal {I}\{.\}\) a function indicating 1 if true and 0 otherwise. The histogram bins of each plane are then normalized to get a coherent description, \(\boldsymbol {\tilde {h}}^{plane}=\left \{\tilde {h}_{1}^{plane},\ldots, \tilde {h}_{2^{n}}^{plane}\right \}\). Finally, we concatenate the histograms of all three planes,

$$ \boldsymbol{H}=\left\{\boldsymbol{\tilde{h}}^{XY},\boldsymbol{\tilde{h}}^{XT},\boldsymbol{\tilde{h}}^{YT}\right\} $$

In this work, we have set the neighborhood and radius parameters for non-uniform pattern LBP-TOP as {P XY ,P XT ,P YT }=8 and {R X ,R Y ,R T }=2, respectively, following the specifications in [12]. Meanwhile, neighborhood parameters for LPQ-TOP are set to {W x ,W y ,W t }=5, as specified in [47]. For BSIF-TOP, the filter size l=9 and representative bit size n=12 were empirically determined and applied to all three planes.

4 Experimental setup

In this section, we discuss the datasets used and their respective evaluation protocols, as well as details on the implemented experimental pipeline.

4.1 Datasets and their evaluation protocols

In order to exhibit the potency of our proposed methods, we conduct a series of extensive experiments on low quality versions or subsets of three popular benchmark action recognition datasets: the KTH [26], the UCF Youtube [48], and the HMDB [29].

The KTH action dataset [26] is one of the most widely used datasets for action recognition. It consists of videos captured from a rather controlled environment, containing 6 action classes performed by 25 actors in 4 different scenarios. There are 599 video samples in total (one subject has one less clip), and each clip is sampled at 25 fps at a frame resolution of 160×120 pixels. We follow the original experimental setup specified in [26], reporting the average accuracy over all classes. Similar to the protocol established in our previous work [1, 2], six downsampled versions of the KTH were created–three for spatial downsampling (SD α ), and three for temporal downsampling (TD β ). We limit our experiments to downsampling factors, α,β={2,3,4}, which denotes spatial or temporal downsampled versions of a half, a third, and a fourth of the original resolution or frame rate.

These videos that undergo spatial downsampling lose many important spatial details which may fail interest point detectors. To cope with this issue, we increase the sensitivity of the change of gradients to detect a decent amount of interest points from each videos. Specifically, we use various k parameters for different downsampled modes, i.e., k=0.0001,0.000075, and 0.00005 for SD 2, SD 3, and SD 4 respectively. Since there is no change in frame resolution for the case of temporal downsampling so, we keep the value of k parameter unchanged. Also for estimation of feature trajectories, we also use different values for neighborhood size N and trajectory length L, i.e., we use N=11, 8, and 4 for SD 2, SD 3, and SD 4 videos respectively and L=15, 8, and 5 for TD 2, TD 3, and TD 4 videos, respectively. We empirically determine the suitability of these values by prior experiments.

The UCF-YouTube [48], also known ‘UCF-11’ is another popular dataset for action recognition, consisting of videos captured from uncontrolled and complex environments. It contains 11 action classes, and every class has 25 groups with more than 4 action clips in each group. The video clips that belong to the same group share some common features, such as the same actor, similar background, and similar viewpoint. The videos are compromised with various problems such as camera motion, background clutter, viewpoint, and scale variations. There are 1600 video samples in total and each clip is sampled at ∼ 30 fps with a frame resolution of 320×240 pixels. We follow the leave-one-group-out-cross-validation (LOGOCV) scheme specified in [48], reporting the average accuracy over all groups. Since we are interested in evaluating low-quality videos, we apply lossy compression on each video sample. Specifically, we re-encode all video samples by using × 264 video encoder [49] by randomly assigning constant rate factors (crf) that are uniformly distributed across all samples. We used crf values between 23 to 50 where higher values indicate greater compression (and smaller file sizes) and vice versa. For clarity, we call this newly version, YouTube-LQ, with videos now of low quality due to the effects of lossy compression. Some sample videos created with different crf values are shown in Fig. 3. As we can see from the figure, videos that has a higher crf values have poor structural information.

Fig. 3
figure 3

Sample video frame with different constant rate factors (crf). Left: crf 29, center: crf 38, right: crf 50. Note the adverse deterioration in quality when video is compressed

The HMDB [29] is one of the largest human action recognition dataset that is increasingly popular in recent years. It has a total of 6766 videos of 51 action classes collected from movies and YouTube videos. HMDB is a considerably challenging dataset with videos acquired from uncontrolled environments with large viewpoint, scale, background, and illumination variations. Videos in HMDB are annotated with a rich set of meta-labels including quality information: three quality labels were used, i.e., “good,” “medium,” and “bad.” Three training-testing splits were defined for the purpose of evaluations, and performance is to be reported by the average accuracy over all three splits. In our experiments, we use the same specified splits for training, while testing was done using only videos with “bad” and “medium” labels; for clarity, these two sets will hereafter be denoted as HMDB-BQ and HMDB-MQ, respectively. In the medium quality videos, only large body parts are identifiable, while they are totally unidentifiable in the bad quality videos due to the presence of motion blur and compression artifacts. Bad and medium videos comprise of 20.8 and 62.1% of the total number of videos in the entire original database respectively.

Figure 1 shows some sample frames of various actions from the downsampled KTH dataset, compressed YouTube dataset and “poor” quality subset of the HMDB subset.

4.2 Evaluation framework setup

Our evaluation framework generally comprises of two main steps: feature representation and classification.

For feature representation, spatio-temporal features are first extracted from each action video before encoding into a “histogram of visual words” using visual codewords generated by classic bag-of-visual-words (BoVW) method. In all our experiments, we perform histogram-level concatenation of two types of features: encoded HOG and HOF descriptors for interest point method (denoted by “STIP”) and encoded MBHx and MBHy descriptors for the trajectory-based method (denoted by ‘iDT’). Histogram-level concatenation is known to be more effective than descriptor-level concatenation [1, 2]. Feature histograms from various dynamic textural features, such as LBP-TOP, LPQ-TOP, and BSIF-TOP, are also extracted from the videos, and then concatenated with their associated encoded features. In our experiments, we set the codebook size to 4000 which has been empirically shown to be effective in obtaining good results across numerous datasets [6]. To decrease the computational overhead during codebook generation, we used a subset of 100,000 features randomly selected from all training samples.

To perform a classification, we use a multi-class non-linear support vector machine (SVM) with χ 2-kernel defined as:

$$ K(H_{i},H_{j})=exp(-\gamma D(H_{i},H_{i})) $$

where H i ={h i1,…,h in } and H j ={h j1,…,h jn } denote histograms of visual words. D is the χ 2 distance function defined as:

$$ D(H_{i},H_{j})=\sum_{i=1}^{V}\frac{(h_{in}-h_{jn})^{2}}{h_{in}+h_{jn}} $$

where V is the size of codebook and γ is the mean of distances between all training samples. We use a computationally efficient approximation of the non-linear kernel by Veldadi and Zisserman [50] which allows features to undergo a non-linear kernel map expansion before SVM classification. It provides us the flexibility of deciding which features are to be “kernelized.” We fixed the value of regularization parameter c to 10 and adopted a one-versus-rest strategy for multi-class classification, where classes with the highest score are considered as the predicted class.

For benchmarking, we regard “STIP” and “iDT” features as our baseline methods and also made comparisons against other competing methods from literature.

5 Results and analysis

In this section, we present the results of a comprehensive set of experiments conducted on the three datasets, followed by an in-depth analysis into the results and various influencing factors.

5.1 Experiments on downsampled KTH videos

In this section, we present our experimental results on six downsampled versions of the KTH dataset. We choose the KTH dataset to perform this extensive range of experiments as it is lightweight and also a widely used benchmark in this domain area. From the results reported in Table 1 (STIP-based methods) and Table 2 (iDT-based methods), methods that exploit additional textural features clearly demonstrate significant improvement, as compared to their respective baseline methods. This is notably consistent across all six downsampled videos of KTH dataset. Also, methods that used iDT as the base feature outperform their STIP-based counterparts across all downsampled versions. Among the textural features employed, BSIF-TOP appears to be the most promising choice, clearly outperforming the other textural features. More important, it must be pointed out that the contribution of textural features becomes more significant as video quality deteriorates (particularly for cases SD 4 and TD 4). This exemplifies the robustness of textural features against video quality.

Table 1 Recognition accuracy (%) of various STIP-based approaches in comparison with other approaches on downsampled versions of the KTH dataset
Table 2 Recognition accuracy (%) of various iDT based approaches in comparison with other approaches on downsampled versions of the KTH dataset

Analysis on experiments: From the results of both STIP and iDT features, it is observed that the decrease in performance is most obvious with respect to spatial resolution. Feature detection in image frames is based on the variation of intensities in local image structures. Hence, the drop in spatial resolution may cause failure in detecting essential image structures. Compared to STIP, the performance of iDT features dropped tremendously, especially for SD 3 and SD 4 videos. Figure 4 gives a closer look on the detected features when videos are downsampled spatially and temporally. While the number of detected iDT features appear to be more than STIP features, they are obviously less salient (Fig. 4 shows a lot of trajectories that were sampled from the background regions).

Fig. 4
figure 4

Response of detectors when videos are downsampled spatially and temporally. Videos on the first row are using Harris3D detector and videos on second row are using warped flow estimation based feature tracking. The videos in columns 1, 2, and 3 represents the baseline, half resolution of the baseline, and one-third resolution of the baseline, respectively, while columns 4 and 5 represents one-half and one-third frame rate of baseline, respectively

Spatio-temporal textures circumvent this feature detection step by relying on statistical regularities across the spatio-temporal cube. Regions in an image such as background areas that have less textural information will offer little count towards the overall statistics. However, previous findings [2, 14] have observed that textural features alone do not offer good performance though it can serve as a strong supplement to other attention-oriented features such as shape and motion. For instance, in case of STIP based methods, BSIF-TOP textures help improve the accuracy for both SD 4 and TD 4 videos by ≈ 6%; for iDT based methods, it improves by ≈ 21% for SD 4 and ≈ 0.5% for TD 4 videos.

Among various textural features used jointly with STIP and iDT features, BSIF-TOP appears to be the most promising choice, as it outperforms the rest. With the degradation of spatial resolution and temporal sampling rate, BSIF-TOP comparatively performs better than LBP-TOP and LPQ-TOP features. Figures 5 and 6 analyzes the the performance improvement of BSIF-TOP features relative to that of LBP-TOP and LPQ-TOP. For instance, on STIP features, the improvement of BSIF-TOP over LPQ-TOP is ≈ 5% for SD 4 and ≈ 4.8% TD 4 videos. On iDT features, the improvement of BSIF-TOP over LPQ-TOP is ≈ 10% for SD 3 and ≈ 0.5% for TD 3 videos. Overall observation points to the fact the BSIF-TOP performs relatively well as the video quality drops, an indication of its robustness in this aspect.

Fig. 5
figure 5

Percentage improvement of BSIF-TOP over LBP-TOP and LPQ-TOP when combined with STIP on downsampled KTH

Fig. 6
figure 6

Percentage improvement of BSIF-TOP over LBP-TOP and LPQ-TOP, when combined with iDT on downsampled KTH

Our best approach, which combines the base features with BSIF-TOP dynamic textures, also performed better than the recent works by Gao et al. [34] and Rahman and See [35]. Surprisingly, the results of [34] are reported without the “s2” videos from KTH, which are videos that contain larger motion and scale variations. This could suggest that their method could fare worser still with the consideration of the omitted “s2” videos.

We further furnish the confusion matrices for four approaches: STIP, STIP+BSIF-TOP, iDT, and iDT+BSIF-TOP on the SD 3 videos in Fig. 7. Due to space limitations, we only report these confusion matrices for the SD 3 videos as example. For both STIP and iDT features, it is clear to see that additional usage of textural features helps to improve the accuracy of certain action classes such as“walking” and “jogging” by more than ≈ 20–40%. However, this is also at the expense of a slight drop in accuracy for other classes such as “handclapping” and “handwaving.” We make a special note for the recognition of “boxing” videos using feature trajectories: Its somewhat poor performance on low resolution videos is greatly improved through the introduction of textural features, an increase of more than 50%.

Fig. 7
figure 7

Confusion matrices of the KTH- SD 3 dataset. Confusion matrices on the right side show the effects of fusing BSIF-TOP textural features with STIP and iDT features. a STIP. b STIP+BSIF-TOP. c iDT. d iDT+BSIF-TOP

5.2 Experiments on compressed videos of YouTube dataset (YouTube-LQ)

In this section, we repeated our experiments on on compressed videos of YouTube dataset (YouTube-LQ) to demonstrate the effectiveness of using textural features with both STIP and iDT features. We observe that after applying compression, both the STIP and iDT baseline features struggle to maintain their original performances (i.e., 71.94% for STIP and 81.58% for iDT). From Table 3, it is clear that methods that use additional textural features demonstrate significant improvement as compared to their baselines. Again, iDT-based methods outperform their STIP counterparts while joint usage with BSIF-TOP once again tops the other textural features by a good measure. However, our best textural feature still falls short compared to the use of deep object features [35], which is arguably very robust against effects stemming from video compression. We endeavor to investigate in future how textural and deep object features can be synergized together.

Table 3 Recognition accuracy (%) of various approaches on the Youtube-LQ dataset

Analysis on experiments: After applying compression, the performance of baseline features become lower than that on the original videos since it critically affects the spatial quality. With the inclusion of textural features, the performance of both STIP and iDT features increased significantly. Once again, BSIF-TOP is the most promising choice, offering the highest performance improvement in comparison to LBP-TOP and LPQ-TOP. Figure 8 provides a closer look into how BSIF-TOP improves the recognition performance at a much larger extent over other textural features.

Fig. 8
figure 8

Percentage improvement of BSIF-TOP over LBP-TOP and LPQ-TOP, when combined with trajectory based features on YouTube-LQ

The confusion matrices shown in Fig. 9 offers more insightful analysis into class-wise performances. It is clear that for both STIP- and iDT-based methods, the addition of textural features play an important role. Many action classes have improved, i.e., “golf swing,” “soccer juggling,” “swing,” “tennis swing,” “trampoline jumping” and “volleyball spiking.” It is interesting to mention that the iDT features performed slightly better on actions with complex scenes such as “volleyball spiking” than the STIP features, when combined with BSIF-TOP features.

Fig. 9
figure 9

Confusion matrices of the YouTube-LQ dataset. Confusion matrices on the right side show the effects of fusing BSIF-TOP textural features with STIP and iDT features. a STIP. b STIP+BSIF-TOP. c iDT. d iDT+BSIF-TOP

5.3 Experiments on medium and bad quality subsets from HMDB dataset (HMDB-MQ and HMDB-LQ)

In order to demonstrate the effectiveness of adding texture information to STIP- and iDT-based features for a larger number of action classes, we also run our experiments on low-quality subsets of HMDB dataset. From Table 4, we can see a significant leap in performance when the textural features are aggregated, particularly BSIF-TOP. The iDT+BSIF-TOP method achieved a very commendable result with ≈ 14% improvement on the “Bad” subset (HMDB-LQ) and ≈ 4.5% increase on the “Medium” subset (HMDB-MQ). The method that incorporates deep object features [35] remains competitive against our proposed methods. It appears to perform better when STIP is the choice of base feature. Nevertheless, our proposed usage of textural features, particularly the BSIF-TOP, still surpasses the performance of the method with deep object features when iDT is the base feature.

Table 4 Recognition accuracy (%) of various feature combinations on HMDB-LQ and HMDB-MQ subsets

Analysis on experiments: On both STIP and iDT base features, the “Bad” quality videos are the most challenging, with only a recognition accuracy of just above 20%. The use of textural features are able to help increase their performances by a very good margin of ≈ 11 and ≈ 14% for STIP and iDT features, respectively. Meanwhile, for “Medium” quality videos, the improvement in performance after supplementing with textural features is not as marked on the iDT base features as compared to the STIP base features. The addition of BSIF-TOP features offers the most significant jump in performance, almost 9% more than the next best textural feature (LPQ-TOP). Further analysis on the performance improvement of the BSIF-TOP over its textural counterparts is shown in Fig. 10. As expected, the BSIF-TOP is far more superior than the LBP-TOP particularly when STIP features are considered as the base features. These differences are much less pronounced when combined with iDT features instead.

Fig. 10
figure 10

Percentage improvement of BSIF-TOP over LBP-TOP and LPQ-TOP, when combined with interest point based features on HMDB-LQ and HMDB-MQ

Figure 11 show the confusion matrices for the STIP, STIP+BSIF-TOP, iDT, and iDT+BSIF-TOP features respectively, based on the first split of the HMDB dataset (both HMDB-LQ and HMDB-MQ included). Since there are 51 classes in total, we can only compare them visually by observing the diagonal patterns in these confusion matrices. The more coloured the diagonals are on a blue background, the better the performance with lesser false positives. In total, about 15–17 action classes improved after BSIF-TOP is used together with the base features. Some action classes that benefit from the use of textural information are such as “Climb,” “Sword,” “Draw sword,” “Fencing,” and “Golf.”

Fig. 11
figure 11

Confusion matrices of the first split of HMDB dataset. Confusion matrices on the right side show the effects of fusing BSIF-TOP textural features with STIP and iDT features. a STIP. b STIP+BSIF-TOP. c iDT. d iDT+BSIF-TOP

5.4 Discussion

In this section, we present additional analysis as a result of our investigation into several influencing factors in our recognition pipeline such as the comparison between the textural features considered in this work, feature sampling for codebook generation and the choice of feature encoding method. We also offer a balanced commentary on the potential use of deeply learned features.

Analysis on textural features: To investigate the efficacy of textural features alone, we remove the base shape and motion features (STIP and iDT) for the purpose of this analysis. Figure 12 compares the performance of the three dynamic textural features considered in this paper: LBP-TOP, LPQ-TOP, and BSIF-TOP. This was done for the original and six downsampled versions of the KTH dataset (denoted as SD 2, SD 3, SD 4, TD 2, TD 3, TD 4), the original and compressed UCF-YouTube datasets, and the three subsets of the HMDB dataset (denoted as “Bad,” “Medium,” “Good”). In all cases, the BSIF-TOP emerged as the most robust textural feature, capable of extracting effective global information regardless of the adversity in video quality.

Fig. 12
figure 12

Performance of various textural features on the original and low quality datasets

Analysis on feature sampling for codebook generation: Determining the appropriate codebook size is important to ensure that the extracted local features are encoded into a codebook of sufficient capacity. Many authors have analyzed this issue in detail [6, 37] and have also suggested appropriate codebook sizes based on their empirical evaluations over many experimental datasets. Following their suggestions, we choose to use a codebook size of 4000 for all our tested datasets, after verification by experiments. However, the number of features that are sampled randomly to build the codebook could potentially be vital to the recognition performance. For consistency in our main experiments (in Sections 5.1, 5.2, and 5.3), we had fix the number of sampled features to 100,000 to obtain a reasonably good level of accuracy while maintaining a manageable computational load for codebook learning.

To see the effect of using more features for codebook generation, we tested two of our best performing methods on all three subsets of the HMDB dataset, with a variety of feature set size: 100, 150, 200, and 250 k. In Fig. 13, we observe that the recognition accuracy of various HMDB videos somewhat improves when we use more features to construct the codebook (codebook size remains the same at 4000). Interestingly, with the STIP as base feature (see Fig. 13 a), we can achieve better accuracy of up to ≈ 5%) if we use a larger sampling size. But the scenario is the opposite for the iDT case (see Fig. 13 b) where the recognition accuracy significantly drops across all three subsets when larger sampling sizes are used. The iDT features are constructed by MBHx and MBHy descriptors, which describe the “gradient” on the temporal dimension of the trajectories, along both horizontal and vertical spatial directions. Hence, if too many features are sampled, the intra-class variations of these trajectories may likely result in a variety of perturbations to the descriptors, which in turn increases the ambiguity during clustering. A possible remedy to this is to increase the codebook size to accommodate a larger variety of trajectories, particularly for more complex scenes such as those in HMDB. Codebooks constructed from less ambiguous features have higher discriminative capacities that may help to gain better recognition performance.

Fig. 13
figure 13

Recognition performance of STIP+BSIF-TOP and iDT+BSIF-TOP on various subsets of the HMDB dataset. Bars of different colors denote the varying amount of feature descriptors (100, 150, 200, 250 k) that are sampled for codebook generation. a STIP+BSIF-TOP. b iDT+BSIF-TOP

Analysis on shape and motion features encoding Many feature encoding methods have been proposed in literature, such as histogram encoding, Fisher Vector encoding and sparse coding [37]. We choose to use histogram encoding, better known as bag-of-visual-words (BoVW) since it is widely used by many recent action recognition works [2, 6, 7]. Though some authors [1, 8] showed that Fisher Vector (FV) encoding is superior to BoVW encoding, but we found that this is untrue most of the time for the evaluated low-quality videos, with the exception of videos from HMDB. Our best methods (STIP+BSIF-TOP and iDT+BSIF-TOP) achieve better accuracy using BoVW encoding on most counts, as can be seen in Table 5. From this observation, we suggest that codebooks for videos with plenty of motion and complex background scenes may be better constructed using FV encoding which applies soft quantization to features.

Table 5 Recognition accuracy (%) of various datasets with STIP+BSIF-TOP and iDT+BSIF-TOP methods using bag-of-visual-words (BoVW) and fisher vector (FV) encoding

Analysis on deeply learned features: Owing to the recent breakthrough in deep learning techniques, particularly deep convolutional neural networks (CNNs), we have made comparisons against a recent work that used object features extracted from a pre-trained CNN model [35]. In this work, the authors used an popular off-the-shelf CNN model, the VGG-16 architecture [51] that was pre-trained on ImageNet for 1000 categories. Object features from the last few layers of the network (concatenation of fc6 and fc7 layers) were extracted from each frame and average pooled.

Interestingly, the widely acclaimed deeply learned features was not entirely superior in all cases. Experimental results on the downsampled KTH datasets (Tables 1 and 2) and the HMDB low-quality subsets (Table 4) showed that the combination of the robust BSIF-TOP dynamic textural feature with the base features (STIP or iDT) can surpass the recognition capability of combining with deeply learned object features. In fact, the baseline performance for some of the downsampled versions (particularly SD2, TD2, TD3) is also better than that combined with deep object features. This can be attributed to the pre-trained CNN model’s inability to generalize for videos with distorted spatial information. To obtain a single dimension, the average-pooled deep features (across all frames) [35] also resulted in the removal of valuable temporal cues.

However, for the compressed YouTube-LQ dataset, the use of deep object features performed exceedingly better than our best approach (see Table 3). The BSIF-TOP feature dimension is also larger than that of the fc6+fc7 deep object features (12,288 to 8192); this will cause classifier training to take a longer time.

5.5 Computational complexity

In this section, we compare the various feature descriptors (including time consumed for their feature detection) based on their total computation speed. This comparison is performed using a sample “Bike riding” video taken from the HMDB dataset, with a 240 ×320 frame resolution and a total of 246 image frames at 30 fps. This estimation of run-time speed was performed on an Intel i7 3.60 GHz machine with 24 GB RAM. Table 6 shows the computational cost of various feature descriptors in seconds, per image frame. Among the shape-motion descriptors, the HOG+HOF feature takes 0.047 s faster than the MBHx+MBHy feature, which relies on feature tracking and warped flow estimation. Among the textural features compared, the LPQ-TOP and BSIF-TOP are the most efficient methods (both much faster than computing shape-motion features), and yet they are also the most promising features for recognizing actions in low-quality video. Between the two feature detectors, extracting the iDT (the better performing method) takes around 0.2 s per frame on a single scale and much longer on multiple scales. In future, we intend to explore the possibility of multi-scale trajectories with the help of parallelized frameworks [52].

Table 6 Computational cost (detection+description) of various feature descriptors

6 Conclusion

In this paper, we demonstrate that dynamic textural features can help improve the performance of action recognition in low-quality videos by a good margin. In comparison with current methods that mainly rely on shape and motion features, the use of textural features is a novel proposition that is found to be robust against undesirable, but often, realistic video conditions: low resolution and frame rate, lossy compression, and the presence of motion blurring and artifacts. Our extensive set of experiments marked the BSIF-TOP as a promising candidate for textural features to complement conventional shape and motion features.

Even with the advent of deep learning techniques, we see a great value in the use of features that directly exemplify a particular image structure such as textures. However, features learned from deep neural networks have also showed great potential, even more so if the network has been carefully tuned for the target domain. Likewise, the filters used in the BSIF approach are fundamentally learnt through ICA in an unsupervised manner. Hence, future directions point towards further exploration on how richer features can be learnt from videos sampled from a wide quality range to enable better generalization.