Dynamic Texture Recognition Using TimeCausal and TimeRecursive SpatioTemporal Receptive Fields
Abstract
This work presents a first evaluation of using spatiotemporal receptive fields from a recently proposed timecausal spatiotemporal scalespace framework as primitives for video analysis. We propose a new family of video descriptors based on regional statistics of spatiotemporal receptive field responses and evaluate this approach on the problem of dynamic texture recognition. Our approach generalises a previously used method, based on joint histograms of receptive field responses, from the spatial to the spatiotemporal domain and from object recognition to dynamic texture recognition. The timerecursive formulation enables computationally efficient timecausal recognition. The experimental evaluation demonstrates competitive performance compared to state of the art. In particular, it is shown that binary versions of our dynamic texture descriptors achieve improved performance compared to a large range of similar methods using different primitives either handcrafted or learned from data. Further, our qualitative and quantitative investigation into parameter choices and the use of different sets of receptive fields highlights the robustness and flexibility of our approach. Together, these results support the descriptive power of this family of timecausal spatiotemporal receptive fields, validate our approach for dynamic texture recognition and point towards the possibility of designing a range of video analysis methods based on these new timecausal spatiotemporal primitives.
Keywords
Dynamic texture Receptive field Spatiotemporal Timecausal Timerecursive Video descriptor Receptive field histogram Scale space1 Introduction
The ability to derive properties of the surrounding world from timevarying visual input is a key function of a generalpurpose computer vision system and necessary for any artificial or biological agent that is to use vision for interpreting a dynamic environment. Motion provides additional cues for understanding a scene and some tasks by necessity require motion information, e.g. distinguishing between events or actions with similar spatial appearance or estimating the speed of moving objects. Motion cues are also helpful when other visual cues are weak or contradictory. Thus, understanding how to represent and incorporate motion information for different video analysis tasks—including the question of what constitute meaningful spatiotemporal features—is a key factor for successful applications in areas such as action recognition, automatic surveillance, dynamic texture and scene understanding, video indexing and retrieval, autonomous driving, etc.
Challenges when dealing with spatiotemporal image data are similar to those present in the static case, that is highdimensional data with large intraclass variability caused both by the diverse appearance of conceptually similar objects and by the presence of a large number of identity preserving visual transformations. From the spatial domain, we inherit the basic image transformations: translations, scalings, rotations, nonlinear perspective transformations and illumination transformations. In addition to these, spatiotemporal data will contain additional sources of variability: differences in motion patterns for conceptually similar motions, events occurring faster or slower, velocity transformations caused by camera motion and timevarying illumination. Moreover, the larger data dimensionality compared to static images presents additional computational challenges.
For biological vision, local image measurements in terms of receptive fields constitute the first processing layers [11, 25]. In the area of scalespace theory, it has been shown that Gaussian derivatives and related operators constitute a canonical model for visual receptive fields [17, 26, 30, 31, 32, 33, 39, 41, 67, 69, 70, 78, 80]. In computer vision, spatial receptive fields based on the Gaussian scalespace concept have been demonstrated to be a powerful frontend for solving a large range of visual tasks. Theoretical properties of scalespace filters enable the design of methods invariant or robust to natural image transformations [41, 42, 43, 44]. Also, early processing layers based on such “ideal” receptive fields can be shared among multiple tasks and thus free resources both for learning higherlevel features from data and during online processing.
The most straightforward extension of the Gaussian scalespace concept to the spatiotemporal domain is to use Gaussian smoothing also over the temporal domain. However, for realtime processing or when modelling biological vision this would violate the fundamental temporal causality constraint present for realtime tasks: it is simply not possible to access the future. The ad hoc solution of using a timedelayed truncated Gaussian temporal kernel would instead imply unnecessarily long temporal delays, which would make the framework less suitable for timecritical applications.
The preferred option is to use truly timecasual visual operators. Recently, a new timecausal spatiotemporal scalespace framework, leading to a new family of timecausal spatiotemporal receptive fields, has been introduced by Lindeberg [44]. In addition to temporal causality, the timerecursive formulation of the temporal smoothing operation offers computational advantages compared to Gaussian filtering over time in terms of fewer computations and a compact recursively updated memory of the past.
The idealised spatiotemporal receptive fields derived within that framework also have a strong connection to biology in the sense of very well modelling both spatial and spatiotemporal receptive field shapes of neurons in the LGN and V1 [42, 44]. This similarity further motivates designing algorithms based on these primitives. They provably work well in a biological system, which points towards their applicability also for artificial vision. An additional motivation, although not actively pursued here, is that designing methods based on primitives similar to biological receptive fields can enable a better understanding of information processing in biological systems.
The purpose of this study is a first evaluation of using this new family of timecausal spatiotemporal receptive fields as visual primitives for video analysis. As a first application, we have chosen the problem of dynamic texture recognition. A dynamic texture or spatiotemporal texture is an extension of texture to the spatiotemporal domain and can be naively defined as “texture + motion” or more formally as a spatiotemporal pattern that exhibits certain stationarity or selfsimilar properties over both space and time [6]. Examples of dynamic textures are windblown vegetation, fire, waves, a flock of flying birds or a flag flapping in the wind. Recognising different types of dynamic textures is important for visual tasks such as automatic surveillance (e.g. detecting forest fires), video indexing and retrieval (e.g. return all images set on the sea) and to enable artificial agents to understand and interact with the world by interpreting different environments.
 (i)
Qualitative and quantitative effects from varying model parameters including the spatial and the temporal scales and the number of principal components and the number of bins used in the histogram descriptor.
 (ii)
A comparison between the performance of descriptors constructed from different sets of receptive fields.
 (iii)
An extensive comparison with stateoftheart dynamic texture recognition methods.
Together, these results support the descriptive power of this family of timecausal spatiotemporal receptive fields, validate our approach for dynamic texture recognition and point towards the possibility of designing a range of video analysis methods based on these new timecausal spatiotemporal primitives.
This paper is an extended version of a conference paper [27], in which only a single descriptor version was evaluated. We here present a more extensive experimental evaluation, including results on varying the descriptor parameters and of using different sets of receptive fields. We specifically present new video descriptors with improved performance compared to the previous video descriptor used in [27].
2 Related Work
For dynamic texture recognition, additional challenges compared to the static case include variabilities in motion patterns and speed and much higher data dimensionality. Although spatial texture descriptors can give reasonable performance also in the dynamic case (a human observer can typically distinguish dynamic textures based on a single frame), making use of dynamic information as well provides a consistent advantage over purely spatial descriptors (see, e.g. [1, 3, 24, 56, 87]). Thus, a large range of dynamic texture recognition methods have been developed, exploring different options for representing motion information and combining this with spatial information.
Some of the first methods for dynamic texture recognition were based on optic flow, see, e.g. the pioneering work by Nelson and Polana [52] and later work by Péteri and Chetverikov [54], Lu et al. [49], Fazekas and Chetverikov [16], Fazekas et al. [15] and Crivelli et al. [7].
Compared to the motion patterns of larger rigid objects, however, dynamic textures often feature chaotic nonsmooth motions, multiple directions of motion at the same image point and intensity changes not mediated by rigid motion—consider, e.g. fluttering leaves, turbulent water and fire. Thus, the brightness constancy assumption, usually underlying optical flow estimation, is violated for many dynamic texture types. This implies difficulties estimating optic flow for dynamic textures, although alternative types of assumptions, such as brightness conservation and colour constancy, have later been explored, for example, in the context of dynamic texture detection and segmentation [5, 15].
Another early approach, first proposed by Soatto et al. [66], is to model the timeevolving appearance of a dynamic texture as a linear dynamical system (LDS). Although originally designed for synthesis, recognition can be done using model parameters as features. A drawback of such global LDS models, however, is poor invariance to viewpoint and illumination conditions [81]. To circumvent the heavy dependence of the first LDS models on global spatial appearance, the bags of dynamical systems (BoS) approach was developed by Ravichandran et al. [60]. Here, the global model is replaced by a set of codebook LDS models, each describing a small space–time cuboid, used as local descriptors in a bagofwords framework. For additional LDSbased approaches, see also [4, 51, 57, 63, 75].
The use of histograms of local image measurements as visual descriptors was pioneered by Swain and Ballard [68], who proposed using 3D colour histograms for object recognition. Different types of histogram descriptors have subsequently proven highly useful for a large range of visual tasks [9, 10, 22, 29, 34, 36, 37, 38, 48, 64, 73, 85].
A large group of dynamic texture recognition methods based on statistics of local image primitives are local binary pattern (LBP)based approaches. The original spatial LBP descriptor captures the joint binarised distribution of pixel values in local neighbourhoods. The spatiotemporal generalisations of LBP used for dynamic texture recognition do the same either for 3D space–time volumes (VLBP) [86, 87] or by applying a twodimensional LBP descriptor but on three orthogonal planes (LBPTOP) [87]. Several extensions/versions of LBPTOP have subsequently been presented, e.g. utilising averaging and principal histogram analysis to get more reliable statistics (Enhanced LBP) [61] or multiclustering of salient features to identify and remove outlier frames (AFSTOP) [24]. See also the completed local binary pattern approach (CVLBP) [71] and multiresolution edgeweighted local structure patterns (MEWLSP) [72].
Contrasting LBPTOP with VLBP highlights a conceptual difference concerning the generalisation of spatial descriptors to the spatiotemporal domain: Whereas the VLBP descriptor is based on full space–time 2D+T features, the LBPTOP descriptor applies 2D features originally designed for purely spatial tasks on several cross sections of a spatiotemporal volume. While the former is in some sense conceptually more appealing and implies more discriminative modelling of the space–time structure, this comes with the drawback of higher descriptor dimensionality and higher computational load.
For LBPbased methods, using 2D descriptors on three orthogonal planes has so far been proven more successful, and this approach is frequently used also by nonLBPbased methods [1, 2, 3, 83]. A variant of this idea proposed by Norouznezhad et al. [53] is to replace the three orthogonal planes with nine spatiotemporal symmetry planes and apply a histogram of oriented gradients on these planes. Similarly, the directional number transitional graph (DNG) by Rivera et al. [62] is evaluated on nine spatiotemporal cross sections of a video.
A different group of approaches similarly based on gathering statistics of local space–time structure, but using different primitives, is spatiotemporal filteringbased approaches. Examples are the oriented energy representations by Wildes and Bergen [79] and Derpanis and Wildes [12], where the latter represents pure dynamics of spatiotemporal textures by capturing space–time orientation using 3D Gaussian derivative filters. Marginal histograms of spatiotemporal Gabor filters were proposed by Gonçalves et al. [20]. Our proposed approach, which is based on joint histograms of timecausal spatiotemporal receptive field responses, also fits into this category. However, in contrast to spatiotemporal filteringbased approaches using marginal histograms, the use of joint histograms, which can capture the covariation of different features, enables distinguishing a larger number of local space–time patterns. Joint statistics of ideal scalespace filters have previously been used for spatial object recognition, see, e.g. [64] and the approach by Linde and Lindeberg [37, 38], which we here generalise to the spatiotemporal domain. The use of timecausal filters in our approach also contrasts with other spatiotemporal filteringbased approaches, and it should also be noted that the timecausal limit kernel used in this paper is timerecursive, whereas no timerecursive formulation is known for the scaletime kernel previously proposed by Koenderink [31].
There also exist a number of related methods similarly based on joint statistics of local space–time structure but using filters learned from data as primitives. Unsupervised learningbased approaches are, e.g. multiscale binarised statistical image features (MBSIFTOP) by Arashloo and Kittler [3], which learn filters by means of independent component analysis and PCANetTOP by Arashloo et al. [2], where two layers of hierarchical features are learnt by means of principal component analysis (PCA). Approaches instead based on sparse coding for learning filters are orthogonal tensor dictionary learning (OTD) [59], equiangular kernel dictionary learning (SKDL) [58] and manifold regularised slow feature analysis (MRSFA) [50].
Recently, several deep learningbased approaches for dynamic texture recognition have been proposed. These are typically trained by supervised learning and learn complex and abstract highlevel features from data. Andrearczyk and Whelan [1] train a dynamic texture convolutional neural network from scratch to extract features on three orthogonal planes (DTCNN). Qi et al. [56] instead use a network pretrained on an object recognition task for feature extraction (TCoF), whereas Hong et al. [23] use features from a pretrained network as the basis for a deep dual descriptor (D3). Additional approaches include the highlevel feature approach by Wang and Hu [76], which is a hybrid method using deep learning in combination with chaotic features, and the early deep learning approach of Culibrk and Sebe [8] based on temporal dropout of changes. It should be noted that none of these approaches are based on full spatiotemporal deep features. They instead use features extracted from individual frames, differences between pairs of frames or on orthogonal space–time planes. The best deep learning approaches are among the bestperforming dynamic texture recognition methods, but this comes at the price of a lack of understanding and interpretability. The properties of the learned nonlinear mappings and why these prove successful are not fully understood, neither in general nor for a network trained on a specific task. These highly complex “blackbox” methods may also suffer more from surprising and unintuitive failure modes. Compared to methods incorporating more prior information about the problem structure, deep learning also requires larger amounts of training data from the specific domain, or one similar enough for successful transfer learning.
Spatiotemporal transforms have also been applied for dynamic texture recognition. Ji et al. [28] propose a method based on wavelet domain fractal analysis (WMFS), whereas Dubois et al. [13] utilise the 2D+T curvelet transform and Smith et al. [65] use spatiotemporal wavelets for video texture indexing. Fractal dimensionbased methods make use of the selfsimilarity properties of dynamic textures. Xu et al. [83] create a descriptor from the fractal dimension of motion features (DFS), whereas Xu et al. [82] instead extract the powerlaw behaviour of local multiscale gradient orientation histograms (3DOTF) (see also [18, 65]).
Additional approaches include using average degrees of complex networks [21] and a total variationbased approach by El Moubtahij et al. [14]. Wang and Hu [77] instead create a descriptor from chaotic features. There are also approaches combining several different descriptor types or features, such as DLPEGASOS by Ghanem and Ahuja [19], which combines LBP, PHOG and LDS descriptors with maximum margin distance learning, or Yang et al. [84] who use ensemble SVMs to combine LBP, shapeinvariant cooccurrence patterns (SCOPs) and chromatic information with dynamic information represented by LDS models.
3 SpatioTemporal Receptive Field Model
We here give a brief description of the timecausal spatiotemporal receptive field framework of Lindeberg [44]. This framework provides the timecausal primitives for describing local space–time structure—the spatiotemporal receptive fields (or equivalently 2D+T scalespace filters)—which our proposed family of video descriptors is based on.

(x, y) denotes the image coordinates,

t denotes time,

\(h(t;\; \tau )\) denotes a temporal smoothing kernel,

\( g(x  u t, y  vt;\; s, \varSigma )\) denotes a spatial affine Gaussian kernel with spatial covariance matrix \(\varSigma \) that moves with image velocity (u, v).
A more general approach, and also more similar to biological vision, would be to include a family of velocityadapted receptive fields over a range of image velocities v. Such a set of receptive fields would constitute a Galilean covariant representation enabling fully Galilean invariant recognition according to the general theory presented in [41, Section 4.1.34.1.4] and applied to motion recognition in [35]. Exploring these possibilities is, however, left for future work. In this study, we choose to evaluate how far we can get using space–time separable receptive fields only.
Notably, also the family of space–time separable receptive fields can represent image velocities, since a set of spatial and temporal derivatives of different orders with respect to space and time can implicitly encode optic flow. An interesting biological parallel is that the superior colliculus is able to perform basic visual tasks, although receiving its primary inputs from the lateral geniculate nucleus (LGN), where a majority of the receptive fields are space–time separable.
These spatiotemporal receptive field responses can be computed efficiently by space–time separable filtering. The spatial smoothing is done in by separable discrete Gaussian filtering over the spatial dimensions, with the spatial extent proportional to the spatial scale parameter in units of the standard deviation \(\sigma _s = \sqrt{s}\). The temporal smoothing is performed by recursive filtering requiring only two additions and one multiplication per pixel and spatiotemporal scale level. Then, scalespace derivatives are computed by applying small support discrete derivative approximations to the spatiotemporal scalespace representation.
4 Video Descriptors
 (i)
Computation of the spatiotemporal scalespace representation at the specified spatial and temporal scales (spatial and temporal smoothing).
 (ii)
Computation of local spatiotemporal receptive field responses using discrete derivative approximation filters over space and time.
 (iii)
Dimensionality reduction of the local spatiotemporal receptive field responses/feature vectors using principal component analysis (PCA).
 (iv)
Aggregation of joint statistics of receptive field responses over a space–time region into a multidimensional histogram.
4.1 SpatioTemporal ScaleSpace Representation
The first processing step is spatial and temporal smoothing to compute the timecausal spatiotemporal scalespace representation of the current frame at the chosen spatial and temporal scales. The spatial smoothing is done in a cascade over spatial scales, and the temporal smoothing is performed recursively over time and in a cascade over temporal scales according to [44]. The fully separable convolution kernels and the timerecursive implementation of the temporal smoothing imply that the smoothing can be done in a computationally efficient way.
4.2 SpatioTemporal Receptive Field Responses
After the smoothing step, the spatiotemporal receptive field responses \(F = [F_1,F_2,\ldots F_N]\) are computed densely over the current frame for a set of N chosen scalespace derivative filters. These filters will include partial derivatives over a set of different spatial and temporal differentiation orders at single or multiple spatial and temporal scales. Alternatively, for a video descriptor based on differential invariants, the features will correspond to such differential invariants computed from the partial derivatives. Using filters of multiple scales enables capturing image structures of different spatial extent and temporal duration. All derivatives are scalenormalised.
Previous methods utilising “ideal” (as opposed to learned) spatiotemporal filters have used families of filters created from a single filter type, e.g. by applying thirdorder filters in different directions in space–time [12] or using a Gabor filter extended by both spatial rotations and velocity transformations [20].
We here take a different approach, using a richer family of receptive fields encompassing different orders of spatial and temporal differentiation, specifically including mixed spatiotemporal derivatives. Using such a set enables representing local intensity variations of different orders. This is not possible if restricting filters to a single base filter, even if the filters would be extended by spatiotemporal transformations. The spatiotemporal receptive field responses from such a family can thus be used to separate patterns based on a combination of, for example, firstorder and secondorder intensity variations. It will also enable computing measures that place equal weight on the first and secondorder changes.
Biological vision does also comprise such richer sets of spatiotemporal receptive fields, which typically occur in pairs of oddshaped and evenshaped receptive fields [74] and which can be well modelled by spatiotemporal scalespace derivatives for different orders of spatial and temporal differentiation [44]. As previously discussed in Sect. 3, a more general model would be to combine these spatiotemporal receptive fields with velocity adaptation over a set of Galilean transformations. This extension is, however, left for future work.
A more detailed discussion concerning the choice of the specific receptive field sets that our video descriptors are based on is given in Sect. 4.6.1.
4.3 Dimensionality Reduction with PCA
When combining a large number of local image measurements, most of the variability in a dataset will be contained in a lowerdimensional subspace of the original feature space. This opens up for Dimensionality reduction of the local feature vectors to enable summarising a larger set of receptive field responses without resulting in a prohibitively large joint histogram descriptor.
For this purpose, a dimensionality reduction step in the form of principal component analysis is added before constructing the joint histogram. The result will be a local feature vector \(\tilde{F}(x,y,t) = [\tilde{F}_1, \tilde{F}_2, \ldots \tilde{F}_M] \in \mathbb {R}^M\) \(M \le N\), where the new components correspond to linear combinations of the filter responses from the original feature vector F.
The transformation properties of the spatiotemporal scalespace derivatives imply scale covariance for such linear combinations of receptive fields, as well as for the individual receptive field responses if using the proper scale normalisation with \(\gamma _s = 1\) and \( \gamma _{\tau } = 1\). A more detailed treatment of the scale covariance properties is given in “Appendix A.1”. The main reason for choosing PCA is simplicity and that it has empirically been shown to give good results for spatial images [38]. The number of principal components M can be adapted to requirements for descriptor size and need for detail in modelling the local image structure. The dimensionality reduction step can also be skipped if working with a smaller number of receptive fields.
4.4 Joint Receptive Field Histograms
The approach that we will follow to represent videos of dynamic textures is to use histograms of spatiotemporal receptive field responses. Notably, such a histogram representation will discard information about the spatial positions and the temporal moments that the feature responses originate from. Because the histograms are computed from spatiotemporal receptive field responses in terms of spatiotemporal derivatives, these feature responses will, however, implicitly code for partial spatial and temporal information, like pieces in a spatiotemporal jigsaw puzzle. By computing these receptive field responses over multiple spatial and temporal scales, we additionally capture primitives of different spatial size and temporal duration. For spatial recognition, related histogram approaches have turned out to be highly successful, leading to approaches such as SIFT [48], HOG [9] and HOF [10] and bagofwords models. For the task of texture recognition, the loss of information caused by discarding spatial positions is also less critical, since many textures can be expected to possess certain stationarity properties. In this work, we use an extension of this paradigm of histogrambased image descriptors for building a conceptually simple system for video analysis.
The receptive field responses up to a certain order will represent information corresponding to the coefficients of a Taylor expansion around each point in space time. Each histogram cell will correspond to a certain “template” local space–time structure, encoding joint conditions on the magnitudes of the spatiotemporal receptive field responses (here, image derivatives, differential invariants or PCA components). This is somewhat similar to, for example, the templates used in VLBP [87] but notably represented and computed using different primitives. The normalised histogram video descriptor then captures the relative frequency of such space–time structures in a space–time region. The number of different local “templates” will be decided by the number of receptive fields/principal components and the number of bins.
A joint histogram video descriptor explicitly models the covariation of different types of image measurements, in contrast to the more common choice of descriptors based on marginal distributions or relative feature strength (see, e.g. [12, 20]). A simple example of this is that a joint histogram over \(L_x\) and \(L_y\) will reflect the orientations of gradients over image space, which would not be sufficiently captured by the corresponding marginal histograms. Using joint histograms similarly implies the ability to represent other types of patterns that correspond to certain relationships between groups of features, such as how receptive field responses covary over different spatial and temporal scales.
In the general case, these histograms should be computed regionally, over different regions over space and/or time, e.g. to separate regions that contain different types of dynamic textures. For almost all experiments in this study, however, the space–time region for the histogram descriptor will be chosen as the entire video, leading to a single global histogram descriptor per dynamic texture video. The single exception is the experiment presented in Sect. 7.6, where we compute histograms over a smaller number of video frames. The reason for primarily using global histograms is that the videos in the DynTex and UCLA benchmarks are presegmented to contain a single dynamic texture class per video. Thereby, we can make this conceptual simplification for the experimental evaluation of our different types of video descriptors.
It should be noted that, if represented naively, a subset of the histograms descriptors evaluated here would be prohibitively large. However, for such highdimensional descriptors the number of nonzero histogram cells can be considerably lower than the maximum number of cells. This implies they can be efficiently represented using a computationally efficient sparse representation as outlined by Linde and Lindeberg [38].
4.5 Covariance and Invariance Properties of the Video Descriptor
The scale covariance properties of the spatiotemporal receptive fields and the PCA components, according to the theory presented in “Appendix A.1”, imply that a histogram descriptor constructed from these primitives will be scalecovariant over all nonzero spatial scaling factors and for temporal scaling factors that are integer powers of the distribution parameter c of the timecausal limit kernel. This means that our proposed video descriptors can be used as the building blocks of a scaleinvariant recognition framework.
The video descriptors investigated in this paper and the receptive field sets they are based on
Name  Receptive field set 

RF Spatial  \(\{ L_x, L_y, L_{xx}, L_{xy}, L_{yy} \}\) 
STRF Njet  \( \{ L_x, L_y, L_{xx}, L_{xy}, L_{yy} \} , \{L_t, L_{tt} \} \), 
\(\{ L_{xt}, L_{yt}, L_{xxt}, L_{xyt}, L_{yyt}\}, \)  
\(\{ L_{xtt}, L_{ytt}, L_{xxtt}, L_{xytt}, L_{yytt} \} \)  
STRF RotInv  \(\{ \nabla L, \nabla L_t, \nabla L_{tt} \}\), 
\(\{ \nabla ^2 L, \nabla ^2 L_t, \nabla ^2 L_{tt} \}\),  
\(\{ \det \mathscr {H} L, \det \mathscr {H} L_t, \det \mathscr {H} L_{tt} \}\)  
STRF Njet (previous [27])  \( \{ L_x, L_y, L_{xx}, L_{xy}, L_{yy} \} , \{L_t, L_{tt}\}, \) 
\( \{ L_{xt}, L_{yt}, L_{xxt}, L_{xyt}, L_{yyt}\}\) 
4.6 Choice of Receptive Fields and Descriptor Parameters
The basic video descriptor described above will give rise to a family of video descriptors when varying the set of receptive fields and the descriptor parameters. Here, we describe the different options investigated in this work considering: (i) the set of receptive fields, (ii) the number of bins and the number of principal components used for constructing the histogram and (iii) the spatial and temporal scales of the receptive fields.
4.6.1 Receptive Field Sets
The set of receptive fields used as primitives for constructing the histogram will determine the type of information that is represented in the video descriptor. A straightforward example of this is that using rotationally invariant differential invariants will imply a rotationally invariant video descriptor. A second example is that including or excluding purely temporal derivatives will enable or disable capturing temporal intensity changes not mediated by spatial motion. We have chosen to compare video descriptors based on four different receptive field groups as summarised in Table 1.
First, note that all video descriptors, except STRF Njet (previous), include first and secondorder spatial and temporal derivatives in pairs \(\{L_x, L_{xx}\}\), \(\{L_t, L_{tt}\}\), \(\{L_{xt}, L_{xtt}\}\), \(\{ L_{xxt}, L_{xxtt}\}\), etc. The motivation for this is that first and secondorder derivatives provide complementary information, and by including both, equal weight is put on first and secondorder information. It has specifically been observed that biological receptive fields occur in pairs of oddshaped and evenshaped receptive field profiles that can be well approximated by Gaussian derivatives [32, 42, 74]. In the following, we describe the four video descriptors in more detail and do further motivate the choice of their respective receptive field sets.
RF Spatial is a purely spatial descriptor based on the full spatial Njet up to order two. This descriptor will capture the spatial patterns in “snapshots” of the scene (single frames) independent of the presence of movement. Using spatial derivatives up to order two means that each histogram cell template will represent a discretised secondorder approximation of the local spatial image structure. An additional motivation for using this receptive field set is that this descriptor is one of the bestperforming spatial descriptors for the receptive fieldbased object recognition method in [38]. This descriptor is primarily included as a baseline to compare the spatiotemporal descriptors against.
STRF Njet is a directionally selective spatiotemporal descriptor, where the first and secondorder spatial derivatives are complemented with the first and secondorder temporal derivatives of these as well as the first and secondorder temporal derivatives of the smoothed video L. Including purely temporal derivatives means that the descriptor can capture intensity changes not mediated by spatial motion (flicker). The set of mixed spatiotemporal derivatives will on the other hand capture the interplay between changes over the spatial and temporal domains, such as movements of salient spatial patterns. An additional motivation for including mixed spatiotemporal derivatives is that they represent features that are well localised with respect to joint spatiotemporal scales. This implies that when using multiple scales, a descriptor including mixed spatiotemporal derivatives will have better ability to separate spatiotemporal patterns at different spatiotemporal scales.
STRF RotInv is a rotationally invariant video descriptor based on a set of rotationally invariant features over the spatial domain: the spatial gradient magnitude \( \nabla L  = \sqrt{L_x^2 + L_y^2}\), the spatial Laplacian \(\nabla ^2 L = L_{xx} + L_{yy}\) and the determinant of the spatial Hessian \(\det \mathscr {H} L = L_{xx}L_{yy}  L_{xy}^2\).^{1} These are evaluated on the smoothed video L directly and on the first and secondorder temporal derivatives of the scalespace representation \(L_t\) and \(L_{tt}\). One motivation for choosing these spatial differential invariants is that they are functionally independent and span the space of rotationally invariant first and secondorder differential invariants over the spatial domain. This set of rotationally invariant features was also demonstrated to be the basis of one of the bestperforming spatial descriptors in [38]. By applying these differential operators to the first and secondorder temporal derivatives of the video, the interplay between temporal and spatial intensity variations is captured.
It can be noted that none of these video descriptors makes use of the full spatiotemporal 4jet. This reflects the philosophy of treating space and time as distinct dimensions, where the most relevant information lies in the interplay between spatial changes (here, of first and secondorder) with temporal changes (here, of first and secondorder). Third and fourthorder information with respect to either the spatial or the temporal domain is thus discarded. Receptive field responses for two videos of dynamic textures are shown for spatiotemporal partial derivatives in Fig. 3 and for rotational differential invariants in Fig. 4.
It should be noted that the recognition framework presented here also allows for using nonseparable receptive fields with nonzero image velocities. Exploring this is, however, left for future work, and in this study we instead focus on evaluating different sets of space–time separable receptive fields (see also the discussion in Sect. 3).
4.6.2 Number of Bins and Principal Components
4.6.3 Binary Histograms
When choosing \(n_\mathrm{bins}=2\) equivalent to a joint binary histogram, the local image structure is described by only the sign of the different image measurements. This will make the descriptor invariant to uniform rescalings of the intensity values, such as multiplicative illumination transformations or indeed any change that does not affect the sign of the receptive field response. Binary histograms in addition enable combining a larger number of image measurements without a prohibitive large descriptor dimensionality and have proven an effective approach by a large number of LBPinspired methods.
4.6.4 Spatial and Temporal Scales
A more general approach than using fixed pairs of adjacent scales is to operate on a wider range of spatiotemporal scales in parallel. For example, for breaking water waves that roll onto a beach, the coarser scale receptive fields will respond to the gross motion pattern of the water waves, whereas the finerscale receptive fields will respond to the detailed finescale motion pattern of the water surface. A generalpurpose vision system should have the ability to dynamically operate over such different subsets of spatial and temporal scales, to extract maximum amount of relevant information about a dynamic scene. Specifically, there is interesting potential in determining local spatial and temporal scale levels adaptively from the video data, using recently developed methods for spatiotemporal scale selection [45, 47]. We leave such extensions for future work.
5 Datasets
We evaluate our proposed approach on six standard dynamic texture recognition/classification benchmarks from two widely used dynamic texture datasets: UCLA [66] and DynTex [55]. We here give a brief description of the datasets and the benchmarks. Sample frames from the datasets are shown in Fig. 5 (UCLA) and Fig. 6 (DynTex).
5.1 UCLA
The UCLA dataset was introduced by Soatto et al. [66] and is composed of 200 videos (160 \(\times \) 110 pixels, 15 fps) featuring 50 different dynamic textures with 4 samples from each texture. The UCLA50 benchmark [66] divides the 200 videos into 50 classes with one class per individual texture/scene. It should be noted that this partitioning is not conceptual in the sense of the classes constituting different types of textures such as “fountains”, “sea” or “flowers” but instead targets instancespecific and viewpointspecific recognition. This means that not only different individual fountains but also the same fountain seen from two different viewpoints should be separated from each other.
Since for many applications it is more relevant to recognise different dynamic texture categories, a partitioning of the UCLA dataset into conceptual classes, UCLA9, was introduced by Ravichandran et al. [60] with the following classes: boiling water (8), fire (8), flowers (12), fountains (20), plants (108), sea (12), smoke (4), water (12) and waterfall (16), where the numbers correspond to the number of samples from each class. Because of the large overrepresentation of plant videos for this benchmark, in the UCLA8 benchmark, those are excluded to get a more balanced dataset, leaving 92 videos from eight conceptual classes.
5.2 DynTex
A larger and more diverse dynamic texture dataset, DynTex, was introduced by Péteri et al. [55], featuring a larger variation of dynamic texture types recorded under more diverse conditions (720 \(\times \) 576 pixels, 25 fps). From this dataset, three gradually larger and more challenging benchmarks have been compiled by Dubois et al. [13]. The Alpha benchmark includes 60 dynamic texture videos from three different classes: sea, grass and trees. There are 20 examples of each class and some variations in scale and viewpoint. The Beta benchmark includes 162 dynamic texture videos from ten classes: sea, vegetation, trees, flags, calm water, fountain, smoke, escalator, traffic and rotation. There are 7–20 examples of each class. The Gamma benchmark includes 264 dynamic texture videos from ten classes: flowers, sea, trees without foliage, dense foliage, escalator, calm water, flags, grass, traffic and fountains. There are 7–38 examples of each class, and this benchmark features the largest intraclass variability in terms of scale, orientation, etc.
6 Experimental Setup
6.1 Benchmark CrossValidation Schemes
The standard test setup for the UCLA50 benchmark, which we adopt also here, is fourfold crossvalidation [66]. For each partitioning, three out of four samples from each dynamic texture instance are used for training, while the remaining one is held out for testing.
The standard test setup for the UCLA8 and UCLA9 benchmarks is to report the average accuracy over 20 random partitions, with 50% data used for training and 50% for testing (randomly bisecting each class) [19]. We use the same setup here, except that we report results as an average over 1000 trials to get more reliable statistics. This since we noted that, because of the small size of the dataset, the specific random partitioning will otherwise affect the result. For all the UCLA benchmarks, in contrast to the most common setup of using manually extracted patches, we use the noncropped videos; thus, our setup could be considered a slightly harder problem.
For the DynTex benchmarks, the experimental setup used is leaveoneout crossvalidation as in [3, 24, 56, 84]. We perform no subsampling of videos but use the full 720 \(\times \) 576 pixels frames.
6.2 Classifiers
We present results of both using a support vector machine (SVM) classifier and a nearest neighbour (NN) classifier, the latter to evaluate the performance also of a simpler classifier without hidden tunable parameters. For NN we use the \(\chi ^2\)distance \(d(x,y)= \sum _{i} (x_iy_i)^2/ (x_i+y_i)\) to compute the distance between two histogram video descriptors, and for SVM we use the \(\chi ^2\)kernel \(e^{\gamma d(x,y)}\). Results are quite robust to the choice of the SVM hyperparameters \(\gamma \) and C. We here use \(\gamma = 0.1\) and \(C = 10{,}000\) for all experiments.
6.3 Descriptor Parameter Tuning
Comparisons with state of the art and between video descriptors based on different sets of receptive fields are made using binary descriptors. Parameter tuning is performed as a grid search over the number of principal components \(n_\mathrm{comp} \in [2,17]\), spatial scales \(\sigma _s \in \{1, 2, 4, 8, 16\}\) and temporal scales \( \sigma _\tau \in \{50,100, 200, 400\}\). For spatial and temporal scales, we consider both single scales and combinations of two adjacent spatial and temporal scales. The standard evaluation protocols for the respective benchmarks (i.e. slightly different crossvalidation schemes) are used for parameter selection, and results are reported for each video descriptor using the optimal set of parameters.
7 Experiments
Our first experiments consider a qualitative and quantitative evaluation of different versions of our video descriptors, where we present results on: (i) varying the number of bins and principal components, (ii) using different spatial and temporal scales for the receptive fields and (iii) comparing descriptors based on different sets of receptive fields. This is followed by (iv) a comparison with stateoftheart dynamic texture recognition methods and finally (v) a qualitative analysis on reasons for errors.
7.1 Number of Bins and Principal Components
The classification performance of the STRF Njet descriptor as function of the number of bins and the number of principal components used in the histogram descriptor are presented in Fig. 7 for the UCLA8 and UCLA50 benchmarks and in Fig. 8 for the Beta and Gamma benchmarks. A first observation is that, not surprisingly, when using a smaller number of principal components, each dimension needs to be divided into a larger number of bins to achieve good performance, e.g. for \(n_\mathrm{comp} \in {\{2, 3\}}\) the best performance is achieved for \(n_\mathrm{bins} \ge 15\) for all benchmarks. To discriminate between a large number of spatiotemporal patterns using only a few image measurements, these need to be more precisely recorded. A qualitative difference between using an odd or an even number of bins for \(n_\mathrm{bins} \le 8\) can also be noted. This can be explained by a qualitative difference in the handling of feature values close to zero.
At the other end of the spectrum, it can be seen that when using a large number of principal components, fewer bins suffice. Using a large number of spatiotemporal primitives in combination with a small number of bins means that the different qualitative “types” of patters are more diverse, while at the same time being less “precise” in the sense of being unaffected by small changes in the magnitude of the filter responses. Binary or ternary descriptors are thus less sensitive to variations of the same rough type of space–time structure. Indeed, for binary descriptors only the sign of the receptive field response is recorded and a binary descriptor thus gives full invariance to, for example, multiplicative illumination transformations.
We thus conclude that binary histogram descriptors are a very useful option, combining top performance with simplicity. Therefore, we in the following investigate the effect of varying the remaining descriptor parameters using binary descriptors only.
7.2 Spatial and Temporal Scales
Each dataset will have a set of scales that are better for describing the spatial patterns and the motion patterns present in the videos. The classification performance of the STRF Njet descriptor as function of the spatial and the temporal scales of the receptive fields for different combinations of a single spatial scale \(\sigma _s \in \{1,2,4,8,16 \}\) and a single temporal scale \(\sigma _\tau \in \{1,2,4,8,16 \}\) is shown in Fig. 9 for the UCLA benchmarks and in Fig. 10 for the DynTex benchmarks. All results have been obtained with \(n_\mathrm{comp} = 15 \) and \(n_\mathrm{bins} = 2\).
For all the UCLA benchmarks, an approximately unimodal maximum over scales is obtained. For the UCLA8 and UCLA9 benchmarks, the best performance is obtained when combining a smaller spatial scale with a shorter temporal scale. For the UCLA50 benchmark, the best results are instead achieved for shorter temporal scales in combination with larger spatial scales. The observation that a short temporal scale works well for all benchmarks could indicate that fast motions are discriminative and that the best spatial scales are different for UCLA50 is not strange, since this benchmark features instance recognition (e.g. separating 108 different plants) rather than generalising between classes. Although it might feel intuitive that small details should be useful for instance recognition, this will depend on the dataset. For example, plants with similar leaves but different global growth patterns could be easier to separate at a larger spatial scale.
For the DynTex benchmarks, the scale combinations that give the best results are scattered rather than showing an unimodal maximum. This could indicate that the different subsets of dynamic textures are best separated at different (and nonadjacent) scales. Since the DynTex dataset is quite diverse, this would not be strange. It should also be noted that the differences between the best and the second best results are here typically only one or two correctly classified videos. It is, however, clear that using the largest spatial scale in combination with the longest temporal scale gives markedly worse results.
When using \(2\times 2\) scales, we noted a similar performance pattern during scale tuning with unimodal maxima for the UCLA benchmarks and scattered maxima for the DynTex benchmarks (not shown). Comparing the absolute performance when using single versus multiple scales, it depends on the receptive field set if using multiple scales gives a consistent advantage. If inspecting the sets of optimal parameters found for the different benchmarks (presented in “Appendix A.2”), it can be noted that, for the STRF Njet descriptor, the best results are sometimes achieved using a single scale and sometimes when using \(2\times 2\) scales. However, STRF Njet includes a quite large number (17) of receptive fields and when using a smaller set of receptive fields, such as in RF Spatial (5 receptive fields) or STRF RotInv (9 receptive fields), video descriptors using multiple spatiotemporal scales consistently have the best performance. This shows that receptive fields at different scales can contain complementary information.
Comparison to state of the art for the UCLA benchmarks
7.3 Receptive Field Sets
 (i)
The new spatiotemporal descriptor STRF Njet.
 (ii)
The rotationally invariant descriptor STRF RotInv.
 (iii)
The purely spatial RF Spatial.
 (iv)
The previous spatiotemporal descriptor STRF Njet (previous) [27].
7.3.1 STRF NJet (Previous) Versus STRF NJet
We note that parameter tuning and adding the secondorder temporal derivatives of the spatial derivatives result in improved performance for our new STRF Njet descriptor compared to the STRF Njet (previous) descriptor [27]. The new descriptor shows improved accuracy for all the benchmarks. We have also observed an improvement from both these changes individually (not explicitly shown here).
7.3.2 SpatioTemporal Versus Spatial Descriptors
Comparison to state of the art for the DynTex benchmarks
The largest improvement is obtained for the Gamma benchmark, where adding spatiotemporal receptive fields reduces the error from 9.5 to 4.5% when using an SVM classifier. Smaller improvements are obtained for the UCLA8 and UCLA9 benchmarks, with a reduction in error from 2.2 to 1% and from 1.4 to 0.8 %, respectively. For the UCLA50 benchmarks, the performance saturates at 100% for both descriptor types (rather indicating the relative simplicity of this benchmark). The only exception where RF spatial shows better performance is for the Beta benchmark using a NN classifier. Here, the purely spatial descriptor achieves 5.6% error versus 6.2% error for STRF Njet.
Competitive performance for purely spatial descriptors on the Beta benchmark has been reported previously [56], and we here make a similar observation. Thus, not surprisingly, for some settings genuine spatiotemporal information is of greater importance than for others. Here, the largest gain is indeed obtained for the most complex task.
7.3.3 Rotationally Invariant Descriptors
The rotationally invariant STRF Njet RotInv descriptor does not achieve fully as good performance as the directionally dependent STRF Njet descriptor for the tested benchmarks. The difference in classification accuracy in favour of the directionally selective descriptor is most pronounced for the more complex DynTex benchmarks: STRF RotInv achieves 7.4 and 6.8% error on the Beta and Gamma benchmarks using an SVM classifier, compared to STRF Njet with 4.9 and 4.5% error, respectively. However, a comparison with state of the art in Tables 2 and 3 reveals that the STRF Njet RotInv descriptor still achieves competitive performance compared to other dynamic texture recognition approaches.
It is of conceptual interest that these good results can be obtained also when disregarding orientation information completely. Indeed, if considering marginal histograms of receptive field responses, the most striking differences between texture classes such as waves, grass and foliage are the typical directions of change. (Waves show a stronger gradient in the vertical directions grass in the horizontal and foliage in both.) A qualitative conclusion is that directional information is not the main mode of recognition here; instead, the local space–time structure independent of orientation is highly discriminative. We conclude that our proposed STRF RotInv descriptor could be a viable option for tasks where rotation invariance is of greater importance than for these benchmarks. However, the possible gain from enabling recognition of textures at orientations not present in the training data will have to be balanced against the possible gain from discriminative directional information.
7.4 Comparison to State of the Art
This section presents a comparison between our proposed approach and stateoftheart dynamic texture recognition methods. We include video descriptors constructed from four different sets of receptive fields (see Table 1) and compare against the bestperforming methods found in the literature for each benchmark. We also aim to include a range of different types of approaches with an extra focus on methods similar to ours, i.e. different LBP versions and relatively shallow (max 2 layers) spatiotemporal filteringbased approaches using either handcrafted filters or filters learned from data. Results for all the other methods are taken from the literature, where the relevant references are indicated in the table.
7.4.1 UCLA Datasets
The UCLA benchmark results are presented in Table 2. Our proposed STRF Njet descriptor shows highly competitive performance compared to all the other methods, achieving the highest mean accuracy averaged over all the benchmarks and either the single best or the shared best result on four out of the six benchmarks.
For the UCLA50 benchmark, our three new video descriptors achieve 0% error using both an SVM and a NN classifier. The main difference between these descriptors and the untuned STRF Njet (previous) is the use of a larger spatial scale, which is seen in Sect. 7.2 to be more adequate for this benchmark. Enhanced LBP [61] and Ensemble SVMs [84] also achieve 0% error rate, and there are several methods with error rates below 0.5%. The main conclusions we draw from the UCLA50 results are that recognising the same dynamic texture instance from the same viewpoint is (not surprisingly) an in comparison easier task than separating conceptual classes and that our approach performs on par with the best stateoftheart methods on this task.
For the conceptual UCLA8 and UCLA9 benchmarks using an NN classifier, our STRF Njet descriptor achieves 0.9 and 1.0% error, respectively, which are the single best results among all methods. This demonstrates that our approach is stable and works well with a simple classifier also for a quite highdimensional descriptor. For the UCLA8 benchmark together with a NN classifier, the second bestperforming approach is our rotationalinvariant descriptor STRF RotInv with 1.2% error and after that MEWLSP [72] with 2% error. For UCLA9, the second bestperforming approach is MBSIFTOP [3] with 1.2% error followed by STRF RotInv and MEWLSP, which both achieve 1.4% error.
For the UCLA8 benchmark combined with an SVM classifier, the best performing approaches are OTD [59] and 3DOTF [82] both with 0.5% error. For UCLA9, the best method using an SVM classifier is DNGP [62], which achieves 0.4% error. Our STRF Njet descriptor achieves 1.0% error on the UCLA8 benchmark and 0.8% error on the UCLA9 benchmark. It should be noted that OTD, 3DOTF and DNGP simultaneously show considerably worse results on the NN benchmarks and that the standard UCLA protocol (average over 20 trials) can give quite variable results because of the limited number of samples in the benchmarks. Averaging over 1000 trials means that our results are more stable and less likely to include “outliers” for some of the benchmarks.
Our approach shows improved results on all the UCLA benchmarks compared to a large range of similar methods also based on gathering statistics of local space–time structure but using different spatiotemporal primitives. This includes methods that are more complex in the sense of combining several different descriptors or a larger number of feature extracting steps (MEWLSP [72], HOGNSP [53]), methods learning higherlevel hierarchical features (PCANetTOP [2], SKDL [58], temporal dropout DL, DTCNN [1]) and improved and extended LBPbased methods (Enhanced LBP [61], MBSIFTOP [3], MEWLSP [72], CVLBP [71]) as well as the standard LBPTOP [87] and VLBP [86] descriptors. An interesting observation is also that compared to VLBP and CVLBP, which is similar to our approach use binary histograms and full 2D+T primitives, the performance of our approach is 2.1–10.5 percentage points better for all the benchmarks. The most important difference between these methods and our approach is indeed the spatiotemporal primitives used for computing the histogram.
7.4.2 DynTex Datasets
The DynTex benchmark results are presented in Table 3. For this larger and more complex dataset, it can be seen that utilising colour information and supervised hierarchical feature learning seems to give a clear advantage with three deep learning approaches on top. DTCNN, trained from scratch to extract features on three orthogonal planes, demonstrates the best performance with 0% error on the Alpha and Beta benchmarks and 0.4% error on the Gamma benchmark. Two deep learning methods based on feature extraction using pretrained networks (Deep Dual descriptor and stTCoF) also obtain very good results. However, although included for reference, we do not directly aim here to compete with these more conceptually complex methods. The main focus of our work is instead to evaluate the usefulness of the timecausal spatiotemporal primitives without entanglement with a more complicated hierarchical framework.
Our proposed STRF Njet descriptor achieves 0% error on the Alpha benchmark using both an SVM and a NN classifier, 4.9% (SVM) and 6.2% (NN) error on the Beta benchmark and 4.5% (SVM) and 8.8% error (NN) on the Gamma benchmark. This means that we achieve better results than all other nondeep learning methods utilising only greyscale information except one: MRSFA [50] which achieves 1% error on the Beta SVM benchmark and 1.9% error on the Beta NN benchmark. (This method has not been tested on the Alpha and Gamma benchmarks.) It should, however, be noted that MRSFA uses regional descriptors capturing the relative location of image structures and a bagofwords framework on top of the histogram descriptor. This approach is thus significantly more complex compared to our method. Both these extensions would be relatively straightforward to implement also using our proposed video descriptors.
Our results can also be compared to the LBPTOP extension AFSTOP, which shows the most competitive results using an SVM classifier with 1.7, 9.9 and 5.7% error, respectively, on the Alpha, Beta and Gamma benchmarks. Our approach thus achieves better performance on all the tested benchmarks, although AFSTOP includes several added features, such as removing outlier frames. Improvements compared to the basic LBPTOP descriptor are larger, and this can be considered a more fair benchmark, since we are testing an early version of our approach. Compared to MBSIFTOP and PCANetTOP, which both learn 2D filters from data and apply those on three orthogonal planes, our approach also achieves better results on all the DynTex benchmarks.
We also show notably better results (in the order of 1020 percentage points) than those reported from using DFS [83], OTD [59], SKDL [58] and the 2D+T curvelet transform [13]. However, those use a nearest centroid classifier and a different SVM traintest partition, which means that a direct comparison is not possible. We also note that although STRF Njet achieves the best results, the rotationally invariant descriptor version STRF RotInv, the RF Spatial descriptor and the untuned STRF Njet (previous) descriptor also achieve competitive performance. This demonstrates the robustness and flexibility of our approach.
In conclusion, our approach shows highly competitive performance for this larger and more complex benchmark, even though our proposed approach is a conceptually simple method utilising only local information. The STRF Njet descriptor achieves better performance than all other greyscale methods of similar conceptual complexity and better performance compared to both several more complex methods and two methods learning filters from data. We believe this should be considered as strong validation that these timecausal spatiotemporal receptive fields indeed capture useful spatiotemporal information.
7.5 Descriptor Size Versus Performance
The classification accuracy together with the descriptor size for increasing numbers of principal components \(n_\mathrm{comp}\) for the binary STRF Njet, STRF RotInv and RF Spatial descriptors is shown in Fig. 12. We include all combinations of \(2\times 2\) scales (see Sect. 4.6.4) for each value of \(n_\mathrm{comp}\) to illustrate the overall trend rather than focusing on one specific choice of spatial and temporal scales. Table 4 shows the best classification performance (after scale tuning) for the STRF Njet descriptor on the Gamma benchmark for each value of \(n_\mathrm{comp}\) together with the average number of nonempty histogram cells.
For all video descriptors, it can be seen that the performance first increases fast with the descriptor size (here, equivalently the number of principal components used) up to around \(n_\mathrm{comp} = 10\). For the STRF Njet descriptor, some additional gains are then obtained, primarily between 10 and 14 principal components, after which the performance saturates. For the STRF RotInv descriptor, the performance saturates at a lower level than for the STRF Njet descriptor, indicating that by using only rotationally invariant receptive fields some discriminative information is lost. For the RF spatial descriptor, the best performance is obtained using all principal components (5 receptive fields \(\times \) 2 spatial scales).
Descriptor size versus classification accuracy for the binary STRF Njet descriptor evaluated on the Gamma benchmark
\(n_\mathrm{comp}\)  \(n_\mathrm{cells}\) (filled)  Accuracy (%) 

2  \(4.0 \times 10^0\)  45.5 
5  \( 3.2 \times 10^1\)  86.0 
8  \( 2.6 \times 10^2\)  90.5 
10  \( 1. 0 \times 10^3\)  93.6 
12  \( 4.1 \times 10^3 \)  93.9 
13  \( 8.2 \times 10^3 \)  94.7 
14  \(1.6 \times 10^4 \)  95.1 
15  \( 3.3 \times 10^4 \)  94.7 
16  \( 6.5 \times 10^4 \)  95.5 
17  \( 1.3 \times 10^5 \)  95.1 
Comparing the number of filled histogram cells with the corresponding powers of two, it can be seen that these binary histograms are almost filled. We found this to be true for the other binary histogram types tested here as well. In contrast to this result, we have noted as low as 0.01% filled histogram cells for some of the larger nonbinary histograms tested in Sect. 7.1.
Indeed, for nonbinary histograms the more precise conditions on the magnitude of the receptive field responses will imply a larger set of very “rare” patterns. Thus, for the larger nonbinary histograms there can be large benefits from using a sparse representation, while for the binary histograms tested here, a nonsparse representation is most advantageous.
7.6 Recognition Using Fewer Frames
It should be noted that the stationarity properties of dynamic textures will often manifest over timescales longer than a few frames. Thus, it is important that local (in time) parts of motion patterns can be matched to other local (in time) patterns, rather than forcing also a short sequence to match the statistics of a full video. We, here, extract a collection of shorter sequences of length \(n_\mathrm{frames} \in \{1,2,4,8,16, 32, 64, 128\}\) from each video. Then, during training, the full set of shorter sequences for each value of \(n_\mathrm{frames}\) is used as training examples, while for testing the recognition is based on a single shorter sequence.
We study the performance of the binary STRF Njet descriptor with \(n_\mathrm{comp} \in \{10,13\}\) on the Beta and Gamma benchmarks. The results are presented in Fig. 13. It can be seen that for \(n_\mathrm{comp} = 13\) performance close to the baseline can be achieved also using very few frames. For the Beta benchmark, there is 0.3% difference between using a single frame and the entire video (92.3 vs 92.6 %), while for the Gamma benchmark this difference is 1.1% (92.8 vs 93.9%). When using a smaller descriptor with \(n_\mathrm{comp} = 10\), we note that the performance is instead improved when training and testing using shorter sequences, where the best values of \(n_\mathrm{frames}\) give accuracy 1% above the baseline for both benchmarks. Note that, in a realtime scenario, also the temporal delay of the timecausal filters will influence the reaction time and that at least three frames are needed to compute a discrete secondorder temporal derivative approximation.
The reason why using less data/fewer frames might give a drop in performance is clear—there is simply less information that can be used for making a decision. The reason why using shorter sequences can improve the performance is most likely related to that the matching becomes more flexible—each shorter sequence of a video only needs to have a similar representation to at least one shorter sequence from another video in the same class, rather than requiring the full videos to match.
In conclusion, these results demonstrate that our approach is robust to using descriptors computed from shorter video sequences and that it should thus be a viable option also for realtime scenarios where a quick decision is needed.
7.7 Qualitative Results
To gain more insight into the qualitative behaviour of our proposed family of video descriptors, we inspected the confusions and the closest neighbours to correctly classified and misclassified samples. Confusion matrices for the UCLA9, the Beta and the Gamma benchmarks for the STRF Njet descriptor are presented in Fig. 14. We note that the main cause of error for UCLA9 is confusing fire with smoke. UCLA8 (not shown here) shares a similar pattern. There is indeed a similarity in dynamics between these textures in the presence of temporal intensity changes not mediated by spatial movements. Confusions between flowers and plants and between fountain and waterfall are most likely caused by similarities in the spatial appearance and the motion patterns of these dynamic texture classes.
When inspecting the confusions between the different classes for the Beta and Gamma benchmarks, there is no clear pattern visible. This is probably partly due to the fact that these benchmarks contain larger intraclass variabilities. Misclassifications seem to be caused by a single video in one class having some similarity to a single video of different class rather than certain classes being consistently mixed up. We note the largest ratio of misclassified samples for the escalator and traffic classes, which are also the classes with the fewest samples.
A bit more light is shed on the reasons for misclassifications for the DynTex benchmarks when inspecting the closest neighbours in feature space for misclassified videos (Fig. 15) and for correctly classified videos (Fig. 16). We note that a frequent feature of the misclassified videos is the presence of multiple textures, such as a flag with light foliage in front of it (misclassified as foliage), or a fountain flowing into a pool of calm water (misclassified as calm water). There are also examples of confusion caused by similarity in either the spatial appearance or the temporal dynamics between specific instances of different classes, such as calm water reflecting a small tree being misclassified as foliage or a field of grass waving in the wind misclassified as sea.
8 Summary and Discussion
We have presented a new family of video descriptors based on joint histograms of spatiotemporal receptive field responses and evaluated several members in this family on the problem of dynamic texture recognition. This is, to our knowledge, the first video descriptor that uses joint statistics of a set of “ideal” (in the sense of derived on the basis of pure mathematical reasoning) spatiotemporal scalespace filters for video analysis and the first quantitative performance evaluation of using the family of timecausal scalespace filters derived by Lindeberg [44] as primitives for video analysis. Our proposed approach generalises a previous method by Linde and Lindeberg [37, 38], based on joint histograms of receptive field responses, from the spatial to the spatiotemporal domain and from object recognition to dynamic texture recognition.
Our experimental evaluation on several benchmarks from two widely used dynamic texture datasets demonstrates competitive performance compared to stateoftheart dynamic texture recognition methods. The best results are achieved for our binary STRF Njet descriptor, which combines directionally selective spatial and spatiotemporal receptive field responses. For the UCLA benchmarks, the STRF Njet descriptor achieves the highest mean accuracy averaged over all benchmarks as well as the shared best or single best result for several benchmarks. In addition, our approach achieves improved results on all the UCLA benchmarks compared to a large range of similar methods also based on gathering statistics of local space–time structure but using different spatiotemporal primitives.
For the larger more complex DynTex benchmarks, deep learning approaches come out on top. However, also here our proposed video descriptor achieves improved performance compared to a range of similar methods, such as local binary patternbased methods including recent extensions and improvements [24, 61, 86, 87] and two methods learning filters by means of unsupervised learning [2, 3]. The improved performance compared to approaches that learn filters from data (where for the UCLA benchmarks this also includes two deep learningbased approaches), shows that designing filters based on structural priors of the world can for these tasks be as effective as learning.
It should be noted that we believe an extension of our framework, for example, by complementing the current single layer of receptive fields with hierarchical features, would, indeed, be beneficial to address more complicated visual tasks. Therefore, the presented approach is not intended as a final point, but rather as a first conceptually simple approach for performing video analysis based on these new spatiotemporal primitives. Using a set of localised histograms, which capture the relative locations of different image structures, or learning or designing higherlevel hierarchical features on top of such local spatiotemporal derivative responses, will most likely improve the performance. A single global histogram would of course not be appropriate for videos that are not presegmented or do not have the spatial and temporal stationarity properties of dynamic textures. To handle such scenes, regional histograms should instead be computed. Also, in the presence of global camera motion, velocity adaptation would be beneficial. However, the objective here has primarily been to use these new timecausal spatiotemporal primitives within a simple framework, to focus more on a first evaluation of the spatiotemporal primitives and less on what is to be built on top. Thus, the presented framework is not aimed at directly competing with such conceptually more complex methods. Our approach does not include combinations of different feature types, ensemble classifiers, regional pooling to capture the relative location of features or learned or handcrafted midlevel features. However, constructing a more complex framework on top of these spatiotemporal primitives is certainly possible and would with high probability result in additional performance gains.
In summary, our conceptually simple video descriptor achieves highly competitive performance across all benchmarks compared to other greyscale “shallow” methods and improved performance compared to all other methods of similar conceptual complexity using different spatiotemporal primitives, either handcrafted or learned from data. We consider this as strong validation that these timecausal spatiotemporal receptive fields are highly descriptive for modelling the local space–time structure and as evidence in favour of their general applicability as primitives for other video analysis tasks.
Our approach could also be implemented using noncausal Gaussian spatiotemporal scalespace kernels. This might give somewhat improved results, since at each point in time, additional information from the future could then also be used. However, a timedelayed Gaussian kernel would imply longer temporal delays—thereby making it less suited for timecritical applications—in addition to requiring more computations and larger temporal buffers. The computational advantages of these timecausal filters imply that they can, in fact, be preferable also in settings where the temporal causality might not be of primary interest.
The scalespace filters used in this work have a strong connection to biology in the sense that a subset of these receptive fields very well model both spatial and spatiotemporal receptive fields in the LGN and V1 [11, 42, 44]. The receptive fields in V1, V2, V4 and MT serve as input to a large number of higher visual areas. It does indeed appear attractive to be able to use similar filters for early visual processing over a wide range of visual tasks. This study can be seen as a first step in a more general investigation into what can be done with spatiotemporal features similar to those in the primate brain.
Directions for future work include: (i) A multiscale recognition framework, where each video is represented by a set of descriptors computed at multiple spatial and temporal scales, both during training and testing. This would enable scaleinvariant recognition according to the theory presented in “Appendix A.1”. (ii) An extension to colour by including spatiochromotemporal receptive fields would with high probability improve the performance for tasks where colour information is discriminative. (iii) Including nonseparable spatiotemporal receptive fields with nonzero image velocities in the video descriptor according to the more general receptive field model in [44]. Such velocityadapted receptive fields, in fact, constitute a dominant portion of the receptive fields in areas V1 and MT [11]. (iv) Using positiondependent histograms to take into account the relative locations of features. (v) Learning or designing higherlevel hierarchical features based on these spatiotemporal primitives.
We propose that the spatiotemporal receptive field framework should be of more general applicability to other video analysis tasks. Timecausal spatiotemporal receptive fields are indeed the visual primitives used for solving a large range of visual tasks for biological agents. The theoretical properties of these spatiotemporal receptive fields imply that they can be used to design methods that are provably invariant or robust to different types of natural image transformations, where such invariances will reduce the sample complexity for learning. We thus see the possibility to both integrate timecausal spatiotemporal receptive fields into current video analysis methods and to design new types of methods based on these primitives.
Footnotes
 1.
To transform the determinant of the spatial Hessian having the same dimensionality in terms of \([\text{ intensity }]\) as the other spatial differential invariants, we transform the magnitude by a square root function while preserving its sign: \( (\det \mathscr {H} L)_{transf} = {\text {sign}}(\det \mathscr {H} L) \,  \det \mathscr {H}L^{1/2}\).
Notes
Acknowledgements
We would like to thank Oskar Linde for providing access to his code for computing joint receptive field histograms for spatial object recognition, which has influenced our implementation for spatiotemporal recognition.
References
 1.Andrearczyk, V., Whelan, P.F.: Convolutional neural network on three orthogonal planes for dynamic texture classification. Pattern Recognit. 76, 36–49 (2018)CrossRefGoogle Scholar
 2.Arashloo, S.R., Amirani, M.C., Noroozi, A.: Dynamic texture representation using a deep multiscale convolutional network. J. Vis. Commun. Image Represent. 43, 89–97 (2017)CrossRefGoogle Scholar
 3.Arashloo, S.R., Kittler, J.: Dynamic texture recognition using multiscale binarized statistical image features. IEEE Trans. Multimed. 16(8), 2099–2109 (2014)CrossRefGoogle Scholar
 4.Chan, A.B., Vasconcelos, N.: Classifying video with kernel dynamic textures. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2009), pp. 1–6. IEEE (2007)Google Scholar
 5.Chetverikov, D., Fazekas, S., Haindl, M.: Dynamic texture as foreground and background. Mach. Vis. Appl. 22(5), 741–750 (2011)CrossRefGoogle Scholar
 6.Chetverikov, D., Péteri, R.: A brief survey of dynamic texture description and recognition. In: Kurzyński, M., Puchała, E., Woźniak, M., Żołnierek, A. (eds.) Computer Recognition Systems, pp. 17–26. Springer, Berlin, Heidelberg (2005)CrossRefGoogle Scholar
 7.Crivelli, T., CernuschiFrias, B., Bouthemy, P., Yao, J.F.: Motion textures: modeling, classification, and segmentation using mixedstate Markov random fields. SIAM J. Imaging Sci. 6(4), 2484–2520 (2013)MathSciNetCrossRefGoogle Scholar
 8.Culibrk, D., Sebe, N.: Temporal dropout of changes approach to convolutional learning of spatiotemporal features. In: Proceedings of 22nd ACM International Conference on Multimedia, pp. 1201–1204 (2014)Google Scholar
 9.Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893 (2005)Google Scholar
 10.Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) Proceedings of European Conference on Computer Vision (ECCV 2006), Lecture Notes in Computer Science, vol. 3952, pp. 428–441. Springer, Berlin (2006)Google Scholar
 11.DeAngelis, G.C., Ohzawa, I., Freeman, R.D.: Receptive field dynamics in the central visual pathways. Trends Neurosci. 18(10), 451–457 (1995)CrossRefGoogle Scholar
 12.Derpanis, K.G., Wildes, R.P.: Spacetime texture representation and recognition based on a spatiotemporal orientation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1193–1205 (2012)CrossRefGoogle Scholar
 13.Dubois, S., Péteri, R., Ménard, M.: Characterization and recognition of dynamic textures based on the 2D+T curvelet transform. Signal Image Video Process. 9(4), 819–830 (2015)CrossRefGoogle Scholar
 14.El Moubtahij, R., Augereau, B., FernandezMaloigne, C., Tairi, H.: A polynomial texture extraction with application in dynamic texture classification. In: Twelfth International Conference on Quality Control by Artificial Vision 2015 (2015)Google Scholar
 15.Fazekas, S., Amiaz, T., Chetverikov, D., Kiryati, N.: Dynamic texture detection based on motion analysis. Int. J. Comput. Vis. 82(1), 48–63 (2009)CrossRefGoogle Scholar
 16.Fazekas, S., Chetverikov, D.: Analysis and performance evaluation of optical flow features for dynamic texture recognition. Signal Process. Image Commun. 22(7), 680–691 (2007)CrossRefGoogle Scholar
 17.Florack, L.M.J.: Image Structure. Springer, Berlin (1997)CrossRefGoogle Scholar
 18.Ghanem, B., Ahuja, N.: Phase based modelling of dynamic textures. In: Proceedings of International Conference on Computer Vision (ICCV 2007), pp. 1–8. IEEE (2007)Google Scholar
 19.Ghanem, B., Ahuja, N.: Maximum margin distance learning for dynamic texture recognition. In: European Conference on Computer Vision (ECCV 2010), Springer LNCS, vol. 6312, pp. 223–236 (2010)CrossRefGoogle Scholar
 20.Gonçalves, W.N., Machado, B.B., Bruno, O.M.: Spatiotemporal Gabor filters: A new method for dynamic texture recognition. arXiv preprint arXiv:1201.3612 (2012)
 21.Gonçalves, W.N., Machado, B.B., Bruno, O.M.: A complex network approach for dynamic texture recognition. Neurocomputing 153, 211–220 (2015)CrossRefGoogle Scholar
 22.Hernandez, J.A.R., Crowley, J.L., Lux, A., Pietikäinen, M.: Histogramtensorial gaussian representations and its applications to facial analysis. In: Local Binary Patterns: New Variants and Applications, pp. 245–268. Springer (2014)Google Scholar
 23.Hong, S., Ryu, J., Im, W., Yang, H.S.: D3: Recognizing dynamic scenes with deep dual descriptor based on key frames and key segments. Neurocomputing 273, 611–621 (2018)CrossRefGoogle Scholar
 24.Hong, S., Ryu, J., Yang, H.S.: Not all frames are equal: Aggregating salient features for dynamic texture classification. Multidimens. Syst. Signal Process. (2016). https://doi.org/10.1007/s1104501604637 CrossRefGoogle Scholar
 25.Hubel, D.H., Wiesel, T.N.: Brain and Visual Perception: The Story of a 25Year Collaboration. Oxford University Press, Oxford (2005)Google Scholar
 26.Iijima, T.: Observation theory of twodimensional visual patterns. Technical report, Papers of Technical Group on Automata and Automatic Control, IECE, Japan (1962) (in Japanese)Google Scholar
 27.Jansson, Y., Lindeberg, T.: Dynamic texture recognition using timecausal spatiotemporal scalespace filters. In: Proceedings of Scale Space and Variational Methods for Computer Vision (SSVM 2017), Springer LNCS, vol. 10302, pp. 16–28. Kolding, Denmark (2017)Google Scholar
 28.Ji, H., Yang, X., Ling, H., Xu, Y.: Wavelet domain multifractal analysis for static and dynamic texture classification. IEEE Trans. Image Process. 22(1), 286–299 (2013)MathSciNetCrossRefGoogle Scholar
 29.Kläser, A., Marszalek, M., Schmid, C.: A spatiotemporal descriptor based on 3Dgradients. In: Proceedings of British Machine Vision Conference Leeds, UK (2008)Google Scholar
 30.Koenderink, J.J.: The structure of images. Biol. Cybernet. 50, 363–370 (1984)MathSciNetCrossRefGoogle Scholar
 31.Koenderink, J.J.: Scaletime. Biol. Cybernet. 58, 159–162 (1988)MathSciNetCrossRefGoogle Scholar
 32.Koenderink, J.J., van Doorn, A.J.: Representation of local geometry in the visual system. Biol. Cybernet. 55, 367–375 (1987)MathSciNetCrossRefGoogle Scholar
 33.Koenderink, J.J., van Doorn, A.J.: Generic neighborhood operators. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 597–605 (1992)CrossRefGoogle Scholar
 34.Laptev, I., Lindeberg, T.: Local descriptors for spatiotemporal recognition. In: Proceedings of ECCV’04 Workshop on Spatial Coherence for Visual Motion Analysis, Springer LNCS, vol. 3667, pp. 91–103. Prague, Czech Republic (2004)CrossRefGoogle Scholar
 35.Laptev, I., Lindeberg, T.: Velocity adaptation of spatiotemporal receptive fields for direct recognition of activities: an experimental study. Image Vis. Comput. 22(2), 105–116 (2004)CrossRefGoogle Scholar
 36.Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of Computer Vision and Pattern Recognition (CVPR’06), pp. 2169–2178. Washington, DC, USA (2006)Google Scholar
 37.Linde, O., Lindeberg, T.: Object recognition using composed receptive field histograms of higher dimensionality. In: International conference on pattern recognition, vol. 2, pp. 1–6. Cambridge (2004)Google Scholar
 38.Linde, O., Lindeberg, T.: Composed complexcue histograms: an investigation of the information content in receptive field based image descriptors for object recognition. Comput. Vis. Image Underst. 116, 538–560 (2012)CrossRefGoogle Scholar
 39.Lindeberg, T.: ScaleSpace Theory in Computer Vision. Springer, Berlin (1993)zbMATHGoogle Scholar
 40.Lindeberg, T.: Feature detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 77–116 (1998)Google Scholar
 41.Lindeberg, T.: Generalized Gaussian scalespace axiomatics comprising linear scalespace, affine scalespace and spatiotemporal scalespace. J. Math. Imaging Vis. 40(1), 36–81 (2011)MathSciNetCrossRefGoogle Scholar
 42.Lindeberg, T.: A computational theory of visual receptive fields. Biol. Cybernet. 107(6), 589–635 (2013)MathSciNetCrossRefGoogle Scholar
 43.Lindeberg, T.: Invariance of visual operators at the level of receptive fields. PLOS ONE 8(7), e66,990 (2013)CrossRefGoogle Scholar
 44.Lindeberg, T.: Timecausal and timerecursive spatiotemporal receptive fields. J. Math. Imaging Vis. 55(1), 50–88 (2016)MathSciNetCrossRefGoogle Scholar
 45.Lindeberg, T.: Spatiotemporal scale selection in video data. J. Math. Imaging Vis. (2017). https://doi.org/10.1007/s1085101707669 MathSciNetCrossRefzbMATHGoogle Scholar
 46.Lindeberg, T.: Spatiotemporal scale selection in video data. In: Proceedings of Scale Space and Variational Methods in Computer Vision (SSVM 2017), Springer LNCS, vol. 10302, pp. 3–15 (2017)Google Scholar
 47.Lindeberg, T.: Dense scale selection over space, time and spacetime. SIAM J. Imaging Sci. 11(1), 407–441 (2018)MathSciNetCrossRefGoogle Scholar
 48.Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
 49.Lu, Z., Xie, W., Pei, J., Huang, J.: Dynamic texture recognition by spatiotemporal multiresolution histograms. In: Proceedings of Seventh IEEE Workshop Motion and Video Computing, vol. 2, pp. 241–246. IEEE (2005)Google Scholar
 50.Miao, J., Xu, X., Xing, X., Tao, D.: Manifold regularized slow feature analysis for dynamic texture recognition. arXiv preprint arXiv:1706.03015 (2017)
 51.Mumtaz, A., Coviello, E., Lanckriet, G.R., Chan, A.B.: A scalable and accurate descriptor for dynamic textures using bag of system trees. IEEE Trans. Pattern Anal. Mach. Intell. 37(4), 697–712 (2015)CrossRefGoogle Scholar
 52.Nelson, R.C., Polana, R.: Qualitative recognition of motion using temporal texture. CVGIP Image Underst. 56(1), 78–89 (1992)CrossRefGoogle Scholar
 53.Norouznezhad, E., Harandi, M.T., Bigdeli, A., Baktash, M., Postula, A., Lovell, B.C.: Directional spacetime oriented gradients for 3D visual pattern analysis. In: European Conference on Computer Vision, Springer LNCS, vol. 7574, pp. 736–749 (2012)CrossRefGoogle Scholar
 54.Péteri, R., Chetverikov, D.: Dynamic texture recognition using normal flow and texture regularity. In: Pattern Recognition and Image Analysis, pp. 9–23 (2005)CrossRefGoogle Scholar
 55.Péteri, R., Fazekas, S., Huiskes, M.J.: DynTex: a comprehensive database of dynamic textures. Pattern Recognit. Lett. 31(12), 1627–1632 (2010)CrossRefGoogle Scholar
 56.Qi, X., Li, C., Guoying, Z., Hong, X., Pietikäinen, M.: Dynamic texture and scene classification by transferring deep image features. Neurocomputing 171, 1230–1241 (2016)CrossRefGoogle Scholar
 57.Qiao, Y., Weng, L.: Hidden markov model based dynamic texture classification. IEEE Signal Process. Lett. 22(4), 509–512 (2015)CrossRefGoogle Scholar
 58.Quan, Y., Bao, C., Ji, H.: Equiangular kernel dictionary learning with applications to dynamic texture analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 308–316 (2016)Google Scholar
 59.Quan, Y., Huang, Y., Ji, H.: Dynamic texture recognition via orthogonal tensor dictionary learning. In: IEEE International Conference on Computer Vision, pp. 73–81 (2015)Google Scholar
 60.Ravichandran, A., Chaudhry, R., Vidal, R.: Viewinvariant dynamic texture recognition using a bag of dynamical systems. In: Computer Vision and Pattern Recognition, pp. 1651–1657 (2009)Google Scholar
 61.Ren, J., Jiang, X., Yuan, J.: Dynamic texture recognition using enhanced LBP features. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2400–2404 (2013)Google Scholar
 62.Rivera, A.R., Chae, O.: Spatiotemporal directional number transitional graph for dynamic texture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2146–2152 (2015)CrossRefGoogle Scholar
 63.Sagel, A., Kleinsteuber, M.: Alignment distances on systems of bags. IEEE Trans. Circuits Syst. Video Technol. (2017). https://doi.org/10.1109/TCSVT.2017.2715851
 64.Schiele, B., Crowley, J.: Recognition without correspondence using multidimensional receptive field histograms. Int. J. Comput. Vis. 36(1), 31–50 (2000)CrossRefGoogle Scholar
 65.Smith, J.R., Lin, C.Y., Naphade, M.: Video texture indexing using spatiotemporal wavelets. In: Proceedings of IEEE International Conference on Image Processing, vol. 2, pp. 437–440 (2002)Google Scholar
 66.Soatto, S., Doretto, G., Wu, Y.N.: Dynamic textures. IEEE Int. Conf. Comput. Vis. 2, 439–446 (2001)zbMATHGoogle Scholar
 67.Sporring, J., Nielsen, M., Florack, L., Johansen, P. (eds.): Gaussian ScaleSpace Theory: Proc. PhD School on ScaleSpace Theory. Series in Mathematical Imaging and Vision. Springer, Copenhagen (1997)zbMATHGoogle Scholar
 68.Swain, M.J., Ballard, D.H.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991)CrossRefGoogle Scholar
 69.ter Haar Romeny, B.: FrontEnd Vision and MultiScale Image Analysis. Springer, Berlin (2003)CrossRefGoogle Scholar
 70.ter Haar Romeny, B., Florack, L., Nielsen, M.: Scaletime kernels and models. In: Proceedings of International Conference on ScaleSpace and Morphology in Computer Vision (ScaleSpace’01), Springer LNCS, vol. 2106. Vancouver, Canada (2001)Google Scholar
 71.Tiwari, D., Tyagi, V.: Dynamic texture recognition based on completed volume local binary pattern. Multidimens. Syst. Signal Process. 27(2), 563–575 (2016)CrossRefGoogle Scholar
 72.Tiwari, D., Tyagi, V.: Dynamic texture recognition using multiresolution edgeweighted local structure pattern. Comput. Electr. Eng. (2016). https://doi.org/10.1016/j.compeleceng.2016.11.008 CrossRefGoogle Scholar
 73.van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010)CrossRefGoogle Scholar
 74.Valois, R.L.D., Cottaris, N.P., Mahon, L.E., Elfer, S.D., Wilson, J.A.: Spatial and temporal receptive fields of geniculate and cortical cells and directional selectivity. Vis. Res. 40(2), 3685–3702 (2000)CrossRefGoogle Scholar
 75.Wang, L., Liu, H., Sun, F.: Dynamic texture video classification using extreme learning machine. Neurocomputing 174, 278–285 (2016)CrossRefGoogle Scholar
 76.Wang, Y., Hu, S.: Exploiting high level feature for dynamic textures recognition. Neurocomputing 154, 217–224 (2015)CrossRefGoogle Scholar
 77.Wang, Y., Hu, S.: Chaotic features for dynamic textures recognition. Soft Comput. 20(5), 1977–1989 (2016)CrossRefGoogle Scholar
 78.Weickert, J., Ishikawa, S., Imiya, A.: Linear scalespace has first been proposed in Japan. J. Math. Imaging Vis. 10(3), 237–252 (1999)MathSciNetCrossRefGoogle Scholar
 79.Wildes, R.P., Bergen, J.R.: Qualitative spatiotemporal analysis using an oriented energy representation. European Conference on Computer Vision, Springer LNCS, vol. 1843, pp. 768–784 (2000)Google Scholar
 80.Witkin, A.P.: Scalespace filtering. In: Proceedings of 8th International Joint Conference on Artificial Intelligence, pp. 1019–1022. Karlsruhe, Germany (1983)Google Scholar
 81.Woolfe, F., Fitzgibbon, A.: Shiftinvariant dynamic texture recognition. Eur. Conf. Comput. Vis. 3952, 549–562 (2006)Google Scholar
 82.Xu, Y., Huang, S., Ji, H., Fermüller, C.: Scalespace texture description on SIFTlike textons. Comput. Vis. Image Underst. 116(9), 999–1013 (2012)CrossRefGoogle Scholar
 83.Xu, Y., Quan, Y., Zhang, Z., Ling, H., Ji, H.: Classifying dynamic textures via spatiotemporal fractal analysis. Pattern Recognit. 48(10), 3239–3248 (2015)CrossRefGoogle Scholar
 84.Yang, F., Xia, G.S., Liu, G., Zhang, L., Huang, X.: Dynamic texture recognition by aggregating spatial and temporal features via ensemble SVMs. Neurocomputing 173, 1310–1321 (2016)CrossRefGoogle Scholar
 85.ZelnikManor, L., Irani, M.: Eventbased analysis of video. In: Proceedings of Computer Vision and Pattern Recognition (CVPR’01), pp. II:123–130 (2001)Google Scholar
 86.Zhao, G., Pietikäinen, M.: Dynamic texture recognition using volume local binary patterns. In: Proceedings of the Workshop on Dynamical Vision WDV, Springer LNCS, vol. 4358, pp. 165–177 (2006)Google Scholar
 87.Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.