Dynamic Texture Recognition Using Time-Causal and Time-Recursive Spatio-Temporal Receptive Fields

This work presents a first evaluation of using spatio-temporal receptive fields from a recently proposed time-causal spatio-temporal scale-space framework as primitives for video analysis. We propose a new family of video descriptors based on regional statistics of spatio-temporal receptive field responses and evaluate this approach on the problem of dynamic texture recognition. Our approach generalises a previously used method, based on joint histograms of receptive field responses, from the spatial to the spatio-temporal domain and from object recognition to dynamic texture recognition. The time-recursive formulation enables computationally efficient time-causal recognition. The experimental evaluation demonstrates competitive performance compared to state of the art. In particular, it is shown that binary versions of our dynamic texture descriptors achieve improved performance compared to a large range of similar methods using different primitives either handcrafted or learned from data. Further, our qualitative and quantitative investigation into parameter choices and the use of different sets of receptive fields highlights the robustness and flexibility of our approach. Together, these results support the descriptive power of this family of time-causal spatio-temporal receptive fields, validate our approach for dynamic texture recognition and point towards the possibility of designing a range of video analysis methods based on these new time-causal spatio-temporal primitives.


Introduction
The ability to derive properties of the surrounding world from time-varying visual input is a key function of a general purpose computer vision system and necessary for any artificial or biological agent that is to use vision for interpreting a dynamic environment. Motion provides additional cues for understanding a scene and some tasks by necessity require motion information, e.g. distinguishing between events or actions with similar spatial appearance or estimating the speed of moving objects. Motion cues are also helpful when other visual cues are weak or contradictory. Thus, understanding how to represent and incorporate motion information for different video analysis tasks -including the question of what constitute meaningful spatio-temporal features -is a key factor for successful applications in areas such as action recognition, automatic surveillance, dynamic texture and scene understanding, video-indexing and retrieval, autonomous driving, etc.
Challenges when dealing with spatio-temporal image data are similar to those present in the static case, that is highdimensional data with large intraclass variability caused both by the diverse appearance of conceptually similar objects and the presence of a large number of identity preserving visual transformations. From the spatial domain, we inherit the basic image transformations: translations, scalings, rotations, nonlinear perspective transformations and illumination transformations. In addition to these, spatio-temporal data will contain additional sources of variability: differences in motion patterns for conceptually similar motions, events occurring faster or slower, velocity transformations caused by camera motion and time-varying illumination. Moreover, the larger data dimensionality compared to static images presents additional computational challenges.
For biological vision, local image measurements in terms of receptive fields constitute the first processing layers (Hubel and Wiesel [26]; DeAngelis et al. [10]). In the area of scalespace theory, it has been shown that Gaussian derivatives and related operators constitute a canonical model for visual receptive fields (Iijima [27]; Witkin [77]; Koenderink [31,32]; Koenderink and van Doorn [33,34]; Lindeberg [39,41]; Florack [16]; Sporring et al. [67]; Weickert et al. [75]; ter Haar Romeny et al. [21,22]). In computer vision, spatial receptive fields based on the Gaussian scale-space concept have been demonstrated to be a powerful front-end for solving a large range of visual tasks. Theoretical properties of scale-space filters enable the design of methods invariant or robust to natural image transformations (Lindeberg [41][42][43][44]). Also, early processing layers based on such "ideal" receptive fields can be shared among multiple tasks and thus free resources both for learning higher level features from data and during on-line processing.
The most straightforward extension of the Gaussian scalespace concept to the spatio-temporal domain is to use Gaussian smoothing also over the temporal domain. However, for real-time processing or when modelling biological vision this would violate the fundamental temporal causality constraint present for real-time tasks: It is simply not possible to access the future. The ad hoc solution of using a timedelayed truncated Gaussian temporal kernel would instead imply unnecessarily long temporal delays, which would make the framework less suitable for time-critical applications. The preferred option is to use truly time-casual visual operators. Recently, a new time-causal spatio-temporal scalespace framework, leading to a new family of time-causal spatio-temporal receptive fields, has been introduced by Lindeberg [44]. In addition to temporal causality, the timerecursive formulation of the temporal smoothing operation offers computational advantages compared to Gaussian filtering over time in terms of fewer computations and a compact recursively updated memory of the past.
The idealised spatio-temporal receptive fields derived within that framework also have a strong connection to biology in the sense of very well modelling both spatial and spatio-temporal receptive field shapes of neurons in the LGN and V1 (Lindeberg [42,44]). This similarity further motivates designing algorithms based on these primitives. They provably work well in a biological system, which points towards their applicability also for artificial vision. An additional motivation, although not actively pursued here, is that designing methods based on primitives similar to biological receptive fields can enable a better understanding of information processing in biological systems.
The purpose of this study is a first evaluation of using this new family of time-causal spatio-temporal receptive fields as visual primitives for video analysis. As a first application, we have chosen the problem of dynamic texture recognition. A dynamic texture or spatio-temporal texture is an extension of texture to the spatio-temporal domain and can be naively defined as "texture + motion" or more formally as a spatio-temporal pattern that exhibits certain stationarity properties and self-similarity over both space and time (Chetverikov and Péteri [5]). Examples of dynamic textures are windblown vegetation, fire, waves, a flock of flying birds or a flag flapping in the wind. Recognising different types of dynamic textures is important for visual tasks such as automatic surveillance (e.g. detecting forest fires), video indexing and retrieval (e.g. return all images set on the sea) and to enable artificial agents to understand and interact with the world by interpreting different environments.
In this work, we start by presenting a new family of video descriptors in the form of joint histograms of spatiotemporal receptive field responses. Our approach generalises a previous method by Linde and Lindeberg [38] from the spatial to the spatio-temporal domain and from object recognition to dynamic texture recognition. We subsequently perform an experimental evaluation on two commonly used dynamic texture datasets and present results on: (i) Qualitative and quantitative effects from varying model parameters including the spatial and the temporal scales and the number of principal components and the number of bins used in the histogram descriptor. (ii) A comparison between the performance of descriptors constructed from different sets of receptive fields. (iii) An extensive comparison with state-of-the-art dynamic texture recognition methods.
Our benchmark results demonstrate competitive performance compared to state-of-the-art dynamic texture recognition methods. This although our approach is a conceptually simple method utilising only local information with the additional constraint of using time-causal operators. Especially, it is shown that binary versions of our dynamic texture descriptors achieve improved performance compared to a large range of similar methods using different primitives either handcrafted or learned from data. Further, our qualitative and quantitative investigation into parameter choices and the use of different sets of receptive fields highlights the robustness and flexibility of our approach. Together, these results support the descriptive power of this family of time-causal spatio-temporal receptive fields, validate our approach for dynamic texture recognition and point towards the possibility of designing a range of video analysis methods based on these new time-causal spatio-temporal primitives.
This paper is an extended version of a conference paper [28], in which only a single descriptor version was evalu-ated. We here present a more extensive experimental evaluation, including results on varying the descriptor parameters and of using different sets of receptive fields. We specifically present new video descriptors with improved performance compared to the previous video descriptor used in [28].

Related work
For dynamic texture recognition, additional challenges compared to the static case include variabilities in motion patterns and speed and much higher data dimensionality. Although spatial texture descriptors can give reasonable performance also in the dynamic case (a human observer can typically distinguish dynamic textures based on a single frame), making use of dynamic information as well provides a consistent advantage over purely spatial descriptors (see e.g. Zhao et al. [84]; Arashloo and Kittler [3]; Hong et al. [25]; Qi et al. [55] and Andrearczyk and Whelan [1]). Thus, a large range of dynamic texture recognition methods have been developed, exploring different options for representing motion information and combining this with spatial information.
Some of the first methods for dynamic texture recognition were based on optic flow, see e.g. the pioneering work by Nelson and Polana [51] and later work by Péteri and Chetverikov [53], Lu et al. [48], Fazekas and Chetverikov [15], Fazekas et al. [14] and Crivelli et al. [6]. However, compared to the motion patterns of larger rigid objects, dynamic textures often feature chaotic non-smooth motions, multiple directions of motion at the same point and intensity changes not mediated by rigid motion -consider e.g. fluttering leaves, turbulent water and fire. This means that the assumptions made for optical flow computations are typically violated, which results in difficulties estimating optic flow for dynamic textures.
Another early approach, first proposed by Soatto et al. [66], is to model the time-evolving appearance of a dynamic texture as a linear dynamical system (LDS). Although originally designed for synthesis, recognition can be done using model parameters as features. A drawback of such global LDS models is, however, poor invariance to viewpoint and illumination conditions (Woolfe and Fitzgibbon [78]). To circumvent the heavy dependence of the first LDS models on global spatial appearance, the bags of dynamical systems (BoS) approach was developed by Ravichandran et al. [59].
Here, the global model is replaced by a set of codebook LDS models, each describing a small space-time cuboid, used as local descriptors in a bag-of-words framework. For additional LDS-based approaches, see also Chan and Vasconcelos [4], Mumtaz et al. [50], Qiao and Weng [56], Wang et al. [72] and Sagel and Kleinsteuber [62].
A large group of dynamic texture recognition methods based on statistics of local image primitives are local binary pattern (LBP) based approaches. The original spatial LBP descriptor captures the joint binarized distribution of pixel values in local neighbourhoods. The spatio-temporal generalisations of LBP used for dynamic texture recognition do the same either for 3D space-time volumes (VLBP) (Zhao et al. [83,84]) or by applying a two-dimensional LBP descriptor but on three orthogonal planes (LBP-TOP) (Zhao et al. [84]). Several extensions/versions of LBP-TOP have subsequently been presented, e.g. utilising averaging and principal histogram analysis to get more reliable statistics (Enhanced LBP) (Ren et al. [60]) or multi-clustering of salient features to identify and remove outlier frames (AFS-TOP) (Hong et al. [25]). See also the completed local binary pattern approach (CVLBP) (Tiwari and Tyagi) [69] and multiresolution edge-weighted local structure patterns (MEWLSP) (Tiwari and Tyagi [70]).
Contrasting LBP-TOP with VLBP highlights a conceptual difference concerning the generalisation of spatial descriptors to the spatio-temporal domain: Whereas the VLBP descriptor is based on full space-time 2D+T features the LBP-TOP descriptor applies 2D features originally designed for purely spatial tasks on several cross sections of a spatiotemporal volume. While the former is in some sense conceptually more appealing and implies more discriminative modelling of the space-time structure, this comes with the drawback of higher descriptor dimensionality and higher computational load.
For LBP-based methods, using 2D descriptors on three orthogonal planes have so far proven more successful and this approach is frequently used also by non LBP-based methods (Andrearczyk and Whelan [1]; Arashloo and Kittler [3]; Arashloo et al. [2]; Xu et al. [80]). A variant of this idea proposed by Norouznezhad et al. [52] is to replace the three orthogonal planes with nine spatio-temporal symmetry planes and apply a histogram of oriented gradients on these planes. Similarly, the directional number transitional graph (DNG) by Rivera et al. [61] is evaluated on nine spatio-temporal cross sections of a video.
A different group of approaches similarly based on gathering statistics of local space-time structure, but using different primitives, is spatio-temporal filtering based approaches. Examples are the oriented energy representations by Wildes and Bergen [76] and Derpanis and Wildes [11], where the latter represents pure dynamics of spatio-temporal textures Fig. 1: Sample frames from the UCLA dataset. The top and bottom rows show different dynamic texture instances from the same conceptual class as used in the UCLA8 and UCLA9 benchmarks. For the UCLA50 benchmark, these instances should instead be separated as different classes. The conceptual classes from left to right:"boiling", "fire", "flower", "fountain", "sea", "smoke", "water" and "waterfall". The conceptual classes from left to right: "sea","calm water", "naked trees", "fountains", "flags", "traffic" and "escalator". by capturing space-time orientation using 3D Gaussian derivative filters. Marginal histograms of spatio-temporal Gabor filters were proposed by Gonçalves et al. [19]. Our proposed approach, which is based on joint histograms of time-causal spatio-temporal receptive field responses, also fits into this category. However, in contrast to spatio-temporal filtering based approaches using marginal histograms, the use of joint histograms, which can capture the covariation of different features, enable distinguishing a larger number of local spacetime patterns. Joint statistics of ideal scale-space filters have previously been used for spatial object recognition, see e.g. Schiele and Crowley [64] and the approach by Linde and Lindeberg [37,38], which we here generalise to the spatiotemporal domain. The use of time-causal filters in our approach also contrasts with other spatio-temporal filtering based approaches and it should also be noted that the time-causal limit kernel used in this paper is time-recursive, whereas no time-recursive formulation is known for the scale-time kernel previously proposed by Koenderink [32].
There also exist a number of related methods similarly based on joint statistics of local space-time structure but using filters learned from data as primitives. Unsupervised learning based approaches are e.g. multi-scale binarized statistical image features (MBSIF-TOP) by Arashloo and Kittler [3], which learn filters by means of independent component analysis and PCANet-TOP by Arashloo et al. [2], where two layers of hierarchical features are learnt by means of principal component analysis (PCA). Approaches instead based on sparse coding for learning filters are orthogonal tensor dictionary learning (OTD) (Quan et al. [58]), equiangular kernel dictionary learning (SKDL) (Quan et al. [57]) and manifold regularised slow feature analysis (MR-SFA) (Miao et al. [49]).
Recently, several very successful deep learning based approaches for dynamic texture recognition have been proposed. These are typically trained by supervised learning and borrow statistical strength from pretraining on large static image databases. The best deep learning methods currently outperform most other approaches. Andrearczyk and Whelan [1] train a dynamic texture convolutional neural network (DT-CNN) to extract features on three orthogonal planes. Qi et al. [55] instead use a pretrained network for feature extraction (TCoF) and Hong et al. [24] propose a deep dual descriptor (D3). See also the early deep learning approach of Culibrk and Sebe [7] based on temporal dropout of changes and the high-level feature approach by Wang and Hu [73], which combines chaotic features with deep learning. Deep learning based approaches will learn more complex and abstract high-level features, however, with properties less well understood and requiring much more training data compared to more "shallow" approaches.
Spatio-temporal transforms have also been applied for dynamic texture recognition. Ji et al. [29] propose a method based on wavelet domain fractal analysis (WMFS), whereas Dubois et al. [12] utilise the 2D+T curvelet transform and Smith et al. [65] use spatio-temporal wavelets for video texture indexing. Fractal dimension based methods make use of the self-similarity properties of dynamic textures. Xu et al. [80] create a descriptor from the fractal dimension of motion features (DFS), whereas Xu et al. [79] instead extract the power-law behaviour of local multi-scale gradient orientation histograms (3D-OTF) (see also Ghanem and Ahuja [17] and Smith et al. [65]).
Additional approaches include using average degrees of complex networks (Gonçalves et al. [20]) and a total variation based approach by El Moubtahij et al. [13]. Wang and Hu [74] instead create a descriptor from chaotic features. There are also approaches combining several different descriptor types or features, such as DL-PEGASOS by Ghanem and Ahuja [18], which combines LBP, PHOG and LDS descriptors with maximum margin distance learning, or Yang et al. [81] who use ensemble SVMs to combine LBP, shapeinvariant co-occurrence patterns (SCOPs) and chromatic information with dynamic information represented by LDS models. 3 Spatio-temporal receptive field model We here give a brief description of the time-causal spatiotemporal receptive field framework of Lindeberg [44]. This framework provides the time causal primitives for describing local space-time structure -the spatio-temporal receptive fields (or equivalently 2D+T scale-space filters) -which our proposed family of video descriptors is based on. In this framework, the axiomatically derived scale-space kernel at spatial scale s and temporal scale τ is of the form where -(x, y) denotes the image coordinates, t denotes time, h(t; τ) denotes a temporal smoothing kernel, g(x − ut, y − vt; s, Σ ) denotes a spatial affine Gaussian kernel with spatial covariance matrix Σ that moves with image velocity (u, v).
Here, we restrict ourselves to rotationally symmetric Gaussian kernels over the spatial domain corresponding to Σ = I and to smoothing kernels with image velocity (u, v) = (0, 0) leading to space-time separable receptive fields. The temporal smoothing kernel h(t; τ) used here is the time-causal kernel composed of truncated exponential functions coupled in cascade: with individual time constants µ k and with a composed scaleinvariant limit kernel having a Fourier transform of the form ([44, Eq. (38)]) where c > 1 is the distribution parameter for the logarithmic distribution of intermediate scale levels. For practical purposes, the limit kernel is approximated by a finite number K of recursive filters coupled in cascade. We here use c = 2 and K ≥ 7.
The spatio-temporal receptive fields are in turn defined as partial derivatives of the spatio-temporal scale-space kernel T for different orders (m 1 , m 2 , n) of spatial and temporal differentiation computed at multiple spatio-temporal scales. The result of convolving a video f (x, y,t) with one of these kernels L x m 1 y m 2 t n (·, ·, ·; s, τ) = ∂ x m 1 y m 2 t n (T (·, ·, ·; s, τ) * f (·, ·, ·)) is referred to as the receptive field response. A set of receptive field responses will comprise a spatio-temporal N-jet representation of the local space-time structure, in essence corresponding to a truncated Taylor expansion possibly at multiple spatial and temporal scales. Using this representation thus enables capturing more diverse information about the local space-time structure than what can be done using a single filter type. One set of receptive fields considered in this work consists of combinations of the following sets of partial derivatives: (i) the first-and second-order spatial derivatives, (ii) the first-and second-order temporal derivatives of these and (iii) the first-and second-order temporal derivatives of the original smoothed video L: A second set consists of spatio-temporal invariants defined from combinations of these partial derivatives. A subset of the receptive fields/scale-space derivative kernels are shown in Figure 3 (the kernels are three-dimensional over (x, y,t), but are here illustrated using a single spatial dimension x).
All video descriptors are expressed in terms of scalenormalised derivatives [44]: with the corresponding scale-normalised receptive fields computed as: where γ s > 0 and γ τ > 0 are the scale normalization powers of the spatial and the temporal domains, respectively. We here use γ s = 1 and γ τ = 1 corresponding to the maximally scale-invariant case (as described in Appendix A.1). It should be noted that the scale-space representation and receptive field responses are computed "on the fly" using only a compact temporal buffer. The time-recursive formulation for the temporal smoothing [44,Eq (56)] means that computing the scale-space representation for a new frame at temporal scale τ k only requires information from the present moment t and the scale-space representation for the preceding frame t − 1 at temporal scales τ k and τ k−1 L(·, ·,t; ·, τ k ) = L(·, ·,t − 1; ·, τ k )+ 1 µ k (L(·, ·,t; ·, τ k−1 ) − L(·, ·,t − 1; ·, τ k−1 )) (9) where t represents time and τ k > τ k−1 are two adjacent temporal scale levels, where k = 0 corresponds to the original signal (the spatial coordinates and the spatial scale are here suppressed for brevity). This is a clear advantage compared to using a Gaussian kernel (possibly delayed and truncated) over the temporal domain, since it implies less computations and smaller temporal buffers. For further details concerning the spatio-temporal scale-space representation and its discrete implementation, we refer to Lindeberg [44].

Video descriptors
We here describe our proposed family of video descriptors. The histogram descriptor is based on regional statistics of time-causal spatio-temporal receptive field responses and the process of computing the video descriptor can be divided into four main steps: (i) Computation of the spatio-temporal scale-space representation at the specified spatial and temporal scales (spatial and temporal smoothing). (ii) Computation of local spatio-temporal receptive field responses using discrete derivative approximation filters over space and time. (iii) Dimensionality reduction of local spatio-temporal receptive field responses/feature vectors using principal component analysis (PCA). (iv) Aggregation of joint statistics of receptive field responses over a space-time region into a multidimensional histogram.
Note that all computations are performed "on the fly", one frame at a time. In the following, each of these steps is described in more detail. A schematic illustration of our dynamic texture recognition workflow is given in Figure 4.

Spatio-temporal scale-space representation
The first processing step is spatial and temporal smoothing to compute the time-causal spatio-temporal scale-space representation of the current frame at the chosen spatial and temporal scales. The spatial smoothing is done in a cascade over spatial scales and the temporal smoothing is performed recursively over time and in a cascade over temporal scales according to [44]. The fully separable convolution kernels and the time-recursive implementation of the temporal smoothing implies that the smoothing can be done in a computationally efficient way.

Spatio-temporal receptive field responses
After the smoothing step, the spatio-temporal receptive field responses F = [F 1 , F 2 , ...F N ] are computed densely over the current frame for the N chosen scale-space derivative filters. These filters will include partial derivatives over a set of different spatial and temporal differentiation orders at single or multiple spatial and temporal scales. Alternatively, for a video descriptor based on differential invariants, the features will correspond to such differential invariants computed from the partial derivatives. Using multiple scales enables capturing image structures of different spatial extent and temporal duration. All derivatives are scale-normalised. In contrast to previous methods utilising "ideal" (as opposed to learned) spatio-temporal filters, such as Derpanis and Wildes [11] and Gonçalves et al. [19], our method encompasses using a diverse group of partial derivatives and differential invariants computed from the spatio-temporal Njet as opposed to a single filter type. This means that our descriptor can capture more diverse information about the local image structure, where receptive field responses up to a certain order contain information corresponding to a truncated Taylor expansion around each image point.

Dimensionality reduction with PCA
When combining a large number of local image measurements, most of the variability in a dataset will be contained in a lower-dimensional subspace of the original feature space. This opens up for dimensionality reduction of the local feature vectors to enable summarising a larger set of receptive field responses without resulting in a prohibitively large joint histogram descriptor.
For this purpose, a dimensionality reduction step in the form of principal component analysis is added before constructing the joint histogram. The result will be a local fea- where the new components correspond to linear combinations of the filter responses from the original feature vector F.
The transformation properties of the spatio-temporal scalespace derivatives imply scale covariance for such linear combinations of receptive fields, as well as for the individual receptive field responses if using the proper scale-normalisation with γ s = 1 and γ τ = 1. A more detailed treatment of the scale-covariance properties is given in Appendix A.1.
The main reason for choosing PCA is simplicity and that it has empirically been shown to give good results for spatial images (Linde and Lindeberg [38]). The number of principal components M can be adapted to requirements for descriptor size and need for detail in modelling the local image structure. The dimensionality reduction step can also be skipped if working with a smaller number of receptive fields.

Joint receptive field histograms
To create the joint histogram of receptive field responses, each feature dimension is partitioned into n bins number of equidistant bins in the range where the mean and the standard deviation are computed over the training set and d is a parameter controlling the number of standard deviations that are spanned for each dimension. We here use d = 5. This results in a histogram with n cells = (n bins ) M distinct multidimensional cells, where M is the number of principal components used. Such a joint histogram of spatio-temporal filter responses explicitly models the co-variation of different types of image measurements, in contrast to descriptors based on marginal distributions or relative feature strength.
Each histogram cell will correspond to a certain "template" local space-time structure, similar to e.g. VLBP [84] but notably represented and computed using different primitives. The histogram descriptor thus captures the relative frequency of such space-time structures in a space-time region. The number of different local "templates" will be decided by the number of receptive fields/principal components and the number of bins.
If represented naively, a joint histogram could imply a prohibitively large descriptor. However, in practice the number of non-zero bins can be considerably lower than the maximum number of bins, which enables utilising a computationally efficient sparse representation as outlined by Linde and Lindeberg [38].
In this work, although the statistics are computed one frame at a time, they are later aggregated into a single global histogram per video. It is, however, straightforward to instead compute the video descriptors regionally over both time and space to classify new videos after having seen only a limited number of frames or to use the video descriptor for distinguishing regions containing different motion patterns.

Covariance and invariance properties
The scale-covariance properties of the spatio-temporal receptive fields and the PCA components, according to the theory presented in Appendix A.1, imply that a histogram descriptor constructed from these primitives will be scale covariant over all non-zero spatial scaling factors and for temporal scaling factors that are integer powers of the distribution parameter c of the time-causal limit kernel. This means that our proposed video descriptors can be used as the building blocks of a scale-invariant recognition framework.
A straightforward option for this, is to use our proposed video descriptors in a multi-scale recognition framework, where each video is represented by a set of descriptors computed at multiple spatial and temporal scales, both during training and testing. A scale-covariant descriptor then implies that if training is performed for a video sequence at scale (s 1 , τ 1 ) and if a corresponding video sequence is rescaled by spatial and temporal scaling factors S s and S τ , corresponding recognition can be performed at scale (s 2 , τ 2 ) = (S 2 s s 1 , S 2 τ τ 1 ). However, for the initial experiments performed in this work this option for scale invariant recognition has not been explored. Instead, training and classification is performed at the same scale or the same set of scales and the outlined scale-invariant extension is left for future work.

Choice of receptive fields and descriptor parameters
The basic video descriptor described above will give rise to a family of video descriptors when varying the set of receptive fields and the descriptor parameters. Here, we describe the different options investigated in this work considering: (i) the set of receptive fields, (ii) the number of bins and the number of principal components used for constructing the histogram and (iii) the spatial and temporal scales of the receptive fields.

Receptive field sets
The set of receptive fields used as primitives for constructing the histogram will determine the type of information that is represented in the video descriptor. A straightforward example of this is that using rotationally invariant differential invariants will imply a rotationally invariant video descriptor. A second example is that including or excluding purely temporal derivatives will enable or disable capturing temporal intensity changes not mediated by spatial motion. We have chosen to compare video descriptors based on four different receptive field groups as summarised in Table 1. Table 1: The video descriptors investigated in this paper and the receptive field sets they are based on.

Name
Receptive field set First, note that all video descriptors, except STRF Njet (previous), include first-and second-order spatial and temporal derivatives in pairs {L x , L xx }, {L t , L tt }, {L xt , L xtt }, {L xxt , L xxtt } etc. The motivation for this is that first-and second-order derivatives provide complementary information and by including both, equal weight is put on first-and second-order information. It has specifically been observed Snapshots of receptive field responses are shown for two dynamic textures from the DynTex classes "waves" (top) and "traffic" (bottom) (σ s = 8, that biological receptive fields occur in pairs of odd-shaped and even-shaped receptive field profiles that can be well approximated by Gaussian derivatives (Koenderink and van Doorn [33], De Valois et al. [71], Lindeberg [42]). In the following, we describe the four video descriptors in more detail and do further motivate the choice of their respective receptive field sets.
RF Spatial is a purely spatial descriptor based on the full spatial N-jet up to order two. This descriptor will capture the spatial patterns in "snapshots" of the scene (single frames) independent of the presence of movement. Using spatial deriva-tives up to order two means that each histogram cell template will represent a discretized second-order approximation of the local spatial image structure. An additional motivation for using this receptive field set is that this descriptor is one of the best performing spatial descriptors for the receptive field based object recognition method in [38]. This descriptor is primarily included as a baseline to compare the spatio-temporal descriptors against.
STRF N-jet is a directionally selective spatio-temporal descriptor, where the first-and second-order spatial derivatives are complemented with the first-and second-order temporal derivatives of these as well as the first-and secondorder temporal derivatives of the smoothed video L. Including purely temporal derivatives means that the descriptor can capture intensity changes not mediated by spatial motion (flicker). The set of mixed spatio-temporal derivatives will on the other hand capture the interplay between changes over the spatial and temporal domains, such as movements of salient spatial patterns. An additional motivation for including mixed spatio-temporal derivatives is that they represent features that are well localised with respect to joint spatio-temporal scales. This implies that when using multiple scales, a descriptor including mixed spatio-temporal derivatives will have better abilities to separate spatio-temporal patterns at different spatio-temporal scales.
STRF RotInv is a rotationally invariant video descriptor based on a set of rotationally invariant features over the spatial domain: the spatial gradient magnitude |∇L| = L 2 x + L 2 y , the spatial Laplacian ∇ 2 L = L xx + L yy and the determinant of the spatial Hessian det H L = L xx L yy − L 2 xy 1 . These are evaluated on the smoothed video L directly and on the first-and second-order temporal derivatives of the scale-space representation L t and L tt . One motivation for choosing these spatial differential invariants is that they are functionally independent and span the space of rotationally invariant first-and second-order differential invariants over the spatial domain. This set of rotationally invariant features was also demonstrated to be the basis of one of the best performing spatial descriptors in [38]. By applying these differential operators to the first-and second-order temporal derivatives of the video, the interplay between temporal and spatial intensity variations is captured.
STRF N-jet (previous) is our previously published [28] video descriptor. This descriptor is included for reference and differs from STRF N-jet by lacking the second-order temporal derivatives of the first-and second-order spatial derivatives and not comprising any parameter tuning.
It can be noted that none of these video descriptors makes use of the full spatio-temporal 4-jet. This reflects the philosophy of treating space and time as distinct dimensions, where the most relevant information lies in the interplay between spatial changes (here, of first-and second-order) with temporal changes (here, of first-and second-order). Thirdand fourth-order information with respect to either the spatial or the temporal domain is thus discarded. Receptive field responses for two videos of dynamic textures are shown for spatio-temporal partial derivatives in Figure 5 and for rotational differential invariants in Figure 6.
It should be noted that the recognition framework presented here also allows for using non-separable receptive fields with non-zero image velocities, but we have chosen to not investigate that option in this first study.

Number of bins and principal components
Different choices of the number of bins n bins and the number of principal components n comp will give rise to qualitative different histogram descriptors. Using few principal components in combination with many bins will enable finegrained recognition of a smaller number of similar pattern templates (separating patterns based on smaller magnitude differences in receptive field responses). On the other hand, using a larger number of principal components in combination with fewer bins will imply a histogram capturing a larger number of more varied but less "precise" patterns. The different options that we have considered in this work are: n comp ∈ [2,17] n bins ∈ {2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25} After a set of initial experiments, where we varied the number of bins and the number of principal components (presented in Section 7.1), we noted that binary histograms with 10-17 principal components achieve highly competitive results for all benchmarks. Binary histogram descriptors also have an appeal in simplicity and one less parameter to tune. Therefore, the subsequent experiments (Section 7.2 and forward), were performed using binary histograms only.

Binary histograms
When choosing n bins = 2 equivalent to a joint binary histogram, the local image structure is described by only the sign of the different image measurements. This will make the descriptor invariant to uniform rescalings of the intensity values, such as multiplicative illumination transformations or indeed any change that does not affect the sign of the receptive field response. Binary histograms in addition enable combining a larger number of image measurements without a prohibitive large descriptor dimensionality and have proven an effective approach by a large number of LBPinspired methods.

Spatial and temporal scales
Including receptive fields of multiple spatial and temporal scales in the descriptor enables capturing image structures of different spatial extent and temporal duration. Such a multiresolution descriptor also comprises more complex image primitives, since patterns of different spatial and temporal scales can be combined. The spatial and temporal scales, i.e. the standard deviation for the respective scale-space kernels, considered in this work are: We thus consider 20 combinations of a single spatial scale with a single temporal scale and 12 combinations of two spatial scales and two temporal scales.

Datasets
We evaluate our proposed approach on six standard dynamic texture recognition/classification benchmarks from two widely used dynamic texture datasets: UCLA (Soatto et al. [66]) and DynTex (Péteri et al. [54]). We here give a brief description of the datasets and the benchmarks. Sample frames from the datasets are seen in Figure 1 (UCLA) and Figure 2 (DynTex).

UCLA
The UCLA dataset was introduced by Soatto et al. [66] and is composed of 200 videos (160 × 110 pixels, 15 fps) featuring 50 different dynamic textures with 4 samples from each texture. The UCLA50 benchmark [66] divides the 200 videos into 50 classes with one class per individual texture/scene. It should be noted that this partitioning is not conceptual in the sense of the classes constituting different types of textures such as "fountains", "sea" or "flowers" but instead targets instance specific and viewpoint specific recognition. This means that not only different individual fountains but also the same fountain seen from two different viewpoints should be separated from each other.
Since for many applications it is more relevant to recognise different dynamic texture categories, a partitioning of the UCLA dataset into conceptual classes, UCLA9, was introduced by Ravichandran et al. [59] with the following classes: boiling water (8), fire (8), flowers (12), fountains (20), plants (108), sea (12), smoke (4), water (12) and waterfall (16), where the numbers correspond to the number of samples from each class. Because of the large overrepresentation of plant videos for this benchmark, in the UCLA8 benchmark, those are excluded to get a more balanced dataset, leaving 92 videos from eight conceptual classes.

DynTex
A larger and more diverse dynamic texture dataset, DynTex, was introduced by Péteri et al. [54], featuring a larger variation of dynamic texture types recorded under more diverse conditions (720 × 576 pixels, 25 fps). From this dataset, three gradually larger and more challenging benchmarks have been compiled by Dubois et al. [12]. The Alpha benchmark includes 60 dynamic texture videos from three different classes: sea, grass and trees. There are 20 examples of each class and some variations in scale and viewpoint. The Beta benchmark includes 162 dynamic texture videos from ten classes: sea, vegetation, trees, flags, calm water, fountain, smoke, escalator, traffic and rotation. There are 7 to 20 examples of each class. The Gamma benchmark includes 264 dynamic texture videos from ten classes: flowers, sea, trees without foliage, dense foliage, escalator, calm water, flags, grass, traffic and fountains. There are 7 to 38 examples of each class and this benchmark features the largest intraclass variability in terms of scale, orientation, etc.

Experimental setup
This section describes the cross-validation schemes used for the different benchmarks, the classifiers and the use of parameter tuning over the descriptor parameters.

Benchmark cross-validation schemes
The standard test setup for the UCLA50 benchmark, which we adopt also here, is 4-fold cross-validation [66]. For each partitioning, three out of four samples from each dynamic texture instance are used for training, while the remaining one is held out for testing.
The standard test setup for UCLA8 and UCLA9 is to report the average accuracy over 20 random partitions, with 50 % data used for training and 50 % for testing (randomly bisecting each class) [18]. We use the same setup here, except that we report results as an average over 1000 trials to get more reliable statistics. This since we noted that, because of the small size of the dataset, the specific random partitioning will otherwise affect the result. For all the UCLA benchmarks, in contrast to the most common setup of using manually extracted patches, we use the non-cropped videos, thus our setup could be considered a slightly harder problem.
For the DynTex benchmarks, the experimental setup used is leave-one-out cross-validation as in [25,3,81,55]. We perform no subsampling of videos but use the full 720 × 576 pixels frames.

Classifiers
We present results of both using a support vector machine (SVM) classifier and a nearest neighbour (NN) classifier, the latter to evaluate the performance also of a simpler classifier without hidden tunable parameters. For NN we use the χ 2distance d(x, y) = ∑ i (x i − y i ) 2 /(x i + y i ) and for SVM the χ 2kernel e −γd(x,y) . Results are quite robust to the choice of the SVM hyperparameters γ and C. We here use γ = 0.1 and C = 10, 000 for all experiments.

Descriptor parameter tuning
Comparisons with state-of-the-art and between video descriptors based on different sets of receptive fields are done using binary descriptors. Parameter tuning is performed as a grid search over the number of principal components n comp ∈ [2,17], spatial scales σ s ∈ {1, 2, 4, 8, 16} and temporal scales σ τ ∈ {50, 100, 200, 400}. For spatial and temporal scales, we consider both single scales and combinations of two adjacent spatial and temporal scales. The standard evaluation protocols for the respective benchmarks (i.e. slightly different cross validation schemes) are used for parameter selection and results are reported for each video descriptor using the optimal set of parameters.
Not fully separating the test and the train data during parameter selection does introduce the risk of model selection bias, but it should be noted that this is standard practice on these benchmarks. This is due to the fact that a full nested cross-validation scheme is very computationally expensive and the small size of the benchmarks means using a subset of videos as a validation set is not a viable option.

Experiments
Our first experiments consider a qualitative and quantitative evaluation of different versions of our video descriptors, where we present results on: (i) varying the number of bins and principal components, (ii) using different spatial and temporal scales for the receptive fields and (iii) comparing descriptors based on different sets of receptive fields. This is followed by (iv) a comparison with state-of-the-art dynamic texture recognition methods and finally (v) a qualitative analysis on reasons for errors.

Number of bins and principal components
The classification performance of the STRF N-jet descriptor as function of the number of bins and the number of principal components used in the histogram descriptor are presented in Figure 7 for the UCLA8 and UCLA50 benchmarks and in Figure 8 for the Beta and Gamma benchmarks. A first observation is that, not surprisingly, when using a smaller number of principal components, each dimension needs to be divided into a larger number of bins to achieve good performance, e.g. for n comp ∈ {2, 3} the best performance is achieved for n bins >= 15 for all benchmarks. To discriminate between a large number of spatio-temporal patterns using only a few image measurements, these need to be more precisely recorded. A qualitative difference between using an odd or an even number of bins for n bins <= 8 can also be noted. This can be explained by a qualitative difference in the handling of feature values close to zero.
At the other end of the spectrum, it can be seen that when using a large number of principal components, fewer bins suffice. Using a large number of spatio-temporal primitives in combination with a small number of bins means that the different qualitative "types" of patters are more diverse, while at the same time being less "precise" in the sense of being unaffected by small changes in the magnitude of the filter responses. Binary or ternary descriptors are thus less sensitive to variations of the same rough type of space-time structure. Indeed, for binary descriptors only the sign of the receptive field response is recorded and a binary descriptor thus gives full invariance to e.g. multiplicative illumination transformations.
For the larger Beta and Gamma benchmarks, it is clear that the descriptors that in this way combine a large number of image measurements with binary or ternary histograms achieve superior performance. This indicates that for these larger more complex datasets, capturing the essence of a local space-time pattern rather than its more precise appearance is the right trade-off. In fact, the best results can for all benchmarks be achieved using n bins = 2 and n comp ∈ [10,17], with the single exception of 0.3 percentage points lower error on the UCLA8 dataset if instead using a ternary descriptor. We thus conclude that binary histogram descriptors are a very useful option, combining top performance with simplicity. Therefore, we in the following investigate the effect of varying the remaining descriptor parameters using binary descriptors only.

Spatial and temporal scales
Each dataset will have a set of scales that are better for describing the spatial patterns and motion patterns present in the videos. The classification performance of the STRF Njet descriptor as function of the spatial and the temporal scales of the receptive fields for different combinations of a single spatial scale σ s ∈ {1, 2, 4, 8, 16} and a single temporal scale σ τ ∈ {1, 2, 4, 8, 16} are shown in Figure 10 for the UCLA benchmarks and in Figure 11 for the DynTex benchmarks. All results have been obtained with n comp = 15 and n bins = 2.
For all the UCLA benchmarks, an approximately unimodal maximum over scales is obtained. For the UCLA8 and UCLA9 benchmarks, the best performance is obtained when combining a smaller spatial scale with a shorter temporal scale. For the UCLA50 benchmark, the best results are instead achieved for shorter temporal scales in combination with larger spatial scales. The observation that a short temporal scale works well for all benchmarks could indicate that fast motions are discriminative and that the best spatial scales are different for UCLA50 is not strange, since this benchmark features instance recognition (e.g. separating 108 different plants) rather than generalising between classes. Although it might feel intuitive that small details should be useful for instance recognition, this will depend on the dataset. For example, plants with similar leaves but different global growth patterns could be easier to separate at a larger spatial scale.
For the DynTex benchmarks, the scale combinations that give the best results are scattered rather than showing an unimodal maximum. This could indicate that the different subsets of dynamic textures are best separated at different (and non-adjacent) scales. Since the DynTex dataset is quite diverse, this would not be strange. It should also be noted that the differences between the best and the second best results are here typically only one or two correctly classified videos. It is, however, clear that using the largest spatial scale in combination with the longest temporal scale gives markedly worse results.
When using 2 x 2 scales, we noted a similar performance pattern during scale tuning with unimodal maxima for the UCLA benchmarks and scattered maxima for the DynTex benchmarks (not shown). Comparing the absolute performance when using single vs multiple scales, it depends on the receptive field set if using multiple scales gives a consistent advantage. If inspecting the sets of optimal parameters found for the different benchmarks (presented in Appendix A.2), it can be noted that, for the STRF N-jet descriptor, the best results are sometimes achieved using a single scale and sometimes when using 2 x 2 scales. However, STRF N-jet includes a quite large number (17) of receptive fields and when using a smaller set of receptive fields, such as in RF Spatial (5 receptive fields) or STRF RotInv (9 receptive fields), video descriptors using multiple spatio-temporal scales consistently have the best performance. This shows that receptive fields at different scales can contain complementary information.
We conclude that, although results competitive with many state-of-the-art methods can be obtained for a heuristic choice of spatial and temporal scales, using parameter tuning to find an appropriate scale/set of scales may lead to improved performance of a few percentage points.

Receptive field sets
In this section, we present results on relative performance between our four proposed video descriptors constructed from different sets of receptive fields (see Section 4.6.1): (i) The new spatio-temporal descriptor STRF N-jet.
A comparison of the classification performance of these four video descriptors across all benchmarks is shown in Figure 9. The performance of all four video descriptors is also compared to state-of-the-art in Table 2 and Table 3.

STRF N-jet (previous) vs STRF N-jet
We note that parameter tuning and adding the second-order temporal derivatives of the spatial derivatives, result in improved performance for our new STRF N-jet descriptor compared to the STRF N-jet (previous) descriptor [28]. The new descriptor shows improved accuracy for all the benchmarks. We have also observed an improvement from both these changes individually (not explicitly shown here).

Spatio-temporal vs spatial descriptors
A comparison between the STRF N-jet descriptor and RF Spatial reveals improved accuracy when including spatiotemporal receptive fields for the UCLA8, UCLA9, Alpha and Gamma benchmarks. Note that a comparison to the STRF N-jet (previous) descriptor is less relevant, since that descriptor is in contrast to the others not subject to parameter tuning. The largest improvement is obtained for the Gamma benchmark, where adding spatio-temporal receptive fields reduces the error from 9.5 % to 4.5 % when using an SVM classifier. Smaller improvements are obtained for the UCLA8 and UCLA9 benchmarks, with a reduction in error from 2.2 % to 1 % and from 1.4 % to 0.8 %, respectively. For the UCLA50 benchmarks, the performance saturates at 100 % for both descriptor types (rather indicating the relative simplicity of this benchmark). The only exception where RF Spatial shows better performance is for the Beta benchmark using a NN classifier. Here, the purely spatial descriptor achieves 5.6 % error vs 6.2 % error for STRF N-jet.
Competitive performance for purely spatial descriptors on the Beta benchmark has been reported previously [55] and we here make a similar observation. Thus, not surprisingly, for some settings settings genuine spatio-temporal information is of greater importance than for others. Here, the largest gain is indeed obtained for the most complex task.

Rotationally invariant descriptors
The rotationally invariant STRF N-jet RotInv descriptor does not achieve fully as good performance as the directionally dependent STRF N-jet descriptor for the tested benchmarks. The difference in classification accuracy in favour of the directionally selective descriptor is most pronounced for the more complex DynTex benchmarks: STRF RotInv achieves 7.4 % and 6.8 % error on the Beta and Gamma benchmarks using an SVM classifier, compared to STRF N-jet with 4.9 % and 4.5 % error, respectively. However, a comparison with state-of-the-art in Table 2 and Table 3, reveals that the STRF N-jet RotInv descriptor still achieves competitive performance compared to other dynamic texture recognition approaches.
It is of conceptual interest that these good results can be obtained also when disregarding orientation information completely. Indeed, if considering marginal histograms of receptive field responses, the most striking differences between texture classes such as waves, grass and foliage is the typical directions of change (waves show a stronger gradient in the vertical directions grass in the horizontal and foliage in both). A qualitative conclusion is that directional information is not the main mode of recognition here, instead the local space-time structure independent of orientation is highly discriminative. We conclude that our proposed STRF RotInv descriptor could be a viable option for tasks where rotation invariance is of greater importance than for these benchmarks. However, the possible gain from enabling recognition of textures at orientations not present in the training data will have to be balanced against the possible gain from discriminative directional information.

Comparison to state-of-the-art
This section presents a comparison between our proposed approach and state-of-the-art dynamic texture recognition methods. We include video descriptors constructed from four different sets of receptive fields (see Table 1) and compare against the best performing methods found in the literature n bins For the UCLA8 benchmark, the best results are obtained when using more than eight principal components in combination with binary or ternary histograms. For the UCLA50 benchmark, a large range of different configurations result in 0 % error indicating that the task of recognising dynamic texture instances is less challenging compared to separating dynamic texture categories. Spatial scales: (σ s 1 , σ s 2 ) = (1, 2) pixels. Temporal scales: (σ τ 1 , σ τ 2 ) = (50, 100) ms.  The best results are obtained when using more than eight principal components in combination with binary or ternary histograms. Thus, for these more complex benchmarks, discarding too much information in the dimensionality reduction step impairs the performance. Spatial scales: (σ s 1 , σ s 2 ) = (4, 8) pixels. Temporal scales: (σ τ 1 , σ τ 2 ) = (100, 200) ms.  Fig. 9: The classification performance for video descriptors constructed from different sets of receptive fields. Left: For the UCLA benchmarks. Right: For the DynTex benchmarks. The STRF N-jet descriptor achieves improved performance compared to RF Spatial (with the single exception of the Beta NN benchmark). This shows that the spatio-temporal receptive fields provide complementary information. The rotationally invariant STRF RotInv descriptor is also a competitive option. The new STRF N-jet achieves better results than STRF N-jet (previous) [28]. All descriptors are binary. Additional parameters for each of these benchmark results are given in Appendix A.2.   for the descriptor STRF N-jet with n bins = 2 and n comp = 17. Note the different ranges used for colour coding the maps. Several local optima are obtained for all the three benchmarks. This suggests that different dynamic texture classes are best separated at different (non-adjacent) scales, which could be due to the larger diversity of the dynamic texture types present in these benchmarks. Table 2: Comparison to state-of-the-art for the UCLA benchmarks. Our proposed STRF N-jet descriptor shows consistently very competitive results for all these benchmarks, achieving the highest mean accuracy averaged over all benchmarks as well as the single best result on four out of the six benchmarks. All the STRF and RF descriptors are binary (n bins = 2) and parameter tuning has been performed except for STRF N-jet (previous -----81.0 greyscale Table 3: Comparison to state-of-the-art for the DynTex benchmarks. Our proposed STRF N-jet descriptor ranks at the very top among the greyscale methods showing better performance than a large range of similar methods using different spatio-temporal primitives. All the STRF and RF descriptors are binary (n bins = 2) and parameter tuning has been performed except for STRF N-jet (previous). The parameter values are given in Appendix A.2. Bold font = best greyscale descriptor for each single benchmark, italics font = best colour descriptor, * indicates a different train/test partitioning for SVM and † the use of a nearest centroid classifier. Highlighted rows = our proposed descriptors. for each benchmark. We also aim to include a range of different types of approaches with an extra focus on methods similar to ours i.e. different LBP versions and relatively shallow (max 2 layers) spatio-temporal filtering based approaches using either handcrafted filters or filters learned from data. Results for all the other methods are taken from the literature, where the relevant references are indicated in the table.

UCLA datasets
The UCLA benchmark results are presented in Table 2. Our proposed STRF N-jet descriptor shows highly competitive performance compared to all the other methods, achieving the highest mean accuracy averaged over all the benchmarks and either the single best or the shared best result on four out of the six benchmarks.
For the UCLA50 benchmark, our three new video descriptors achieve 0 % error using both an SVM and a NN classifier. The main difference between these descriptors and the untuned STRF N-jet (previous) is the use of a larger spatial scale, which was seen in Section 7.2 to be more adequate for this benchmark. Enhanced LBP [60] and Ensemble SVMs [81] also achieve 0 % error rate and there are several methods with error rates below 0.5 %. The main conclusions we draw from the UCLA50 results are that recognising the same dynamic texture instance from the same viewpoint is (not surprisingly) an in comparison easier task than separating conceptual classes and that our approach performs on par with the best state-of-the-art methods on this task.
For the conceptual UCLA8 and UCLA9 benchmarks using an NN classifier, our STRF N-jet descriptor achieves 0.9 % and 1.0 % error, respectively, which are the single best results among all methods. This demonstrates that our approach is stable and works well with a simple classifier also for a quite high-dimensional descriptor. For the UCLA8 benchmark together with a NN classifier, the second best performing approach is our rotational-invariant descriptor STRF RotInv with 1.2 % error and after that MEWLSP [70] with 2 % error. For UCLA9, the second best performing approach is MBSIF-TOP [3] with 1.2 % error followed by STRF RotInv and MEWLSP, which both achieve 1.4 % error.
For the UCLA8 benchmark combined with an SVM classifier, the best performing approaches are OTD [58] and 3D-OTF [79] both with 0.5 % error. For UCLA9, the best method using an SVM classifier is DNGP [61], which achieves 0.4 % error. Our STRF N-jet descriptor achieves 1.0 % error on the UCLA8 benchmark, and 0.8 % error on the UCLA9 benchmark. It should be noted that OTD, 3D-OTF and DNGP simultaneously show considerably worse results on the NN benchmarks and that the standard UCLA protocol (average over 20 trials) can give quite variable results because of the limited number of samples in the benchmarks. Averaging over 1000 trials means that our results are more stable and less likely to include "outliers" for some of the benchmarks.
Our approach shows improved results on all the UCLA benchmarks compared to a large range of similar methods also based on gathering statistics of local space-time structure but using different spatio-temporal primitives. This includes methods that are more complex in the sense of combining several different descriptors or a larger number of feature extracting steps (MEWLSP [70], HOG-NSP [52]), methods learning higher-level hierarchical features (PCANet-TOP [2], SKDL [57], temporal dropout DL, DT-CNN [1]) and improved and extended LBP-based methods (Enhanced LBP [60], MBSIF-TOP [3], MEWLSP [70], CVLBP [69]) as well as the standard LBP-TOP [84] and VLBP [83] descriptors. An interesting observation is also that compared to VLBP and CVLBP, which similar to our approach use binary histograms and full 2D+T primitives, the performance of our approach is 2.1 to 10.5 percentage points better for all the benchmarks. The most important difference between these methods and our approach is indeed the spatio-temporal primitives used for computing the histogram.

DynTex datasets
The DynTex benchmark results are presented in Table 3. For this larger and more complex dataset, it can be seen that utilising colour information and supervised hierarchical feature learning seems to give a clear advantage with three deep learning approaches on top. DT-CNN demonstrates the best performance with 0 % error on the Alpha and Beta benchmarks and 0.4 % error on the Gamma benchmark. However, although included for reference, we do not directly aim here to compete with these more complex methods. The main focus of our work is instead to evaluate the usefulness of the time-causal spatio-temporal primitives without entanglement with a complex hierarchical framework.
Our proposed STRF N-jet descriptor achieves 0 % error on the Alpha benchmark using both an SVM and a NN classifier, 4.9 % (SVM) and 6.2 % (NN) error on the Beta benchmark and 4.5 % (SVM) and 8.8 % error (NN) on the Gamma benchmark. This means that we achieve better results than all other non-deep learning methods utilising only grey scale information except one: MR-SFA [49] which achieves 1 % error on the Beta SVM benchmark and 1.9 % error on the Beta NN benchmark (this method has not been tested on the Alpha and Gamma benchmarks). It should, however, be noted that MR-SFA uses regional descriptors capturing the relative location of image structures and a bag-of-words framework on top of the histogram descriptor. This approach is thus significantly more complex compared to our method. Both these extensions would be relatively straightforward to implement also using our proposed video descriptors.
Our results can also be compared to the LBP-TOP extension AFS-TOP, which shows the most competitive results using an SVM classifier with 1.7 %, 9.9 % and 5.7 % error, respectively, on the Alpha, Beta and Gamma benchmarks. Our approach thus achieves better performance on all the tested benchmarks, although AFS-TOP includes several added features, such as removing outlier frames. Improvements compared to the basic LBP-TOP descriptor are larger and this can be considered a more fair benchmark, since we are testing an early version of our approach. Compared to MBSIF-TOP and PCANet-TOP, both learning 2D filters from data and applying those on three orthogonal planes, our approach also achieves better results on all the DynTex benchmarks.
We also show notably better results (in the order of 10-20 percentage points) than those reported from using DFS [80], OTD [58], SKDL [57] and the 2D+T curvelet transform [12]. However, those use a nearest centroid classifier and a different SVM train-test partition, which means that a direct comparison is not possible. We also note that although STRF N-jet achieves the best results, the rotationally invariant descriptor version STRF RotInv, the RF Spatial descriptor and the untuned STRF N-jet (previous) descriptor also achieve competitive performance. This demonstrates the robustness and flexibility of our approach.
In conclusion, our approach shows highly competitive performance for this larger and more complex benchmark, even though our proposed approach is a conceptually simple method utilising only local information. The STRF N-jet descriptor achieves better performance than all other greyscale methods of similar complexity and better performance compared to both several more complex methods and two methods learning filters from data. We believe this should be considered as strong validation that these time-causal spatiotemporal receptive fields indeed capture useful spatio-temporal information.

Qualitative results
To gain more insight into the qualitative behaviour of our proposed family of video descriptors, we inspected the confusions and the closest neighbours to correctly classified and misclassified samples. Confusion matrices for the UCLA9, the Beta and the Gamma benchmarks for the STRF N-jet descriptor are presented in Figure 12. We note that the main cause of error for UCLA9 is confusing fire with smoke. UCLA8 (not shown here) shares a similar pattern. There is indeed a similarity in dynamics between these textures in the presence of temporal intensity changes not mediated by spatial movements. Confusions between flowers and plants and between fountain and waterfall are most likely caused by similarities in the spatial appearance and the motion patterns of these dynamic texture classes.
When inspecting the confusions between the different classes for the Beta and Gamma benchmarks, there is no clear pattern visible. This is probably partly due to the fact that these benchmarks contains larger intraclass variabilities. Misclassifications seem to be caused by a single video in one class having some similarity to a single video of different class rather than certain classes being consistently mixed up. We note the largest ratio of misclassified samples for the escalator and traffic classes, which are also the classes with the fewest samples.
A bit more light is shed on the reasons for misclassifications for the DynTex benchmarks when inspecting the closest neighbours in feature space for misclassified videos ( Figure 13) and for correctly classified videos ( Figure 14). We note that a frequent feature of the misclassified videos is the presence of multiple textures, such as a flag with light foliage in front of it (misclassified as foliage), or a fountain flowing into a pool of calm water (misclassified as calm water). There are also examples of confusion caused by similarity in either the spatial appearance or the temporal dynamics between specific instances of different classes, such as calm water reflecting a small tree being misclassified as foliage or a field of grass waving in the wind misclassified as sea.
A subset of the misclassifications can likely be resolved if utilising colour information, since colour can be highly discriminative for dynamic textures. When considering the state-of-the-art results summarised in Table 3, these also indicate that a "deeper" framework, extracting more high-level abstract features would most likely improve the performance. Options that may be considered are adding additional layers of abstraction before or after the histograms, for example, by learning hierarchical features from data or using bag-ofwords models.

Summary and discussion
We have presented a new family of video descriptors based on joint histograms of spatio-temporal receptive field responses and evaluated several members in this family on the problem of dynamic texture recognition. This is, to our knowledge, the first video descriptor that uses joint statistics of a set of "ideal" (in the sense of derived on the basis of pure mathematical reasoning) spatio-temporal scale-space filters for video analysis and the first quantitative performance evaluation of using the family of time-causal scalespace filters derived by Lindeberg [44] as primitives for video analysis. Our proposed approach generalises a previous method by Linde and Lindeberg [37,38], based on joint histograms of receptive field responses, from the spatial to the spatiotemporal domain and from object recognition to dynamic texture recognition.
Our experimental evaluation on several benchmarks from two widely used dynamic texture datasets demonstrates com-b o il in g w a t e r f ir e f lo w e r s f o u n t a in p la n t s s e a s m o k e w a t e r w a t e r f a ll boiling water (8) fire (8) flowers (12) fountain (20) plants (108) sea (12) smoke (4) water (12) waterfall ( (20) vegetation (20) trees (20) flags (20) calm water (20) fountain (20) smoke (16) escalator (7) traffic (9) rotation (10)  (29) sea (38) naked trees (25) foliage (35) escalator (7) calm water (30) flags (31) grass (23) traffic (9) fountain (   petitive performance compared to state-of-the-art dynamic texture recognition methods. The best results are achieved for our binary STRF N-jet descriptor, which combines directionally selective spatial and spatio-temporal receptive field responses. For the UCLA benchmarks, the STRF N-jet descriptor achieves the highest mean accuracy averaged over all benchmarks as well as the shared best or single best result for several benchmarks. In addition, our approach achieves improved results on all the UCLA benchmarks compared to a large range of similar methods also based on gathering statistics of local space-time structure but using different spatio-temporal primitives. For the larger more complex DynTex benchmarks, deep learning approaches come out on top. However, also here our proposed video descriptor achieves improved performance compared to a range of similar methods, such as local binary pattern based methods including recent extensions and improvements (Zhao and Pietikainen [83,84]; Ren et al. [60], Hong et al. [25]) and two methods learning filters by means of unsupervised learning (Arashloo and Kittler [3]; Arashloo et al. [2]). The improved performance compared to approaches that learn filters from data (where for the UCLA benchmarks this also includes two deep learning based approaches), shows that designing filters based on structural priors of the world can for these tasks be as effective as learning.
It should be noted, however, that the framework presented here is not aimed at directly competing with more complex methods learning higher-level features from data. Our main objective with this study is instead to evaluate the usefulness of these new spatio-temporal primitives using a conceptually simple framework. Our approach does not include combinations of different feature types, ensemble classifiers, regional pooling to capture the relative location of features or learned or handcrafted mid-level features. However, constructing a more complex framework on top of these spatio-temporal primitives is certainly possible and would with high probability result in additional performance gains.
In summary, our conceptually simple video descriptor achieves highly competitive performance across all benchmarks compared to other grey-scale "shallow" methods and improved performance compared to all other methods of similar complexity using different spatio-temporal primitives, either handcrafted or learned from data. We consider this as strong validation that these time-causal spatio-temporal receptive fields are highly descriptive for modelling the local space-time structure and as evidence in favour of their general applicability as primitives for other video analysis tasks.
Our approach could also be implemented using non-causal Gaussian spatio-temporal scale-space kernels. This might give somewhat improved results, since at each point in time, additional information from the future could then also be used. However, a time-delayed Gaussian kernel would imply longer temporal delays -thereby making it less suited for time critical applications -in addition to requiring more computations and larger temporal buffers. The computational advantages of these time-causal filters imply that they can, in fact, be preferable also in settings where the temporal causality might not be of primary interest.
The scale-space filters used in this work have a strong connection to biology in the sense that a subset of these receptive fields very well model both spatial and spatio-temporal receptive fields in the LGN and V1 (DeAngelis et al. [10], Lindeberg [42,44]). The receptive fields in V1, V2, V4 and MT serve as input to a large number of higher visual areas. It does indeed appear attractive to be able to use similar filters for early visual processing over a wide range of visual tasks. This study can be seen as a first step in a more general investigation into what can be done with spatio-temporal features similar to those in the primate brain.
Directions for future work include: (i) A multi-scale recognition framework, where each video is represented by a set of descriptors computed at multiple spatial and temporal scales, both during training and testing. This would enable scale-invariant recognition according to the theory presented in Appendix A.1. (ii) An extension to colour by including spatio-chromo-temporal receptive fields would with high probability improve the performance for tasks where colour information is discriminative. (iii) Including non-separable spatiotemporal receptive fields with non-zero image velocities in the video descriptor according to the more general receptive field model in [44]. Such velocity-adapted receptive fields, in fact, constitute a dominant portion of the receptive fields in areas V1 and MT (DeAngelis et al. [10]). (iv) Using positiondependent histograms to take into account the relative locations of features. (v) Adding additional layers of abstraction before or after the histograms, for example, by learning hierarchical features from data or using bag-of-words models.
We propose that the spatio-temporal receptive field framework should be of more general applicability to other video analysis tasks. Time-causal spatio-temporal receptive fields are indeed the visual primitives used for solving a large range of visual tasks for biological agents. The theoretical properties of these spatio-temporal receptive fields imply that they can be used to design methods that are provably invariant or robust to different types of natural image transformations, where such invariances will reduce the sample complexity for learning. We thus see the possibility to both integrate time-causal spatio-temporal receptive fields into current video analysis methods and to design new types of methods based on these primitives.

A Appendices
A.1 Covariance properties of partial derivatives, PCA components and differential invariants over spatio-temporal scales The video descriptors that we construct are based on combinations of partial derivatives L x m 1 y m 2 t n (10) of different orders (m 1 , m 2 , n) computed at multiple spatio-temporal scales (s, τ), where s denotes the spatial scale parameter and τ the temporal scale parameter. Specifically, all video descriptors are expressed in terms of scale-normalized derivatives (Lindeberg [40,44]) ∂ ξ = s γs/2 ∂ x , ∂ η = s γs/2 ∂ y , ∂ ζ = τ γτ /2 ∂ t (11) written L ξ m 1 η m 2 ζ n = s (m 1 +m 2 )γs/2 τ nγτ /2 L x m 1 y m 2 t n where γ s > 0 and γ τ > 0 are scale normalization powers of the spatial and the temporal domains, respectively. Consider an independent scaling transformation of the spatial and the temporal domains for (x 1 , x 2 ,t ) = (S s x 1 , S s x 2 , S τ t) where S s and S τ denote the spatial and temporal scaling factors, respectively. Define the space-time separable spatio-temporal scale-space representations L and L of f and f , respectively, according to L(x 1 , x 2 ,t; s, τ) = (T (·, ·, ·; s, τ) * f (·, ·, ·)) (x 1 , x 2 ,t; s, τ) L (x 1 , x 2 ,t ; s , τ ) = T (·, ·, ·; s , τ ) * f (·, ·, ·) (x 1 , x 2 ,t ; s , τ ) Then, for the spatio-temporal scale-space kernel defined from the Gaussian kernel g(x, y; s) over the spatial domain and the time-causal limit kernel Ψ (t; τ, c) over the temporal doman T (x, y,t; s, τ) = g(x, y; s)Ψ (t; τ, c) these spatio-temporal scale-space representations will be closed under independent scaling transformations of the spatial and the temporal domains L (x 1 , x 2 ,t ; s , τ ) = L(x 1 , x 2 ,t; s, τ) (18) provided that the spatio-temporal scale levels are appropriately matched [45,Equation (25)] This property holds for all non-zero spatial scaling factors S s . Because of the discrete nature of the temporal scale levels τ k = τ 0 c 2k in the time-causal temporal scale-space representation obtained by convolution with the time-causal limit kernel, this closedness property does, however, only hold for temporal scaling factors S τ that are integer powers of the distribution parameter c of the time-causal limit kernel Specifically, under an independent scaling transformation of the spatial and the temporal domains (14), the partial derivatives transform according to [46,Equation (9)] L ξ m 1 η m 2 t n (x , y ,t ; s , τ ) = = S (m 1 +m 2 )(γs−1) s S n(γτ −1) τ L ξ m 1 η m 2 ζ n (x, y,t; s, τ) (21) In particular, for the specific choice of γ s = 1 and γ τ = 1 these partial derivatives will be equal L ξ m 1 η m 2 t n (x , y ,t ; x , τ ) = L ξ m 1 η m 2 ζ n (x, y,t; s, τ) when computed at matching spatio-temporal scales (19).
Concerning the dimensionality reduction step, where a large set of scale-normalised partial derivatives L ξ m 1 η m 2 ζ n is replaced by a lowerdimensional set of PCA-components F i = ∑ j w i j L ξ m 1,i, j η m 2,i, j ζ n i, j (23) this implies that also the PCA components will be equal at matching spatio-temporal scales (19), although constituting linear combinations of partial derivatives of a different spatial orders (m 1,i, j , m 2,i, j ) and temporal orders n i, j This property does also extend to scale-normalised spatio-temporal differential invariants computed for γ s = 1 and γ τ = 1 as well as to PCA components defined from linear combinations of such scale-normalised spatio-temporal differential invariants.
In these respects, our video descriptors are truly scale covariant under independent scaling transformations of the spatial and the temporal domains.
Corresponding spatio-temporal scale-covariance properties do also hold for video descriptors defined from the non-causal Gaussian spatiotemporal scale-space representation computed based on spatial smoothing with the regular two-dimensional Gaussian kernel and temporal smoothing with the one-dimensional non-causal temporal Gaussian kernel. Then, the scale covariance properties hold for all non-zero spatial scaling constants S s and all non-zero temporal scaling constants S τ .