Spatio-Temporal Scale Selection in Video Data

This work presents a theory and methodology for simultaneous detection of local spatial and temporal scales in video data. The underlying idea is that if we process video data by spatio-temporal receptive fields at multiple spatial and temporal scales, we would like to generate hypotheses about the spatial extent and the temporal duration of the underlying spatio-temporal image structures that gave rise to the feature responses. For two types of spatio-temporal scale-space representations, (i) a non-causal Gaussian spatio-temporal scale space for offline analysis of pre-recorded video sequences and (ii) a time-causal and time-recursive spatio-temporal scale space for online analysis of real-time video streams, we express sufficient conditions for spatio-temporal feature detectors in terms of spatio-temporal receptive fields to deliver scale-covariant and scale-invariant feature responses. We present an in-depth theoretical analysis of the scale selection properties of eight types of spatio-temporal interest point detectors in terms of either: (i)–(ii) the spatial Laplacian applied to the first- and second-order temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian applied to the first- and second-order temporal derivatives, (v) the determinant of the spatio-temporal Hessian matrix, (vi) the spatio-temporal Laplacian and (vii)–(viii) the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix. It is shown that seven of these spatio-temporal feature detectors allow for provable scale covariance and scale invariance. Then, we describe a time-causal and time-recursive algorithm for detecting sparse spatio-temporal interest points from video streams and show that it leads to intuitively reasonable results. An experimental quantification of the accuracy of the spatio-temporal scale estimates and the amount of temporal delay obtained from these spatio-temporal interest point detectors is given, showing that: (i) the spatial and temporal scale selection properties predicted by the continuous theory are well preserved in the discrete implementation and (ii) the spatial Laplacian or the determinant of the spatial Hessian applied to the first- and second-order temporal derivatives leads to much shorter temporal delays in a time-causal implementation compared to the determinant of the spatio-temporal Hessian or the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix.

A general problem when applying the notion of receptive fields in practice, however, is that the types of responses that are obtained in a specific situation can be strongly dependent on the scale levels at which they are computed. The spatial Laplacian of the second-order temporal derivative ∇ 2 (x,y) Ltt Fig. 1. Time-causal spatio-temporal scale-space representation L(x, y, t; s, τ ) with its second-order temporal derivative Ltt(x, y, t; s, τ ) and the spatial Laplacian of the second-order temporal derivative ∇ 2 (x,y) Ltt computed from a video sequence in the UCF-101 dataset (Kayaking g01 c01.avi) at 3 × 3 combinations of the spatial scales (bottom row) σs,1 = 2 pixels, (middle row) σs,2 = 4.6 pixels and (top row) σs,1 = 10.6 pixels and the temporal scales (left column) στ,1 = 40 ms, (middle column) στ,2 = 160 ms and (right column) στ,2 = 640 ms using a logarithmic distribution of the temporal scale levels with distribution parameter c = 2. (Image size: 320 × 172 pixels of original 320 × 240 pixels. Frame 90 of 226 frames at 25 frames per second.) receptive field responses over multiple spatial and temporal scales for a video sequence and different types of spatio-temporal features computed from it. Note how qualitatively different types of responses are obtained at different spatiotemporal scales. At some spatio-temporal scales, we get strong responses due to the movements of the paddle or the motion of the paddler in the kayak. At other spatio-temporal scales, we get relatively larger responses because of the movements of the here unstabilized camera. A computer vision system intended to process the visual input from general spatio-temporal scenes does therefore need to decide what responses within the family of spatio-temporal receptive fields over different spatial and temporal scales it should base its analysis on as well as about how the information from different subsets of spatio-temporal scales should be combined.
For purely spatial data, the problem of performing spatial scale selection is nowadays rather well understood. Given the spatial Gaussian scale-space concept (Koenderink [6]; Lindeberg [7,8]; Florack [9]; ter Haar Romeny [10]), a general methodology for performing spatial scale selection has been developed based on local extrema over spatial scales of scale-normalized differential entities (Lindeberg [11]). This general methodology has in turn been successfully applied to develop robust methods for image-based matching and recognition that are able to handle large variations of the size of the objects in the image domain.
Much less research has, however, been performed on developing methods for choosing locally appropriate temporal scales for spatio-temporal analysis of video data. While some methods for temporal scale selection have been developed (Lindeberg [12]; Laptev and Lindeberg [13]; Willems et al. [14]), these methods suffer from either theoretical or practical limitations. The subject of this article is to present an extended theory for spatio-temporal scale selection in video data.
2 Spatio-temporal receptive field model For processing video data at multiple spatial and temporal scales, we follow the approach with idealized models of spatio-temporal receptive fields of the form as previously derived, proposed and studied in Lindeberg [8,15,16], where we specifically here choose as temporal smoothing kernel over time either (i) the non-causal Gaussian kernel or (ii) the time-causal limit kernel (Lindeberg [16,Equation (38)]) defined via its Fourier transform of the form For simplicity, we consider space-time separable receptive fields with image velocity v = (v 1 , v 2 ) = (0, 0) and set the spatial covariance matrix to Σ = I.

Spatial-temporal scale covariance and scale invariance
In the corresponding spatio-temporal scale-space representation L(x 1 , x 2 , t; s, τ ) = (T (·, ·, ·; s, τ ) * f (·, ·, ·)) (x 1 , x 2 , t; s, τ ) we define scale-normalized partial derivates (Lindeberg [16,Equation (108)] and consider homogeneous spatio-temporal differential invariants of the form where the sum of the orders of spatial and temporal differentiation in a certain term J j=1 |α ij | = J j=1 α 1ij +α 2ij = M and J j=1 β ij = N does not depend on the index i of that term. Consider next an independent scaling transformation of the spatial and the temporal domains of a video sequence where S s and S τ denote the spatial and temporal scaling factors, respectively. Then, a homogeneous spatio-temporal derivative expression of the form (7) This result follows from a combination and generalization of Equation (25) in (Lindeberg [11]) with Equations (10) and (104) in (Lindeberg [17]). With the temporal smoothing performed by the scale-invariant limit kernel according to (3), the temporal scaling transformation property does, however, only hold for temporal scaling transformations that correspond to exact mappings between the discrete temporal scale levels in the time-causal temporal scale-space representation and thus to temporal scaling factors S τ = c i that are integer powers of the distribution parameter c of the time-causal limit kernel.
Specifically, the scale covariance property (9) implies that if we can detect a spatio-temporal level (ŝ,τ ) such that the scale-normalized expression D norm L assumes a local extremum at a point (x,ŷ,t;ŝ,τ ) in spatio-temporal scale space, then this local extremum is preserved under independent scaling transformations of the spatial and temporal domains and is transformed in a scale-covariant way The properties (9) and (10) constitute a theoretical foundation for scale-covariant and scale-invariant spatio-temporal feature detection.

Scale selection in non-causal Gaussian spatio-temporal scale space
In this section, we will perform a closed-form theoretical analysis of the spatial and temporal scale selection properties that are obtained by detecting simultaneous local extrema over both spatial and temporal scales of different scalenormalized spatio-temporal differential expressions. We will specifically analyze how the spatial and temporal scale estimatesŝ andτ are related to the spatial extent s 0 and the temporal duration τ 0 of the underlying spatio-temporal image structures.

Differential entities for spatio-temporal interest point detection
Inspired by the way neurons in the lateral geniculate nucleus (LGN) respond to visual input (DeAngelis et al [2]), which for many LGN cells can be modelled by idealized operations of the form (Lindeberg [15, Equation (108)]) we consider the scale-normalized Laplacian of the first-and second-order temporal derivatives Inspired by the way the determinant of the spatial Hessian matrix constitutes a better spatial interest point detector than the spatial Laplacian operator (Lindeberg [18]), we consider extensions of the spatial Laplacian of the first-and second-order temporal derivatives into the determinant of the spatial Hessian of the first-and second-order temporal derivatives det Over three-dimensional joint space-time, we can also define the determinant of the spatio-temporal Hessian and make an attempt to define a spatio-temporal Laplacian of the form where we have introduced a parameter κ 2 to make explicit the arbitrary scaling factor between temporal vs. spatial derivatives that influences any attempt to add derivative expressions of different dimensionality.

Scale calibration for a Gaussian blink
Consider a spatio-temporal model signal defined as a Gaussian blink with spatial extent s 0 and temporal duration τ 0 for which the spatio-temporal scale-space representation is of the form To calibrate the scale selection properties of the spatio-temporal interest point detectors ∇ 2 (x,y),norm L tt,norm , det H (x,y),norm L tt,norm and det H (x,y,t),norm L, we require that at the origin (x, y, t) = (0, 0, 0) the scale-normalized spatio-temporal differential expression should assume its strongest extremum over spatial and temporal scales at spatio-temporal scale level Here, we have introduced a parameter q < 1 to enforce temporal scale calibration to a finer temporal scale than τ 0 , to enable shorter temporal delays and thus faster responses for feature detection performed based on a time-causal spatio-temporal scale-space representation. By calculating the scale-normalized feature response at the origin for each one of the feature detectors, differentiating with respect to the spatial and temporal scale parameters and solving for the zero-crossings over spatial and temporal scales, we find that the spatial scale-normalization powers γ s and γ τ in the scale-normalized spatio-temporal derivatives (6) should be set to γ s = 1 for the slice-based operators ∇ 2 (x,y) L tt and det H (x,y) L tt , to γ s = 5/4 for the genuine 3-D operator det H (x,y,t) and to γ τ,∇ 2 (x,y) Ltt = γ τ,det H (x,y) Ltt = 3q 2 2(q 2 + 1) , γ τ,det H (x,y,t) = 5q 2 2(q 2 + 1) .

Scale calibration for a Gaussian onset blob
Consider next a spatio-temporal pattern defined as a Gaussian onset blob with spatial extent s 0 and temporal duration τ 0 for which the spatio-temporal scale-space representation is of the form L(x, y, t; s, τ ) = g(x, y; s 0 + s) t u=0 g(u; τ 0 + τ ) du.
To calibrate the scale selection properties of the spatio-temporal interest point detectors ∇ 2 (x,y),norm L t,norm and det H (x,y),norm L t,norm , we require that at the origin (x, y, t) = (0, 0, 0) the scale-normalized spatio-temporal differential entity should assume its strongest extremum at spatio-temporal scale level By again calculating the scale-normalized response at the origin and differentiating with respect to the spatial and temporal scale parameters, we find that the spatial scale-normalization powers γ s and γ τ in the scale-normalized spatiotemporal derivatives (6) should for both operators be set to γ s = 1 and

Lack of scale covariance for the spatio-temporal Laplacian
For the attempt to define a scale-normalized spatio-temporal Laplacian operator the equations that determine the spatial and temporal scale estimates are unfortunately hard to solve in closed form for general values of the scale normalization powers γ s and γ τ . For the specific case of γ s = 1 and γ τ = 1, we can, however, note that the resulting scale estimateŝ will be explicitly dependent on the relative scaling factor κ 2 between the derivatives with respect to the temporal vs. the spatial dimensions. This situation is in clear contrast to the other spatio-temporal differential entities treated above, for which a multiplication of the temporal derivative by a factor κ does not affect the spatial or temporal scale estimates. This property does in turn imply that the spatio-temporal scale estimates will not be covariant under independent relative rescalings of the spatial and temporal dimensions. These theoretical arguments explain why the scale estimates from the spatial and temporal selection mechanisms in [13] were later empirically found to not be sufficiently robust.

Spatio-temporal interest points detected as spatio-temporal scale-space extrema over space-time
In this section, we shall use the above scale-normalized differential entities for detecting spatio-temporal interest points. The overall idea of the most basic form of such an algorithm is to simultaneously detect both spatio-temporal points (x,ŷ,t) and spatio-temporal scales (ŝ,τ ) at which the scale-normalized differential entity (D norm L)(x, y, t; s, τ ) simultaneously assumes local extrema with respect to both space-time (x, y, t) and spatio-temporal scales (s, τ ).

Time-causal and time-recursive algorithm for spatio-temporal scale-space extrema detection
By approximating the spatial smoothing operation by convolution with the discrete analogue of the Gaussian kernel over the spatial domain [7], which obeys a semi-group property over spatial scales, and approximating the time-causal limit kernel by a cascade of first-order recursive filters [16, Equation (56)]), we can state an algorithm for computing the time-causal and time-recursive spatiotemporal scale-space representation and detecting spatio-temporal scale-space extrema of scale-normalized differential invariants from it as follows: 1. Determine a set of temporal scale levels τ k and spatial scale levels s l at which the algorithm is to operate by computing spatio-temporal scale-space representations at these spatio-temporal scales. 2. Compute time constants µ k = ( 1 + 4 r 2 (τ k − τ k−1 ) − 1)/2 according to (Lindeberg [16,Equations (58) and (55) ii. Initiate spatio-temporal scale-space representation at current temporal scale L(x, y, t; s 1 , τ k ) = T (x, y; s 1 ) * B(x, y, k). iii. Loop over the spatial scale levels s l in ascending order: A. Compute spatio-temporal representations at coarser spatial scales using the semi-group property over spatial scales L(·, ·, t; s l , τ k ) = T (·, ·; s l − s l−1 ) * L(·, ·, t; s l−1 , τ k ).
(b) For all temporal and spatial scales, compute temporal derivatives using backward differences over the short temporal buffers of past frames. (c) For all temporal and spatial scales, compute the scale-normalized differential entity (D norm L)(x, y, t; s l , τ k ) at that spatio-temporal scale. (d) For all points and spatio-temporal scales (x, y; s l , τ k ) for which the magnitude of the post-normalized differential entity is above a pre-defined threshold |(D postnorm L)(x, y, t; s l , τ k )| ≥ θ, determine if the point is either a positive maximum or a negative minimum in comparison to its nearest neighbours over space (x, y), time t, spatial scales s l and temporal scales τ k . Because the detection of local extrema over time requires a future reference in the temporal direction, this comparison is not done at the must recent frame but at the nearest past frame.
i. For each detected scale-space extremum, compute more accurate estimates of its spatio-temporal position (x,ŷ,t) and spatio-temporal scale (ŝ,τ ) using parabolic interpolation along each dimension [17,Equation (115)]. Do also compensate the magnitude estimates by a magnitude correction factor computed for each dimension.
When detecting local extrema with respect to the spatial, temporal and scale dimensions, we order the comparisons with respect to the nearest neighbours in each dimension and stop performing the comparisons at any point in spatiotemporal scale-space as soon as it can be stated that a spatio-temporal point (x, y, t; s, τ ) is neither a local maximum or a local minimum. Figure 2 shows the result of detecting spatio-temporal interest points in this way from the same video sequence as used for the illustrations in figure 1. For this experiment, we used 21 spatial scale levels between σ s = 2 and 21 pixels and 7 temporal scale levels between σ τ = 40 ms and 2.56 s with distribution parameter c = 2 for the time-causal limit kernel. To obtain comparable numbers of features from the different feature detectors, we adapted the thresholds on the scale-normalized differential invariants ∇ 2 (x,y),norm L t,norm , ∇ 2 (x,y),norm L tt,norm , det H (x,y),norm L t,norm , det H (x,y),norm L tt,norm , det H (x,y,t),norm L and ∇ 2 (x,y,t),norm L such that the average number of features from each feature detector was 50 features per frame.

Experimental results
As can be seen from the results, all of these feature detectors respond to regions in the video sequence where there are strong variations in image intensity over space and time. There are, however, also some qualitative differences between the results from the different spatio-temporal interest point detectors. The LGN-inspired feature detectors ∇ 2 (x,y),norm L t,norm and ∇ 2 (x,y),norm L tt,norm respond both to the motion patterns of the paddler and to the spatio-temporal texture corresponding to the waves on the water surface that lead to temporal flickering effects and so do the operators det H (x,y),norm L t,norm and det H (x,y),norm L tt,norm . The more corner-detector-inspired feature detector det H (x,y,t),norm L responds more to image features where there are simultaneously rapid variations over all the spatial and temporal dimensions.

Covariance and invariance properties
From the theoretical scale selection properties of the spatial scale-normalized derivative operators according to the spatial scale selection theory in (Lindeberg [11]) in combination with the temporal scale selection properties of the temporal scale selection theory in (Lindeberg [17]) with the scale covariance of the underlying spatio-temporal derivative expressions ∇ 2 (x,y),norm L t,norm , ∇ 2 (x,y),norm L tt,norm , det H (x,y),norm L t,norm , det H (x,y),norm L tt,norm and det H (x,y,t),norm L described in (Lindeberg [16]), it follows that these spatio-temporal interest point detectors are truly scale covariant under independent scaling transformations of the spatial and the temporal domains if the temporal smoothing is performed by either a non-causal Gaussian g(t; τ ) over the temporal domain or the time-causal limit kernel Ψ (t; τ, c). From the general proof in Section 3, it follows that the detected spatio-temporal interest points transform in a scale-covariant way under independent scaling transformations of the spatial and the temporal domains.

Summary and discussion
We have presented a theory and a method for performing simultaneous detection of local spatial and temporal scale estimates in video data. The theory comprises both (i) feature detection performed within a non-causal spatio-temporal scalespace representation computed for offline analysis of pre-recorded video data and (ii) feature detection performed from real-time image streams where the future cannot be accessed and memory requirements call for time-recursive algorithms based on only compact buffers of what has occurred in the past.
As a theoretical foundation for spatio-temporal scale selection, we have stated general results regarding covariance and invariance properties of spatio-temporal features defined from video data with independent scaling transformations of the spatial and the temporal domains. For five spatio-temporal differential invariants: (i)-(ii) the spatial Laplacian of the first-and second-order temporal derivatives, (iii)-(iv) the determinant of the spatial Hessian of the first-and secondorder temporal derivatives and (v) the determinant of the spatio-temporal Hessian matrix, we have analysed the theoretical scale selection properties of these feature detectors and shown how scale calibration of these feature detectors can be performed to make the spatio-temporal scale estimates reflect the spatial extent and the temporal duration of the underlying spatio-temporal features that gave rise to the feature responses.
For an attempt to define a spatio-temporal Laplacian, we have on the other hand shown that this differential expression is not scale covariant under independent rescalings of the spatial and temporal domains, which explains a previously noted poor robustness of the scale selection step in the spatio-temporal interest point detector based on the spatio-temporal Harris operator [13].
To allow for different trade-offs between the temporal response properties of time-causal spatio-temporal feature detection (shorter temporal delays) in relation to signal detection theory, which would call for detection of image structures at the same spatial and temporal scales as they occur, we have specifically introduced a parameter q to regulate the temporal scale calibration to finer temporal scalesτ = q 2 τ 0 as opposed to the more common choiceŝ = s 0 over the spatial domain. According to a theoretical analysis of scale selection properties in non-causal spatio-temporal scale space, the results predict that this parameter should reduce the temporal delay by a factor of q: ∆t → q ∆t. An experimental quantification of the scale selection properties in time-causal spatio-temporal scale space in a longer version of this paper confirm that a substantial decrease in temporal delay is obtained. The specific choice of the parameter q should be optimized with respect to the task that the spatio-temporal selection and the spatio-temporal features are to be used for and given specific requirements of the application domain.
We have also presented an explicit algorithm for detecting spatio-temporal interest points in a time-causal and time-recursive context, in which the future cannot be accessed and memory requirements call for only compact buffers to store partial records of what has occurred in the past and presented experimental results of applying this algorithm to real-world video data for the different types of spatio-temporal interest point detectors that we have studied theoretically.