1 Introduction

A basic paradigm for video analysis consists of performing the first layers of visual processing based on successive layers of spatio-temporal receptive fields.

From a mathematical viewpoint, such an approach can be motivated from the fact that the measurement of the image intensity at a single point in space–time does in general not carry any meaningful information, since such a measurement is strongly dependent on external factors, such as the usually unknown illumination of the scene. The relevant information is instead carried by the relative relations between the measurements of image intensities at different points over space and time, which implies that it is natural to perform visual processing of video data based on local neighbourhoods over space and time.

From a biological viewpoint, such an approach can also be motivated from the fact that the first layers of mammalian vision can be modelled in terms of spatio-temporal receptive fields over multiple spatial and temporal scales. Cell recordings from neurones in the primary visual cortex have shown that there are spatio-temporal receptive fields tuned to different sizes and orientations in the image domain, to different integration times over the temporal domain as well as to different image velocities in space–time [12, 13, 32, 33]. Interestingly, the shapes of the spatio-temporal receptive field families that have been measured in biological vision can furthermore be explained by normative theories of visual receptive fields [69, 71, 75, 78], whose axiomatic derivation is based on structural properties of the environment in combination with assumptions about the internal structure of an idealized vision system to ensure a consistent treatment of image representations over multiple spatio-temporal scales.

Based on these or related motivations, a large number of computer vision approaches have been developed in which the first layers of image features are computed based on spatio-temporal receptive field responses [3, 16, 22, 35,36,37, 43, 48, 51, 53, 93, 95, 96, 98, 101,102,103, 108, 116,117,119, 121, 125].

Fig. 1
figure 1

Time-causal spatio-temporal scale-space representation \(L(x, y, t;\; s, \tau )\) with its first- and second-order temporal derivatives \(L_t(x, y, t;\; s, \tau )\) and \(L_{tt}(x, y, t;\; s, \tau )\) computed from a video sequence in the UCF-101 dataset (Kayaking_g01_c01.avi) at \(3 \times 3\) combinations of the spatial scales (bottom row) \(\sigma _{s,1} = 2~\text{ pixels }\), (middle row) \(\sigma _{s,2} = 4.6~\text{ pixels }\) and (top row) \(\sigma _{s,3} = 10.6~\text{ pixels }\) and the temporal scales (left column) \(\sigma _{\tau ,1} = 40~\text{ ms }\), (middle column) \(\sigma _{\tau ,2} = 160~\text{ ms }\) and (right column) \(\sigma _{\tau ,3} = 640~\text{ ms }\) with the spatial and temporal scale parameters in units of \(\sigma _\mathrm{s} = \sqrt{s}\) and \(\sigma _{\tau } = \sqrt{\tau }\) and using a logarithmic distribution of the temporal scale levels with distribution parameter \(c = 2\) (image size: \(320 \times 172\) pixels of original \(320 \times 240\) pixels; frame 90 of 226 frames at 25 frames/s)

Fig. 2
figure 2

The spatial Laplacian applied to the first- and second-order temporal derivatives \(\nabla _{(x,y)}^2 L_t\) and \(\nabla _{(x,y)}^2 L_{tt}\) as well as the spatio-temporal Laplacian \(\nabla _{(x, y, t)}^2 L\) computed from a video sequence in the UCF-101 dataset (Kayaking_g01_c01.avi) at \(3 \times 3\) combinations of the spatial scales (bottom row) \(\sigma _{s,1} = 2~\text{ pixels }\), (middle row) \(\sigma _{s,2} = 4.6~\text{ pixels }\) and (top row) \(\sigma _{s,3} = 10.6~\text{ pixels }\) and the temporal scales (left column) \(\sigma _{\tau ,1} = 40~\text{ ms }\), (middle column) \(\sigma _{\tau ,2} = 160~\text{ ms }\) and (right column) \(\sigma _{\tau ,3} = 640~\text{ ms }\) with the spatial and temporal scale parameters in units of \(\sigma _\mathrm{s} = \sqrt{s}\) and \(\sigma _{\tau } = \sqrt{\tau }\) and using a time-causal spatio-temporal scale-space representation with a logarithmic distribution of the temporal scale levels for \(c = 2\) (image size: \(320 \times 172\) pixels of original \(320 \times 240\) pixels; frame 90 of 226 frames at 25 framesframes/s)

Fig. 3
figure 3

The determinant of the spatial Hessian applied to the first- and second-order temporal derivatives \(\det \mathcal{H}_{(x,y)} L_t\) and \(\det \mathcal{H}_{(x,y)} L_{tt}\) as well as the determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x,y,t)} L\) computed from a video sequence in the UCF-101 dataset (Kayaking_g01_c01.avi) at \(3 \times 3\) combinations of the spatial scales (bottom row) \(\sigma _{s,1} = 2~\text{ pixels }\), (middle row) \(\sigma _{s,2} = 4.6~\text{ pixels }\) and (top row) \(\sigma _{s,3} = 10.6~\text{ pixels }\) and the temporal scales (left column) \(\sigma _{\tau ,1} = 40~\text{ ms }\), (middle column) \(\sigma _{\tau ,2} = 160~\text{ ms }\) and (right column) \(\sigma _{\tau ,3} = 640~\text{ ms }\) with the spatial and temporal scale parameters in units of \(\sigma _\mathrm{s} = \sqrt{s}\) and \(\sigma _{\tau } = \sqrt{\tau }\) and using a time-causal spatio-temporal scale-space representation with a logarithmic distribution of the temporal scale levels for \(c = 2\). The magnitude values of \(\det \mathcal{H}_{(x,y,t)} L\) have been stretched by the monotone function \(\phi (z) = ({\text {sign}} z) \sqrt{|z|}\) (image size: \(320 \times 172\) pixels of original \(320 \times 240\) pixels; frame 90 of 226 frames at 25 frames/s)

Fig. 4
figure 4

The first- and second-order temporal derivatives of the determinant of the spatial Hessian \(\partial _t(\det \mathcal{H}_{(x,y)} L)\) and \(\partial _{tt}(\det \mathcal{H}_{(x,y)} L)\) computed from a video sequence in the UCF-101 dataset (Kayaking_g01_c01.avi) at \(3 \times 3\) combinations of the spatial scales (bottom row) \(\sigma _{s,1} = 2~\text{ pixels }\), (middle row) \(\sigma _{s,2} = 4.6~\text{ pixels }\) and (top row) \(\sigma _{s,3} = 10.6~\text{ pixels }\) and the temporal scales (left column) \(\sigma _{\tau ,1} = 40~\text{ ms }\), (middle column) \(\sigma _{\tau ,2} = 160~\text{ ms }\) and (right column) \(\sigma _{\tau ,3} = 640~\text{ ms }\) with the spatial and temporal scale parameters in units of \(\sigma _\mathrm{s} = \sqrt{s}\) and \(\sigma _{\tau } = \sqrt{\tau }\) and using a time-causal spatio-temporal scale-space representation with a logarithmic distribution of the temporal scale levels for \(c = 2\). The magnitude values of \(\det \mathcal{H}_{(x,y,t)} L\) have been stretched by the monotone function \(\phi (z) = ({\text {sign}} z) \sqrt{|z|}\) (image size: \(320 \times 172\) pixels of original \(320 \times 240\) pixels; frame 90 of 226 frames at 25 frames/s)

A general problem when applying the notion of receptive fields in practice, however, is that the types of responses that are obtained in a specific situation can be strongly dependent on the scale levels at which they are computed. Figures 123 and 4 show illustrations of the this problem by showing snapshots of spatio-temporal receptive field responses over multiple spatial and temporal scales for a video sequence and for different types of spatio-temporal features computed from it. Note how qualitatively different types of responses are obtained at different spatio-temporal scales. At some spatio-temporal scales, we get strong responses due to the movements of the paddle or the motion of the paddler in the kayak. At other spatio-temporal scales, we get relatively larger responses because of the movements of the here unstabilized camera. The spatio-temporal texture due to the wave patterns on the water surface does also lead to different type of responses at different spatio-temporal scales. A computer vision system intended to process the visual input from general spatio-temporal scenes does therefore need to decide what responses within the family of spatio-temporal receptive fields over different spatial and temporal scales it should base its analysis on as well as about how the information from different subsets of spatio-temporal scales should be combined.

For purely spatial data, the problem of performing spatial scale selection is nowadays rather well understood. Given the spatial Gaussian scale-space concept [24, 34, 44, 46, 47, 59, 60, 67, 70, 106, 111, 120, 123], a general methodology for spatial scale selection has been developed based on local extrema over spatial scales of scale-normalized differential entities [62, 64, 65, 72, 73]. This general methodology has in turn been successfully applied to develop robust methods for image-based matching and recognition [5, 41, 52, 68, 74, 84, 86, 87, 89, 90, 112,113,114] that are able to handle large variations of the size of the objects in the image domain and with numerous applications regarding object recognition, object categorization, multi-view geometry, construction of 3-D models from visual input, human–computer interaction, biometrics and robotics. Alternative approaches for spatial scale selection in other problem domains have also been proposed [7, 8, 10, 19, 28, 29, 31, 38,39,40, 54, 55, 66, 82, 83, 85, 91, 92, 105, 109, 115].

Much less research has, however, been performed on developing methods for choosing locally appropriate temporal scales for spatio-temporal analysis of video data. While some methods for temporal scale selection have been developed [49, 63, 122], the earliest methods suffer from either theoretical or practical limitations: the initial work on time-causal temporal scale selection in Lindeberg [63] is primarily developed over the discrete temporal Poisson scale space, which possesses a semi-group property over temporal scales and therefore leads to unnecessarily long temporal delays for reasons explained in Lindeberg [77, Appendix A]. The spatio-temporal scale selection method in Laptev and Lindeberg [49] is based on a spatio-temporal Laplacian operator that is not scale covariant under independent relative scaling transformations of the spatial versus the temporal domains (see Sect. 4.8), which implies that the spatial and temporal scale estimates will not be robust under independent variations of the spatial and temporal scales in video data as arise, for example, when viewing the same scene with two cameras having different sensor characteristics in terms of spatial resolution or temporal frame rate. The spatio-temporal scale selection method for the determinant of the spatio-temporal Hessian in Willems et al. [122] does not make use of the full flexibility of the notion of \(\gamma \)-normalized derivative operators (see Sect. 4.5) and has not been previously developed over a time-causal and time-recursive spatio-temporal domain as is necessary for processing real-time image streams with requirements of short temporal latencies of the feature responses for time-critical applications and complementary requirements about only small compact buffers of past information.

The subject of this article is to develop an extended theory for performing spatio-temporal scale selection in video data, to generate hypotheses about local characteristic spatial and temporal scales in the video data before recognizing the objects or the spatio-temporal events in the scene that the camera is observing. For this domain, we can consider two basic use cases: For offline analysis of pre-recorded video, one may take the liberty of accessing the virtual future in relation to any pre-recorded time moment and make use of symmetric filtering over the temporal domain based on the non-causal Gaussian spatio-temporal scale-space theory [61, 67, 70]. For online analysis of real-time video streams on the other hand, the future cannot be accessed and we will base the analysis on a fully time-causal and time-recursive spatio-temporal scale-space concept for real-time image streams that only requires access to information from the present moment and a very compact buffer of what has occurred in the past [75] and which constitutes an extension of previous temporal scale-space and multi-scale models [23, 27, 45, 81, 110]. Specifically, for performing spatio-temporal feature detection in the latter time-causal scenario, we will build upon a recently developed theory for temporal scale selection in a time-causal scale-space representation [77] and extend that theory to spatio-temporal scale selection for features that are computed based on a time-causal spatio-temporal scale-space representation. The resulting theory that we will arrive at can be seen as an extension of the previously developed spatial scale selection methodology [64, 65, 73] from spatial images to spatio-temporal video and real-time image streams.

To begin, we will start developing our theory for spatio-temporal scale selection with respect to the problem of detecting sparse spatio-temporal interest points [6, 9, 11, 14, 18, 20, 21, 30, 49, 88, 94, 97, 99, 100, 107, 122, 124, 126, 127], which may be regarded as a conceptually simplest problem domain because of the sparsity of spatio-temporal interest points and the close connection between this problem domain and the detection of spatial interest points for which there exists a theoretically well-founded and empirically tested framework regarding scale selection over the spatial domain [1, 4, 5, 15, 17, 25, 42, 65, 72, 74, 84, 89, 90, 112]. Specifically, using a non-causal Gaussian spatio-temporal scale-space model, we will perform a theoretical analysis of the spatio-temporal scale selection properties of eight different types of spatio-temporal interest point detectors and show that seven of them: (i) the spatial Laplacian of the first-order temporal derivative, (ii) the spatial Laplacian of the second-order temporal derivative, (iii) the determinant of the spatial Hessian of the first-order temporal derivative, (iv) the determinant of the spatial Hessian of the second-order temporal derivative, (v) the determinant of the spatio-temporal Hessian matrix, (vi) the first-order temporal derivative of the determinant of the spatial Hessian matrix and (vii) the second-order temporal derivative of the determinant of the spatial Hessian matrix, do all lead to fully scale-covariant spatio-temporal scale estimates and scale-invariant feature responses under independent scaling transformations of the spatial and the temporal domains. For (viii) the spatio-temporal Laplacian, it is on the other hand not possible to achieve scale covariance or scale invariance, which explains the poor robustness of the spatio-temporal interest points computed from the spatio-temporal Harris operator with scale selection based on the spatio-temporal Laplacian [49] on video data in which there are large independent variations in the spatial and temporal scales of the underlying spatio-temporal image structures.

Then, we will show how this theory can be transferred to an implementation based on fully time-causal spatio-temporal receptive fields to enable the detection of spatio-temporal features from real-time image streams in which the future cannot be accessed. Specifically, since any time-causal image measurement at a nonzero temporal scale will be associated with a nonzero temporal delay, we will introduce an additional parameter q to enable scale calibration of the spatio-temporal interest point detectors to deliver a temporal scale estimate at temporal scale \(\hat{\sigma }_{\tau } = q \, \hat{\sigma }_{\tau ,0}\) for \(q \le 1\) as opposed to the over the spatial domain more common choice of \(\hat{\sigma }_{s} = \hat{\sigma }_{s,0}\) to enable shorter temporal delays and therefore the ability to respond faster in time-critical real-time scenarios, motivated by the general observation that the temporal delay can be expected to be proportional to the temporal scale level when expressed in units of the temporal standard deviation of the temporal scale-space kernel.

Whereas the explicit algorithms and experiments in this paper are restricted to spatio-temporal scale selection at sparse interest points over space and time, in a companion paper [76] we develop complementary methods for computing dense maps of spatial and temporal scale estimates in video data based on a structurally similar theory.

1.1 Structure of this Article

As conceptual background to the work, Sect. 2 starts by describing the theoretical model for spatio-temporal receptive fields and the resulting scale-space concepts that we build upon for computing image and video representations over multiple spatial and temporal scales.

When to develop a theory for spatio-temporal scale selection, main questions regarding the foundations concern what properties the scale selection method should possess and how the scale estimates should be computed. In Sect. 3, we show how it is possible to construct a well-founded theory for simultaneous selection of spatial and temporal scales in video data, by detecting local extrema over spatial and temporal scales of appropriately scale-normalized spatio-temporal derivative responses. This theory is generally valid for a large class of homogeneous spatio-temporal differential invariants and beyond the more explicit examples of spatio-temporal feature detectors considered in more detail in later sections. This theory specifically includes a general statement about scale-covariant properties of the resulting spatio-temporal scale estimates, which implies that the scale estimates are guaranteed to adaptively follow variabilities in spatial and temporal scale levels in the data. This theory also comprises scale-invariant properties of the resulting spatio-temporal features and their magnitude strength measures, which imply that similar types of spatio-temporal image features, while at different scales, will be computed, if the data in video sequence are subject to independent scaling transformations of the spatial and the temporal domains. In these respects, the proposed theory obeys the desirable properties of a spatio-temporal scale selection methodology.

The theory presented so far, does, however, comprise two free parameters, a spatial scale normalization power \(\gamma _\mathrm{s}\) and a temporal scale normalization power \(\gamma _{\tau }\). To understand the behaviour of spatio-temporal feature detectors over multiple scales in more specific situations, Sect. 4 does then show how the scale selection properties of spatio-temporal feature detectors can be analysed by calculating their feature responses at multiple spatio-temporal scales in closed form to determine the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\).

Specifically, we present an in-depth analysis of the theoretical scale selection properties of eight spatio-temporal derivative expressions that may be considered as candidates for defining spatio-temporal interest point detectors, when applied to idealized model patterns in the form of Gaussian blinks or Gaussian onset blobs of different spatial extent and of different temporal duration. By requiring that the selected spatial and temporal scales should reflect the spatial extent and the temporal duration of the input pattern, we show that seven of these spatio-temporal derivative expressions: (i)–(ii) the spatial Laplacian of the first- and second-order temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian of the first- and second-order temporal derivatives, (v) the determinant of the spatio-temporal Hessian matrix and (vi)–(vii) the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix, can be scale calibrated to reflect the spatial extent and the temporal duration of the underlying spatio-temporal image structures that gave rise to the filter responses. For one of these expressions, an attempt to define a spatio-temporal Laplacian operator, the lack of scale covariance under independent scaling transformations of the spatial and temporal domains, corresponding scale-invariant scale calibration cannot, however, be done for that operator. That in turn implies that applying the spatio-temporal Laplacian to video data in which there are unknown spatio-temporal scale variations can be expected to lead to undesirable artefacts.

In Sect. 5, we then present a general algorithm for detecting spatio-temporal interest points from spatio-temporal scale-space extrema of scale-normalized spatio-temporal expressions. Specifically, we present a detailed algorithm for detecting such image features based on a time-causal and time-recursive spatio-temporal scale-space representation. Compared to a corresponding algorithm expressed over a non-causal spatio-temporal scale space, as for the case of using a Gaussian spatio-temporal scale space for analysing pre-recorded video sequences, our time-causal algorithm does never access information from the past and can therefore be applied in real-time settings on video streams. Additionally, by the time-recursive formulation, the requirements about temporal buffering of past information are much lower and do also imply the need for less computations, thus improving the computational efficiency, also if applied in a non-causal setting for analysing pre-recorded video sequences.

As a verification of whether the proposed theory and methods do what they are supposed to do, Sect. 6 presents an experimental quantification of the numerical accuracy of the spatio-temporal scale estimates as well as the amount of temporal delay for the different types of spatio-temporal interest point detectors considered in this work, when applied to idealized spatio-temporal model patterns with ground truth and in the context of a time-causal spatio-temporal scale-space representation. The results do first of all show that the theoretical properties of spatio-temporal feature detectors responding at spatial and temporal scales corresponding to the spatial extent and the temporal duration do with very good approximation transfer to the proposed discrete implementation. Secondly, it is shown that the interest point detectors defined from applying either the spatial Laplacian or the determinant of the spatial Hessian to the first- or second-order temporal derivatives lead to significantly shorter temporal delays compared to the interest point detectors defined from the determinant of the spatio-temporal Hessian or the first- and second-order temporal derivatives of the determinant of the spatial Hessian. For time-critical applications, this implies that the temporal response properties from the first set of spatio-temporal feature detectors will be faster than for those from the other set and therefore the ability of an autonomous agent to react faster. Finally, Sect. 7 concludes with a summary and discussion.

1.2 Relations to Previous Contributions

This paper constitutes a substantially extended version of a shorter conference paper presented at the SSVM 2017 conference [79] and with substantial additions concerning:

  • the motivations underlying the developments of this theory and the relations to previous work (Sect. 1),

  • more details concerning the underlying spatio-temporal receptive field model (Sect. 2),

  • a more extensive description about the proposed general methodology for spatio-temporal scale selection including: (i) its formulation based on temporal scale normalization by \(L_p\)-normalization of the temporal derivative operators, (ii) the theory for scale-invariant and scale-covariant properties of the resulting spatio-temporal features with their spatio-temporal scale estimates as well as (iii) spatio-temporal scale selection based on spatio-temporal differential invariants expressed in terms of local gauge coordinates that guarantee rotational invariance and which could not be included in the conference paper because of lack of space (Sect. 3),

  • the treatment of two additional spatio-temporal differential invariants, the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix,

  • the detailed theoretical analysis of the scale selection properties of the eight different spatio-temporal differential invariants treated in this paper and showing the explicit derivations of how the spatial and temporal scale normalization \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) should be determined by scale calibration for each feature detector (Sect. 4),

  • more complete details about the composed algorithm for detecting spatio-temporal interest points with spatio-temporal scale selection based on time-causal and time-recursive spatio-temporal receptive fields and including a change of order between the spatial and the temporal smoothing operations that substantially reduces the amount of computations (Sect. 5),

  • an experimental quantification of the accuracy of the scale estimates and the temporal delays for the different types of spatio-temporal feature detectors when applied to idealized spatio-temporal model patterns (Sect. 6) and

  • a detailed description of the corresponding spatial scale-space extrema algorithm on which the spatio-temporal scale-space extrema algorithm is based (“Appendix A”).

In relation to the SSVM 2017 paper, this paper therefore gives a more complete treatment of the subject, including more details about the spatio-temporal scale selection theory, much more complete algorithmic details when applying spatio-temporal scale selection in practice as well as a numerical quantification of the accuracy of the spatio-temporal scale estimates and the temporal responses properties (the temporal latencies in a time-causal setting).

2 Spatio-Temporal Receptive Field Model

For processing video data at multiple spatial and temporal scales, we follow the approach with idealized models of spatio-temporal receptive fields of the form

$$\begin{aligned}&T(x_1, x_2, t;\; s, \tau ;\; v, \varSigma )\nonumber \\&\quad = g(x_1 - v_1 t, x_2 - v_2 t;\; s, \varSigma ) \, h(t;\; \tau ) \end{aligned}$$
(1)

as previously derived, proposed and studied in Lindeberg [67, 69, 75, 78], where

  • \(x = (x_1, x_2)^\mathrm{T}\) denotes the image coordinates,

  • t denotes time,

  • s denotes the spatial scale,

  • \(\tau \) denotes the temporal scale,

  • \(v = (v_1, v_2)^\mathrm{T}\) denotes a local image velocity,

  • \(\varSigma \) denotes a spatial covariance matrix determining the spatial shape of a spatial affine Gaussian kernel

    $$\begin{aligned} g(x;\; s, \varSigma ) = \frac{1}{2 \pi s \sqrt{\det \varSigma }} \mathrm{e}^{-x^\mathrm{T} \varSigma ^{-1} x/2s}, \end{aligned}$$
    (2)
  • \(g(x_1 - v_1 t, x_2 - v_2 t;\; s, \varSigma )\) denotes a spatial affine Gaussian kernel that moves with image velocity \(v = (v_1, v_2)\) in space–time and

  • \(h(t;\; \tau )\) is a temporal smoothing kernel over time,

and we specifically here choose as temporal smoothing kernel over time either: (i) the non-causal Gaussian kernel

$$\begin{aligned} h(t;\; \tau ) = g(t;\; \tau ) = \frac{1}{\sqrt{2 \pi \tau }} \mathrm{e}^{-t^2/2\tau } \end{aligned}$$
(3)

or (ii) the time-causal limit kernel [75, Equation (38)]

$$\begin{aligned} h(t;\; \tau ) = \varPsi (t;\; \tau , c) \end{aligned}$$
(4)

defined via its Fourier transform of the form

$$\begin{aligned} \hat{\varPsi }(\omega ;\; \tau , c) = \prod _{k=1}^{\infty } \frac{1}{1 + i \, c^{-k} \sqrt{c^2-1} \sqrt{\tau } \, \omega } \end{aligned}$$
(5)

and corresponding to an infinite cascade of truncated exponential kernels

$$\begin{aligned} h_{\mathrm{exp}}(t;\; \mu _i)= \left\{ \begin{array}{l@{\quad }l} \frac{1}{\mu _i} \mathrm{e}^{-t/\mu _i} &{} t \ge 0 \\ 0 &{} t < 0 \end{array} \right. \end{aligned}$$
(6)

with logarithmically distributed temporal scale levels

$$\begin{aligned} \tau _k = \sum _{k = -\infty }^k \mu _i^2 = c^{2k} \tau _0 \end{aligned}$$
(7)

that cluster infinitely dense near \(\tau \downarrow 0^+\) [75].

Based on this spatio-temporal receptive field model, we define a spatio-temporal scale-space representation of the form [67, 69, 75]

$$\begin{aligned}&L(x_1, x_2, t;\; s, \tau ;\; v, \varSigma ) \nonumber \\&\quad = \left( T(\cdot , \cdot , \cdot ;\; s, \tau ;\; v, \varSigma ) {*} f(\cdot , \cdot , \cdot ) \right) (x_1, x_2, t;\; s, \tau ;\; v, \varSigma ).\nonumber \\ \end{aligned}$$
(8)

When using a one-dimensional Gaussian kernel (3) for smoothing over the temporal domain, we obtain a non-causal Gaussian spatio-temporal scale space. When using the time-causal limit kernel (4) for temporal smoothing, we obtain a time-causal and time-recursive spatio-temporal scale space.

For simplicity, we shall in this treatment henceforth restrict ourselves to space–time separable receptive fields obtained by setting the image velocity to zero \(v = (v_1, v_2)^\mathrm{T} = (0, 0)^\mathrm{T}\) and to receptive fields that are based on rotationally symmetric Gaussian kernels over the spatial domain by setting the spatial covariance matrix to a unit matrix \(\varSigma = I\).

Figures 5 and 6 show examples of such space–time separable receptive fields over a 1+1-D space time, for the main cases when the temporal smoothing is performed using either the non-causal Gaussian kernel or the time-causal limit kernel.

Fig. 5
figure 5

Space–time separable kernels \(T_{x^{m}t^{n}}(x, t;\; s, \tau ) = \partial _{x^m t^n} (g(x;\; s) \, h(t;\; \tau ))\) up to order two obtained as the composition of Gaussian kernels over the spatial domain x and the non-causal Gaussian kernel over the temporal domain (\(s = 1, \tau = 1\)) (horizontal axis: space \(x \in [-3, 3]\); vertical axis: time \(t \in [-3, 3]\))

Fig. 6
figure 6

Space–time separable kernels \(T_{x^{m}t^{n}}(x, t;\; s, \tau ) = \partial _{x^m t^n} (g(x;\; s) \, h(t;\; \tau ))\) up to order two obtained as the composition of Gaussian kernels over the spatial domain x and the time-causal limit kernel over the temporal domain (\(s = 1, \tau = 1, c = 2\)) (horizontal axis: space \(x \in [-3, 3]\); vertical axis: time \(t \in [0, 4]\))

An alternative model for time-causal temporal smoothing could be to instead use Koenderink’s scale-time kernels [45], which correspond to Gaussian smoothing on a logarithmically transformed temporal domain. For reasons described in detail in Lindeberg [77, Section 2.2], in particular the lack of a known time-recursive formulation for Koenderink’s scale-time kernels, which in turn implies a need for larger temporal buffers and more computational work for the temporal smoothing operation compared to using a time-recursive implementation of the time-causal limit kernel based on a set of recursive filters coupled in cascade [75, Section 6], we use the time-causal limit kernel for modelling the time-causal temporal smoothing operation in this work. As described in Lindeberg [75, Appendix 2], it is also possible to establish an approximate mapping between the parameters of the time-causal limit kernel and Koenderink’s scale-time kernel based on the requirement that the zero-, first- and second-order temporal moments of the kernels in the two families should be equal [75, Equation (161)] and leading to qualitatively similar while not identical temporal receptive fields based on temporal derivatives of the time-causal scale-space kernels from the two families [75, Figure 11].

While yet a third type of ad hoc model for time-causal smoothing could possibly also be formulated based on truncated and time-delayed Gaussian kernels, with the temporal delay determined such that the truncation effects in some sense could be regarded as sufficiently small, we will not develop such an approach here because: (i) such a model could be expected to lead to significantly longer temporal delays and (ii) require significantly larger temporal buffers and more computational work compared to our family of time-causal and time-recursive scale-space kernels. For time-critical applications, where the temporal response properties of the vision system need to be as fast as possible, it should in general be much better to base the temporal processing on an inherently time-causal temporal scale-space concept.

2.1 Scale-Normalized Spatio-Temporal Derivatives

Specifically, a natural way of normalizing the spatio-temporal derivative operators within this space–time separable spatio-temporal scale-space concept

$$\begin{aligned}&L(x_1, x_2, t;\; s, \tau )\nonumber \\&\quad = \left( T(\cdot , \cdot , \cdot ;\; s, \tau ) * f(\cdot , \cdot , \cdot ) \right) (x_1, x_2, t;\; s, \tau ) \end{aligned}$$
(9)

with respect to the spatial and temporal scale parameters is by introducing scale-normalized derivative operators according to Lindeberg [65, 75]

$$\begin{aligned} \partial _{\xi }= & {} \partial _{x,\mathrm{norm}} = s^{\gamma _\mathrm{s} /2} \, \partial _x, \end{aligned}$$
(10)
$$\begin{aligned} \partial _{\eta }= & {} \partial _{y,\mathrm{norm}} = s^{\gamma _\mathrm{s} /2} \, \partial _y, \end{aligned}$$
(11)
$$\begin{aligned} \partial _{\zeta }= & {} \partial _{t,\mathrm{norm}} = \alpha _n(\tau ) \, \partial _t, \end{aligned}$$
(12)

and studying scale-normalized partial derivates of the form [75, Equation (108)]

$$\begin{aligned} L_{x_1^{m_1} x_2^{m_2} t^n,\mathrm{norm}} = s^{(m_1 + m_2) \gamma _\mathrm{s}/2} \, \alpha _n(\tau ) \, L_{x_1^{m_1} x_2^{m_2} t^n}, \end{aligned}$$
(13)

where the factor \(s^{(m_1 + m_2) \gamma _\mathrm{s}/2}\) transforms the regular spatial partial derivatives to corresponding scale-normalized spatial derivatives with \(\gamma _\mathrm{s}\) denoting the spatial scale normalization parameter [65] and the factor \(\alpha _n(\tau )\) is the scale normalization factor for scale-normalized temporal derivatives determined according to either: (i) variance-based normalization [75, Equation (74)]

$$\begin{aligned} \alpha _n(\tau ) = \tau ^{n \gamma _{\tau }/2} \end{aligned}$$
(14)

or (ii) \(L_p\)-normalization [75, Equation (76)]

$$\begin{aligned} \alpha _n(\tau ) = \frac{\Vert g_{\xi ^n}(\cdot ;\; \tau ) \Vert _p}{\Vert h_{t^n}(\cdot ;\; \tau ) \Vert _p} = \frac{G_{n,\gamma _{\tau }}}{\Vert h_{t^n}(\cdot ;\; \tau ) \Vert _p} \end{aligned}$$
(15)

with \(G_{n,\gamma _{\tau }}\) denoting the \(L_p\)-norm of the non-causal temporal Gaussian derivative kernel for the \(\gamma _{\tau }\)-value for which this \(L_p\)-norm becomes constant over temporal scales (see [75, Equations (80)–(83)]).

2.2 Temporal Delays

For the non-causal temporal scale-space concept given by convolution with symmetric temporal Gaussian kernels of the form (3), the temporal delay is always zero. When using time-causal temporal scale-space kernels, there will on the other hand always be a nonzero temporal delay \(\delta \). Unfortunately, because of the lack of compact closed-form expression for the time-causal limit kernel (4) over the temporal domain, it is non-trivial to derive an compact closed-form expression for its exact temporal delay. Based on a scale-time approximation of the time-causal limit kernel, it is, however, possible to derive the following approximate expression for the temporal maximum of the temporal smoothing kernel [75, Equation (172)]Footnote 1

$$\begin{aligned} \delta \approx \frac{(c+1)^2 \, \sqrt{\tau }}{2 \sqrt{2} \sqrt{(c-1) \, c^3}}. \end{aligned}$$
(16)

From this expression, we can see that the temporal delay \(\delta \) increases linearly with the temporal scale \(\sigma _{\tau } = \sqrt{\tau }\) in units of the standard deviation of the temporal smoothing kernel. Additionally, the temporal delay depends on the distribution parameter c of the time-causal limit kernel in such a way that larger values of \(c > 1\) lead to shorter temporal delays at the cost of a sparser temporal scale sampling.

3 General Spatial-Temporal Scale Selection Methodology

In this section, we will describe a general spatio-temporal scale selection methodology for simultaneous computation of local characteristic spatial and temporal scale estimates from video data, which for appropriate choices of spatio-temporal derivative expressions for feature detection may reflect the spatial extent and the temporal duration of the underlying spatio-temporal image structures that gave rise to the feature responses.

3.1 Homogeneous Spatio-Temporal Differential Expressions

An essential property of the definition of scale-normalized spatio-temporal derivative operators according to (13) is that they will lead to scale-covariant spatio-temporal image features, if the spatial smoothing performed by a spatial Gaussian kernel (2) and if the temporal smoothing is performed with either a non-causal temporal Gaussian kernel (3) or the time-causal limit kernel (4), provided that the underlying spatio-temporal expression \(\mathcal{D}_{\mathrm{norm}} L\) used for defining the spatio-temporal features is covariant under independent scaling transformations of the spatial and temporal domains.

To express this property compactly, let us introduce multi-index notation for spatio-temporal derivatives

$$\begin{aligned} L_{x^{\alpha } t^{\beta }} = L_{x_1^{\alpha _1} x_2^{\alpha _2} t^{\beta }}, \end{aligned}$$
(17)

where \(x = (x_1, x_2), \alpha = (\alpha _1, \alpha _2)\) and \(|\alpha | = \alpha _1 + \alpha _2\). Then, consider a spatio-temporal differential expression of the form

$$\begin{aligned} \mathcal{D} L = \sum _{i=1}^I \prod _{j=1}^J c_i \, L_{x^{\alpha _{ij}} t^{\beta _{ij}}} = \sum _{i=1}^I \prod _{j=1}^J c_i \, L_{x_1^{\alpha _{1ij}} x_2^{\alpha _{2ij}} t^{\beta _{ij}}}, \end{aligned}$$
(18)

where the sum of the orders of spatial and temporal differentiation in a certain term

$$\begin{aligned}&\sum _{j=1}^J |\alpha _{ij}| = \sum _{j=1}^J \alpha _{1ij} + \alpha _{2ij} = M \end{aligned}$$
(19)
$$\begin{aligned}&\sum _{j=1}^J \beta _{ij} = N \end{aligned}$$
(20)

does not depend on the index i of that term. Such a differential expression is referred to as homogeneous.

3.2 Transformation Property Under Independent Scaling Transformations of the Spatial and the Temporal Domains

Consider next an independent scaling transformation of the spatial and the temporal domains of a video sequence

$$\begin{aligned} f'\left( x_1', x_2', t'\right) = f(x_1, x_2, t) \end{aligned}$$
(21)

for

$$\begin{aligned} \left( x_1', x_2', t'\right) = (S_\mathrm{s} \, x_1, S_\mathrm{s} \, x_2, S_{\tau } \, t), \end{aligned}$$
(22)

where \(S_\mathrm{s}\) and \(S_{\tau }\) denote the spatial and temporal scaling factors, respectively, and define the space–time separable spatio-temporal scale-space representations L and \(L'\) of f and \(f'\), respectively, according to

$$\begin{aligned}&L(x_1, x_2, t;\; s, \tau ) \nonumber \\&\quad = \left( T(\cdot , \cdot , \cdot ;\; s, \tau ) * f(\cdot , \cdot , \cdot ) \right) (x_1, x_2, t;\; s, \tau ), \end{aligned}$$
(23)
$$\begin{aligned}&L'\left( x'_1, x'_2, t';\; s', \tau '\right) \nonumber \\&\quad = \left( T(\cdot , \cdot , \cdot ;\; s', \tau ') * f'(\cdot , \cdot , \cdot ) \right) \left( x'_1, x'_2, t';\; s', \tau '\right) . \end{aligned}$$
(24)

These spatio-temporal scale-space representations are closed under independent scaling transformations of the spatial and the temporal domains

$$\begin{aligned} L'\left( x'_1, x'_2, t';\; s', \tau '\right) = L(x_1, x_2, t;\; s, \tau ) \end{aligned}$$
(25)

provided that the spatio-temporal scale levels are appropriately matched [67, 75]

$$\begin{aligned} s' = S_\mathrm{s}^2 \, s, \quad \tau ' = S_{\tau }^2 \tau . \end{aligned}$$
(26)

For the non-causal Gaussian spatio-temporal scale space having a continuum of both spatial and temporal scale levels, this closedness relation holds for all spatial scaling factors \(S_\mathrm{s} > 0\) and all temporal scaling factors \(S_{\tau } > 0\). For the time-causal spatio-temporal scale-space representation having a continuum of spatial scale levels, while the temporal scale levels are restricted to be discrete (7), the scaling relation holds for all spatial scaling factors \(S_\mathrm{s} > 0\), whereas the closedness relation under temporal scaling transformations holds only for temporal scaling factors of the form \(S_{\tau } = c^j (j \in {\mathbb {Z}})\) that correspond to exact mappings between the discrete temporal scale levels (7), where \(c > 1\) is the distribution parameter of the time-causal limit kernel (4).

Specifically, a homogeneous spatio-temporal derivative expression of the form (18) with the spatio-temporal derivatives \(L_{x_1^{m_1} x_2^{m_2} t^n}\) replaced by scale-normalized spatio-temporal derivatives \(L_{x_1^{m_1} x_2^{m_2} t^n,\mathrm{norm}}\) according to (13) transforms according to

$$\begin{aligned} \mathcal{D'}_{\mathrm{norm}} L' = S_{s}^{M(\gamma _\mathrm{s} - 1)} \, S_{\tau }^{N(\gamma _{\tau } - 1)} \, \mathcal{D}_{\mathrm{norm}} L. \end{aligned}$$
(27)

This result follows from a combination and generalization of Equation (25) in [65], which states that a purely spatial differential expression of the form

$$\begin{aligned} \mathcal{D} L = \sum _{i=1}^I \prod _{j=1}^J c_i \, L_{x^{\alpha _{ij}}} \end{aligned}$$
(28)

when expressed in terms of scale-normalized spatial derivatives transforms according to

$$\begin{aligned} \mathcal{D}'_{\mathrm{norm}} L' = S_{s}^{M(\gamma _\mathrm{s} - 1)} \, \mathcal{D}_{\mathrm{norm}} L \end{aligned}$$
(29)

with Equations (10) and (104) in [77], which state that an nth-order temporal derivative transforms according to

$$\begin{aligned} \partial _{t'^n, {\mathrm{norm}}} L' = S_{\tau }^{n(\gamma _{\tau } - 1)} \, \partial _{t^n, {\mathrm{norm}}} L. \end{aligned}$$
(30)

With the temporal smoothing performed by the scale-invariant limit kernel (4), the temporal scaling transformation property does, however, only hold for temporal scaling transformations that correspond to exact mappings between the discrete temporal scale levels \(\tau _i = \tau _0 \, c^{2 i}\) in the time-causal temporal scale-space representation and thus to temporal scaling factors \(S_{\tau } = c^i\) that are integer powers of the distribution parameter c of the time-causal limit kernel.

The scaling property (27) of homogeneous polynomial spatio-temporal differential invariants also extends to homogenous rational expressions of spatio-temporal derivatives, i.e., rational expressions formed by ratios of two homogeneous polynomials of the form (18).

3.3 General Scale-Covariant Property of the Spatio-Temporal Scale Estimates

The scale-covariant property (27) implies that local extrema over spatio-temporal scales are preserved under independent scaling transformations of the spatial and the temporal domains and that local (possibly multi-valued) spatio-temporal scale estimates obtained from local extrema over spatio-temporal scalesFootnote 2

$$\begin{aligned}&\{ (\hat{s}, \hat{\tau }) \}(x, y, t) \nonumber \\&\quad = \hbox {argmaxminlocal}_{s, \tau } \, (\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s, \tau ) \end{aligned}$$
(31)

are guaranteed to transform in a scale-covariant way under independent scaling transformations of the spatial and the temporal domains

$$\begin{aligned} \left( \hat{s}', \hat{\tau }'\right) = \left( S_\mathrm{s}^2 \, \hat{s}, S_{\tau }^2 \, \hat{\tau }\right) \end{aligned}$$
(32)

or in units of the standard deviation \((\sigma _\mathrm{s}, \sigma _{\tau }) = (\sqrt{s}, \sqrt{\tau })\) of the spatio-temporal scale-space kernel

$$\begin{aligned} \left( \hat{\sigma }'_\mathrm{s}, \hat{\sigma }'_{\tau }\right) = \left( S_\mathrm{s} \, \hat{\sigma }_\mathrm{s}, S_{\tau } \, \hat{\sigma }_{\tau }\right) \end{aligned}$$
(33)

provided that the spatial positions (xy) and the temporal moments t are appropriately matched

$$\begin{aligned} \left( x_1', x_2', t'\right) = (S_\mathrm{s} \, x_1, S_\mathrm{s} \, x_2, S_{\tau } \, t). \end{aligned}$$
(34)

Specifically, the scale-covariant property (27) implies that if we can detect a spatio-temporal scale level \((\hat{s}, \hat{\tau })\) such that the scale-normalized expression \(\mathcal{D}_{\mathrm{norm}} L\) assumes a local extremum over both space–time \((x_1, x_2, t)\) and spatio-temporal scales \((s, \tau )\) at some point \((\hat{x}_1, \hat{x}_2, \hat{t};\; \hat{s}, \hat{\tau })\) in spatio-temporal scale space, then this local extremum is preserved under independent scaling transformations of the spatial and temporal domains and is transformed in a scale-covariant way

$$\begin{aligned} (\hat{x}_1, \hat{x}_2, \hat{t};\; \hat{s}, \hat{\tau }) \mapsto \left( S_\mathrm{s} \, \hat{x}_1, S_\mathrm{s} \, \hat{x}_2, S_{\tau } \, \hat{t};\; S_\mathrm{s}^2 \, \hat{s}, S_{\tau }^2 \, \hat{\tau }\right) . \end{aligned}$$
(35)

The properties (27), (32) and (35), which mean that spatio-temporal scale estimates follow local independent spatial and temporal scaling transformations in video data, constitute a theoretical foundation for scale-covariant spatio-temporal scale selection and scale-invariant feature detection.

3.4 General Scale-Covariant and Scale-Invariant Properties of Feature Responses at Local Extrema Over Spatio-Temporal Scales

Additionally, the magnitude of the feature response \((\mathcal{D}_{\mathrm{norm}} L)_{\mathrm{extr}}\) at the spatio-temporal scale-space extremum over spatial and temporal scales will also transform according to power law

$$\begin{aligned} \left( \mathcal{D}'_{\mathrm{norm}} L'\right) _{\mathrm{extr}} = S_{s}^{M(\gamma _\mathrm{s} - 1)} \, S_{\tau }^{N(\gamma _{\tau } - 1)} \, (\mathcal{D}_{\mathrm{norm}} L)_{\mathrm{extr}}. \end{aligned}$$
(36)

In the special case when the scale normalization powers \(\gamma _\mathrm{s} = 1\) and \(\gamma _{\tau } = 1\), the magnitude responses at the scale-space extrema will be equal.

For reasons that will be explained later in Sect. 4, there are, however, situations where it can be highly motivated to use scale normalization powers not equal to one. Then, the important message is that the magnitude estimates are transformed by a power law and can be compensated for by post-normalization of the magnitude responses that also takes the actual spatio-temporal scale levels into account.

3.5 Spatio-Temporal Scale Selection for Homogeneous Spatio-Temporal Differential Invariants in Terms of Gauge Coordinates

Introduce at every point \((x_1, x_2, t)\) in space–time, local orthonormal gauge coordinate systems (uvt) and (pqt) oriented such that: (i) the v-direction is parallel to the spatial gradient direction of L and the u-direction is orthogonal in image space with the partial derivative in the u-direction being zero \(L_u = 0\) and (ii) the p- and q-directions are parallel with the eigendirections of the spatial Hessian matrix \(\mathcal{H}_{(x,y)} L\) such that the mixed spatial second-order derivative is zero \(L_{pq} = 0\). Then, consider spatio-temporal differential expressions of the forms

$$\begin{aligned} \mathcal{D} L = \sum _{i=1}^I \prod _{j=1}^J c_i \, L_{u^{\alpha _{1ij}} v^{\alpha _{2ij}} t^{\beta _{ij}}} \end{aligned}$$
(37)

or

$$\begin{aligned} \mathcal{D} L = \sum _{i=1}^I \prod _{j=1}^J c_i \, L_{p^{\alpha _{1ij}} q^{\alpha _{2ij}} t^{\beta _{ij}}} \end{aligned}$$
(38)

that satisfy the homogeneity requirements

$$\begin{aligned}&\sum _{j=1}^J \alpha _{1ij} + \alpha _{2ij} = M \end{aligned}$$
(39)
$$\begin{aligned}&\sum _{j=1}^J \beta _{ij} =N \end{aligned}$$
(40)

for all \(i \in [1, I]\). Then, by the construction from these rotationally invariant gauge coordinates, these spatio-temporal differential expressions are guaranteed to be invariant under global rotations of the spatial domain. Additionally, because of the homogeneity of these expressions in terms of the total orders of spatial and temporal differentiation in each term, simultaneous spatial and temporal scale selection based on corresponding scale-normalized derivatives is guaranteed to lead to scale-covariant scale estimates.

As a consequence, the scale estimates will be guaranteed to be rotationally invariant in the sense that if the spatial domain is globally rotated in image space, then both the spatial and the temporal scale estimates will be rotated in the same way as the spatial image positions. A corresponding rotational invariance property of the spatio-temporal scale estimates does also hold for other types of spatio-temporal differential expressions of the form (18) that are additionally rotationally invariant.

What remains in this theory is to choose appropriate scale-normalized spatio-temporal derivative expressions \(\mathcal{D}_{\mathrm{norm}} L\) for different visual tasks and to tune the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) to additional complementary requirements. In next section, we will perform a detailed study of this for eight different spatio-temporal differential invariants with respect to the task of detecting spatio-temporal interest points.

4 Spatio-Temporal Scale Selection in Non-Causal Gaussian Spatio-Temporal Scale Space

In this section, we will perform a closed-form theoretical analysis of the spatial and the temporal scale selection properties that are obtained by detecting simultaneous local extrema over both spatial and temporal scales of different scale-normalized spatio-temporal differential expressions. We will specifically analyse: (i) how the spatial and temporal scale estimates \(\hat{s}\) and \(\hat{\tau }\) are related to the spatial extent \(s_0\) and the temporal duration \(\tau _0\) for different types of spatio-temporal model signals for which closed-form theoretical analysis is possible and (ii) how the resulting scale-normalized magnitude responses of the different differential entities at the selected spatio-temporal scales depend upon the spatial extent \(s_0\) and the temporal duration \(\tau _0\) of the underlying image structures as well as upon a complementary parameter q introduced to enable detection of spatio-temporal image features at finer temporal scales than at the temporal scales at which they occur, to in turn enable shorter temporal delays when computing image features based on a time-causal spatio-temporal scale-space concept.

A main goal is to perform scale calibration, to determine suitable values of the spatial and temporal scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) for different types of spatio-temporal feature detectors, in such a way that the selected spatial and temporal scale levels reflect the spatial extent and the temporal duration of the original spatio-temporal image structures that gave rise to the feature response. The methodology we shall follow is to calculate scale-space representations in closed form for Gaussian-based spatio-temporal image patterns for which the non-causal spatio-temporal scale-space representation can be obtained from the semi-group property of the Gaussian kernel. Then, given that explicit expressions can be calculated for the scale-normalized spatio-temporal derivatives, we will solve for the local extrema of the spatio-temporal differential invariant \(\mathcal{D}_{\mathrm{norm}} L\) over spatio-temporal scales, to define equations that determine the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) from the constraints that the spatio-temporal scale estimates should obey \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \, \tau _0\).

The spatial assumption \(\hat{s} = s_0\) is similar to the method for scale calibration in the spatial scale selection methodology [64, 65, 72] and corresponds to detecting the image structure at the same scale as they appear, which should be optimal with regard to signal detection theory. Regarding the temporal assumption \(\hat{\tau } = q^2 \, \tau _0\), we do, however, also introduce a parameter \(q < 1\) to enforce temporal scale selection at finer temporal scales, to enable shorter temporal delays of the feature responses. As previously described in Sect. 2.2, for the time-causal scale-space representation the temporal delay can be expected to be proportional to the temporal scale in units of the standard deviation of the temporal smoothing kernel \(\delta \sim \sigma _{\tau } = \sqrt{\tau }\). A first-order prediction is therefore that a value of \(q < 1\) can be expected to reduce the temporal delay by the order of a corresponding factor, to enable an autonomous agent using these features as input to respond faster in a time-critical real-time situation.

4.1 The Spatial Laplacian of the Second-Order Temporal Derivative

Inspired by the way neurones in the lateral geniculate nucleus (LGN) respond to visual input [12, 13], which for many LGN cells can be modelled by idealized operations of the form [69, Equation (108)]

$$\begin{aligned} h_{\mathrm{LGN}}(x, y, t;\; s, \tau )= \pm (\partial _{xx} + \partial _{yy}) \, g(x, y;\; s) \, \partial _{t^n} \, h(t;\; \tau ), \end{aligned}$$
(41)

let us for general values of the spatial and temporal scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) study the scale-normalized spatial Laplacian of the second-order temporal derivative defined according to

$$\begin{aligned} \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}= & {} s^{\gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, \nabla _{(x,y)}^2 L_{tt} \nonumber \\= & {} s^{\gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, \left( L_{xxtt} + L_{yytt} \right) , \end{aligned}$$
(42)

which in turn can be seen as an idealized functional model of a so-called “lagged” LGN neurone (compare with [69, Figure 24, right column]). This operator can be expected to give a strong response when both the spatial Laplacian and the second-order temporal derivative give strong responses, e.g., for blinking blobs.

Consider a spatio-temporal image pattern defined as a Gaussian blink with spatial extent \(s_0\) and temporal duration \(\tau _0\):

$$\begin{aligned} f(x, y, t)= & {} g(x, y;\; s_0) \, g(t;\; \tau _0) \nonumber \\= & {} \frac{1}{(2 \pi )^{3/2} s_0 \sqrt{\tau _0}} \, \mathrm{e}^{-(x^2+y^2)/2s_0} \, \mathrm{e}^{-t^2/2\tau _0}. \end{aligned}$$
(43)

By spatial smoothing with the two-dimensional spatial Gaussian kernel and temporal smoothing with the non-causal one-dimensional Gaussian kernel, the resulting spatio-temporal scale-space representation will be of the form

$$\begin{aligned} L(x, y, t;\; s, \tau ) = g(x, y;\; s_0+s) \, g(t;\; \tau _0+\tau ), \end{aligned}$$
(44)

for which the scale-normalized Laplacian of the second-order temporal derivative at the origin \((x, y, t) = (0, 0, 0)\) is given by

$$\begin{aligned} \left. \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}} \right| _{(0,0,0)} = \frac{s^{\gamma _\mathrm{s}} \tau ^{\gamma _{\tau }}}{\sqrt{2} \pi ^{3/2} (s+s_0)^2 (\tau +\tau _0)^{3/2}}. \end{aligned}$$
(45)

Differentiating this expression with respect to the spatial scale parameter s and the temporal scale parameter \(\tau \) and setting the derivative to zero implies that the local extremum over spatial and temporal scales is given by

$$\begin{aligned} \hat{s}= & {} \frac{\gamma _\mathrm{s} s_0}{2 - \gamma _\mathrm{s}}, \end{aligned}$$
(46)
$$\begin{aligned} \hat{\tau }= & {} \frac{2 \gamma _{\tau } \tau _0}{3 - 2 \gamma _{\tau }}. \end{aligned}$$
(47)

If we require the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian blink such that

$$\begin{aligned} \hat{s}&= s_0, \end{aligned}$$
(48)
$$\begin{aligned} \hat{\tau }&= q^2 \tau _0, \end{aligned}$$
(49)

then this implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= 1, \end{aligned}$$
(50)
$$\begin{aligned} \gamma _{\tau }&= \frac{3 q^2}{2 (q^2+1)}, \end{aligned}$$
(51)

where specifically the choice of \(q = 1\) corresponds to \(\gamma _{\tau } = 3/4\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales will be given by

$$\begin{aligned}&\left. \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }} \nonumber \\&\quad = \frac{\left( q^2 \tau _0\right) ^{\frac{3 q^2}{2 \left( q^2+1\right) }}}{4 \sqrt{2} \pi ^{3/2} s_0 \left( \left( q^2+1\right) \tau _0\right) ^{3/2}}, \end{aligned}$$
(52)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }} = \frac{1}{16 \pi ^{3/2} s_0 \tau _0^{3/4}}. \end{aligned}$$
(53)

If we additionally renormalize the original Gaussian blink to having maximum value equal to C

$$\begin{aligned} f(x, y, t)&= C \, (2 \pi )^{3/2} s_0 \sqrt{\tau _0} \, g(x, y;\; s_0) \, g(t;\; \tau _0)\nonumber \\&= C \, \mathrm{e}^{-(x^2+y^2)/2s_0} \, \mathrm{e}^{-t^2/2\tau _0}, \end{aligned}$$
(54)

then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned}&\left. \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }} \nonumber \\&\quad = \frac{C \sqrt{\tau _0} \left( q^2 \tau _0\right) ^{\frac{3 q^2}{2 \left( q^2+1\right) }}}{2 \left( \left( q^2+1\right) \tau _0\right) ^{3/2}}, \end{aligned}$$
(55)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }} = \frac{C}{4 \sqrt{2} \tau _0^{1/4}} \end{aligned}$$
(56)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure defined to achieve scale-invariant magnitude responses over both spatial and temporal scales

$$\begin{aligned}&\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{postnorm}}\nonumber \\&\quad \quad = \tau ^{\frac{2-q^2}{2(q^2+1)}} \left. \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}} \right| _{\gamma _\mathrm{s}=1,\gamma _{\tau }=\frac{3 q^2}{2 (q^2+1)}} \nonumber \\&\quad \quad = s \tau \left( L_{xxtt} + L_{yytt} \right) . \end{aligned}$$
(57)

4.2 The Spatial Laplacian of the First-Order Temporal Derivative

For the spatial Laplacian of the first-order temporal derivative, the corresponding scale-normalized expression is for general values of the spatial and temporal scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) given by

$$\begin{aligned} \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}&= s^{\gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }/2} \, \nabla _{(x,y)}^2 L_t \nonumber \\&= s^{\gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }/2} \, \left( L_{xxt} + L_{yyt} \right) , \end{aligned}$$
(58)

which can be seen as an idealized functional model of a so-called “non-lagged” LGN neurone (compare with [69, Figure 24, left column]). This operator can be expected to give a strong response when both the spatial Laplacian and the first-order temporal derivative give strong responses, e.g., for onset and offset blobs.

Consider a spatio-temporal image pattern defined as a Gaussian onset blob with spatial extent \(s_0\) and temporal duration \(\tau _0\):

$$\begin{aligned}&f(x, y, t)\nonumber \\&\quad = g(x, y;\; s_0) \int _{u = 0}^t g(u;\; \tau _0) \, \mathrm{d}u \nonumber \\&\quad = \frac{1}{(2 \pi )^{3/2} s_0 \sqrt{\tau _0}} \, \mathrm{e}^{-(x^2+y^2)/2s_0} \int _{u = 0}^t \mathrm{e}^{-u^2/2\tau _0} \, \mathrm{d}u. \end{aligned}$$
(59)

By spatial smoothing with the two-dimensional spatial Gaussian kernel and temporal smoothing with the non-causal one-dimensional Gaussian kernel, the resulting spatio-temporal scale-space representation will be of the form

$$\begin{aligned} L(x, y, t;\; s, \tau ) = g(x, y;\; s_0+s) \, \int _{u = 0}^t g(u;\; \tau _0+\tau ) \, \mathrm{d}u, \end{aligned}$$
(60)

for which the scale-normalized spatial Laplacian of the second-order temporal derivative at the origin \((x, y, t) = (0, 0, 0)\) is given by

$$\begin{aligned} \left. \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}} \right| _{(0,0,0)} = -\frac{s^{\gamma _\mathrm{s}} \tau ^{\gamma _{\tau }/2}}{\sqrt{2} \pi ^{3/2} (s_0+s)^2 \sqrt{\tau _0 +\tau }}. \end{aligned}$$
(61)

Differentiating this expression with respect to the spatial scale parameter s and the temporal scale parameter \(\tau \) and setting the derivative to zero implies that the local extremum over spatial and temporal scales is given by

$$\begin{aligned} \hat{s}&= \frac{\gamma _\mathrm{s} s_0}{2 - \gamma _\mathrm{s}},\end{aligned}$$
(62)
$$\begin{aligned} \hat{\tau }&= \frac{\gamma _{\tau } \tau _0}{1 - \gamma _{\tau }}. \end{aligned}$$
(63)

Requiring the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian onset blob according to

$$\begin{aligned} \hat{s}&= s_0, \end{aligned}$$
(64)
$$\begin{aligned} \hat{\tau }&= q^2 \tau _0, \end{aligned}$$
(65)

implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= 1, \end{aligned}$$
(66)
$$\begin{aligned} \gamma _{\tau }&= \frac{q^2}{q^2+1}, \end{aligned}$$
(67)

where specifically the choice \(q = 1\) corresponds to \(\gamma _{\tau } = 1/2\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales will be given by

$$\begin{aligned}&\left. \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }}\nonumber \\&\quad = -\frac{\left( q^2 \tau _0\right) ^{\frac{q^2}{2 q^2+2}}}{4 \sqrt{2} \pi ^{3/2} s_0 \sqrt{\left( q^2+1\right) \tau _0}}, \end{aligned}$$
(68)

where specifically the case \(q = 1\) corresponds to

$$\begin{aligned}&\left. \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }}\nonumber \\&\quad = -\frac{1}{8 \pi ^{3/2} s_0 \root 4 \of {\tau _0}}. \end{aligned}$$
(69)

If we additionally renormalize the original Gaussian onset blob to having maximum value equal to C

$$\begin{aligned} f(x, y, t)&= 2 \pi \, C \, s_0 \, g(x, y;\; s_0) \, g(t;\; \tau _0) \nonumber \\&= \frac{C}{\sqrt{2 \pi }} \, \mathrm{e}^{-(x^2+y^2)/2s_0} \, \int _{u = 0}^t \mathrm{e}^{-u^2/2\tau _0} \, du, \end{aligned}$$
(70)

then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned}&\left. \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }} \nonumber \\&\quad = \frac{C \, \left( q^2 \tau _0\right) ^{\frac{q^2}{2 q^2+2}}}{2 \sqrt{2 \pi } \sqrt{\left( q^2+1\right) \tau _0}}, \end{aligned}$$
(71)

where specifically the case \(q = 1\) corresponds to

$$\begin{aligned} \left. \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}} \right| _{(x, y, t)=(0,0,0),s=\hat{s},\tau =\hat{\tau }} = -\frac{C}{4 \sqrt{\pi } \root 4 \of {\tau _0}} \end{aligned}$$
(72)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure to achieve scale-invariant magnitude responses over both spatial and temporal scales

$$\begin{aligned}&\nabla _{(x,y),\mathrm{postnorm}}^2 L_{t,\mathrm{postnorm}}\nonumber \\&= \tau ^{\frac{1}{2(q^2+1)}}\left. \nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}} \right| _{\gamma _\mathrm{s}=1,\gamma _{\tau }=\frac{q^2}{q^2+1}} \nonumber \\&= s \sqrt{\tau } \left( L_{xxt} + L_{yyt} \right) . \end{aligned}$$
(73)

4.3 The Determinant of the Spatial Hessian Matrix Applied to the Second-Order Temporal Derivative

Inspired by the way the determinant of the spatial Hessian matrix constitutes a better spatial interest point detector than the spatial Laplacian operator [74], we consider an extension of the spatial Laplacian of the second-order temporal derivative (42) into the determinant of the spatial Hessian applied to the second-order temporal derivative

$$\begin{aligned} \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}&= s^{2 \gamma _\mathrm{s}} \, \tau ^{2 \gamma _{\tau }} \, \det \mathcal{H}_{(x,y)} L_{tt}\nonumber \\&= s^{2 \gamma _\mathrm{s}} \, \tau ^{2 \gamma _{\tau }} \; \left( L_{xxtt} \, L_{yytt} - L_{xytt}^2\right) . \end{aligned}$$
(74)

This operator can be expected to give a strong response when both the second-order temporal derivative and the determinant of the spatial Hessian give strong responses, e.g., when there are strong second-order temporal variations in combination with simultaneously strong spatial variations in two orthogonal spatial directions, such as for blinking blobs or corners.

When applied to a Gaussian blink of the form (43) having a spatio-temporal scale-space representation of the form (44), the scale-normalized determinant of the spatio-temporal Hessian at the origin then assumes the form

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \right| _{(0,0,0)} \frac{s^{2 \gamma _\mathrm{s}} \tau ^{2 \gamma _{\tau }}}{8 \pi ^3 (s+s_0)^4 (\tau +\tau _0)^3} \end{aligned}$$
(75)

and assumes its extremum over spatial and temporal scales at

$$\begin{aligned} \hat{s}&=\frac{\gamma _\mathrm{s} s_0}{2 - \gamma _\mathrm{s}}, \end{aligned}$$
(76)
$$\begin{aligned} \hat{\tau }&= \frac{2 \gamma _{\tau } \tau _0}{3 - 2 \gamma _{\tau }}. \end{aligned}$$
(77)

If we require the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian blink according to \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \tau _0\), then this implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= 1, \end{aligned}$$
(78)
$$\begin{aligned} \gamma _{\tau }&= \frac{3 q^2}{2 \left( q^2+1\right) }, \end{aligned}$$
(79)

where specifically the choice \(q = 1\) corresponds to \(\gamma _{\tau } = 3/4\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales will be given by

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \right| _{(0,0,0)} = \frac{\left( q^2 \tau _0\right) ^{\frac{3 q^2}{q^2+1}}}{128 \pi ^3 \left( q^2+1\right) ^3 s_0^2 \tau _0^3}, \end{aligned}$$
(80)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \right| _{(0,0,0)} = \frac{1}{1024 \pi ^3 s_0^2 \tau _0^{3/2}}. \end{aligned}$$
(81)

If we additionally renormalize the original Gaussian blink to having maximum value equal to C according to (54), then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \right| _{(0,0,0)} = \frac{C^2 \left( q^2 \tau _0\right) ^{\frac{3 q^2}{q^2+1}}}{16 \left( q^2+1\right) ^3 \tau _0^2}, \end{aligned}$$
(82)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \right| _{(0,0,0)} = \frac{C^2}{128 \sqrt{\tau _0}} \end{aligned}$$
(83)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure to achieve scale invariance over both spatial and temporal scales

$$\begin{aligned}&\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \nonumber \\&\quad = \tau ^{\frac{2(2-q^2)}{q^2+1}} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}} \right| _{\gamma _\mathrm{s}=1,\gamma _{\tau }=\frac{3 q^2}{2 \left( q^2+1\right) }}\nonumber \\&\quad = s^2 \, \tau ^2 \, (L_{xxtt} \, L_{yytt} - L_{xytt}^2). \end{aligned}$$
(84)

4.4 The Determinant of the Spatial Hessian Matrix Applied to the First-Order Temporal Derivative

Analogously to the determinant of the spatial Hessian applied to the second-order temporal derivative, we can also apply the determinant of the spatial Hessian to the first-order temporal derivative

$$\begin{aligned} \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}&= s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, \det \mathcal{H}_{(x,y)} L_{t} \nonumber \\&= s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \; \left( L_{xxt} L_{yyt} - L_{xyt}^2\right) . \end{aligned}$$
(85)

This operator can be expected to give a strong response when both the first-order temporal derivative and the determinant of the spatial Hessian give strong responses, e.g., when there are strong first-order temporal variations in combination with simultaneously strong spatial variations in two orthogonal spatial directions, such as for onset or offsets blobs or corners.

When applied to an onset Gaussian blob of the form (59) having a spatio-temporal scale-space representation of the form (60), the first-order temporal derivative of the determinant of the spatial Hessian at the origin then assumes the form

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \right| _{(0,0,0)} = \frac{s^{2 \gamma _\mathrm{s}} \tau ^{\gamma _{\tau }}}{8 \pi ^3 (s+s_0)^4 (\tau +\tau _0)} \end{aligned}$$
(86)

and assumes its extremum over spatial and temporal scales at

$$\begin{aligned} \hat{s}&= \frac{\gamma _\mathrm{s} s_0}{2 - \gamma _\mathrm{s}}, \end{aligned}$$
(87)
$$\begin{aligned} \hat{\tau }&= \frac{\gamma _{\tau } \tau _0}{1 - \gamma _{\tau }}. \end{aligned}$$
(88)

If we require the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian onset blob according to \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \tau _0\), then this implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= 1, \end{aligned}$$
(89)
$$\begin{aligned} \gamma _{\tau }&= \frac{q^2}{q^2+1}, \end{aligned}$$
(90)

where specifically the choice \(q = 1\) corresponds to \(\gamma _{\tau } = 1/2\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales will be given by

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \right| _{(0,0,0)} = \frac{\left( q^2 \tau _0\right) ^{\frac{q^2}{q^2+1}}}{128 \pi ^3 \left( q^2+1\right) s_0^2 \tau _0}, \end{aligned}$$
(91)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \right| _{(0,0,0)} = \frac{1}{256 \pi ^3 s_0^2 \sqrt{\tau _0}}. \end{aligned}$$
(92)

If we additionally renormalize the original Gaussian onset blob to having maximum value equal to C according to (70), then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \right| _{(0,0,0)} = \frac{C^2 \left( q^2 \tau _0\right) ^{\frac{q^2}{q^2+1}}}{32 \pi q^2 \tau _0+32 \pi \tau _0}, \end{aligned}$$
(93)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \right| _{(0,0,0)} = \frac{C^2}{64 \pi \sqrt{\tau _0}} \end{aligned}$$
(94)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure to achieve scale invariance over both spatial and temporal scales

$$\begin{aligned}&\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \nonumber \\&\quad = \tau ^{\frac{q^2+2}{2 \left( q^2+1\right) }} \left. \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}} \right| _{\gamma _\mathrm{s}=1,\gamma _{\tau }=\frac{q^2}{q^2+1}} \nonumber \\&\quad = s^{2} \tau \left( L_{xxt} L_{yyt} - 2 L_{xyt}^2 \right) . \end{aligned}$$
(95)

4.5 The Determinant of the Spatio-Temporal Hessian Matrix

For general values of the spatial and temporal scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized determinant of the spatio-temporal Hessian is given by

$$\begin{aligned}&\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L \nonumber \\&\quad = s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, \left( L_{xx} L_{yy} L_{tt} + 2 L_{xy} L_{xt} L_{yt} \phantom {L_{xy}^2} \right. \nonumber \\&\quad \qquad \qquad \quad \quad \left. -\,L_{xx} L_{yt}^2 - L_{yy} L_{xt}^2 - L_{tt} L_{xy}^2 \right) . \end{aligned}$$
(96)

This operator can be expected to give strong responses when there are simultaneously strong second-order variations in three strongly different directions in joint space–time.

When applied to a Gaussian blink of the form (43) having a spatio-temporal scale-space representation of the form (44), the scale-normalized determinant of the spatio-temporal Hessian at the origin then assumes the form

$$\begin{aligned}&\left. \det (\mathcal{H}_{(x,y,t),\mathrm{norm}} L) \right| _{(0,0,0)} \nonumber \\&\quad = -\frac{s^{2 \gamma _\mathrm{s}} \tau ^{\gamma _{\tau }}}{16 \sqrt{2} \pi ^{9/2} (s+s_0)^5 (\tau +\tau _0)^{5/2}} \end{aligned}$$
(97)

and assumes its extremum over spatial and temporal scales at

$$\begin{aligned} \hat{s}&= \frac{2 \gamma _\mathrm{s} s_0}{5 - 2 \gamma _\mathrm{s}}, \end{aligned}$$
(98)
$$\begin{aligned} \hat{\tau }&= \frac{2 \gamma _{\tau } \tau _0}{5 - 2 \gamma _{\tau }}. \end{aligned}$$
(99)

Requiring the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian blink according to \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \tau _0\) implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= \frac{5}{4}, \end{aligned}$$
(100)
$$\begin{aligned} \gamma _{\tau }&= \frac{5 q^2}{2 (q^2+1)}, \end{aligned}$$
(101)

where specifically the choice \(q = 1\) corresponds to \(\gamma _{\tau } = 5/4\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales is given by

$$\begin{aligned}&\left. \det (\mathcal{H}_{(x,y,t),\mathrm{norm}} L) \right| _{(0,0,0)} \nonumber \\&\quad = -\, \frac{\left( q^2 \tau _0\right) ^{\frac{5 q^2}{2 \left( q^2+1\right) }}}{512 \sqrt{2} \pi ^{9/2} s_0^{5/2} \left( \left( q^2+1\right) \tau _0\right) ^{5/2}}, \end{aligned}$$
(102)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \det (\mathcal{H}_{(x,y,t),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{1}{4096 \pi ^{9/2} s_0^{5/2} \tau _0^{5/4}}. \end{aligned}$$
(103)

If we additionally renormalize the original Gaussian blink to having maximum value equal to C according to (54), then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned} \left. \det (\mathcal{H}_{(x,y,t),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{C^3 \, \sqrt{s_0} \, \tau _0^{3/2} \left( q^2 \tau _0\right) ^{\frac{5 q^2}{2 \left( q^2+1\right) }}}{32 \left( \left( q^2+1\right) \tau _0\right) ^{5/2}}, \end{aligned}$$
(104)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \det (\mathcal{H}_{(x,y,t),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{C^3 \, \sqrt{s_0} \root 4 \of {\tau _0}}{128 \sqrt{2}} \end{aligned}$$
(105)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure to achieve scale invariance over both spatial and temporal scales

$$\begin{aligned}&\det (\mathcal{H}_{(x,y,t),\mathrm{postnorm}} L \nonumber \\&\quad = \frac{\tau ^{\frac{2-3 q^2}{2 (q^2+1)}}}{\sqrt{s}} \left. \det (\mathcal{H}_{(x,y,t),\mathrm{norm}} L) \right| _{\gamma _\mathrm{s}=\frac{5}{4},\gamma _{\tau }=\frac{5 q^2}{2 (q^2+1)}} \nonumber \\&\quad = s^{2} \tau \left( L_{xx} L_{yy} L_{tt} + 2 L_{xy} L_{xt} L_{yt} \phantom {L_{xy}^2} \right. \nonumber \\&\qquad \qquad \left. - \,L_{xx} L_{yt}^2 - L_{yy} L_{xt}^2 - L_{tt} L_{xy}^2 \right) . \end{aligned}$$
(106)

In view of these results, it is illuminating to compare to the analysis by Willems et al. [122], who defined a scale-normalized determinant of the Hessian corresponding to (96) based on \(\gamma _\mathrm{s} = 1\) and \(\gamma _{\tau } = 1\), which in turn implies that the spatial and temporal scale estimates were instead given by

$$\begin{aligned} \hat{s}&= \frac{2}{3} s_0, \end{aligned}$$
(107)
$$\begin{aligned} \hat{\tau }&= \frac{2}{3} \tau _0. \end{aligned}$$
(108)

If we would like the features to be detected at the scales at which they occur, such that \(\hat{s} = s_0\) and \(\hat{\tau } = \tau _0\), we should, however, instead choose the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to (100) and (101) for \(q = 1\), so that we achieve maximum similarity between the response property of the spatio-temporal feature detector in relation to the spatio-temporal features we would like to detect. If using a lower value of the parameter \(q < 1\), then this property is sacrificed for the possible gain of obtaining faster temporal responses in a time-causal implementation, where otherwise the detection of image features at coarser temporal scales implies longer temporal delays (compare with Sect. 2.2). Over the spatial domain or over a non-causal temporal domain as used in the original work by Willems et al. [122], it should, however, from signal detection theory be better to calibrate the method such that \(\hat{s} = s_0\) and \(\hat{\tau } = \tau _0\). Notwithstanding the potential gain of achieving a shorter temporal delay by using a lower value of \(q < 1\), from a signal detection theory background there should be no motivation to calibrate the feature detector to choosing finer spatial scale levels than \(s_0\).

4.6 The Second-Order Temporal Derivative of the Determinant of the Spatial Hessian Matrix

When using the spatial Laplacian operator over the spatial domain as a basis for defining spatio-temporal interest operators, the spatial Laplacian does because of its linearity commute with the first- and second-order temporal derivatives. Thereby, the spatial Laplacian of the second-order temporal derivative is equal to the second-order temporal derivative of the spatial Laplacian. When replacing the Laplacian interest operator in the spatio-temporal interest operator \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) by the determinant of the spatial Hessian, an alternative possibility to considering the determinant of the second-order temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) is therefore to consider the second-order temporal derivative of the determinant of the spatial Hessian

$$\begin{aligned}&\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \nonumber \\&\quad = s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, \partial _{tt} (\det \mathcal{H}_{(x,y)} L) \nonumber \\&\quad = s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, \left( L_{xxtt} L_{yy} + 2 L_{xxt} L_{yyt} + L_{xx} L_{yytt}\right. \nonumber \\&\left. \quad \phantom {=} \phantom {s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }} \, (} -\,2 L_{xyt}^2 - 2 L_{xy} L_{xytt}\right) . \end{aligned}$$
(109)

This operator can be expected to give strong responses when the spatial slice of joint space–time contains strong second-order variations on two orthogonal spatial directions, and this structure in turn also leads to strong second-order temporal variations as time evolves.

When applied to a Gaussian blink of the form (43) having a spatio-temporal scale-space representation of the form (44), the scale-normalized determinant of the spatio-temporal Hessian at the origin then assumes the form

$$\begin{aligned}&\left. \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} \nonumber \\&\quad = -\frac{s^{2 \gamma _\mathrm{s}} \tau ^{\gamma _{\tau }}}{4 \pi ^3 (s+s_0)^4 (\tau +\tau _0)^2} \end{aligned}$$
(110)

and assumes its extremum over spatial and temporal scales at

$$\begin{aligned} \hat{s}&=\frac{\gamma _\mathrm{s} s_0}{2 - \gamma _\mathrm{s}}, \end{aligned}$$
(111)
$$\begin{aligned} \hat{\tau }&= \frac{\gamma _{\tau } \tau _0}{2 - \gamma _{\tau }}. \end{aligned}$$
(112)

If we require the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian blink according to \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \tau _0\), then this implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= 1, \end{aligned}$$
(113)
$$\begin{aligned} \gamma _{\tau }&= \frac{2 q^2}{q^2+1}, \end{aligned}$$
(114)

where specifically the choice \(q = 1\) corresponds to \(\gamma _{\tau } = 1\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales will be given by

$$\begin{aligned} \left. \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{\left( q^2 \tau _0\right) ^{\frac{2 q^2}{q^2+1}}}{64 \pi ^3 \left( q^2+1\right) ^2 s_0^2 \tau _0^2}, \end{aligned}$$
(115)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{1}{256 \pi ^3 s_0^2 \tau _0}. \end{aligned}$$
(116)

If we additionally renormalize the original Gaussian blink to having maximum value equal to C according to (54), then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned} \left. \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{C^2 \, \left( q^2 \tau _0\right) ^{\frac{2 q^2}{q^2+1}}}{8 \left( q^2+1\right) ^2 \tau _0}, \end{aligned}$$
(117)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = -\frac{C^2}{32} \end{aligned}$$
(118)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure to achieve scale invariance over both spatial and temporal scales

$$\begin{aligned}&\partial _{tt,\mathrm{postnorm}} (\det \mathcal{H}_{(x,y),\mathrm{postnorm}} L) \nonumber \\&\quad = \tau ^{\frac{1-q^2}{1+q^2}} \left. \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{\gamma _\mathrm{s}=1,\gamma _{\tau }=\frac{2 q^2}{1+q^2}} \nonumber \\&\quad = s^{2} \tau \left( L_{xxtt} L_{yy} + 2 L_{xxt} L_{yyt} + L_{xx} L_{yytt} \phantom {L_{xyt}^2} \right. \nonumber \\&\quad \phantom {=} \phantom {s^{2} \tau } \left. \quad - 2 L_{xyt}^2 - 2 L_{xy} L_{xytt} \right) . \end{aligned}$$
(119)

4.7 The First-Order Temporal Derivative of the Determinant of the Spatial Hessian Matrix

Analogously to the second-order temporal derivative of the determinant of the spatial Hessian, we can also define the first-order temporal derivative of the determinant of the spatial Hessian

$$\begin{aligned}&\partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \nonumber \\&\quad = s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }/2} \, \partial _{t} (\det \mathcal{H}_{(x,y)} L) \nonumber \\&\quad = s^{2 \gamma _\mathrm{s}} \, \tau ^{\gamma _{\tau }/2} \; \left( L_{xxt} L_{yy} + L_{xx} L_{yyt} - 2 L_{xy} L_{xyt}\right) . \end{aligned}$$
(120)

This operator can be expected to give strong responses when the spatial slice of joint space–time contains strong second-order variations on two orthogonal spatial directions, and this structure in turn also leads to strong first-order temporal variations as time evolves.

When applied to an onset Gaussian blob of the form (59) having a spatio-temporal scale-space representation of the form (60), the first-order temporal derivative of the determinant of the spatial Hessian at the origin then assumes the form

$$\begin{aligned}&\left. \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} \nonumber \\&\quad = \frac{s^{2 \gamma _\mathrm{s}} \tau ^{\gamma _{\tau }/2}}{4 \sqrt{2} \pi ^{5/2} (s+s_0)^4 \sqrt{\tau +\tau _0}} \end{aligned}$$
(121)

and assumes its extremum over spatial and temporal scales at

$$\begin{aligned} \hat{s}&= \frac{\gamma _\mathrm{s} s_0}{2 - \gamma _\mathrm{s}}, \end{aligned}$$
(122)
$$\begin{aligned} \hat{\tau }&= \frac{\gamma _{\tau } \tau _0}{1 - \gamma _{\tau }}. \end{aligned}$$
(123)

If we require the spatial and temporal scale estimates to reflect the spatial and temporal extent of the Gaussian onset blob according to \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \tau _0\), then this implies that we should calibrate the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) according to

$$\begin{aligned} \gamma _\mathrm{s}&= 1, \end{aligned}$$
(124)
$$\begin{aligned} \gamma _{\tau }&= \frac{q^2}{q^2+1}, \end{aligned}$$
(125)

where specifically the choice \(q = 1\) corresponds to \(\gamma _{\tau } = 1/2\). For these values of \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\), the scale-normalized magnitude expression at the extremum over spatial and temporal scales will be given by

$$\begin{aligned}&\left. \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} \nonumber \\&\quad = \frac{\left( q^2 \tau _0\right) ^{\frac{q^2}{2 q^2+2}}}{64 \sqrt{2} \pi ^{5/2} s_0^2 \sqrt{\left( q^2+1\right) \tau _0}}, \end{aligned}$$
(126)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = \frac{1}{128 \pi ^{5/2} s_0^2 \root 4 \of {\tau _0}}. \end{aligned}$$
(127)

If we additionally renormalize the original Gaussian onset blob to having maximum value equal to C according to (70), then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned} \left. \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = \frac{C^2 \, \left( q^2 \tau _0\right) ^{\frac{q^2}{2 q^2+2}}}{16 \sqrt{2 \pi } \sqrt{\left( 1+q^2\right) \tau _0}}, \end{aligned}$$
(128)

where specifically the choice \(q = 1\) corresponds to

$$\begin{aligned} \left. \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{(0,0,0)} = \frac{C^2}{32 \sqrt{\pi } \root 4 \of {\tau _0}} \end{aligned}$$
(129)

and implying that if we want to compare responses between different spatio-temporal scale levels, we should consider the following post-normalized magnitude measure to achieve scale invariance over both spatial and temporal scales

$$\begin{aligned}&\partial _{t,\mathrm{postnorm}} (\det \mathcal{H}_{(x,y),\mathrm{postnorm}} L) \nonumber \\&\quad = \tau ^{\frac{1}{2 \left( q^2+1\right) }} \left. \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L) \right| _{\gamma _\mathrm{s}=1,\gamma _{\tau }=\frac{q^2}{q^2+1}} \nonumber \\&\quad = s^{2} \sqrt{\tau } \left( L_{xxt} L_{yy} + L_{xx} L_{yyt} - 2 L_{xy} L_{xyt} \right) . \end{aligned}$$
(130)

4.8 The Spatio-Temporal Laplacian

If aiming at defining a spatio-temporal analogue of the Laplacian operator, one does, however, need to consider that the most straightforward way of defining such an operator

$$\begin{aligned} \nabla _{(x, y, t)}^2 L = L_{xx} + L_{yy} + L_{tt} \end{aligned}$$
(131)

is not covariant under independent scaling transformations of the spatial and temporal domains as occurs if observing the same scene with cameras having independently different spatial and temporal sampling rates. Therefore, if attempting to define a spatio-temporal analogue of the Laplacian of the Gaussian operator, one could in principle consider introducing an arbitrary scaling factor \(\varkappa ^2\) between the temporal versus the spatial derivatives

$$\begin{aligned} \nabla _{(x, y, t)}^2 L = L_{xx} + L_{yy} + \varkappa ^2 L_{tt}. \end{aligned}$$
(132)

This operator can be expected to give strong response when there is strong second-order variation in at least one spatial dimension or in the temporal dimension. It is, however, not necessary that that there are simultaneous strong variations over both space and time, implying that this operator cannot be expected to be as selective as the other seven spatio-temporal interest point detectors studied above.

With the previously introduced recipe of replacing spatial and temporal derivatives by corresponding scale-normalized derivatives, the corresponding scale-normalized expression then becomes

$$\begin{aligned} \nabla _{(x, y, t),\mathrm{norm}}^2 L = s^{\gamma _\mathrm{s}} (L_{xx} + L_{yy}) + \varkappa ^2 \tau ^{\gamma _{\tau }} L_{tt}, \end{aligned}$$
(133)

which, however, is not within the family of spatio-temporal differential invariants (18) guaranteed to lead to scale-covariant spatio-temporal scale selection.

When applied to a Gaussian blink of the form (43) having a spatio-temporal scale-space representation of the form (44), the scale-normalized spatio-temporal Laplacian at the origin then assumes the form

$$\begin{aligned} \left. \nabla _{(x, y, t),\mathrm{norm}}^2 L \right| _{(0,0,0)} = \frac{-2 s^{\gamma _\mathrm{s}} (\tau +\tau _0)-\varkappa ^2 \tau ^{\gamma _{\tau }} (s+s_0)}{2 \sqrt{2} \pi ^{3/2} (s+s_0)^2 (\tau +\tau _0)^{3/2}}. \end{aligned}$$
(134)

Unfortunately, the algebraic equations that determine the spatial and temporal scale estimates as function of \(s_0\) and \(\tau _0\)

$$\begin{aligned}&-2 (\gamma _\mathrm{s}-2) s^{\gamma _\mathrm{s}+1} (\tau +\tau _0)-2 \gamma _\mathrm{s} s_0 s^{\gamma _\mathrm{s}} (\tau +\tau _0) \nonumber \\&\quad +s^2 \varkappa ^2 \tau ^{\gamma _{\tau }}+s s_0 \varkappa ^2 \tau ^{\gamma _{\tau }} = 0, \end{aligned}$$
(135)
$$\begin{aligned}&2 \tau s^{\gamma _\mathrm{s}} (\tau +\tau _0)-s \varkappa ^2 \tau ^{\gamma _{\tau }} ((2 \gamma _{\tau }-3) \tau +2 \gamma _{\tau } \tau _0) \nonumber \\&\quad -s_0 \varkappa ^2 \tau ^{\gamma _{\tau }} ((2 \gamma _{\tau }-3) \tau +2 \gamma _{\tau } \tau _0) = 0, \end{aligned}$$
(136)

are hard to solve for general values of the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\). By solving these equations in the specific case of \(\gamma _\mathrm{s} = 1\) and \(\gamma _{\tau } = 1\), we can, however, note that the resulting scale estimates

$$\begin{aligned} \hat{s} =&s_0 \left( \frac{5}{2+\varkappa ^2}-1\right) , \end{aligned}$$
(137)
$$\begin{aligned} \hat{\tau } =&2 \tau _0 \left( 2-\frac{5}{2+\varkappa ^2}\right) , \end{aligned}$$
(138)

will be explicitly dependent on the relative scaling factor \(\varkappa ^2\) between the derivatives with respect to the temporal versus the spatial domains. This situation is in clear contrast to the previously considered spatio-temporal differential invariants for spatio-temporal scale selection: (i)–(ii) the spatial Laplacian of the first- and second-order temporal derivatives (58), (iii)–(iv) the determinant of the Hessian applied to the first- and second-order temporal derivatives (58) and (42), (v) the determinant of the spatio-temporal Hessian (96) or (vi)–(vii) the first- and second-order temporal derivatives of the determinant of the spatial Hessian (109) and (120), for which a corresponding multiplication of the temporal derivative operator by a temporal rescaling factor \(\varkappa \) does not affect the spatio-temporal scale estimates.

The underlying theoretical reason for this lack of spatial and temporal scale invariance is that the attempt to define a spatio-temporal Laplacian operator according to (132) is not covariant under independent rescaling transformations of the spatial and temporal domains. The spatial Laplacian of the first- and second-order temporal derivatives, the determinant of the Hessian of the first- and second-order temporal derivatives and the determinant of the spatio-temporal Hessian are on the other hand truly covariant under such independent relative scaling transformations of the spatial and temporal domains.

The corresponding magnitude estimate at the extremum over spatio-temporal scales is for \(\gamma _\mathrm{s} = 1\) and \(\gamma _{\tau } = 1\) given by

$$\begin{aligned} \left. \nabla _{(x, y, t),\mathrm{norm}}^2 L \right| _{(0,0,0)} = \frac{3 \sqrt{\frac{3}{10}} \left( 2+\varkappa ^2\right) }{25 \pi ^{3/2} s_0 \sqrt{\tau _0}}. \end{aligned}$$
(139)

If we additionally renormalize the original Gaussian blink to having maximum value equal to C according to (54), then the magnitude value at the extremum over spatio-temporal scales will instead be given by

$$\begin{aligned} \left. \nabla _{(x, y, t),\mathrm{norm}}^2 L \right| _{(0,0,0)} = -\frac{6}{25} \sqrt{\frac{3}{5}} \left( 2+\varkappa ^2\right) \, C \end{aligned}$$
(140)

and also dependent on the in principle arbitrary relative weighting factor \(\varkappa ^2\) between the temporal and spatial derivatives.

To illustrate the practical consequence of the lack of spatio-temporal scale covariance for a differential entity used for spatio-temporal scale selection, let us consider two different video cameras that are observing the same scene. Let us for simplicity assume that the sensors in the two video cameras have the same spatial resolution, whereas the temporal resolutions differ by say a factor of two. If we define a spatio-temporal Laplacian operator for each video domain based on the native coordinate system of each respective individual camera, then the spatio-temporal Laplacian operator in the first video domain will correspond to a spatio-temporal Laplacian operator in the second video domain that differs by a factor of two in the value of \(\varkappa \). Thus, if we perform spatio-temporal scale selection by detecting local extrema over spatio-temporal scales of the spatio-temporal Laplacian, we will detect extrema in effective spatio-temporal differential expressions that differ between the two video domains. Specifically, this implies that we cannot exactly interrelate the spatio-temporal Laplacian responses between the two domains in the way necessary to carry out a proof of scale invariance for general classes of spatio-temporal image structures. Although the scale estimates could for another form of scale normalization be computed for the specific spatio-temporal image model of a Gaussian blink [49], corresponding scale selection properties are then not guaranteed to generalize to more general spatio-temporal image structures beyond the specific subfamily of image structures for which the scale calibration was performed. Because of the covariance properties of the spatio-temporal differential invariants \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}}\) \(L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\), \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\), such interrelations can, however, be carried out for those differential operators between two video domains with undetermined relative scaling factors between the spatial and temporal domains. Consequently, these differential entities are therefore much better for spatio-temporal scale selection than the attempt to define a spatio-temporal Laplacian operator.

Additionally, if one would attempt to rank image features based on the corresponding scale-normalized magnitude measure \(\nabla _{(x, y, t),\mathrm{norm}}^2 L\), then the relative ranking of the image features could therefore also be different between the two domains of the two video cameras, whereas the corresponding relative ranking of image features is preserved for spatio-temporal scale selection based on the differential invariants \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}},\) \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}},\) \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\), \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\).

In the spatio-temporal interest point detector proposed in [49], a scale-normalized spatio-temporal Laplacian operator corresponding to the specific choice of \(\varkappa = 1\) was indeed used for spatio-temporal scale selection, although with a different form of scale normalization of the form

$$\begin{aligned} \nabla _{(x, y, t),\mathrm{norm}}^2 L = s^{a} \tau ^b (L_{xx} + L_{yy}) + s^c \tau ^d L_{tt} \end{aligned}$$
(141)

for the specific choices of \(a = 1, b = 1, c = 1/2\) and \(d = 3/4\). In addition to the above-mentioned fundamental limitation of using a spatio-temporal Laplacian operator for spatio-temporal scale selection, by the mixed scale normalization in (141) with the temporal scale parameter \(\tau \) affecting the spatial derivate expressions \(L_{xx}\) and \(L_{yy}\) and the spatial scale parameter s affecting the temporal derivative expression \(L_{tt}\), it will, however, not be possible to establish a relation between such spatio-temporal Laplacian operators between different spatio-temporal domains that are affected by independent relative rescalings of the spatial and temporal domains. Specifically, it will therefore not be possible to establish a covariance relation between two such independently rescaled spatio-temporal image domains as would be needed to prove scale covariance of the spatial and temporal scale estimates for general spatio-temporal image structures according to the spatio-temporal scale selection theory in Sect. 3. By these theoretical arguments, we can therefore explain why the scale estimates from the spatial and temporal selection mechanisms in [49] were later empirically found to not be sufficiently robust.

If a scale-normalized spatio-temporal Laplacian operator is to be used for spatio-temporal feature detection anyway, the scale normalization according to (133) should, however, lead to better experimental results than the scale normalization according to (141), since the partial derivates with respects to the different dimensions of space and time in the scale-normalized differential expression (141) are not added in terms of dimensionless scale-normalized differential entities for the given values of abc and d, whereas the partial derivatives with respect to space versus time are added in a dimensionless manner in the scale-normalized differential expression (133) if \(\gamma _\mathrm{s} = 1\) and \(\gamma _{\tau } = 1\) (and corresponding to \(a = 1, b = 0, c = 0\) and \(d = 1\) in (141) for the specific choice of \(\varkappa = 1\)).

4.9 Scale Normalization Powers of Spatio-Temporal Interest Point Detectors

To summarize, the analysis of scale calibration in Sects. 4.14.7 shows that the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) for the different spatio-temporal interest point detectors should be determined according to Table 1.

Table 1 Scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) as determined from scale calibration of the seven spatio-temporal interest point detectors \(\nabla _{(x,y) ,\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y) ,\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) that are guaranteed to lead to scale-covariant spatio-temporal scale estimates
Table 2 Relations between magnitude thresholds for seven of the spatio-temporal interest point detectors studied in this paper in terms of a common local contrast parameter C

4.10 Relating Magnitude Thresholds Between Different Spatio-Temporal Feature Detectors

By considering the scale-normalized magnitude responses (55), (71), (82), (93), (104) (117) and (128) of the above scale-covariant spatio-temporal feature detectors and applying post-normalization of these entities to make the feature responses fully scale-invariant, we can express relations between their magnitude responses in terms of the contrast C of the spatio-temporal image pattern that gave rise to the feature response according to Table 2. These relations can in turn be used for expressing coarse relations between magnitude thresholds for the different types of spatio-temporal interest operators.

5 Spatio-Temporal Interest Points Detected as Spatio-Temporal Scale-Space Extrema Over Space–Time

In this section, we shall use the scale-normalized differential entities \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L), \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\nabla _{(x, y, t),\mathrm{norm}}^2 L\) according to (42), (58), (74), (85), (96), (109), (120) and (133) for detecting spatio-temporal interest points. The overall idea of the most basic form of such an algorithm is to simultaneously detect both spatio-temporal points \((\hat{x}, \hat{y}, \hat{t})\) and spatio-temporal scales \((\hat{s}, \hat{\tau })\) at which the scale-normalized differential entity \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s, \tau )\) simultaneously assumes local extrema with respect to both space–time (xyt) and spatio-temporal scales \((s, \tau )\).

For the use case of offline analysis of pre-recorded video using a non-causal spatio-temporal scale-space representation, such a spatio-temporal scale-space extrema algorithm could be expressed as a straightforward generalization of the corresponding spatial scale-space extrema algorithm proposed in Lindeberg [65] and summarized on more compact form in “Appendix A”. The only major conceptual differences are that: (i) the image data should be expanded over both a spatial and a temporal scale parameter instead of just a spatial scale parameter and (ii) the local comparisons for detecting local extrema should be performed over a \(3 \times 3 \times 3 \times 3 \times 3\)-neighbourhood over \((x, y, t;\; s, \tau )\) instead of over a \(3 \times 3 \times 3\)-neighbourhood over \((x, y;\; s)\).

A computational problem when expanding a video sequence over both spatial and temporal scales, however, is that the amount of data may become very large, if expanding the video data into the 5-D spatio-temporal scale-space representation over the spatial domain (xy), the temporal domain t and the spatio-temporal scale parameters \((s, \tau )\). For this reason, we shall instead consider a time-recursive implementation that steps forward over time t and only maintains a much more compact time-recursive memory of past information, as a 4-D representation over the spatial image coordinates (xy) and the spatio-temporal scale levels \((s, \tau )\) at each time moment t. Therefore, the time-recursive implementation avoids expanding the internal memory over the temporal dimension and does also directly apply to a time-causal situation in which the future cannot be accessed.

5.1 Time-Causal and Time-Recursive Algorithm for Spatio-Temporal Scale-Space Extrema Detection

Let us approximate the spatial smoothing operation in the continuous spatio-temporal scale-space representation according to (9) by smoothing with the discrete analogue of the Gaussian kernel over the spatial domain [56]

$$\begin{aligned} T(n_1, n_2;\; s) = \mathrm{e}^{-2s} I_{n_1}(s) \, I_{n_2}(s), \end{aligned}$$
(142)

which obeys the semi-group property over spatial scales

$$\begin{aligned} T(\cdot , \cdot ;\; s_1) * T(\cdot , \cdot ;\; s_2) = T(\cdot , \cdot ;\; s_1+s_2) \end{aligned}$$
(143)

and where \(I_n\) denotes the modified Bessel functions of integer order [2].

Let us additionally approximate the time-causal limit kernel, which can be described by a cascade of first-order integrators [75, Equation (15)]

$$\begin{aligned} \partial _t L(t;\; \tau _k) = \frac{1}{\mu _k} \left( L(t;\; \tau _{k-1}) - L(t;\; \tau _k) \right) , \end{aligned}$$
(144)

by a cascade of first-order recursive filters of the form [75, Equation (56)]

$$\begin{aligned} f_{out}(t) - f_{out}(t-1) = \frac{1}{1 + \mu _k} \, (f_{in}(t) - f_{out}(t-1)). \end{aligned}$$
(145)

Assuming that the input video data f(xyt) is acquired at spatial scale \(s_0\) and temporal scale \(\tau _0\), we can then state a basic algorithm for computing the time-causal and time-recursive spatio-temporal scale-space representation and for detecting spatio-temporal scale-space extrema of scale-normalized differential invariants from it as follows:

  1. 1.

    Determine a set of logarithmically distributed temporal scale levels \(\tau _k\) and spatial scale levels \(s_l\) at which the algorithm is to operate by computing spatio-temporal scale-space representations at these spatio-temporal scales.

  2. 2.

    Compute time constants \(\mu _k = (\sqrt{1 + 4 \, r^2 \, (\tau _k - \tau _{k-1})} - 1)/2\) according to Lindeberg [75, Equations (58) and (55)] for approximating the time-causal limit kernel by a finite number of recursive filters, where r denotes the frame rate and the temporal scale levels \(\tau _k\) are given in units of \([\text{ seconds }]^2\).

  3. 3.

    Expand the first image frame f(xy, 0) into its purely spatial scale-space representation \(L(x, y, 0;\; s, \tau _0)\) over the spatial scale levels \(s_l\) at the finest temporal scale \(\tau _0\) using the semi-group property of the discrete analogue of the Gaussian kernel

    $$\begin{aligned} L(\cdot , \cdot , 0;\; s_l, \tau _0) = T(\cdot , \cdot ;\; s_l-s_{l-1}) * L(\cdot , \cdot , 0;\; s_{l-1}, \tau _0) \end{aligned}$$
    (146)

    with initial condition \(L(x, y, 0;\; s_0, \tau _0) = f(x, y, 0)\) at the finest spatial scale \(s_0\).

  4. 4.

    For each temporal scale level \(\tau _k\), initiate a temporal buffer for temporal scale-space smoothing at this temporal scale using the purely spatial scale-space representation of the first frame as initial condition \(B(x, y, k, l) = L(x, y, 0;\; s_l, \tau _0)\).

  5. 5.

    For each spatial and temporal scale level, initiate a small number of temporal buffers for the nearest past frames. (This number should be equal to the maximum order of temporal differentiation.)

  6. 6.

    Loop forwards over time t (in units of time steps):

    1. (a)

      Given a new image frame f(xyt), expand this frame into its purely spatial scale-space representation \(L(x, y, t;\; s, \tau _0)\) at the finest temporal scale \(\tau _0\)

      $$\begin{aligned} L(\cdot , \cdot , t;\; s_l, \tau _0) = T(\cdot , \cdot ;\; s_l-s_{l-1}) * L(\cdot , \cdot , t;\; s_{l-1}, \tau _0). \end{aligned}$$
      (147)

      with initial condition \(L(x, y, t;\; s_0, \tau _0) = f(x, y, t)\) at the finest spatial scale \(s_0\).

    2. (b)

      Loop over the temporal scale levels k in ascending order:

      1. i.

        For each spatio-temporal scale level (kl), perform temporal smoothing according to (with \(B(x, y, 0, l) = L(x, y, t;\; s_l, \tau _0)\))

        $$\begin{aligned}&B(x, y, k, l) := B(x, y, k, l) \nonumber \\&\quad + \frac{1}{1 + \mu _k}(B(x, y, k-1, l) - B(x, y, k, l)).\nonumber \\ \end{aligned}$$
        (148)
    3. (c)

      For all temporal and spatial scales, compute temporal derivatives using backward differences over the buffers from past frames.

    4. (c)

      For all temporal and spatial scales, compute the scale-normalized differential entity \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s_l, \tau _k)\) at that spatio-temporal scale.

    5. (e)

      For all points and spatio-temporal scales \((x, y;\; s_l, \tau _k)\) for which the magnitude of the post-normalized differential entity is above a pre-defined threshold

      $$\begin{aligned} |(\mathcal{D}_{\mathrm{postnorm}} L)(x, y, t;\; s_l, \tau _k)| \ge \theta , \end{aligned}$$
      (149)

      and optionally, if using complementary thresholding [74], the sign of a complementary differential expressionFootnote 3 \(\bar{\mathcal{D}} L\) is additionally positive

      $$\begin{aligned} |(\bar{\mathcal{D}} L)(x, y, t;\; s_l, \tau _k)| \ge 0, \end{aligned}$$
      (150)

      determine if the point is either a positive maximum or a negative minimum in comparison with its nearest neighbours over space (xy), time t, spatial scales \(s_l\) and temporal scales \(\tau _k\).Because the detection of local extrema over time requires a future reference in the temporal direction, this comparison is not done at the most recent frame but at the nearest past frame.

      1. i.

        For each detected scale-space extremum, compute more accurate estimates of its spatio-temporal position \((\hat{x}, \hat{y}, \hat{t})\) and spatio-temporal scale \((\hat{s}, \hat{\tau })\) using parabolic interpolation along each dimension according to Lindeberg [77, Equation (115)]. Do also compensate the magnitude estimates by a magnitude correction factor computed for each dimension.

When detecting local extrema with respect to the spatial, temporal and scale dimensions, we stop performing the comparisons at any point in spatio-temporal scale-space as soon as it can be stated that a spatio-temporal point \((x, y, t;\; s, \tau )\) is neither a local maximum nor a local minimum.

Note specifically that by performing the spatial smoothing in the outer loop over spatio-temporal scales, the computationally more demanding spatial smoothing is performed only once for each spatial scale level, whereas the computationally more efficient temporal smoothing is performed in the inner loop over all combinations of spatial and temporal scales. The algorithm is also inherently parallel over spatio-temporal scale levels and lends itself to parallel implementation over a multi-core architecture.

5.2 Post-filtering of Spatio-Temporal Scale-Space Extrema

Additionally, to handle the different amounts of temporal delay at adjacent temporal scales, which may strongly affect the detection of local extrema over temporal scales by nearest neighbour processing of temporal scales in a time-causal context, we perform a post-filtering step of the spatio-temporal scale-space extrema as an extension of the post-filtering method proposed for temporal scale-space extrema of a purely temporal time-causal scale-space representation as detailed in Lindeberg [77, Section 7.1]:

  • To post-filter spatio-temporal scale-space extrema with respect to the nearest finer temporal scale, we introduce buffers for keeping a short-term memory of purely temporal extrema of the scale-normalized differential expression \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s \tau )\). If a point \((x, y, t;\; s \tau )\) is a local maximum (minimum) over time t, then keep this point in a the buffer of local maxima (minima) as long as the values monotonically decrease (increase) with time to later time moments. When a point has been detected as a candidate for a spatio-temporal scale-space maximum (minimum), check if there are active buffers of local maxima (minima) in a local spatial \(3 \times 3\)-neighbourhood over space at the nearest finer temporal scale. If there is such a maximum (minimum) having a higher (lower) value than the current spatio-temporal scale-space maximum (minimum), then the current point is not allowed to become a scale-space extremum.

  • To post-filter spatio-temporal scale-space extrema with respect to the nearest coarser temporal scale, we put a record of the spatio-temporal scale-space extremum in a spatial \(3 \times 3\)-neighbourhood over space at the nearest coarser temporal scale. If the original point was a scale-space maximum (minimum), then the short-term memory is kept active as long as the scale-normalized differential expression \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s \tau )\) continues to increase (decrease) over time. If the scale-normalized magnitude additionally would increase above the scale-normalized magnitude of the original candidate scale-space extremum, then the original candidate to a scale-space extremum is disregarded.

With these two mechanisms running in parallel to the above time-causal and time-recursive spatio-temporal scale-space extrema detection algorithm, we can compensate for the different temporal delays at adjacent temporal scale levels, which implies that a spatio-temporal event in the world will appear as an extremum earlier in the time-causal scale-space representation at finer temporal scales in relation to the time-causal scale-space representation at coarser temporal scales.

Fig. 7
figure 7

Spatio-temporal interest points computed from a video sequence in the UCF-101 dataset (Kayaking_g01_c01.avi, cropped) for different scale-normalized spatio-temporal entities and using the presented time-causal and time-recursive spatio-temporal scale-space extrema detection algorithm with the temporal scale-space smoothing performed by a time-discrete approximation of the time-causal limit kernel for \(c = 2\) and temporal scale calibration based on \(q =1\): (top left) The spatial Laplacian of the first-order temporal derivative \(\nabla _{(x, y)}^2 L_t\). (top right) The spatial Laplacian of the second-order temporal derivative \(\nabla _{(x, y)}^2 L_{tt}\). (middle row left) The determinant of the spatial Hessian of the first-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_t\). (middle row right) The determinant of the spatial Hessian of the second-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_{tt}\). (bottom row left) The determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x, y, t)} L\). (bottom row right) The spatio-temporal Laplacian \(\nabla _{(x, y, t)}^2 L\). Each figure shows a snapshot at frame 90 with a threshold on the magnitude of the scale-normalized differential expression determined such that the average number of features is 50 features per frame. The radius of each circle reflects the spatial scale of the spatio-temporal scale-space extremum (image size: \(320 \times 172\) pixels of original \(320 \times 240\) pixels; frame 90 of 226 frames at 25 frames/s)

Fig. 8
figure 8

Spatio-temporal interest points computed from a video sequence in the UCF-101 dataset (TableTennisShot_g10_c01.avi) for different scale-normalized spatio-temporal entities and using the presented time-causal and time-recursive spatio-temporal scale-space extrema detection algorithm with the temporal scale-space smoothing performed by a time-discrete approximation of the time-causal limit kernel for \(c = 2\) and temporal scale calibration based on \(q =1\): (Top left) The spatial Laplacian of the first-order temporal derivative \(\nabla _{(x, y)}^2 L_t\). (Top right) The spatial Laplacian of the second-order temporal derivative \(\nabla _{(x, y)}^2 L_{tt}\). (Middle row left) The determinant of the spatial Hessian of the first-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_t\). (Middle row right) The determinant of the spatial Hessian of the second-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_{tt}\). (Bottom row left) The determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x, y, t)} L\). (Bottom row right) The spatio-temporal Laplacian \(\nabla _{(x, y, t)}^2 L\). Each figure shows a snapshot at frame 37 with a threshold on the magnitude of the scale-normalized differential expression determined such that the average number of features is 30 features per frame. The radius of each circle reflects the spatial scale of the spatio-temporal scale-space extremum (image size: \(320 \times 240\) pixels; frame 37 of 178 frames at 25 frames/s)

Fig. 9
figure 9

Spatio-temporal interest points computed from a video sequence in the UCF-101 dataset (Archery_g01_c07.avi) for different scale-normalized spatio-temporal entities and using the presented time-causal and time-recursive spatio-temporal scale-space extrema detection algorithm with the temporal scale-space smoothing performed by a time-discrete approximation of the time-causal limit kernel for \(c = 2\) and temporal scale calibration based on \(q =1\): (Top left) The spatial Laplacian of the first-order temporal derivative \(\nabla _{(x, y)}^2 L_t\). (Top right) The spatial Laplacian of the second-order temporal derivative \(\nabla _{(x, y)}^2 L_{tt}\). (Middle row left) The determinant of the spatial Hessian of the first-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_t\). (Middle row right) The determinant of the spatial Hessian of the second-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_{tt}\). (Bottom row left) The determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x, y, t)} L\). (Bottom row right) The spatio-temporal Laplacian \(\nabla _{(x, y, t)}^2 L\). Each figure shows a snapshot at frame 71 with a threshold on the magnitude of the scale-normalized differential expression determined such that the average number of features is 30 features per frame. The radius of each circle reflects the spatial scale of the spatio-temporal scale-space extremum (image size: \(320 \times 240\) pixels; frame 71 of 143 frames at 25 frames/s)

Fig. 10
figure 10

Spatio-temporal interest points computed from three video sequences in the UCF-101 dataset (Kayaking_g01_c01.avi, cropped, TableTennisShot_g10_c01.avi and Archery_g01_c07.avi) for different scale-normalized spatio-temporal entities and using the presented time-causal and time-recursive spatio-temporal scale-space extrema detection algorithm with the temporal scale-space smoothing performed by a time-discrete approximation of the time-causal limit kernel for \(c = 2\) and temporal scale calibration based on \(q =1\): (Left column) The first-order temporal derivative of the determinant of the spatial Hessian \(\partial _t (\det \mathcal{H}_{(x, y)} L)\). (Right column) The second-order temporal derivative of the determinant of the spatial Hessian \(\partial _{tt} (\det \mathcal{H}_{(x, y)} L)\). Each figure shows a snapshot at a given frame with a threshold on the magnitude of the scale-normalized differential expression determined such that the average number of features is 50 features per frame for the kayak video and 30 features per frame for the table tennis and archery videos. The radius of each circle reflects the spatial scale of the spatio-temporal scale-space extremum (Image size: \(320 \times 172\) pixels of original \(320 \times 240\) pixels. Top row: frame 90 of 226 frames. Middle row: frame 37 of 178 frames. Bottom row: frame 71 of 143 frames. All videos with 25 frames/s)

Fig. 11
figure 11

Spatio-temporal interest points computed from a video sequence in the KITTI dataset (tracking test video nr 16) for different scale-normalized spatio-temporal entities and using the presented time-causal and time-recursive spatio-temporal scale-space extrema detection algorithm with the temporal scale-space smoothing performed by a time-discrete approximation of the time-causal limit kernel for \(c = 2\) and temporal scale calibration based on \(q =1\): (Top left) The spatial Laplacian of the first-order temporal derivative \(\nabla _{(x, y)}^2 L_t\). (Top right) The spatial Laplacian of the second-order temporal derivative \(\nabla _{(x, y)}^2 L_{tt}\). (Middle row left) The determinant of the spatial Hessian of the first-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_t\). (Middle row right) The determinant of the spatial Hessian of the second-order temporal derivative \(\det \mathcal{H}_{(x, y)} L_{tt}\). (bottom row left) The determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x, y, t)} L\). (Bottom row right) The spatio-temporal Laplacian \(\nabla _{(x, y, t)}^2 L\). Each figure shows a snapshot at frame 77 with a threshold on the magnitude of the scale-normalized differential expression determined such that the average number of features is 200 features per frame. The radius of each circle reflects the spatial scale of the spatio-temporal scale-space extremum (image size: \(1242 \times 375\) pixels; frame 77 of 509 frames at 10 frames/s). (A temporal sampling rate of 10 frames/s is, however, too sparse for this type of local differential analysis of such fast changes in the image structures over time)

5.3 Experimental Results

Figures 7, 8, 9, 10 and 11 show the result of detecting spatio-temporal scale-space extrema in this way for three video sequences from the UCF-101 dataset [104] and one video sequence from the KITTI dataset [26]. For these experiments, we used 21 spatial scale levels between \(\sigma _\mathrm{s} = 2\) and 21 pixels and 7 temporal scale levels between \(\sigma _{\tau } = 40~\text{ ms }\) and 2.56 s with seven additional pre-scales and distribution parameter \(c = 2\) for the time-causal limit kernel. To obtain comparable numbers of features from the different types of feature detectors, we adapted the thresholds on the scale-normalized differential invariants such that the average number of features from each feature detector was 50 features per frame for the kayaking video, a lower number of 30 features per frame for the videos of the table tennis player and the archer where the background is static, and a larger number of 200 features per frame for the driving scene, where the camera is moving relative to a cluttered environment.

Figure 7 and the first row of Fig. 10 show results computed from the same video of a kayaker as used for the illustrations in Figs. 1, 2, 3 and 4. As can be seen from the results, all eight feature detectors respond to regions in the video sequence where there are strong variations in image intensity over space and time. There are, however, also some qualitative differences between the results from the different spatio-temporal interest point detectors. The LGN-inspired feature detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\) and \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) respond both to the motion patterns of the paddler and to the spatio-temporal texture corresponding to the waves on the water surface that lead to temporal flickering effects and so do the operators \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\) and \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\). The more corner detector inspired feature detectors \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) respond more to image features where there are simultaneously rapid variations over both of the spatial dimensions and the temporal dimension.

Figure 8 and the second row of Fig. 10 show corresponding results of detecting spatio-temporal scale-space extrema from a video sequence with a table tennis player. Here, we can note that the seven spatio-temporal interest point detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}},\) \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) do all give rise to rather rich distributions of feature responses corresponding to the motion pattern of the tennis player. (The responses on the left part of the table tennis table are caused by cast shadows of the tennis player from the lamp in the ceiling). The LGN-inspired feature detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\) and \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) do both specifically generate responses when the ball flies over the net and so do the determinant of the spatio-temporal Hessian as well as the first- and second-order temporal derivatives of the spatial Laplacian. The responses due to the spatio-temporal Laplacian are, however, less specific to specific motion events, and with numerous responses from the almost static background. Incorporating also the theoretical limitations of the spatio-temporal Laplacian described in Sect. 4.8 as well as other limitations that will be described below, we conclude that this operator should therefore not be considered as a suitable feature detector for processing video data.

Figure 9 and the third row of Fig. 10 show the results of detecting corresponding spatio-temporal scale-space extrema from a video sequence with an archer. Here, we can note that the five spatio-temporal interest point detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}},\) \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}},\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) and \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) do in a corresponding manner give rise to rather rich distributions of feature responses corresponding to the motion pattern of the archer. For the determinant of the spatio-temporal Hessian, which operates like a three-dimensional corner detector, there are, however, many more responses along the edges of the archer than for the other four feature detectors. The four feature detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) based on first- or second-order temporal derivatives do all generate multiple responses when the arrow hits the cloth on the wall. The response of the determinant of the spatio-temporal Hessian is, however, delayed and not as strong as for the other competing spatio-temporal events in the scene.

Figure 11 shows results of applying these spatio-temporal scale extrema detection algorithms to a scene with a car driving along a road. Because image feature detection based on space–time separable spatio-temporal receptive fields is here applied to a scene where the camera is moving relative to the environment, static spatial image features in the world that move relative to the motion direction will here lead to spatio-temporal receptive field responses.

For the six basic spatio-temporal interest point detectors that constitute combinations of differential entities used for spatial interest point detection with temporal derivates: (i)–(ii) the spatial Laplacian applied to the first- and second-order temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian applied to the first- and second-order temporal derivatives and (v)–(vi) the first- and second-order temporal derivatives of the determinant of the spatial Hessian matrix, we can note that all these spatio-temporal interest point detectors lead to feature responses for the parked cars that have qualitatively similarities to the responses from applying spatial interest point detectors to a static scene, with the additional constraint that there should also be relative motions between the camera and the environment. For (vii) the genuine 3-D determinant of the spatio-temporal Hessian, the responses are on the other hand more selective, while for (viii) the spatio-temporal Laplacian, the responses are far less selective and less informative.

An alternative way of handling spatio-temporal scenes with dominant relative motions between the camera and the environment, in contrast to this use of space–time separable receptive fields for only image velocity \(v = 0\), is by exploiting the full structure of the spatio-temporal receptive field model (1), by considering spatio-temporal receptive fields with nonzero image velocities \(v \ne 0\), which can be locally adapted to the local motion direction corresponding to velocity adaptation [50, 51, 61] or alternatively performing local, regional or global image stabilization. Then, the image operations can be made truly covariant under local, regional or global Galilean image transformations [67, 71] and allow for a more explicit separation of spatio-temporal receptive field responses that correspond to more complex spatio-temporal image structures than local Galilean motions.

5.4 Covariance and Invariance Properties

From the theoretical scale selection properties of the spatial scale-normalized derivative operators according to the spatial scale selection theory in Lindeberg [65] in combination with the temporal scale selection properties of the temporal scale selection theory in Lindeberg [77] with the scale covariance of the underlying spatio-temporal derivative expressions \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) described in Lindeberg [75], it follows that these spatio-temporal interest point detectors are truly scale covariant under independent scaling transformations of the spatial and the temporal domains if the temporal smoothing is performed by either a non-causal Gaussian kernel \(g(t;\; \tau )\) over the temporal domain or the time-causal limit kernel \(\varPsi (t;\; \tau , c)\). From the general proof in Sect. 3, it follows that the selected spatio-temporal scale levels transform in a scale-covariant way under independent scaling transformations of the spatial and the temporal domains. Additionally, the post-normalized magnitude estimates from these seven spatio-temporal differential invariants are truly scale invariant.

6 Quantifying the Accuracy of the Scale Estimates and the Amounts of Temporal Delays

The theoretical analysis of the scale selection properties of the different types of spatio-temporal interest point detectors presented in Sect. 4 was performed for a non-causal Gaussian spatio-temporal concept and using model signals based on Gaussian or integrated Gaussian intensity profiles over time. While it was conceptually shown in Lindeberg [77] that important scale selection properties in terms of temporal scale-invariance transfer from a non-causal Gaussian temporal scale-space concept to the time-causal temporal scale-space concept based on the time-causal limit kernel, it is of interest to also quantify the numerical properties in terms of the spatio-temporal scale estimates and the temporal delays obtained from a truly time-causal scale-space concept and a time-causal implementation.

In this section, we will experimentally quantify: (i) how well the spatio-temporal scale selection properties transfer to a discrete implementation, specifically how accurate the spatial and temporal scale estimates are for idealized model patterns with ground truth, as well as (ii) how the different spatio-temporal feature detectors differ in their ability to respond fast with regard to time-critical applications.

Table 3 Numerical quantification of the spatio-temporal scale selection properties of four spatio-temporal interest point detectors when applied to model signals defined as time-causal Gaussian blinks of spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and different temporal durations \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms
Table 4 Numerical quantification of the spatio-temporal scale selection properties of three spatio-temporal interest point detectors when applied to model signals defined as time-causal Gaussian onset blobs of spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and different temporal durations \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms

6.1 Time-Causal Gaussian Blink

To quantify the transfer of the spatio-temporal scale selection properties to a time-causal spatio-temporal domain, we first generated a set of videos with time-causal Gaussian blinks obtained by filtering a discrete delta function with a discrete Gaussian kernel over the spatial domain and a discrete approximation of the time-causal limit kernel over the temporal domain. Such videos sequences were generated with spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and temporal durations of \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms at a frame rate of 50 frames/s and for distribution parameter \(c = 2\) of the time-causal limit kernel. The reason for not varying the spatial scale parameter in this experiment is that the properties of the spatial scale selection mechanism have already been sufficiently well established and tested.

Then, we detected spatio-temporal scale-space extrema of: (i) the spatial Laplacian of the second-order temporal derivative of \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\), (ii) the determinant of the spatial Hessian of the second-order temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\), (iii) the determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) and (iv) the second-order temporal derivative of the determinant of the spatial Hessian \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) for each one of these videos, and recorded (i) the selected spatial scale \(\hat{\sigma }_\mathrm{s}\) in units of pixels, (ii) the selected temporal scale \(\hat{\sigma }_{\tau }\) in units of milliseconds and (iii) a measure of the effective temporal delay \(\delta = \hat{t} - t_{\mathrm{max}}\) defined as the time difference between the time moment \(\hat{t}\) at which the spatio-temporal scale-space extremum is detected and the time moment \(t_{\mathrm{max}}\) at which the spatio-temporal maximum in the input function occurred. The motivation for the latter choice is that because of the time-causal model, each spatio-temporal pattern is associated with an inherent temporal delay. By compensating for this delay, the intention is that the compensated delay score should more reflect the additional amount of temporal delay caused by the time-causal feature detection method.

The results of these experiments are given in Table 3 for two different settings of the temporal scale calibration parameter q. Note that (i) the spatial scale estimates \(\hat{\sigma }_\mathrm{s}\) are highly accurate and that (ii) when using \(q = 1\) the temporal scale estimates \(\hat{\sigma }_{\tau }\) do also give good estimates of the temporal duration of the underlying spatio-temporal image structures considering the coarse sampling of the temporal scale levels induced by a distribution parameter of \(c = 2\), which means that the ratio between adjacent temporal scale levels is equal to two in units of dimension \([\text{ time }]\) and which in turn limits the effective resolution of the temporal scale estimates. Additionally, the implementation differs from the presented scale selection theory in the respects that: (i) the theoretical analysis has been performed based on the non-causal Gaussian temporal scale-space model, whereas the experiments are performed using the time-causal scale-space model, (ii) the spatio-temporal scale selection theory is continuous, whereas the discrete implementation is based on the discrete analogue of the Gaussian kernel [56] over space and recursive filters over time and (iii) for shorter temporal scales, the temporal scales of the model signals are close to the inner temporal scale in the video, determined by the frame rate of 50 fps corresponding to 20 ms between adjacent frames, implying that the temporal discretization effects at shorter temporal scales become stronger.

For this family of model signals, the spatial Laplacian of the second-order temporal derivative \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) and the determinant of the Hessian of the second-order temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) respond very fast to the onset of a spatio-temporal Gaussian blob when using \(q = 1\). For the determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) and the second-order temporal derivative of the determinant of the spatial Hessian \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\), the temporal delays are, however, substantial when using \(q = 1\). By instead choosing the temporal scale calibration parameter q to a lower value of \(q = 3/4\), the effective temporal delays can be substantially reduced in many cases up to a reduction near 50% for the determinant of the spatio-temporal Hessian \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) and the second-order temporal derivative of the determinant of the spatial Hessian \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) at the cost of less accurate but still not completely unreasonable estimates of the temporal duration of the underlying spatio-temporal image structures.

A general conclusion that we can draw from this experiments is that the operators \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) and \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) that operate directly on temporal derivatives respond significantly faster compared to the operator \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) that operates on the joint space–time structure and the operator \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) that operates on temporal derivatives of a nonlinear spatial differential invariant.

6.2 Time-Causal Gaussian Onset Blob

To quantify the transfer of the spatio-temporal scale selection properties for another class of model signals, we then generated a set of videos with time-causal Gaussian onset blobs obtained by filtering a the tensor product between a discrete delta function over the spatial domain and discrete Heaviside function over the temporal domain function with a discrete Gaussian kernel over the spatial domain and a discrete approximation of the time-causal limit kernel over the temporal domain. Such videos sequences were generated with spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and temporal durations of \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms at a frame rate of 50 frames/s and for distribution parameter \(c = 2\) of the time-causal limit kernel.

Then, we detected spatio-temporal scale-space extrema of: (i) the spatial Laplacian of the first-order temporal derivative \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\), (ii) the determinant of the spatial Hessian of the first-order temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\) and (iii) the first-order temporal derivative of the determinant of the spatial Hessian \(\partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) for each one of these videos, and recorded the (i) the selected spatial scale \(\hat{\sigma }_\mathrm{s}\) in units of pixels, (ii) the selected temporal scale \(\hat{\sigma }_{\tau }\) in units of milliseconds and (iii) a measure of the effective temporal delay \(\delta = \hat{t} - t_{\mathrm{max}}\) defined as the time difference between the time moment \(\hat{t}\) at which the spatio-temporal scale-space extremum is detected and the time moment \(t_{\mathrm{max}}\) at which the spatio-temporal maximum of the spatio-temporal scale-space kernel at the same spatio-temporal scale \(\sigma _{\tau ,0}\) occurs.

The results of these experiments are given in Table 4 for two different settings of the temporal scale calibration parameter q. Note that (i) again the spatial scale estimates \(\hat{\sigma }_\mathrm{s}\) are highly accurate and that (ii) when using \(q = 1\) the temporal scale estimates \(\hat{\sigma }_{\tau }\) do also give good estimates of the temporal duration of the underlying spatio-temporal signals again considering the coarse sampling of the temporal scale levels induced by sparse sampling the temporal scale levels resulting from the distribution parameter of \(c = 2\) for the time-causal limit kernel, which in turn means that the ratio between adjacent temporal scale levels is equal to two in units of dimension \([\text{ time }]\) and which again limits the effective resolution of the temporal scale estimates. For this problem of onset detection, the temporal delays are, however, longer than for the previous problem of detecting blinks. By instead choosing the temporal scale calibration parameter q to a lower value of \(q = 3/4\), the effective temporal delay can be substantially reduced in some cases up to a reduction near 50 % for the spatial Laplacian of the first-order temporal derivative of \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\) and the determinant of the spatial Hessian of the first-order temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\) at the cost of less accurate but still not completely unreasonable estimates of the temporal duration of the underlying spatio-temporal image structures

7 Summary and Discussion

We have presented a general theory and methodology for performing simultaneous detection of local characteristic spatial and temporal scale estimates in video data. The theory comprises both (i) feature detection performed within a non-causal spatio-temporal scale-space representation computed for offline analysis of pre-recorded video data and (ii) feature detection performed from real-time image streams where the future cannot be accessed and memory requirements call for time-recursive algorithms based on only compact buffers of what has occurred in the past.

As a theoretical foundation for spatio-temporal scale selection, we have stated general sufficiency results regarding scale-covariant spatio-temporal scale estimates and complementary invariance properties of spatio-temporal features defined from video data in which there may be independent scaling transformations of the spatial and the temporal domains. For a wide class of homogeneous spatio-temporal differential expressions, the spatio-temporal scale estimates obtained from the presented theory and methodology have been shown to obey the basic property that they adaptively follow independent local spatial and temporal scaling transformations in the video data, which constitutes a basic requirement on a spatio-temporal scale selection mechanism. In other words, if the spatial size of the image structures changes by a factor \(S_\mathrm{s}\) in the spatial domain and/or the temporal duration of the spatio-temporal image structures changes by a factor \(S_{\tau }\), then the spatial scale parameter in units of \(\sigma _\mathrm{s} = \sqrt{s}\) and the temporal scale parameter in units of \(\sigma _{\tau } = \sqrt{\tau }\) of the detected spatio-temporal image features will change by corresponding factors. Additionally, we have shown that the magnitude estimates either are automatically invariant under spatio-temporal scaling transformations or can be compensated to become so by post-normalization, depending on the specific values of the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\). These properties together imply that the presented theory and methodology obeys the necessary properties to handle video data in which there may be large spatial and temporal scaling variations in the spatio-temporal image structures.

For seven specific spatio-temporal differential invariants: (i)–(ii) the spatial Laplacian of the first- and second-order temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian of the first- and second-order temporal derivatives, (v) the determinant of the spatio-temporal Hessian matrix and (vi)–(vii) the first- and second-order temporal derivatives of the determinant of the spatial Hessian, we have performed an in-depth analysis of their theoretical scale selection properties and shown how scale calibration can be performed to determine the spatial and temporal scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) such that the selected spatio-temporal scale levels reflect the spatial extent and the temporal duration of the underlying spatio-temporal features that gave rise to the feature responses. These spatio-temporal differential invariants can all be used for formulating spatio-temporal interest point detectors. Theoretically and experimentally, we have described and illustrated their properties and shown that they lead to intuitively reasonable results.

For one spatio-temporal differential expression, an attempt to define a spatio-temporal Laplacian, we have on the other hand shown that this differential expression is not scale covariant under independent rescalings of the spatial and temporal domains, which explains a previously noted poor robustness of the scale selection step in the spatio-temporal interest point detector based on the spatio-temporal Harris operator [49].

Whereas the presented spatio-temporal scale selection theory is fully continuous over space and time, we have by quantitative experiments on model signals with ground truth shown that the numerical accuracy of the spatio-temporal scale estimates carries over to a carefully designed discrete implementation, based on the discrete analogue of the Gaussian over space and a cascade of first-order recursive filters over time.

To allow for different trade-offs between the temporal response properties of time-causal spatio-temporal feature detection (shorter temporal delays) in relation to signal detection theory, which would call for detection of image structures at the same spatial and temporal scales as they occur, we have specifically introduced a parameter q to regulate the temporal scale calibration to finer temporal scales \(\hat{\tau } = q^2 \, \tau _0\) as opposed to the more common choice \(\hat{s} = s_0\) over the spatial domain. According to the presented theoretical analysis of scale selection properties in non-causal spatio-temporal scale space, the results predict that this parameter should reduce the temporal delay by a factor of q: \(\Delta t \mapsto q \, \Delta t\). Our numerical experiments with scale selection properties in time-causal spatio-temporal scale space confirm that a substantial decrease in temporal delay is obtained. The specific choice of the parameter q should be optimized with respect to the task that the spatio-temporal selection and the spatio-temporal features are to be used for and given specific requirements of the application domain.

We have also presented an explicit algorithm for detecting spatio-temporal interest points in a time-causal and time-recursive context in which the future cannot be accessed and memory requirements call for only compact buffers to store partial records of what has occurred in the past and presented experimental results of applying this algorithm to real-world video data for the different types of spatio-temporal interest point detectors that we have studied theoretically.

Experimentally, we have shown that four of the presented spatio-temporal interest operators: (i)–(ii) the spatial Laplacian of the first- and second-order temporal derivatives and (iii)–(iv) the determinant of the Hessian of the first- and second-order temporal derivatives, lead to significantly shorter temporal delays than (v) the determinant of the spatio-temporal Hessian matrix or (vi)–(vii) the first- and second-order temporal derivatives of the determinant of the spatial Hessian.

While the experimental results in this paper have been presented solely based on a time-causal and time-recursive spatio-temporal concept, the overall methodology can also be implemented based on a non-causal Gaussian spatio-temporal scale-space concept [67]. Such an implementation would, however, require more computations and larger temporal buffers compared to using the time-causal and time-recursive receptive fields based on first-order integrators coupled in cascade that constitute the temporal smoothing model underlying the implementation reported in this work. Additionally, an ad hoc use of time-delayed truncated Gaussian kernels instead would be expected to lead to less rapid temporal responses for time-critical applications compared to the truly time-causal scale-space kernels used for the experiments in this work. For offline analysis of pre-recorded data on an architecture where computational and memory resources do not constitute a bottle-neck, such a non-causal implementation would on the other hand have the potential of computing more accurate image features, since the method could then also make use of information from the future in relation to any pre-recorded time moment, which is not permitted for these time-causal operations.

We propose that the spatio-temporal scale selection mechanism presented in this paper should be far more general than the more specific applications developed here for detecting spatio-temporal interest points. Concerning extensions of the approach, a first natural extension concerns extending the sparse spatio-temporal scale selection into dense spatio-temporal scale selection, which is addressed in a companion paper [76]. A second natural extension is to extend the current use of a space–time separable spatio-temporal scale-space representation based on spatio-temporal receptive fields (1) with image velocity zero to incorporate mechanisms for velocity-adapted spatio-temporal receptive fields with nonzero image velocities and/or image stabilization.