Once feature points have been detected, the next stage is to extract descriptors to encode the characteristics of these salient regions for classification. The descriptors can be based on various types of information, including appearance, motion and saliency, but we wish to also include our additional depth information.
RMD
The Relative Motion Descriptor (RMD) introduced by Oshin et al. (2011) has been shown to perform well in a large range of action recognition datasets, while making use of only the saliency information obtained during interest point detection. A spatio-temporal volume \(\eta \) is created, containing the interest point detections and their strengths. The saliency content of a sub-cuboid, with origin at (u, v, w) is defined for a sub-cuboid of dimensions \((\hat{u},\hat{v},\hat{w})\) as
$$\begin{aligned} c(u,v,w) = \displaystyle \sum _{{\varvec{\gamma }}=0}^{(\hat{u},\hat{v},\hat{w})} \eta (\left[ u,v,w\right] +{\varvec{\gamma }}). \end{aligned}$$
(18)
For efficiency this is implemented as an integral volume. The descriptor \(\delta \) of the saliency distribution at a position (u, v, w) can then be formed, by performing N comparisons of the content of two randomly offset spatio-temporal sub-cuboids, with origins at \((u,v,w)+{\varvec{\beta }}\) and \((u,v,w)+{\varvec{\beta }'}\):
$$\begin{aligned} \delta (u,\!v,\!w) = \displaystyle \sum _{n=0}^{N} \!\!\left\{ \begin{array}{ll} 2^{n}&{} \text {if }c(\left[ u,v,w\right] + {\varvec{\beta }}_{\!n}) \!>\! c(\left[ u,v,w\right] + {\varvec{\beta }'}_{\!\!n}) \\ 0 &{} \!\!\text{ otherwise } \\ \end{array} \right. \end{aligned}$$
(19)
Note that the collections of offsets \({\varvec{\beta }}_{0..N}\) and \({\varvec{\beta }'}_{0..N}\) are randomly selected prior to training, and then maintained, rather than selecting new offsets for each clip.
By extracting \(\delta \) at every location in the sequence, a histogram may be constructed, which encodes the occurrences of relative saliency distributions within the sequence, without requiring appearance data or motion estimation. Increasing the number of comparisons N leads to improved descriptiveness, however the resulting histograms also become more sparse. A common alternative is to compute several \(\delta \) histograms, each using different collections of random offsets \({\varvec{\beta }}_{0..N}\) and \({\varvec{\beta }'}_{0..N}\). The resulting histograms are then concatenated, with the result encoding more information without sparsifying the histogram. However, this comes at the cost of the independence between bins, i.e. introducing some possible redundancies.
We propose extending the standard RMD described above, by storing the saliency measurements within a 4D integral hyper-volume, so as to encode the behaviour of the interest point distribution across the 3D scene, rather than within the image plane. The 4D integral volume can be populated by extracting the depth measurements at each detected interest point. RMD-4D descriptors can then be extracted, using comparisons between pairs of sub-hypercuboids. The resulting histogram encodes relative distributions of saliency, both temporally, and in terms of 3D spatial location. As with the original RMD, the descriptor can be applied in conjunction with any interest point detector and is not restricted to the extended interest point detectors described in Section 7 (provided that a depth video is available during descriptor extraction).
Bag of Visual Words
One of the most successful approaches in action recognition is to concatenate a range of local descriptors and to calculate a bag of words representation. Laptev et al. (2008) used this approach to great effect to combine HOG and HOF descriptors (defined as G and F). Both histograms are computed over a small window, storing coarsely quantized image gradient and optical flow vectors, respectively. This provides a descriptor \(\rho \) of the visual appearance and local motion around the salient point at I(u, v, w).
$$\begin{aligned} \rho (u,v,w) = \left( G\left( I\left( u,v,w\right) \right) ,F\left( I\left( u,v,w\right) \right) \right) \end{aligned}$$
(20)
When accumulating \(\rho \) over space and time, a Bag of Words (BOW) approach is employed. Clustering is performed on all \(\rho \) obtained during training, creating a codebook of distinctive descriptors. During recognition, all newly extracted descriptors are assigned to the nearest cluster center from the codebook, and the frequency of each clusters occurrences are accumulated. In this work K-Means clustering is used, with a Euclidean distance function as in Laptev et al. (2008). To extend \(\rho \) to 4D, we include a Histogram of Oriented Depth Gradients (HODG):
$$\begin{aligned} \rho (\!u,\!v,\!w\!)\! = \!\Big (\!G\!\big (I\!\left( \!u,\!v,\!w\!\right) \big )\!,\!F\!\big (I\!\left( \!u,\!v,\!w\!\right) \big )\!,\!G\!\big (D\!\left( \!u,\!v,\!w\!\right) \big )\Big ). \end{aligned}$$
(21)
Thus the descriptor encapsulates local structural information, in addition to local appearance and motion. The bag of words approach is applied to this extended descriptor, as in the original scheme. Importantly, this descriptor is not dependent on the interest point detector, provided the HODG can be calculated from the depth stream D. By normalising these local descriptors, we are able to resolve the scale ambiguity which remained in our auto-calibration of Sect. 5.
3D Motion Descriptors
The inclusion of structural (depth) features into the bag of words descriptor does not fully exploit the additional information in the Hollywood 3D dataset. During pre-processing we also extracted the 3D motion fields for the dataset, which can be used as a replacement for the optical flow features F. We refer to these 3D motion descriptors as “Histograms of Oriented Scene-flows” (HOS).
Given the dense 3D flow field (\(\dot{x},\,\dot{y},\,\dot{z}\)), we can extract a local 3D motion descriptor using the spherical co-ordinate system
$$\begin{aligned} {\gamma } = \text {atan}\left( \frac{\dot{y}}{\dot{x}} \right) \text { and } {\delta } = \text {atan}\left( \frac{\dot{z}}{\dot{y}} \right) , \end{aligned}$$
(22)
to describe the 3D orientation of flow vectors. Note that \( {\gamma } \) refers to the “in plane” orientation (from the viewpoint of the left camera) i.e.when \( {\gamma } \) is \(0^{\circ }\), the motion is toward to the top of the image, when \( {\gamma } \) is \(90^{\circ }\) the motion is toward the right of the image, etc. In contrast \( {\delta } \) refers to the “out of plane” orientation, i.e.how much the motion is angled away from, or towards, the camera.
We encode the distribution of 3D orientations in a region around each interest point, capturing the nature of local motion field using a spherical histogram \( \mathbf{H}\) (see Fig. 6) which can be included into the bag of words descriptor \(\rho \). This is similar to the approach used for shape context (Belongie et al. 2002), but in the velocity domain. The contribution of each flow vector to the histogram is weighted based on the magnitude of the flow vector. As with HoG, HoF and HoDG this histogramming discards much of the spatial information. However, some general attributes are maintained by separating the region into several neighbouring blocks, and encoding each of them independently as \( \mathbf{H}_{1\ldots n}\). These sub-region spherical histograms are then combined to form the overall descriptor \( \mathbf{H}\). It should be noted that placing histogram bins at regular angular intervals in this way leads to the bins covering unequal areas of the sphere’s surface. An exaggerated version of this effect can be seen in Fig. 6a, although in practice fewer bins are used and the difference is less pronounced. In the future, regular or semi-regular sphere tessellations could be considered to mitigate this (Saff and Kuijlaars 1997).
As above we normalise the descriptors to resolve the scale ambiguity between sequences. Thus, even though motion fields are consistent only up to a similarity transform, the normalised spherical histograms,
$$\begin{aligned} {\bar{\mathbf{H}}} = \frac{ \mathbf{H}}{| \mathbf{H}|}\,, \end{aligned}$$
(23)
are consistent up to a 3D rotation, making these 3D motion descriptors much more comparable between camera configurations, and thus suitable for “in the wild” recognition. In addition to this, the normalised features provide invariance to the speed at which actions are performed, as only the shape and not the value of the motion field is encoded. This is again very important for “in the wild” recognition, with many different actors, each of whom have their own action style.
Rotational Invariance
Next we look at including viewpoint invariance in our 3D motion features (i.e. removing the final 3D rotation ambiguity, and making the descriptors completely consistent). This is one of the biggest challenges for “in the wild” action recognition. The the same action viewed from different angles looks completely different. However, as we are using the underlying 3D motion field, it is possible to modify our feature encoding to be invariant to such changes.
We firstly encode invariance to camera roll (i.e. rotation around the z axis) by cycling the order of the subregion histograms \( \mathbf{H}_{1\ldots n}\) such that the subregion containing the largest amount of motion occurs first [similar to the orientation normalisation used in shape context (Belongie et al. 2002), SIFT (Lowe 2004), Uniform LBPs (Ojala et al. 2002) etc.]. This re-arranged, roll-invariant, descriptor is referred to as \({\bar{\mathbf{H}}^{\mathbf{r}}} \) (see Fig. 7).
We can follow a similar approach for the flow vectors within the subregion histograms, to make the direction of the motions as well as their positions, rotationally-invariant. If we find the strongest motion vector in \( \mathbf{H}\) and label its 3D orientation as \( \hat{\phi } \),\( \hat{\psi } \) then we can redefine our local orientations in relation to this flow vector,
$$\begin{aligned} {\gamma } ^p= \text {atan}\left( \frac{\dot{y}}{\dot{x}}- \hat{\phi } \right) \text { and } {\delta } ^p = \text {atan}\left( \frac{\dot{z}}{\dot{y}}- \hat{\psi } \right) . \end{aligned}$$
(24)
The resulting descriptors \( {\bar{\mathbf{H}}^{\mathbf{p}}} \) obtained when encoding \( {\gamma } ^p\),\( {\delta } ^p\) makes the flow vectors robust to camera pitch (rotation around the x axis) in addition to roll, as shown in Fig. 8.
However, due to the separation of \( {\gamma } \) and \( {\delta } \) our descriptors are still not resistant to camera pans (rotation around the y axis, which at 90\(^{\circ }\) causes \( {\gamma } \) orientation to become \( {\delta } \) orientation). In addition, normalising based on the maximum flow vector is sensitive to outliers in the flow field. As such, our final approach is to perform PCA on the local region of the motion field, extracting 3 new basis vectors \(\dot{x}',\dot{y}',\dot{z}'\) (the eigenvectors of the motion field covariance). Computing orientation using these basis vectors,
$$\begin{aligned} {\gamma } '= \text {atan}\left( \frac{\dot{y}'}{\dot{x}'} \right) \text { and } {\delta } ' = \text {atan}\left( \frac{\dot{z}'}{\dot{y}'} \right) , \end{aligned}$$
(25)
leads to a descriptor \( {\bar{\mathbf{H}}'}\) which is invariant to all 3 types of camera viewpoint change, and also robust to outlier motions. See Fig. 9 for an illustration.