Repetition Estimation
Abstract
Visual repetition is ubiquitous in our world. It appears in human activity (sports, cooking), animal behavior (a bee’s waggle dance), natural phenomena (leaves in the wind) and in urban environments (flashing lights). Estimating visual repetition from realistic video is challenging as periodic motion is rarely perfectly static and stationary. To better deal with realistic video, we elevate the static and stationary assumptions often made by existing work. Our spatiotemporal filtering approach, established on the theory of periodic motion, effectively handles a wide variety of appearances and requires no learning. Starting from motion in 3D we derive three periodic motion types by decomposition of the motion field into its fundamental components. In addition, three temporal motion continuities emerge from the field’s temporal dynamics. For the 2D perception of 3D motion we consider the viewpoint relative to the motion; what follows are 18 cases of recurrent motion perception. To estimate repetition under all circumstances, our theory implies constructing a mixture of differential motion maps: \(\mathbf {F}\), \({\varvec{\nabla }}\mathbf {F}\), \({\varvec{\nabla }}{\varvec{\cdot }} \mathbf {F}\) and \({\varvec{\nabla }}{\varvec{\times }} \mathbf {F}\). We temporally convolve the motion maps with wavelet filters to estimate repetitive dynamics. Our method is able to spatially segment repetitive motion directly from the temporal filter responses densely computed over the motion maps. For experimental verification of our claims, we use our novel dataset for repetition estimation, better-reflecting reality with non-static and non-stationary repetitive motion. On the task of repetition counting, we obtain favorable results compared to a deep learning alternative.
Keywords
Video analysis Motion Periodicity Repetition counting Wavelet transform Motion segmentation1 Introduction
Visual repetitive motion is common in our everyday experience as it appears in sports, music-making, cooking and other daily activities. In natural scenes, it appears as leaves in the wind, waves in the sea or the drumming of a woodpecker, whereas our encounters of visual repetition in urban environments include blinking lights, the spinning of wind turbines or a waving pedestrian. In this work we reconsider the theory of periodic motion and propose a method for estimating repetition in real-world video.
To understand the origin and appearance of visual repetition we rethink the theory of periodic motion inspired by existing work (Pogalin et al. 2008; Davis et al. 2000). We follow a differential geometric approach, starting from the divergence, gradient and curl components of the 3D flow field. From the decomposition of the motion field and its temporal dynamics, we derive three motion types and three motion continuities to arrive at \(3\times 3\) fundamental cases of intrinsic periodicity in 3D. For the 2D perception of 3D intrinsic periodicity, the observer’s viewpoint can be somewhere in the continuous range between two viewpoint extremes. Finally, we arrive at 18 fundamental cases for the 2D perception of 3D intrinsic periodic motion.
Estimating repetition in practice remains challenging. First and foremost, repetition appears in many forms due to its diversity motion types and motion continuity (Fig. 1). Sources of variation in motion appearance include the action class, origin of motion and the observer’s viewpoint. Moreover, the motion appearance is often non-static due to a moving camera or as the observed phenomena develops over time. In practice, repetitions are rarely perfectly periodic but rather are non-stationarity. Existing literature (Levy and Wolf 2015; Pogalin et al. 2008) generally assumes static and stationary repetitive motion. As reality is more complex, we here address the challenges involved with non-static and non-stationary by proposing a novel method for estimating repetition in real-world video.
To deal with the diverse and possibly non-static motion appearance in realistic video, our theory implies representing the video with a mixture of first-order differential motion maps. For non-stationary temporal dynamics the fixed-period Fourier transform (Cutler and Davis 2000; Pogalin et al. 2008) is not suitable. Instead, we handle complex temporal dynamics by decomposing the motion into a time-frequency distribution using the continuous wavelet transform. To increase robustness and to be able to handle camera motion, we combine the wavelet power of all motion representations. Finally, we alleviate the need for explicit tracking (Pogalin et al. 2008) or motion segmentation (Runia et al. 2018) by segmenting repetitive motion directly from the wavelet power. On the task of repetition counting, our method performs well on an existing video dataset and our novel QUVA Repetition dataset which emphasizes on more realistic video.
We rethink the theory of periodic motion to arrive at a classification of periodic motion. Starting from the 3D motion field induced by an object periodically moving through space, we decompose the motion into three elementary components: divergence, curl and shear. From the motion field decomposition and the field’s temporal dynamics, we identify 9 fundamental cases of periodic motion in 3D. For the 2D perception of 3D periodic motion we consider the observer’s viewpoint relative to the motion. Two viewpoint extremes are identified, from which 18 cases of 2D repetitive appearance emerge.
Our spatiotemporal filtering method addresses the wide variety of repetitive appearances and effectively handles non-stationary motion. Specifically, diversity in motion appearance handled by representing video as six differential motion maps that emerge from the theory. To identify the repetitive dynamics in the possibly non-stationary video, we use the continuous wavelet transform to produce a time-frequency distribution densely over the video. Directly from the wavelet responses we localize the repetitive motion and determine the repetitive contents.
Extending beyond the video dataset of Levy and Wolf (2015), we propose a new dataset for repetition estimation, that is more realistic and challenging in terms of non-static and non-stationary videos. To encourage further research on video repetition, we will make the dataset and source code available as download.
2 Related Work
2.1 Repetition Estimation
The seminal work of Cutler and Davis (2000) uses normalized autocorrelation to obtain similarity matrices and proceeds by repetition estimation using Fourier analysis. Pogalin et al. (2008) estimate the frequency of motion in video by tracking an object, performing principal component analysis over the tracked regions and also employing the Fourier-based periodogram. From the spectral decomposition, the dominant frequencies can be identified by peak detection and non-trivial separation of fundamental and harmonic frequencies. While Fourier-based methods provide a good estimate of strongly periodic motion, they are unsuitable nor intended to deal with more realistic non-stationary repetition, see the accelerating rower in Fig. 2.
As strongly periodic motion has received serious attention, less effort has been devoted to non-stationary repetition in video. Briassouli and Ahuja (2007) use the Short-Time Fourier Transform for estimating the time-varying spectral components in video to distinguish multiple periodically moving objects. The filtering-based approach of Burghouts and Geusebroek (2006) uses a time-causal filter bank from Koenderink (1988) to detect quasi-periodic motion in video. Their method works online and shows good results when filter response frequencies are tuned correctly. In this work, we employ the continuous wavelet transform over multiple temporal scales to estimate repetition in complex video.
The deep learning method of Levy and Wolf (2015) is different from all other work but resembles our work in counting-based evaluation over a large video dataset. The general idea is to train a convolutional neural network for predicting the motion period in short video clips. As training data is not available, the network is optimized on synthetic video sequences in which moving squares exhibit periodic motion of four motion types from Pogalin et al. (2008). At test time, the method takes a stack of video frames, performs explicit motion localization to obtain a region of interest and then classifies the motion period by forwarding the frame crops through the network. The system is evaluated on the task of repetition counting and shows near-perfect performance on their YTSegments dataset. The 100 videos are a good initial benchmark but as the majority of videos have a static viewpoint and exhibit stationary periodic motion, we propose a new dataset. Our dataset better reflects reality by including more non-static and non-stationary examples.
Increased video complexity in terms of motion appearance, scene complexity and camera motion demands intricate spatiotemporal localization of salient motion. While many methods for periodic motion analysis incorporate some form of tracking or motion segmentation (Polana and Nelson 1997; Pogalin et al. 2008; Levy and Wolf 2015), few approaches specifically address the challenge of repetitive motion segmentation. Goldenberg et al. (2005) estimate the repetitive foreground motion to leverages its center-of-mass trajectory for classifying human behavior. More closely related is the work of Lindeberg (2017) in which scale selection over space and time leads to an effective temporal scale map. Inspired by this, we perform spatial segmentation of repetitive motion directly from the spectral power maps obtained through the continuous wavelet transform. This is appealing, as it connects localization to the temporal dynamics rather than relying on decoupled localization by state-of-the-art motion segmentation, e.g. Tokmakov et al. (2017).
2.2 Categorization of Motion Types
In real-world video, periodic motion emerges in a wide variety of appearances (see Fig. 3). We reconsider the theory of periodic motion by proposing a classification of fundamental periodic motion types starting from the 3D motion field tied to a moving object. Using first-order differential analysis, we decompose the motion field into its primitive components. The work of Koenderink and van Doorn (1975) delivered inspiration for our theoretical derivation of repetitive motion types from the flow field. Similar to the Helmholtz–Hodge decomposition (Abraham et al. 1988) into the eigenvalues of the flow field’s Jacobian matrix, it finds use in flow field topology for fluid dynamics and electrodynamics. Although our work is similar in differential decomposition of the motion field, we use it to reach a novel classification of periodic motion patterns. We use the insights for establishing our repetition estimation method.
Although not directly related to our work, first-order differential geometric motion representations have been used extensively as spatiotemporal video descriptors. Klaser et al. (2008) proposes a spatial multi-scale motion descriptor based on first-order differential motion and uses integral videos for efficient computation. Along similar lines, MoSIFT (Chen and Hauptmann 2009) uses spatial interest points and enforces sufficient temporal dynamics to eliminate candidate points. In terms of motion descriptors, our work bears resemblance to the Divergence–Curl–Shear descriptor proposed by Jain et al. (2013). Their favorable action classification results associated with the differential-based descriptor support our findings for periodic motion estimation.
3 Repetitive Motion
Visual repetition is defined as a reoccurring pattern over space or time in the 3D world. In this work, we focus on temporally repetitive motion rather than spatially repetitive patterns such as a texture. Consequently, the 3D motion field induced by a moving object is the right starting point for our theoretical analysis.
3.1 Motion Field Decomposition
3.2 Intrinsic Periodic Motion in 3D
3.2.1 Motion Types
3.2.2 Motion Continuities
3.2.3 Categorization of Periodic Motion
The intrinsic periodicity in 3D does not cover all perceived recurrence in an image sequence. For the trivial cases of constant translation and constant expansion in 3D, the perceived recurrence will appear when a repetitive chain of objects (conveyor) or a repetitive appearance (texture on a car tire) on the object is aligned with the motion. In such cases, the recurrence will also be observed in the field of view. For constant rotation, the restriction is that the appearance cannot be constant over the surface, as no motion, let alone recurrent motion would be observed. In the rotational case, any rotational symmetry in appearance will induce a higher order recurrence as a multiplication of the symmetry and the rotational speed.
For the purpose of periodic motion, nine cases organize in a \(3\times 3\) Cartesian table of basic motion type times motion continuity, see Fig. 5a. The corresponding examples of these nine cases are given in Fig. 5b. This is the list of fundamental cases, where a mixture of types is permitted. In practice, some cases are ubiquitous, while for others it is hard to find examples at all.
3.3 Visual Recurrence in 2D
3.4 Non-static Repetition
Relative motion between the moving object and the observer adds another dimension of complexity. In particular with recurrent motion (1) the camera may move because the camera is mounted on the moving object itself, or (2) the camera is following the target of interest, or (3) the camera is in motion independent of the motion of the object. For the first two cases, the camera motion reflects the periodic dynamics of the object’s motion. The flow field may be outside the object, but otherwise it displays a complementary pattern in the flow field.
In the first case, the periodically moving camera will produce a global repetitive flow field as opposed to local repetitive flow when the object itself is moving. The third case particularly demands the removal of the camera motion prior to the repetitive motion analysis. In practice, this situation occurs frequently. Therefore, particular attention needs to be paid to camera motion independent of the target’s motion. When the viewpoint changes from frontal to side view due to camera motion, the analysis will be inevitably hard. Figure 6 illustrates the dramatic changes in the flow field when the camera changes from one extreme viewpoint (side) to the other (frontal), or vice versa. Our method handles such appearance changes by simultaneously using multiple motion representations and summing temporal filter responses.
3.5 Non-stationary Repetition
4 Method
In this section we present our method for estimating repetition in video. The method takes as input a sequence of RGB frames and outputs a frequency distribution densely computed over space and time. Subsequently, the spectral power distribution, which we obtain from the continuous wavelet transform, is used for repetition counting, motion segmentation or other frequency-based measurements. We target the general case in which moving objects may exhibit non-stationary periodicity or have a non-static appearance due to camera motion or repetition superposed on translation. Our method, summarized in Fig. 8, comprises motion estimation and two consecutive filtering steps: first we spatially filter the motion fields to arrive at first-order differential geometric motion maps, and then we determine the video’s repetitive contents by applying the continuous wavelet transform densely over the motion maps. Task-dependent post-processing steps may give the desired output; here we focus on repetition counting as it enables straightforward evaluation of our method in the presence of non-stationary repetitions.
4.1 Differential Geometric Motion Maps
Figure 9 displays an example frame with four of six motion maps (the two are omitted here). The six motion maps represent the video for each moment in time and address the diversity in repetitive motion. In our experiments, we will evaluate the individual and joint representative power associated with the motion maps. A priori it is unknown which motion we are dealing with, to which we return later by combining the temporal responses of all motion maps.
4.2 Dense Temporal Filtering
So far we have only considered spatial filtering to obtain the motion maps for a moment in time. Here we include time and proceed by temporal filtering of the motion maps to estimate the video’s repetitive motion. This is where the current method diverges from our previous work. In (Runia et al. 2018), we relied on the same motion maps but performed max-pooling over the foreground motion segmentation obtained separately from Papazoglou and Ferrari (2013). The max-pooled values over time construct a one-dimensional signal acting as a surrogate for the dynamics in a particular motion map. Spectral decomposition for each of the signals led to six (possibly contrasting) time-frequency estimates. To select the most discriminative representation, we employed a self-quality assessment based on the spectral power in the signals.
We found two problems with this approach: (1) the decoupled motion segmentation may not be optimal for estimating repetitive motion dynamics, and (2) max-pooling over the foreground motion mask discards most information and is unable to deal with multiple moving parts. We here address these problems by dense temporal filtering over all locations in the motion map instead of operating on the max-pooled signals. Spatially dense estimation of the local spectral power enables us to localize regions likely containing repetitive motion. The temporal filtering can be implemented in several ways, for example, as Fourier transform through temporal convolution. To handle non-stationary video dynamics, we perform the continuous wavelet transform by convolution to obtain a time-varying spectral decomposition.
4.3 Continuous Wavelet Transform
4.4 Combining Spectral Power Maps
We compute the time-localized frequency estimates by temporal convolution densely over the six individual motion representations. For each representation this produces a time-varying maximum power map and scale map. The power map contains the spatial distribution of maximum wavelet power over all temporal scales; the scale map holds the temporal scales corresponding to the wavelets with maximum power. What remains is combining the wavelet responses from all motion representations.
Rather than selecting the single most discriminative representation (Runia et al. 2018), we combine the spectral power maps by summation on a per-frame basis. To illustrate this, we visualize four (out of six) individual power maps and their combined response in Fig. 11. Summation of the spectral power maps has a number of attractive properties. Most importantly, the motion maps with the strongest repetitive appearance will contribute most to the final power map whereas weakly-periodic motion maps will have a negligible contribution. This effectively serves as a dynamic selection of the most discriminative motion representation. Moreover, as the spectral power is time-localized, the relative contribution per motion representation will be evolving over time. This is appealing because motion appearance can be non-static in realistic video due to camera motion or gradual change in motion type.
4.5 Spatial Segmentation
The combined wavelet power map gives a time-varying spatial distribution of spectral power over all motion representations, whereas the corresponding effective scale map relates to the temporal scale with maximum spectral power. We propose to use the spatial distribution of spectral power for segmentation of the regions with strongest repetitive appearance. Subsequently, we use the scale map to infer the dominant temporal scale (related to the motion frequency) over the localized region.
The spatial segmentation of repetitive motion is performed in a straightforward manner. For a moment in time, we simply mean-threshold the combined wavelet power map to obtain a binary segmentation mask associated with regions containing significant spectral power. More precisely, the wavelet-based motion segmentation will attend to regions in which the maximum spectral power over all temporal scales is significant. Figure 9 (bottom row) illustrates this by displaying the combined power map and corresponding scale map. In general, performing motion segmentation directly from the spatial distribution of spectral power is appealing as it couples the localization and subsequent frequency measurements. Our experiments will verify this claim and compare them with specialized motion segmentation methods. We would like to mention that our segmentation method leaves the door open for multiple repetitively moving objects whereas most state-of-the-art segmentation methods assume a single dominant foreground motion (Tokmakov et al. 2017).
4.6 Repetition Counting
To obtain an instantaneous frequency estimate of the salient motion, we median-pool the temporal wavelet scales over the segmentation mask. Median-pooling is preferred over mean-pooling as it relatively robust to outliers and will produce a better estimate of the dominant frequency. The corresponding temporal wavelet scale is then converted to an instantaneous frequency using Eq. 18. For a moment in time, this will deliver a frequency estimate for the salient repetitive motion. Counting the number of repetitions follows temporal integration of the consecutive frequency measurements with the temporal sampling spacing inferred from the video’s frame rate.
We emphasize our method’s ability to count the number of cycles in non-stationary video. For a stationary periodic signal, the median-pooled temporal scales will be constant over time, while non-stationarity motion produces time-varying frequency estimates. Although the videos considered in our experiments are temporally segmented, the time-localized wavelet responses could also be used for temporal localization of repetitive actions. Moreover, although the current approach performs median-pooling over the motion segmentation mask, the spatial distribution of wavelet power also enables the identification of multiple periodically moving parts.
5 Experiments
We perform experiments to show the effectiveness of our method on the task of counting repetitions in video. Prior to evaluating our full method, we demonstrate the strength of the continuous wavelet transform for estimating repetition in non-stationary signals, show the need for diversified motion maps to deal with the wide variety in motion appearance, and investigate our method’s ability to handle dynamic viewpoints. Before discussing the actual experiments, we introduce the video datasets for testing, give implementation details and specify our counting evaluation metrics.
5.1 Datasets and Evaluation
The main experiments consider two video datasets: the existing YTSegments and our new QUVA Repetition dataset; both collected for the purpose of evaluating repetition estimation in video. The two real-world datasets contain a single dominant repetitive motion only for the ease of evaluation. Additionally, we perform a controlled experiment on viewpoint estimation with synthetic video that we generated through 3D modeling in Blender.
YTSegments Dataset For the purpose of evaluating repetition counting in video, Levy and Wolf (2015) introduced a new video benchmark. The 100 videos downloaded from YouTube are purely for evaluation purpose as training the network is performed with synthesized videos. A wide range of actions appears in the videos: several sports, cooking and animal movement. Each video is temporally segmented such that only the repetitive action is covered. The clips are annotated with a total repetition count. While the dataset serves as a good initial benchmark for repetition estimation, it is limited in terms of cycle length variation (non-stationarity), motion appearances and camera motion. As our goal is to evaluate our method on more realistic video, we introduce a new video dataset that is more challenging in terms of non-stationarity, motion appearance, camera motion and background clutter.
QUVA Repetition Dataset In Runia et al. (2018) we introduced a more realistic video benchmark for repetition estimation. The QUVA Repetition consists of 100 videos displaying a wide variety of repetitive video dynamics, including various kinds of sport, music-making, cooking, grooming, construction and animal behavior. The videos are collected from YouTube with emphasis on creating a diverse collection of videos suitable for evaluating our method’s ability to deal with non-stationary motion, camera motion and significant evolution of motion appearance over the course of a video.
Dataset statistics of YTSegments and QUVA Repetition
YTSegments | QUVA repetition | |
---|---|---|
Number of videos | 100 | 100 |
Duration min/max (s) | 2.1/68.9 | 2.5/64.2 |
Duration avg. (s) | \(14.9 \pm 9.8\) | \(17.6 \pm 13.3\) |
Count avg. ± SD | \(10.8 \pm 6.5\) | \(12.5 \pm 10.4\) |
Count min/max | 4/51 | 4/63 |
Cycle length variation | 0.22 | 0.36 |
Camera motion | 21 | 53 |
Superposed translation | 7 | 27 |
The characteristics for both datasets are reported in Table 1. It is apparent that our videos have more variability in cycle length, motion appearance, camera motion and background clutter. The increased difficulty in both appearance and temporal dynamics give a more realistic benchmark for repetition estimation in the wild. Figure 12 displays a number of examples from both datasets. The project page^{1} contains the dataset download link and several video previews.
5.2 Implementation Details
Motion Segmentation Complex videos with background clutter or camera motion demand segmentation of the foreground motion prior to further analysis. Although our method directly performs localization from the densely computed wavelet power, we also evaluate with state-of-the-art motion segmentation methods. The fast video segmentation method of Papazoglou and Ferrari (2013) is chosen as classical approach and was also used in Runia et al. (2018). This approach separates foreground objects from the background in a video by combining motion boundaries followed by segmentation refinement. We also evaluate the more recent deep learning based method of Tokmakov et al. (2017). The method trains a two-stream convolutional neural network with a long-short term memory (LSTM) module to capture the evolution over time. The network parameters are optimized using the large FlyingThings 3D dataset (Mayer et al. 2016). To refine the motion masks from the trained networks, a conditional random field is applied for refinement. For both methods we use the official implementations made available by the authors. While both methods generally attain excellent segmentations, we observed that segmentation fails completely for some more difficult frames (either all or none pixels selected as foreground). To remedy incorrect segmentation masks we reuse the segmentation of the previous frame if the fraction of foreground pixels is less than 1% of the entire frame.
Differential Geometric Motion Maps To compute the motion maps we perform spatial filtering by first-order Gaussian kernels. The filtering is implemented in PyTorch and runs in large batches on the GPU to accelerate computation. Spatial convolution is performed with \(\sigma = 4\) for all experiments. We also evaluated \(\sigma = \{2,8,16\}\) but found only minor variation in performance. In practice, a combination of multiple spatial scales may produce best results. Once the spatial first-order derivatives \(\nabla _x F_x, \nabla _y F_x, \nabla _x F_y\) and \(\nabla _y F_y\) have been obtained through convolution, the differential motion maps are computed as specified in Sect. 4.1.
Continuous Wavelet Transform We use the continuous wavelet filtering implementation as outlined in Torrence and Compo (1998). In comparison to the previous version of our work, we now also perform temporal filtering on the GPU^{2} resulting in a considerable speed-up. This enables us to apply the wavelet transform in large batches over all spatial locations in the video. As previously mentioned, we use a Morlet wavelet (\(\omega _0 = 6\)) with logarithmic scales (\(\delta j = 0.125\), \(s_0 = 2\delta t\)). We limit the range of J corresponding to a minimum of four repetitions by setting \(s_{\min }\) and \(s_{\max }\) accordingly in (16) and (17). Depending on the video length, there are typically between 50 and 60 temporal scales levels. When compute budget is tight, computational efficiency can be improved by pruning the filter bank with scale selection, for example using the maximum response of a Laplacian filter (Lindeberg 2017). Alternatively, learning could be employed to infer the relationship between motion-speed and relevant wavelet scale-levels to prune the filter bank.
Repetition Counting The instantaneous frequency estimates are obtained from the dense wavelet power by pooling over the motion foreground mask. As detailed in Sect. 4.6, the frequencies are integrated over time to arrive at a final repetition count. To remove frequency estimate outliers inconsistent with adjacent frames, we apply a median filter of 9 timesteps (frames) to enforce local smoothness. This gives a slight improvement on both video datasets. The final Count predictions are not rounded, hence evaluation metrics may be slightly off due to incomplete cycles.
Reimplementation of Baselines We compare our method against two existing works for repetition estimation. The method of Pogalin et al. (2008) is chosen to represent the class of Fourier-based methods. Our reimplementation uses a more recent object tracker (Henriques et al. 2012) but is identical otherwise. The tracker is initialized by manually drawing a box on the first frame. Converting the frequency to a count is trivial using the video length and frame rate. Additionally, we compare with the deep learning method of Levy and Wolf (2015) using their publicly available code and pretrained model without any modifications.
5.3 Temporal Filtering: Fourier Versus Wavelets
Results From the results in Fig. 15 it is clear that wavelet-based counting outperforms the periodogram on idealized signals. As expected, we observe that the Fourier-based measurements generally fail on videos with significant cycle length variation as they give a global frequency prediction. Wavelets naturally handle non-stationary repetition and are less sensitive to cycle length variability. We also tried adding a substantial amount of Gaussian noise (\(\sigma = 0.5\)) to the signals; this resulted in a minor negative effect on both methods (data not shown). This controlled experiment shows the effectiveness of wavelets for repetition estimation assuming a clear signal can be distilled from the videos.
5.4 Viewpoint Invariance
Setup The theory of repetition considers two viewpoint extremes (Fig. 6). In this experiment we evaluate our method’s ability to handle a continuous transition from one viewpoint extreme to the other. The designated mechanism for this is the use of multiple motion representations and the summation of their spectral power obtained from the continuous wavelet transform. To test this, we set-up a controlled experiment in which we synthesize a video clip from 3D modeled data in Blender. This enables full control over the object’s motion and the viewpoint. Specifically, we choose to build a simple 3D scene containing a ball periodically bouncing on the floor as displayed in the top row of Fig. 16. Initially, the camera captures the bouncing ball from the side view but after a number of full motion cycles, the camera smoothly transitions to frontal view (case 3 to case 6 in Fig. 6). We record the median-pooled vertical flow and divergence over the foreground region to obtain two time-varying signals. The spectral power for both signals is individually estimated using the continuous wavelet transform, after which we combine the power by summation. In addition to a synthetic experiment, we also include the result of a real-world video with significant dynamic viewpoint change (previously shown in Fig. 7).
Results Figures 16 and 17 plot the two median-pooled flow signals and their joint wavelet power obtained by summation. Initially, as the moving object is captured from the side view, vertical flow is best measurable. Upon the viewpoint transition, vertical flow vanishes while the divergent flow becomes dominant. As a result of the camera motion, the measurement of the spectral power for both individual signals will only give a strong response for either the first or second half of the video. However, the summation of the spectra gives a clear measurement over the complete video as is apparent from the combined wavelet power spectrum. This illustrates our method’s ability to handle viewpoint changes by the combination of the wavelet power contained in multiple motion representations. By summation of the spectra, the best measurable motion representation will naturally give the largest contribution to the combined power. Therefore, this mechanism acts as a replacement of the global representation selection used in (Runia et al. 2018) by dynamically leveraging information in all representations.
5.5 Diversity in Motion Maps
Setup As wavelets prove to be effective for repetition estimation and multiple representations show value on a synthetic video, we now assess the value of a diversified video representation on real videos of our QUVA Repetition dataset. We hypothesize that, due to the high variability in motion pattern and viewpoint, no single representation is powerful but their joint diversity is effective. To test this, we perform repetition counting over all individual motion maps listed in Eq. (13). Instead of summing the wavelet power for all representations, we test the performance of the six motion representations individually. For each representation we densely compute te wavelet power and count the number of repetitions as outlined in the method’s section. For a fair comparison, we exclude our motion segmentation mechanism based on wavelet power and instead use the motion segmentation proposed by Papazoglou and Ferrari (2013). Again, we evaluate repetition counting on our QUVA Repetition dataset. To obtain a lower-bound on the error, we also select the best representation per video in an oracle fashion.
Value of diversity in six motion maps for videos from QUVA Repetition
MAE | OBOA | # Selected | |
---|---|---|---|
\({\varvec{\nabla }}{\varvec{\cdot }} \mathbf {F}\) | \(77.8 \pm 90.8\) | 0.21 | 10 |
\({\varvec{\nabla }}{\varvec{\times }} \mathbf {F}\) | \(53.0 \pm 65.5\) | 0.32 | 11 |
\(\nabla _x F_x\) | \(58.1 \pm 63.5\) | 0.29 | 15 |
\(\nabla _y F_y\) | \(59.5 \pm 68.4\) | 0.31 | 9 |
\(F_x\) | \(49.6 \pm 48.0\) | 0.35 | 25 |
\(F_y\) | \(42.0 \pm 45.3\) | 0.43 | 30 |
Oracle best | \(24.1 \pm 33.5\) | 0.63 | 100 |
5.6 Video Acceleration Sensitivity
Setup In this experiment, we examine our method’s sensitivity to acceleration by artificially speeding-up videos. Starting from the YTSegments dataset, in which most videos exhibit strong periodic motion, we induce significant non-stationarity by artificially accelerating the videos halfway. More precisely, we modify the videos such that after the midpoint frame, the speed is increased by dropping every second frame. What follows are 100 videos with a \(2\times \) acceleration starting halfway. We compare against the deep learning method of Levy and Wolf (2015) which handles non-stationarity by running the period-predicting convolutional neural network in sliding-window fashion over the video. Fourier-based analysis was left out as it will inevitably fail on this task.
5.7 Motion Segmentation
Setup In this experiment we investigate the effectiveness of the motion segmentations obtained directly from the wavelet power for repetition estimation. We visually compare the motion segmentations and test whether replacing our localization mechanism with a state-of-the-art motion segmentation method improves repetition estimation performance. We keep the method identical except for the segmentation method to obtain a motion mask. In addition to our wavelet-based motion segmentation to obtain the discriminative motion mask we compare our method’s performance without any localization (full-frame), the video segmentation method of Papazoglou and Ferrari (2013) and the deep learning approach of Tokmakov et al. (2017).
Results We visually compare the three different motion segmentation methods in Fig. 20. For most videos, our method is able to localize the repetitive motion. As the emphasis of our work is on repetition estimation, where the segmentation masks are a byproduct, the state-of-the-art specifically devoted to foreground motion segmentation naturally produce the visually best results and lowest intersection-over-union error with respect to the ground truth mask. Our intention is to obtain a motion mask best suitable for repetition estimation which not necessarily overlaps with the foreground motion. By thresholding the wavelet power maps, our method seems to emphasize on regions with most discriminative repetitive motion. This is best recognizable from the bottom two rows where the motion segmentation includes background regions that periodically change due to the motion. If maximum intersection-over-union overlap with respect to the ground truth foreground motion mask is desired, we observe a number of failure cases. For the rower (bottom row), the periodicity contained in the movement of the paddles yields a significantly stronger wavelet response than the body itself hence the body is excluded from the segmentation mask due to mean-thresholding of the wavelet power. In case of football keep-ups (third row), the dominant repetitive motion is the football moving up-and-down but the actor also rotates around its axis which is not revealed in the static images. However, the oscillating ball dominates the scene and our segmentation masks should not include the actor’s torso for this reason. The threshold is currently fixed to the mean wavelet power – setting it higher or adaptively could improve the segmentation masks.
5.8 Comparison to the State-of-the-Art
Setup In this experiment, we perform a full comparison on the task of repetition counting for both video datasets. We compare against the Fourier-based method of Pogalin et al. (2008) and the deep learning approach of Levy and Wolf (2015).
Repetition counting results of our method with different motion segmentation mechanism
YTSegments | QUVA repetition | |||
---|---|---|---|---|
Motion segmentation method | MAE \(\downarrow \) | OBOA \(\uparrow \) | MAE \(\downarrow \) | OBOA \(\uparrow \) |
Full-frame | \(46.0 \pm 67.2\) | 0.28 | \(60.8 \pm 49.4\) | 0.22 |
Papazoglou and Ferrari (2013) | \(13.1 \pm 20.3\) | 0.78 | \(42.6 \pm 49.2\) | 0.44 |
Tokmakov et al. (2017) | \(21.6 \pm 57.2\) | 0.76 | \(38.9 \pm 39.2\) | 0.42 |
Differential geometry (this paper) | \(\varvec{9.4 \pm 17.4}\) | 0.89 | \(\varvec{26.1 \pm 39.6}\) | 0.62 |
Comparison with the state-of-the-art on repetition counting for the YTSegments and our QUVA Repetition dataset
YTSegments | QUVA repetition | |||
---|---|---|---|---|
MAE \(\downarrow \) | OBOA \(\uparrow \) | MAE \(\downarrow \) | OBOA \(\uparrow \) | |
Pogalin et al. (2008) | \(21.9 \pm 30.1\) | 0.68 | \(38.5 \pm 37.6\) | 0.49 |
Levy and Wolf (2015) | \(\mathbf {6.5 \pm \phantom {0}9.2}\) | 0.90 | \(48.2 \pm 61.5\) | 0.45 |
This paper | \(9.4 \pm 17.4\) | 0.89 | \(\mathbf {26.1 \pm 39.6}\) | 0.62 |
The results change dramatically when considering our challenging QUVA Repetition dataset; notably the deep learning approach of Levy and Wolf (2015) now performs the worst, with an MAE of 48.2. This could possibly be explained by the fact that their network only considers four motion types during training or the convolutional network’s fixed temporal input dimension posing a constraint on the effective motion periods (ranging from 0.2 to 2.33 seconds). Dealing with motion periods outside of this range most likely requires retraining the network. The Fourier-based method of Pogalin et al. (2008) scores an MAE of 38.5, whereas we obtain an average error of 26.1. On the YTSegments dataset our simplified method slightly improves over the MAE of \(10.3 \pm 19.8\) reported in (Runia et al. 2018), while giving comparable results to previously reported MAE of \(23.2 \pm 34.4\) on the QUVA Repetition dataset. The Fourier-based and deep learning-based approaches are unable to effectively handle the increased non-stationarity and motion complexity found in our challenging video dataset. The method proposed here improves the ability to handle such difficult videos without relying on explicit motion segmentation methods.
Sensitivity of our method with respect to different optical flow methods
YTSegments | QUVA repetition | |||
---|---|---|---|---|
MAE \(\downarrow \) | OBOA \(\uparrow \) | MAE \(\downarrow \) | OBOA \(\uparrow \) | |
TV-L\(^1\) | \(9.8 \pm 17.9\) | 0.89 | \(26.5 \pm 67.5\) | 0.67 |
EpicFlow | \(9.7 \pm 17.9\) | 0.88 | \(30.8 \pm 38.2\) | 0.55 |
FlowNet 2.0 | \(\mathbf {9.4 \pm 17.4}\) | 0.89 | \(\mathbf {26.1 \pm 39.6}\) | 0.62 |
To gain a better understanding of our method’s characteristics we study success and failure cases. We observe that our wavelet-based motion segmentation struggles with scenes containing dynamic textures such as sand or water (e.g. Fig. 12, bottom row). Based on our analysis, we believe the reason for this is two-fold: (1) For such regions, motion estimation using optical flow is difficult (Adelson 2001); and (2) Dynamic textures produce visual repetitive dynamics resulting in a strong wavelet response over its entire surface. Consequently, motion segmentation by mean-thresholding of the spectral power will fail inevitably; and subsequent measurements over the foreground motion mask will be incorrect as well. For such videos, we observe an enormous over-count as the frequency estimates correspond to the high-frequent rippling water. The error associated with these videos explains the limited improvement over our previous method (Runia et al. 2018) which relied on Papazoglou and Ferrari (2013) for motion segmentation, being less prone to such segmentation failures. To remedy the problem of coarse and inaccurate segmentation masks, a post-processing step (e.g. conditional random field) is likely to improve the overall segmentation quality.
We also observe that all methods make a common mistake: over-counting videos with a factor of two. The similarity in these videos is that one full cycle contains the exact same motion first with one arm (or leg) followed by the other (e.g. walking lunges or swimming front-crawl). As the perceived motion is almost identical for both limbs, the estimated temporal dynamics are twice as fast. Again, the significant over-estimate of the motion frequency produces a large count error for all methods. Solving this problem is not easy, as current repetition estimates in those cases are essentially also a correct prediction; however, the human annotators define salient motion as a full cycle with both limbs.
6 Conclusion
We have categorized 3D intrinsic periodic motion as translation, rotation or expansion depending on the first-order differential decomposition of the motion field. Additionally, we distinguish three periodic motion continuities: constant, intermittent and oscillatory motion. For the 2D perception of 3D periodicity, the camera will be somewhere in the continuous range between two viewpoint extremes. What follows are 18 fundamentally different cases of repetitive motion appearance in 2D. The practical challenges associated with repetition estimation are the wide variety in motion appearance, non-stationary temporal dynamics and camera motion. Our method addresses all these challenges by computing a diversified motion representation, employing the continuous wavelet transform and combining the power spectra of all representations to support viewpoint invariance. Whereas related work explicitly localizes the foreground motion, our method performs repetitive motion segmentation directly from the wavelet power maps resulting in a simplified approach. We verify our claims by improving the state-of-the-art on the task of repetition counting on our challenging new video dataset. The method requires no training and requires only a minimum number of hyper-parameters which are fixed throughout the paper. We envision applications beyond repetition estimation as the wavelet power and scale maps can support localization of low- and high-frequency regions suitable for region pruning or action classification.
Footnotes
Notes
References
- Abraham, R., Marsden, J. E., & Ratiu, T. (1988). Manifolds, tensor analysis, and applications (Vol. 75). Berlin: Springer.CrossRefzbMATHGoogle Scholar
- Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In Human vision and electronic imaging (Vol. 4299). International Society for Optics and Photonics.Google Scholar
- Albu, A. B., Bergevin, R., & Quirion, S. (2008). Generic temporal segmentation of cyclic human motion. Pattern Recognition, 41(1), 6–21. CrossRefzbMATHGoogle Scholar
- Azy, O., & Ahuja, N. (2008). Segmentation of periodically moving objects. In Proceedings of the IEEE international conference on pattern recognition (pp. 1–4).Google Scholar
- Belongie, S., & Wills, J. (2006). Structure from periodic motion. In W. James MacLean (Ed.), Spatial coherence for visual motion analysis. Berlin Heidelberg: Springer. Google Scholar
- Briassouli, A., & Ahuja, N. (2007). Extraction and analysis of multiple periodic motions in video sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(7), 1244–1261.CrossRefGoogle Scholar
- Burghouts, G. J., & Geusebroek, J. M. (2006). Quasi-periodic spatiotemporal filtering. IEEE Transactions on Image Processing, 15(6), 1572–1582.CrossRefGoogle Scholar
- Chen, M. Y., & Hauptmann, A. (2009). MoSIFT: Recognizing human actions in surveillance videos. Technical Report, CMU-CS-09-161, Carnegie Mellon University.Google Scholar
- Chetverikov, D., & Fazekas, S. (2006). On motion periodicity of dynamic textures. In Proceedings of the British machine vision conference (pp. 167–176).Google Scholar
- Cutler, R., & Davis, L. S. (2000). Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 781–796.CrossRefGoogle Scholar
- Davis, J., Bobick, A., & Richards, W. (2000). Categorical representation and recognition of oscillatory motion patterns. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1, 628–635.Google Scholar
- Goldenberg, R., Kimmel, R., Rivlin, E., & Rudzsky, M. (2005). Behavior classification by eigendecomposition of periodic motions. Pattern Recognition, 38(7), 1033–1043.CrossRefzbMATHGoogle Scholar
- Grossmann, A., & Morlet, J. (1984). Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM Journal on Mathematical Analysis, 15(4), 723–736.MathSciNetCrossRefzbMATHGoogle Scholar
- Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European conference on computer vision Google Scholar
- Huang, S., Ying, X., Rong, J., Shang, Z., & Zha, H. (2016). Camera calibration from periodic motion of a pedestrian. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
- Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
- Jain, M., Jegou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2555–2562).Google Scholar
- Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14, 201–211. CrossRefGoogle Scholar
- Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In Proceedings of the British machine vision conference (pp. 275–1).Google Scholar
- Koenderink, J., & van Doorn, A. (1975). Invariant properties of the motion parallax field due to the movement of rigid bodies relative to an observer. Optica Acta: International Journal of Optics, 9, 773–791.Google Scholar
- Koenderink, J. J. (1988). Scale-time. Biological Cybernetics, 58(3), 159–162.MathSciNetCrossRefzbMATHGoogle Scholar
- Laptev, I., Belongie, S. J., Perez, P., & Wills, J. (2005). Periodic motion detection and segmentation via approximate sequence alignment. Proceedings of the IEEE International Conference on Computer Vision, 1, 816–823.Google Scholar
- Levy, O., & Wolf, L. (2015). Live repetition counting. In Proceedings of the IEEE international conference on computer vision.Google Scholar
- Li, X., Li, H., Joo, H., Liu, Y., & Sheikh, Y. (2018). Structure from recurrent motion: From rigidity to recurrency. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3032–3040)Google Scholar
- Lindeberg, T. (2017). Dense scale selection over space, time and space-time. Journal on Imaging Sciences, 11(1), 438–451.MathSciNetGoogle Scholar
- Liu, F., & Picard, R. W. (1998). Finding periodicity in space and time. In Proceedings of the IEEE international conference on computer vision (pp. 376–383).Google Scholar
- Lu, C., & Ferrier, N. J. (2004). Repetitive motion analysis: Segmentation and event classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 258–263.CrossRefGoogle Scholar
- Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040–4048).Google Scholar
- Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In Proceedings of the IEEE international conference on computer vision (pp. 1777–1784).Google Scholar
- Pogalin, E., Smeulders, A. W. M., & Thean, A. H. C. (2008). Visual quasi-periodicity. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
- Polana, R., & Nelson, R. C. (1997). Detection and recognition of periodic, nonrigid motion. International Journal of Computer Vision, 23(3), 261–282.CrossRefGoogle Scholar
- Ran, Y., Weiss, I., Zheng, Q., & Davis, L. S. (2007). Pedestrian detection via periodic motion analysis. International Journal of Computer Vision, 71(2), 143–160.Google Scholar
- Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). EpicFlow: Edge-preserving interpolation of correspondences for optical flow. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
- Runia, T. F. H., Snoek, C. G. M., & Smeulders, A. W. M. (2018). Real-world repetition estimation by div, grad and curl. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
- Sarel, B., & Irani, M. (2005). Separating transparent layers of repetitive dynamic behaviors. In Proceedings of the IEEE international conference on computer vision.Google Scholar
- Thangali, A., & Sclaroff, S. (2005). Periodic motion detection and estimation via space-time sampling. In Proceedings of the IEEE workshops on application of computer vision.Google Scholar
- Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning motion patterns in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
- Torrence, C., & Compo, G. P. (1998). A practical guide to wavelet analysis. Bulletin of the American Meteorological Society, 79(1), 61–78.CrossRefGoogle Scholar
- Tralie, C. J., & Perea, J. A. (2018). (quasi) periodicity quantification in video data, using topology. SIAM Journal on Imaging Sciences, 11(2), 1049–1077.MathSciNetCrossRefzbMATHGoogle Scholar
- Tsai, P. S., Shah, M., Keiter, K., & Kasparis, T. (1994). Cyclic motion detection for motion based recognition. Pattern Recognition, 27(12), 1591–1603.CrossRefGoogle Scholar
- Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L 1 optical flow. Pattern recognition, LNCS (Vol. 4713, pp. 214–223). Berlin: Springer.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.