A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them

Sun, Deqing; Roth, Stefan; Black, Michael J.

doi:10.1007/s11263-013-0644-x

A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them

Open access
Published: 03 September 2013

Volume 106, pages 115–137, (2014)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them

Download PDF

Deqing Sun¹,
Stefan Roth² &
Michael J. Black³

20k Accesses
444 Citations
7 Altmetric
Explore all metrics

Abstract

The accuracy of optical flow estimation algorithms has been improving steadily as evidenced by results on the Middlebury optical flow benchmark. The typical formulation, however, has changed little since the work of Horn and Schunck. We attempt to uncover what has made recent advances possible through a thorough analysis of how the objective function, the optimization method, and modern implementation practices influence accuracy. We discover that “classical” flow formulations perform surprisingly well when combined with modern optimization and implementation techniques. One key implementation detail is the median filtering of intermediate flow fields during optimization. While this improves the robustness of classical methods it actually leads to higher energy solutions, meaning that these methods are not optimizing the original objective function. To understand the principles behind this phenomenon, we derive a new objective function that formalizes the median filtering heuristic. This objective function includes a non-local smoothness term that robustly integrates flow estimates over large spatial neighborhoods. By modifying this new term to include information about flow and image boundaries we develop a method that can better preserve motion details. To take advantage of the trend towards video in wide-screen format, we further introduce an asymmetric pyramid downsampling scheme that enables the estimation of longer range horizontal motions. The methods are evaluated on the Middlebury, MPI Sintel, and KITTI datasets using the same parameter settings.

Non-local Total Generalized Variation for Optical Flow Estimation

Experimental Evaluation of Four Intermediate Filters to Improve the Motion Field Estimation

FALDOI: A New Minimization Strategy for Large Displacement Variational Optical Flow

Article 15 November 2016

1 Introduction

The field of optical flow estimation is making steady progress as evidenced by the increasing accuracy of current methods on the Middlebury optical flow benchmark (Baker et al. 2007). After over 30 years of research, these methods have obtained an impressive level of reliability and accuracy (Wedel et al. 2008b, 2009; Werlberger et al. 2009; Xu et al. 2012; Zimmer et al. 2009). But what has led to this progress? The majority of today’s methods strongly resemble the original formulation of Horn and Schunck (HS, 1981). They combine a data term that assumes constancy of some image property with a spatial term that models how the flow is expected to vary across the image. An objective function combining these two terms is then optimized. Given that this basic structure is unchanged since HS, what has enabled the performance gains of modern approaches?

The paper has three parts. In the first, we perform a study of recent optical flow methods and models. The most accurate methods on the Middlebury flow dataset make different choices about how to model the objective function, how to approximate this model to make it computationally tractable, and how to optimize it. Since most published methods change all of these properties at once, it can be difficult to know which choices are most important. To address this, we define a baseline algorithm that is “classical”, in that it is a direct descendant of the original HS formulation, and then systematically vary the model and method using different techniques from the art. The results are surprising. We find that only a small number of key choices produce statistically significant improvements and that they can be combined into a very simple method that achieves reasonable accuracy. More importantly, our analysis reveals what makes current flow methods work so well.

Part two examines the principles behind this success. We find that one algorithmic choice produces the most significant improvements: applying a median filter to intermediate flow values during incremental estimation and warping (Wedel et al. 2008b, 2009). While this heuristic improves the accuracy of the recovered flow fields, it actually increases the energy of the objective function. This suggests that what is being optimized is actually a new and different objective. Using observations about median filtering and L1 energy minimization from Li and Osher (2009), we formulate a new non-local term that is added to the original, classical objective. This new term goes beyond standard local (pairwise) smoothness to robustly integrate information over large spatial neighborhoods. We show that minimizing this new energy approximates the original optimization with the heuristic median filtering step. Note, however, that the new objective falls outside our definition of classical methods.

Once the median filtering heuristic is formulated as a non-local term in the objective, we immediately recognize how to modify and improve it. In part three we show how information about image structure and flow boundaries can be incorporated into a weighted version of the non-local term to prevent over-smoothing across boundaries. By incorporating structure from the image, this weighted version does not suffer from some of the errors produced by median filtering and better preserves motion boundaries. Figure 1 illustrates optical flow estimates for a range of methods from a “basic” HS method to our proposed Classic+NL method.

Finally we observe that the classical methods all go beyond the original HS algorithm by using a spatial pyramid to cope with large motions. The classical pyramid downsamples the image equally in both the horizontal and vertical direction, typically until some minimum image dimension is reached. With today’s wide-aspect ratio video, we point out that an asymmetric approach can be employed resulting in a pyramid that downsamples more in the horizontal direction than in the vertical one. This effectively allows the estimation of larger horizontal motions. This simple change results in significant improvements on the wide-aspect-ratio video in the KITTI (Geiger et al. 2012) and MPI Sintel (Butler et al. 2012) datasets.

At the time of writing our previous conference paper (Sun et al. 2010a, March), the resulting approach was ranked 1st in both angular and end-point errors in the Middlebury evaluation. At the writing of this paper (Sep. 2012), the method, Classic+NL, ranks 13th in both AAE and EPE. Several recent and high-ranking methods directly build on Classic+NL, such as layered models (Sun et al. 2010b, 2012, 2013), methods with more advanced motion prior models (Chen et al. 2012; Jia et al. 2011), efficient optimization schemes for the non-local term (Krähenbühl and Koltun 2012), and better initialization to deal with large displacement optical flow (Chen et al. 2013).

Compared to the conference version (Sun et al. 2010a), this paper includes many more detailed results and analyses. In addition to an expanded literature review we compare our proposed method to the closely related non-local total variation method (Werlberger et al. 2010). We discuss the limitations of our method in dealing with occlusions and fast moving objects. We report results on the MIT HAMA data set (Liu et al. 2008) and find that the results are consistent with those on Middlebury. We also test our methods on the MPI Sintel (Butler et al. 2012) and KITTI (Geiger et al. 2012) datasets, which offer greater challenges. Using the same parameters tuned on the Middlebury training set, our method performs well on these new datasets, particularly using an asymmetric pyramid.

In summary, the contributions of this paper are to (1) analyze current flow models and methods to understand which design choices matter; (2) formulate and compare several classical objectives descended from HS using modern methods; (3) formalize one of the key heuristics and derive a new objective function that includes a non-local spatial smoothness term; (4) modify this new objective to produce a competitive method; (5) extend spatial pyramids to exploit the extra width of high-definition and letterbox videos. In doing so, we provide a “recipe” for others studying optical flow that can guide their design choices. Finally, to enable comparison and further innovation, we provide a public Matlab implementation (http://www.cs.brown.edu/people/dqsun; last accessed 24 July 2013).

2 Previous Work

It is important to separately analyze the contributions of the objective function that defines the problem (Sect. 2.1) and the optimization algorithm and implementation used to minimize it (Sect. 2.2). The HS formulation, for example, has long been thought to be highly inaccurate. Barron et al. (1994) reported an average angular error (AAE) of $\sim $ $30^{\circ }$ on the “Yosemite” sequence. This confounds the objective function with the particular optimization method proposed by Horn and Schunck. Horn and Schunck noted that the correct way to optimize their objective is by solving a system of linear equations as is common today. This was impractical on the computers of the day, hence they used a heuristic method. In fact, Barron et al. note that the original HS derivatives were implemented crudely and report a modified version of HS with AAE around $11^{\circ }$. When optimized with today’s methods, the HS objective achieves surprisingly competitive results (Geiger et al. 2012) despite the expected over-smoothing and sensitivity to outliers. The reported accuracy of a method is jointly determined by the objective function, the optimization techniques, the implementation details, and the parameter tuning/learning (cf. Marr 1982; Szeliski 2010). We review related research in the context of the first three aspects below.

2.1 Models

The global formulation of optical flow introduced by Horn and Schunck (1981) relies on both brightness constancy and spatial smoothness assumptions, but suffers from the fact that their quadratic formulation is not robust to outliers. Shulman and Herve (1989) use an L1 penalty instead to preserve flow discontinuities. Black and Anandan (1996) introduce a robust framework to deal with outliers in both the data and the spatial terms. Subsequently, many different robust functions have been explored (Brox et al. 2004; Lempitsky et al. 2008; Sun et al. 2008) and it remains unclear which is best. We refer to all these spatially-discrete formulations derived from HS as “classical.” We systematically explore variations in the formulation and optimization of these approaches. The surprise is that the classical model, appropriately implemented, remains fairly competitive.

There are many formulations beyond the classical ones that we do not consider here. Significant ones use oriented smoothness (Nagel and Enkelmann 1986; Sun et al. 2008; Wedel et al. 2009; Zimmer et al. 2011, 2009), rigidity constraints (Wedel et al. 2008a, 2009), an over-parameterized smoothness term (Nir et al. 2008), or image segmentation (Black and Jepson 1996; Lei and Yang 2009; Xu et al. 2008; Zitnick et al. 2005). While they deserve similar careful consideration, we expect many of our conclusions to carry forward. Note that one can select among a set of models or methods for a given sequence (Mac Aodha et al. 2010), instead of finding a “best” model for all the sequences.

2.2 Methods

Many of the implementation details that are thought to be important date back to the early days of optical flow. Current best practices include coarse-to-fine estimation to deal with large motions (Bergen et al. 1992; Brox et al. 2004), texture decomposition (Wedel et al. 2008a, b) or high-order filter constancy (Adelson et al. 1984; Brox et al. 2004; Glaer et al. 1983; Lempitsky et al. 2010; Zimmer et al. 2009) to reduce the influence of lighting changes, incremental warping (Bergen et al. 1992), warping with bicubic interpolation (Lempitsky et al. 2008; Wedel et al. 2008b), temporal averaging of image derivatives (Horn 1986; Wedel et al. 2008b), graduated non-convexity (Blake and Zisserman 1987) to minimize non-convex energies (Black and Anandan 1996; Sun et al. 2008), and median filtering after each incremental estimation step to remove outliers (Wedel et al. 2008b).

This median filtering heuristic is of particular interest as it makes non-robust methods more robust and improves the accuracy of all methods we tested. The effect on the objective function and the underlying reason for its success have not previously been analyzed. Least median squares estimation can be used to robustly reject outliers in flow estimation (Bab-Hadiashar and Suter 1998), but previous work has focused on the data term.

Related to median filtering, and our new non-local term, is the use of bilateral filtering to prevent smoothing across motion boundaries (Xiao et al. 2006). This approach separates a variational method into two filtering update stages, and replaces the original anisotropic diffusion process with multi-cue driven bilateral filtering. As with median filtering, the bilateral filtering step changes the original energy function.

Models that are formulated with an L1 robust penalty are often coupled with specialized total variation (TV) optimization methods (Zach et al. 2007). Here we focus on generic optimization methods that can apply to most models and find that the estimated flow fields are as accurate as the reported results for specialized methods.

Despite recent algorithmic advances, there is a lack of publicly available, easy to use, and accurate flow estimation software. The GPU4Vision project (http://gpu4vision.icg.tugraz.at; last accesed 24 July 2013) has made a substantial effort to change this and provides executable files for several accurate methods (Wedel et al. 2008a, b, 2009; Werlberger et al. 2009). The dependence on the GPU and the lack of source code are limitations. Since the publication of our conference paper, our public Matlab code has been used by both researchers to develop new optical flow algorithms (Adato et al. 2011; Chen et al. 2012, 2013; Jia et al. 2011; Krähenbühl and Koltun 2012) and practitioners to use optical flow for different applications (Humayun et al. 2011; Lin and Fisher 2012; Niu et al. 2012). Currently other available optical-flow software includes (http://lmb.informatik.uni-freiburg.de/resources/software.php; last accessed 24 July 2013 http://people.csail.mit.edu/celiu/OpticalFlow/; last accessed 24 July 2013 http://www.cse.cuhk.edu.hk/leojia/projects/flow/; last accessed 24 July 2013).

3 Classical Models

As is common to “classical” methods we only address the two-frame optical flow estimation problem. We write the classical optical flow objective function in its spatially discrete form as

$$\begin{aligned} E(\mathbf{u },\mathbf{v })&= \sum _{i, j}\big \{\rho _D(I_1(i, j) \!-\! I_2(i\!+\!u_{i,j}, j\!+\!v_{i,j}))\nonumber \\&\qquad +\, \lambda [\rho _S(u_{i,j}\!-\!u_{i+1,j})\! +\! \rho _S(u_{i,j}\!-\!u_{i,j+1})\nonumber \\&\qquad +\, \rho _S(v_{i,j}\!-\!v_{i+1,j}) \!+\! \rho _S(v_{i,j}\!-\!v_{i,j+1})] \big \}, \end{aligned}$$

(1)

where $\mathbf{u }$ and $\mathbf{v }$ are the horizontal and vertical components of the optical flow field to be estimated from images $I_1$ and $I_2$, $i,j$ indexes a particular image pixel location, $u_{i,j}$ and $v_{i,j}$ are elements of $\mathbf{u }$ and $\mathbf{v }$ respectively, $\lambda $ is a regularization parameter, and $\rho _D$ and $\rho _S$ are the data and spatial penalty functions. We consider three different penalty functions: (1) the quadratic HS penalty $\rho (x) = x^2$; (2) the Charbonnier penalty $\rho (x) = \sqrt{x^2 + \epsilon ^2}$ (Bruhn et al. 2005), a differentiable variant of the absolute value, the most robust convex function; and (3) the Lorentzian $\rho (x) = \log (1+\frac{x^2}{2 \sigma ^2})$, which is a non-convex robust penalty used by Black and Anandan (1996). We refer to the robust formulation with the Lorentzian penalty as BA (short for Black and Anandan). Note that this classical model is related to a standard pairwise Markov random field (MRF) based on a 4-neighborhood (Geman and Geman 1984).

In the remainder of this section we define a baseline method using several techniques from the literature. This is not the “best” method, but includes modern techniques and will be used for comparison. We only briefly describe the main choices, which are explored in more detail in the following section and the cited references.

Quantitative results are presented throughout the remainder of the text. In all cases we report the average end-point error (EPE) on the Middlebury training and test sets, depending on the experiment.

3.1 Baseline Methods

To gain robustness against lighting changes, we follow Wedel et al. (2008b) and apply the Rudin–Osher–Fatemi (ROF; Rudin et al. 1992) structure texture decomposition method to pre-process the input sequences and linearly combine the texture and structure components (in the proportion 20:1). The parameters are set according to Wedel et al. (2008b).

Optimization is performed using a standard incremental multi-resolution technique (e. g., Black and Anandan 1996; Brox et al. 2004) to estimate flow fields with large displacements. The optical flow estimated at a coarse level is used to warp the second image toward the first at the next finer level, and a flow increment is calculated between the first image and the warped second image. The standard deviation of the Gaussian anti-aliasing filter is set to be $\frac{1}{\sqrt{2 d}}$, where $d$ denotes the downsampling factor. Each level is recursively downsampled from its nearest lower level. In building the pyramid, the downsampling factor is not critical as pointed out in the next section; here we use the settings of Sun et al. (2008), which uses a factor of 0.8 in the final stages of the optimization. For the basic pyramid scheme, we adaptively determine the number of pyramid levels so that the top level has a width or height of around 20–30 pixels. At each pyramid level, we perform 10 warping steps to compute the flow increment.

At each warping step, we linearize the data term once, which involves computing terms of the type $\frac{\partial }{ \partial x} I_2(i+u^k_{i,j}, j+v^k_{i,j})$, where ${\partial }/\partial x$ denotes the partial derivative in the horizontal direction, $u^k$ and $v^k$ denote the current flow estimate at iteration $k$. As suggested by Wedel et al. (2008b), we compute the derivatives of the second image using the 5-point derivative filter $\frac{1}{12}[-1 \ 8 \ 0 \ -8 \ 1]$, and warp the second image and its derivatives toward the first using the current flow estimate by bicubic interpolation. We then compute the spatial derivatives of the first image, compute the average of these and the corresponding warped derivatives of the second image (cf. Álvarez et al. 2007; Horn 1986), and use these in place of $\frac{\partial I_2}{\partial x}$. For pixels moving out of the image boundaries, we set both their corresponding temporal and spatial derivatives to zero. After each warping step, the flow update is computed, and then we apply a $5\times 5$ median filter to the newly computed flow field to remove outliers (Wedel et al. 2008b).

For the Charbonnier (Classic-C) and Lorentzian (Classic-L) penalty function, we use a graduated non-convexity scheme (GNC; Blake and Zisserman 1987) as described by Sun et al. (2008). First, we replace the robust penalty functions by quadratic penalty functions and obtain a quadratic formulation of the objective function, $E_Q(\mathbf{u }, \mathbf{v })$. Then we linearly combine the quadratic penalty function with the desired robust penalty function and gradually change the weighting of the two terms to reach the desired robust penalty function. In practice, we use a three-stage GNC scheme, with the objective functions for the first, second, and third stages being $E_Q(\mathbf{u }, \mathbf{v })$, $\frac{1}{2}\big ( E_Q(\mathbf{u }, \mathbf{v }) + E(\mathbf{u }, \mathbf{v }) \big )$, and $E(\mathbf{u }, \mathbf{v })$ respectively. The output of a previous stage serves as the initialization to the next stage. The standard deviations of the corresponding quadratic penalty function are set to be 1 for the Charbonnier penalty and, for the Lorentzian, are taken to be the same as the $\sigma $ value used in the Lorentzian function. The same regularization weight $\lambda $ is used for both the quadratic and the robust objective functions.

3.2 Baseline Results

The regularization parameter $\lambda $ is selected among a set of candidate values to achieve the best average end-point error (EPE) on the Middlebury training set. For the Charbonnier penalty function, the candidate set is $[1,3,5,8,10]$ and $5$ is optimal. The Charbonnier penalty uses $\epsilon =0.001$ for both the data and the spatial term in Eq. 1. The Lorentzian uses $\sigma = 1.5$ for the data term, $\sigma = 0.03$ for the spatial term, and $\lambda =0.06$. These parameters are fixed throughout the experiments, except where mentioned.

Table 1 summarizes the EPE results of the basic model with three different penalty functions on the Middlebury test set, along with the two top performers at the time of performing the evaluation (considering only published papers when the evaluation table was generated). Table 2 provides detailed results for each sequence. The classic formulations with two non-quadratic penalty functions (Classic-C) and (Classic-L) achieve competitive results despite their simplicity. The baseline optimization of HS and BA (Classic-L) results in significantly better accuracy than previously reported for these models (Sun et al. 2008). Note that the analysis also holds for the training set (Table 3).

Table 1 Models: average rank and end-point error (EPE) on the Middlebury test set using different penalty functions

A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them

Abstract

Similar content being viewed by others

Non-local Total Generalized Variation for Optical Flow Estimation

Experimental Evaluation of Four Intermediate Filters to Improve the Motion Field Estimation

FALDOI: A New Minimization Strategy for Large Displacement Variational Optical Flow

1 Introduction

2 Previous Work

2.1 Models

2.2 Methods

3 Classical Models

3.1 Baseline Methods

3.2 Baseline Results

4 Practices Explored

4.1 Image Pre-Processing

4.2 Coarse-to-Fine Estimation and Graduated Non-Convexity (GNC)

4.3 Interpolation Method and Derivatives

4.4 Penalty Functions

4.5 Median Filtering

4.6 Best Practices

5 Models Underlying Median Filtering

6 Improved Model

6.1 Closely-Related Work

6.2 Results on the MIT Dataset

6.3 Performance on MPI Sintel and KITTI Datasets

6.4 Asymmetric Pyramids for Wide-Aspect-Ratio Video

6.5 Computational Time

6.6 Limitations

7 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation