1 Introduction

The field of optical flow estimation is making steady progress as evidenced by the increasing accuracy of current methods on the Middlebury optical flow benchmark (Baker et al. 2007). After over 30 years of research, these methods have obtained an impressive level of reliability and accuracy (Wedel et al. 2008b, 2009; Werlberger et al. 2009; Xu et al. 2012; Zimmer et al. 2009). But what has led to this progress? The majority of today’s methods strongly resemble the original formulation of Horn and Schunck (HS, 1981). They combine a data term that assumes constancy of some image property with a spatial term that models how the flow is expected to vary across the image. An objective function combining these two terms is then optimized. Given that this basic structure is unchanged since HS, what has enabled the performance gains of modern approaches?

The paper has three parts. In the first, we perform a study of recent optical flow methods and models. The most accurate methods on the Middlebury flow dataset make different choices about how to model the objective function, how to approximate this model to make it computationally tractable, and how to optimize it. Since most published methods change all of these properties at once, it can be difficult to know which choices are most important. To address this, we define a baseline algorithm that is “classical”, in that it is a direct descendant of the original HS formulation, and then systematically vary the model and method using different techniques from the art. The results are surprising. We find that only a small number of key choices produce statistically significant improvements and that they can be combined into a very simple method that achieves reasonable accuracy. More importantly, our analysis reveals what makes current flow methods work so well.

Part two examines the principles behind this success. We find that one algorithmic choice produces the most significant improvements: applying a median filter to intermediate flow values during incremental estimation and warping (Wedel et al. 2008b, 2009). While this heuristic improves the accuracy of the recovered flow fields, it actually increases the energy of the objective function. This suggests that what is being optimized is actually a new and different objective. Using observations about median filtering and L1 energy minimization from Li and Osher (2009), we formulate a new non-local term that is added to the original, classical objective. This new term goes beyond standard local (pairwise) smoothness to robustly integrate information over large spatial neighborhoods. We show that minimizing this new energy approximates the original optimization with the heuristic median filtering step. Note, however, that the new objective falls outside our definition of classical methods.

Once the median filtering heuristic is formulated as a non-local term in the objective, we immediately recognize how to modify and improve it. In part three we show how information about image structure and flow boundaries can be incorporated into a weighted version of the non-local term to prevent over-smoothing across boundaries. By incorporating structure from the image, this weighted version does not suffer from some of the errors produced by median filtering and better preserves motion boundaries. Figure 1 illustrates optical flow estimates for a range of methods from a “basic” HS method to our proposed Classic+NL method.

Fig. 1
figure 1

Estimated optical flow on the Middlebury test “Army” sequence. Left to right: a an old implementation of the Horn and Schunck (HS) method (Sun et al. 2008), b a new implementation with current practices, c a modern implementation of a robust version, d an improved model that uses a non-local spatial term to robustly integrate information over a large spatial neighborhood, e ground truth from the Middlebury website (downsampled and JPEG compressed; original ground truth is withheld), and f the first frame. Color coding as in (Baker et al. 2007), shown in Fig. 4c. Average end-point error (EPE): a 0.22, b 0.12, c 0.09, and d 0.08

Finally we observe that the classical methods all go beyond the original HS algorithm by using a spatial pyramid to cope with large motions. The classical pyramid downsamples the image equally in both the horizontal and vertical direction, typically until some minimum image dimension is reached. With today’s wide-aspect ratio video, we point out that an asymmetric approach can be employed resulting in a pyramid that downsamples more in the horizontal direction than in the vertical one. This effectively allows the estimation of larger horizontal motions. This simple change results in significant improvements on the wide-aspect-ratio video in the KITTI (Geiger et al. 2012) and MPI Sintel (Butler et al. 2012) datasets.

At the time of writing our previous conference paper (Sun et al. 2010a, March), the resulting approach was ranked 1st in both angular and end-point errors in the Middlebury evaluation. At the writing of this paper (Sep. 2012), the method, Classic+NL, ranks 13th in both AAE and EPE. Several recent and high-ranking methods directly build on Classic+NL, such as layered models (Sun et al. 2010b, 2012, 2013), methods with more advanced motion prior models (Chen et al. 2012; Jia et al. 2011), efficient optimization schemes for the non-local term (Krähenbühl and Koltun 2012), and better initialization to deal with large displacement optical flow (Chen et al. 2013).

Compared to the conference version (Sun et al. 2010a), this paper includes many more detailed results and analyses. In addition to an expanded literature review we compare our proposed method to the closely related non-local total variation method (Werlberger et al. 2010). We discuss the limitations of our method in dealing with occlusions and fast moving objects. We report results on the MIT HAMA data set (Liu et al. 2008) and find that the results are consistent with those on Middlebury. We also test our methods on the MPI Sintel (Butler et al. 2012) and KITTI (Geiger et al. 2012) datasets, which offer greater challenges. Using the same parameters tuned on the Middlebury training set, our method performs well on these new datasets, particularly using an asymmetric pyramid.

In summary, the contributions of this paper are to (1) analyze current flow models and methods to understand which design choices matter; (2) formulate and compare several classical objectives descended from HS using modern methods; (3) formalize one of the key heuristics and derive a new objective function that includes a non-local spatial smoothness term; (4) modify this new objective to produce a competitive method; (5) extend spatial pyramids to exploit the extra width of high-definition and letterbox videos. In doing so, we provide a “recipe” for others studying optical flow that can guide their design choices. Finally, to enable comparison and further innovation, we provide a public Matlab implementation (http://www.cs.brown.edu/people/dqsun; last accessed 24 July 2013).

2 Previous Work

It is important to separately analyze the contributions of the objective function that defines the problem (Sect. 2.1) and the optimization algorithm and implementation used to minimize it (Sect. 2.2). The HS formulation, for example, has long been thought to be highly inaccurate. Barron et al. (1994) reported an average angular error (AAE) of \(\sim \) \(30^{\circ }\) on the “Yosemite” sequence. This confounds the objective function with the particular optimization method proposed by Horn and Schunck. Horn and Schunck noted that the correct way to optimize their objective is by solving a system of linear equations as is common today. This was impractical on the computers of the day, hence they used a heuristic method. In fact, Barron et al. note that the original HS derivatives were implemented crudely and report a modified version of HS with AAE around \(11^{\circ }\). When optimized with today’s methods, the HS objective achieves surprisingly competitive results (Geiger et al. 2012) despite the expected over-smoothing and sensitivity to outliers. The reported accuracy of a method is jointly determined by the objective function, the optimization techniques, the implementation details, and the parameter tuning/learning (cf. Marr 1982; Szeliski 2010). We review related research in the context of the first three aspects below.

2.1 Models

The global formulation of optical flow introduced by Horn and Schunck (1981) relies on both brightness constancy and spatial smoothness assumptions, but suffers from the fact that their quadratic formulation is not robust to outliers. Shulman and Herve (1989) use an L1 penalty instead to preserve flow discontinuities. Black and Anandan (1996) introduce a robust framework to deal with outliers in both the data and the spatial terms. Subsequently, many different robust functions have been explored (Brox et al. 2004; Lempitsky et al. 2008; Sun et al. 2008) and it remains unclear which is best. We refer to all these spatially-discrete formulations derived from HS as “classical.” We systematically explore variations in the formulation and optimization of these approaches. The surprise is that the classical model, appropriately implemented, remains fairly competitive.

There are many formulations beyond the classical ones that we do not consider here. Significant ones use oriented smoothness (Nagel and Enkelmann 1986; Sun et al. 2008; Wedel et al. 2009; Zimmer et al. 2011, 2009), rigidity constraints (Wedel et al. 2008a, 2009), an over-parameterized smoothness term (Nir et al. 2008), or image segmentation (Black and Jepson 1996; Lei and Yang 2009; Xu et al. 2008; Zitnick et al. 2005). While they deserve similar careful consideration, we expect many of our conclusions to carry forward. Note that one can select among a set of models or methods for a given sequence (Mac Aodha et al. 2010), instead of finding a “best” model for all the sequences.

2.2 Methods

Many of the implementation details that are thought to be important date back to the early days of optical flow. Current best practices include coarse-to-fine estimation to deal with large motions (Bergen et al. 1992; Brox et al. 2004), texture decomposition (Wedel et al. 2008a, b) or high-order filter constancy (Adelson et al. 1984; Brox et al. 2004; Glaer et al. 1983; Lempitsky et al. 2010; Zimmer et al. 2009) to reduce the influence of lighting changes, incremental warping (Bergen et al. 1992), warping with bicubic interpolation (Lempitsky et al. 2008; Wedel et al. 2008b), temporal averaging of image derivatives (Horn 1986; Wedel et al. 2008b), graduated non-convexity (Blake and Zisserman 1987) to minimize non-convex energies (Black and Anandan 1996; Sun et al. 2008), and median filtering after each incremental estimation step to remove outliers (Wedel et al. 2008b).

This median filtering heuristic is of particular interest as it makes non-robust methods more robust and improves the accuracy of all methods we tested. The effect on the objective function and the underlying reason for its success have not previously been analyzed. Least median squares estimation can be used to robustly reject outliers in flow estimation (Bab-Hadiashar and Suter 1998), but previous work has focused on the data term.

Related to median filtering, and our new non-local term, is the use of bilateral filtering to prevent smoothing across motion boundaries (Xiao et al. 2006). This approach separates a variational method into two filtering update stages, and replaces the original anisotropic diffusion process with multi-cue driven bilateral filtering. As with median filtering, the bilateral filtering step changes the original energy function.

Models that are formulated with an L1 robust penalty are often coupled with specialized total variation (TV) optimization methods (Zach et al. 2007). Here we focus on generic optimization methods that can apply to most models and find that the estimated flow fields are as accurate as the reported results for specialized methods.

Despite recent algorithmic advances, there is a lack of publicly available, easy to use, and accurate flow estimation software. The GPU4Vision project (http://gpu4vision.icg.tugraz.at; last accesed 24 July 2013) has made a substantial effort to change this and provides executable files for several accurate methods (Wedel et al. 2008a, b, 2009; Werlberger et al. 2009). The dependence on the GPU and the lack of source code are limitations. Since the publication of our conference paper, our public Matlab code has been used by both researchers to develop new optical flow algorithms (Adato et al. 2011; Chen et al. 2012, 2013; Jia et al. 2011; Krähenbühl and Koltun 2012) and practitioners to use optical flow for different applications (Humayun et al. 2011; Lin and Fisher 2012; Niu et al. 2012). Currently other available optical-flow software includes (http://lmb.informatik.uni-freiburg.de/resources/software.php; last accessed 24 July 2013 http://people.csail.mit.edu/celiu/OpticalFlow/; last accessed 24 July 2013 http://www.cse.cuhk.edu.hk/leojia/projects/flow/; last accessed 24 July 2013).

3 Classical Models

As is common to “classical” methods we only address the two-frame optical flow estimation problem. We write the classical optical flow objective function in its spatially discrete form as

$$\begin{aligned} E(\mathbf{u },\mathbf{v })&= \sum _{i, j}\big \{\rho _D(I_1(i, j) \!-\! I_2(i\!+\!u_{i,j}, j\!+\!v_{i,j}))\nonumber \\&\qquad +\, \lambda [\rho _S(u_{i,j}\!-\!u_{i+1,j})\! +\! \rho _S(u_{i,j}\!-\!u_{i,j+1})\nonumber \\&\qquad +\, \rho _S(v_{i,j}\!-\!v_{i+1,j}) \!+\! \rho _S(v_{i,j}\!-\!v_{i,j+1})] \big \}, \end{aligned}$$
(1)

where \(\mathbf{u }\) and \(\mathbf{v }\) are the horizontal and vertical components of the optical flow field to be estimated from images \(I_1\) and \(I_2\), \(i,j\) indexes a particular image pixel location, \(u_{i,j}\) and \(v_{i,j}\) are elements of \(\mathbf{u }\) and \(\mathbf{v }\) respectively, \(\lambda \) is a regularization parameter, and \(\rho _D\) and \(\rho _S\) are the data and spatial penalty functions. We consider three different penalty functions: (1) the quadratic HS penalty \(\rho (x) = x^2\); (2) the Charbonnier penalty \(\rho (x) = \sqrt{x^2 + \epsilon ^2}\) (Bruhn et al. 2005), a differentiable variant of the absolute value, the most robust convex function; and (3) the Lorentzian \(\rho (x) = \log (1+\frac{x^2}{2 \sigma ^2})\), which is a non-convex robust penalty used by Black and Anandan (1996). We refer to the robust formulation with the Lorentzian penalty as BA (short for Black and Anandan). Note that this classical model is related to a standard pairwise Markov random field (MRF) based on a 4-neighborhood (Geman and Geman 1984).

In the remainder of this section we define a baseline method using several techniques from the literature. This is not the “best” method, but includes modern techniques and will be used for comparison. We only briefly describe the main choices, which are explored in more detail in the following section and the cited references.

Quantitative results are presented throughout the remainder of the text. In all cases we report the average end-point error (EPE) on the Middlebury training and test sets, depending on the experiment.

3.1 Baseline Methods

To gain robustness against lighting changes, we follow Wedel et al. (2008b) and apply the Rudin–Osher–Fatemi (ROF; Rudin et al. 1992) structure texture decomposition method to pre-process the input sequences and linearly combine the texture and structure components (in the proportion 20:1). The parameters are set according to Wedel et al. (2008b).

Optimization is performed using a standard incremental multi-resolution technique (e. g., Black and Anandan 1996; Brox et al. 2004) to estimate flow fields with large displacements. The optical flow estimated at a coarse level is used to warp the second image toward the first at the next finer level, and a flow increment is calculated between the first image and the warped second image. The standard deviation of the Gaussian anti-aliasing filter is set to be \(\frac{1}{\sqrt{2 d}}\), where \(d\) denotes the downsampling factor. Each level is recursively downsampled from its nearest lower level. In building the pyramid, the downsampling factor is not critical as pointed out in the next section; here we use the settings of Sun et al. (2008), which uses a factor of 0.8 in the final stages of the optimization. For the basic pyramid scheme, we adaptively determine the number of pyramid levels so that the top level has a width or height of around 20–30 pixels. At each pyramid level, we perform 10 warping steps to compute the flow increment.

At each warping step, we linearize the data term once, which involves computing terms of the type \(\frac{\partial }{ \partial x} I_2(i+u^k_{i,j}, j+v^k_{i,j})\), where \({\partial }/\partial x\) denotes the partial derivative in the horizontal direction, \(u^k\) and \(v^k\) denote the current flow estimate at iteration \(k\). As suggested by Wedel et al. (2008b), we compute the derivatives of the second image using the 5-point derivative filter \(\frac{1}{12}[-1 \ 8 \ 0 \ -8 \ 1]\), and warp the second image and its derivatives toward the first using the current flow estimate by bicubic interpolation. We then compute the spatial derivatives of the first image, compute the average of these and the corresponding warped derivatives of the second image (cf. Álvarez et al. 2007; Horn 1986), and use these in place of \(\frac{\partial I_2}{\partial x}\). For pixels moving out of the image boundaries, we set both their corresponding temporal and spatial derivatives to zero. After each warping step, the flow update is computed, and then we apply a \(5\times 5\) median filter to the newly computed flow field to remove outliers (Wedel et al. 2008b).

For the Charbonnier (Classic-C) and Lorentzian (Classic-L) penalty function, we use a graduated non-convexity scheme (GNC; Blake and Zisserman 1987) as described by Sun et al. (2008). First, we replace the robust penalty functions by quadratic penalty functions and obtain a quadratic formulation of the objective function, \(E_Q(\mathbf{u }, \mathbf{v })\). Then we linearly combine the quadratic penalty function with the desired robust penalty function and gradually change the weighting of the two terms to reach the desired robust penalty function. In practice, we use a three-stage GNC scheme, with the objective functions for the first, second, and third stages being \(E_Q(\mathbf{u }, \mathbf{v })\), \(\frac{1}{2}\big ( E_Q(\mathbf{u }, \mathbf{v }) + E(\mathbf{u }, \mathbf{v }) \big )\), and \(E(\mathbf{u }, \mathbf{v })\) respectively. The output of a previous stage serves as the initialization to the next stage. The standard deviations of the corresponding quadratic penalty function are set to be 1 for the Charbonnier penalty and, for the Lorentzian, are taken to be the same as the \(\sigma \) value used in the Lorentzian function. The same regularization weight \(\lambda \) is used for both the quadratic and the robust objective functions.

3.2 Baseline Results

The regularization parameter \(\lambda \) is selected among a set of candidate values to achieve the best average end-point error (EPE) on the Middlebury training set. For the Charbonnier penalty function, the candidate set is \([1,3,5,8,10]\) and \(5\) is optimal. The Charbonnier penalty uses \(\epsilon =0.001\) for both the data and the spatial term in Eq.  1. The Lorentzian uses \(\sigma = 1.5\) for the data term, \(\sigma = 0.03\) for the spatial term, and \(\lambda =0.06\). These parameters are fixed throughout the experiments, except where mentioned.

Table 1 summarizes the EPE results of the basic model with three different penalty functions on the Middlebury test set, along with the two top performers at the time of performing the evaluation (considering only published papers when the evaluation table was generated). Table 2 provides detailed results for each sequence. The classic formulations with two non-quadratic penalty functions (Classic-C) and (Classic-L) achieve competitive results despite their simplicity. The baseline optimization of HS and BA (Classic-L) results in significantly better accuracy than previously reported for these models (Sun et al. 2008). Note that the analysis also holds for the training set (Table 3).

Table 1 Models: average rank and end-point error (EPE) on the Middlebury test set using different penalty functions
Table 2 Models: average end-point error (EPE) on the Middlebury optical flow benchmark (test set)
Table 3 Pre-processing: average end-point error (EPE) on the Middlebury training set for the baseline method (Classic-C) using different image pre-processing techniques

Because Classic-C performs quite well despite its simplicity, we set it as the baseline below. Note that our baseline implementation of HS has a lower average EPE than many more sophisticated methods. The HS implementation here incorporates many algorithmic and implementation details not present in the original HS method; the core idea of quadratic data and spatial terms however remains the same. In our naming convention, one can think of the HS method here as Classic-Q, meaning that it is the same as the Classic-C method except that the data and spatial penalty terms are quadratic.

4 Practices Explored

We now systematically vary the baseline approach by incorporating different ideas that have appeared in the literature, with the goal of illuminating which of these ideas are significant. This analysis is performed on the Middlebury training set by changing only one property at a time. Statistical significance is determined using a Wilcoxon signed rank test (Wilcoxon 1945) between each modified method and the baseline Classic-C method; a \(p\) value less than 0.05 indicates a significant difference. Each section below presents detailed comparisons of all these methods and then summarizes the results in a simple “take away message” about what we think are the “best practices” based on the data.

4.1 Image Pre-Processing

While it is common to talk about the brightness constancy assumption as a core feature of most optical flow algorithms, in practice many other constancy assumptions have been used. It is common, for example, to pre-filter the images in a variety of ways ranging from simple smoothing to edge detection. For each method, we optimize the regularization parameter \(\lambda \) for the training sequences. The results are summarized in Table 3, with details of the methods applied to individual training sequences given in Tables 4 and 5. The baseline uses a non-linear pre-filtering of the images (ROF) to reduce the influence of illumination changes between frames (Wedel et al. 2008b). Table 3 shows the effect of using no pre-processing, resulting in the standard brightness constancy model (*-brightness). Classic-C-brightness actually achieves lower EPE on the training set than does Classic-C but significantly higher error on the test set (Table 1). This disparity suggests overfitting to the training data and leaves open the question as to whether the standard brightness constancy assumption, formulated robustly, may still compete with various types of filter/structure constancy given appropriate training data.

Table 4 Models and pre-processing: average end-point error (EPE) on the Middlebury training set for the classical model and different penalty functions
Table 5 Pre-processing: average end-point error (EPE) on the Middlebury training set for the baseline method (Classic-C) using different pre-processing techniques

Simpler alternatives, such as filter response (or high-order) constancy (Brox et al. 2004; Bruhn & Weickert 2005; Sun et al. 2008) can serve the same purpose as ROF texture decomposition. A variety of pre-filters have been used in the literature, including derivative filters, Laplacians (Burt et al. 1982; Lempitsky et al. 2010), and Gaussians. Edges have also been emphasized using the Sobel edge magnitude (Vaudrey and Klette 2009).

Gradient only imposes constancy of the gradient vector at each pixel as proposed by Brox et al. (2004); i. e.,  it robustly penalizes the Euclidean distance between image gradients. We use central difference filters (\(Dx = [-0.5 \ 0 \ 0.5]\) and \(Dy = Dx^T\)). Gaussian+Dx+Dy assumes separate brightness, horizontal derivative, and vertical derivative constancy. A weighted combination of robust functions applied to each term is used as by Sun et al. (2008). Neither of these methods differ significantly from the baseline texture decomposition (Classic-C). Two methods are significantly worse: the Sobel edge magnitude (Vaudrey and Klette 2009) and Laplacian pre-filtering (\(5\times 5\)) as used by Lempitsky et al. (2010). Sobel edge magnitude appears to not work well on some of the sequences, particularly the synthetic ones, and may not be suitable for a general flow estimation method. Laplacian pre-filtering (\(5\times 5\)) as used by Lempitsky et al. (2010) produces good results on “RubberWhale”, but poor ones on the synthetic sequences. Note that the parameters for the FusionFlow method (Lempitsky et al. 2010) were mainly tuned using the “RubberWhale” sequence. The evaluation results suggest room for improving the FusionFlow method by a better pre-processing technique. Gaussian pre-filtering (\(\sigma = 0.5\)) performed well on the synthetic sequences, but poorly on real ones. Finally, the texture-structure blending ratio is 20:1 in Wedel et al. (2008b) but 4:1 in Werlberger et al. (2009). We find that (Texture4:1) performs better (but not significantly) on the synthetic sequences with a little degradation on the real ones. By default, the blended result from texture decomposition is normalized to \([-1, 1]\) by Wedel et al. (2008b) and [0, 255] in our experiment. Not doing this normalization (Unnormalized texture) has little effect.

For the Laplacian pre-filtering, we find combining the filtered image with the original image, in the proportion 1:1, improves accuracy significantly (Laplacian1:1). Similar to the ROF texture decomposition, such an approach boosts the high frequency while suppressing the low frequency components that contain the lighting change.

Good Practices: Some form of image filtering is useful but simple derivative constancy is nearly as good as the more sophisticated texture decomposition method.

4.2 Coarse-to-Fine Estimation and Graduated Non-Convexity (GNC)

We vary the number of warping steps per pyramid level and find that 3 warping steps gives similar results as using the baseline 10 (Table 6), except on “Urban3”, which is dominated by large motion and occlusions (see Table 7 for sequence-specific results). For the coarse-to-fine pyramid, Sun et al. (2008) use a downsampling factor of 0.8 during non-convex optimization. A traditional downsampling factor of 0.5 (Down- \(0.5\) ), however, has nearly identical performance. Note that a larger factor means that the pyramid levels are more similar in size and, for a pyramid with top bottom levels of the same size, results in more pyramid levels.

Table 6 Model and methods: average end-point error (EPE) on the Middlebury training set for the baseline method (Classic-C) using different algorithm and modeling choices
Table 7 Model and methods: average end-point error (EPE) on the Middlebury training set for the baseline model (Classic-C) using different algorithm and modeling choices

Previously, Brox et al. (2004) have reported that a downsampling factor of 0.95 produces much better results than 0.5. Note that for each iterative warping estimation step, Brox et al. use successive over-relaxation (SOR) to iteratively solve their linear system of equations and stop the iteration before convergence. With a downsampling factor of 0.95, they effectively increase the number of iterative warping steps performed by the algorithm, and this likely helps the overall algorithm converge. For our implementation, we solve the linear system of equations using the Matlab built-in backslash function and obtain converged results for each iterative warping estimation step. Under such a setting, we find that the downsampling factor has little influence on the performance.

Removing the GNC procedure for the Charbonnier penalty function (w/o GNC) results in higher EPE on most sequences and higher energy on all sequences (Table 8). This suggests that the GNC method is helpful even for the convex Charbonnier penalty function due to the nonlinearity of the data term.

Table 8 Energy (\(\times 10^6\), Eq. 1) for the optical flow fields computed on the Middlebury training set, evaluated using convolution-based bicubic interpolation (Keys 1981)

Good Practices: The downsampling factor does not matter when using a convex penalty; a standard factor of 0.5 is fine. Some form of GNC is useful even for a convex robust penalty like Charbonnier because of the nonlinear data term.

4.3 Interpolation Method and Derivatives

We find that the baseline bicubic interpolation is more accurate than bilinear (Table 6, Bilinear), as already reported in previous work (Wedel et al. 2008b). Removing temporal averaging of the gradients (w/o TAVG), using a Central difference filter \([-1\ 0\ 1]/2\), or using a 7-point derivative filter \([-1 \ 9 \ -45 \ 0 \ 45 \ -9 \ 1]/60\) (Bruhn et al. 2005) all reduce accuracy compared to the baseline, but not significantly.

The baseline method computes the image derivative by first computing the derivative of the second image, warping the intermediate result toward the first image, and then averaging the warped result with the spatial derivative of the first image. Another approach is to first warp the second image toward the first image, compute the derivatives of the warped image, and then perform the temporal averaging with the spatial derivatives of the first image (Bruhn et al. 2005). We find the second approach produces similar results (Deriv-warp). However, the derivatives computed in either way are inconsistent with those implicitly interpolated by the bicubic interpolation. Bicubic interpolation interpolates not only the image but also the derivatives (Press et al. 2002). Because the Matlab built-in function interp2 is based on cubic convolution (Keys 1981) and does not provide the derivatives used in interpolation, we use the spline-based implementation by Press et al. (2002). With the new implementation (Bicubic-II), the three different ways to compute the derivatives give very similar EPE results, all better than the Matlab built-in function. However, the one with consistent derivatives (Bicubic-II) gives the lowest energy solution, as shown in Table 9.

Table 9 Energy (\(\times 10^6\), Eq. 1) for the optical flow fields computed on the Middlebury training set, evaluated using spline-based bicubic interpolation (Press et al. 2002)

Good Practices: Use spline-based bicubic interpolation with a 5-point filter. Compute the derivatives during the interpolation to obtain the lowest energy solutions. Temporal averaging of the derivatives is probably worthwhile for a small computational expense.

4.4 Penalty Functions

We find that the convex Charbonnier penalty performs better than the more robust, non-convex Lorentzian on both the training and test sets. We test using the Charbonnier for the data term and Lorentzian for the spatial term (C–L) and vice versa (L–C). The two approaches perform better than using the Lorentzian for both terms but worse than using the Charbonnier for both terms.

One reason might be that non-convex functions are more difficult to optimize, causing the optimization scheme to find a poor local optimum. Another reason might be that the MAP estimator actually favors the “wrong” penalty functions (Nikolova 2007; Schmidt et al. 2010).

We investigate a generalized Charbonnier penalty function \( \rho (x) = (x^2 + \epsilon ^2)^a\) that is equal to the Charbonnier penalty when \(a=0.5\), and non-convex when \(a< 0.5\) (see Fig. 2). We optimize the regularization parameter \(\lambda \) again. We find a slightly non-convex penalty with \(a=0.45\) (GC-0.45) performs consistently better than the Charbonnier penalty, whereas more non-convex penalties (GC-0.25 with \(a=0.25\)) show no improvement.

Fig. 2
figure 2

Different penalty functions for the spatial terms: Charbonnier (\(\epsilon =0.001\)), generalized Charbonnier (\(a=0.45\) and \(a=0.25\)), and Lorentzian (\(\sigma =0.03\))

Good Practices: The less-robust Charbonnier is preferable to the highly non-convex Lorentzian and a slightly non-convex penalty function (GC-0.45) is better still.

4.5 Median Filtering

Figure 3 illustrates the median filtering step within the coarse-to-fine incremental estimation process. The baseline \(5\times 5\) median filter (MF \(5\times 5\) ) is better than both MF \(3\times 3\) (Wedel et al. 2008b) and MF \(7\times 7\), but the difference is not significant (Table 6). When we perform \(5\times 5\) median filtering twice (\(2\times \) MF) or five times (\(5\times \) MF) per warping step, the results are worse. Finally, removing the median filtering step (w/o MF) makes the computed flow significantly less accurate with larger outliers as shown in Table 6 and Fig. 4.

Fig. 3
figure 3

The median filtering is performed after every incremental warping step (i. e., once at every image pyramid level). The output of the median filtering is upsampled and used as the initial estimate for the next larger pyramid level

Fig. 4
figure 4

Estimated flow fields on sequence “RubberWhale” using Classic-C with and without (w/o MF) the median filtering step. a (w/ MF) energy 502, 387, b (w/o MF) energy 449, 290, c color key (Baker et al. 2007). The median filtering step helps reach a solution free from outliers but with a higher energy. The flow fields have been normalized by their maximum magnitude resulting in different contrasts. The outliers in the result without median filtering (b) make the flow appear lower contrast

One interesting result with HS is that repeatedly applying median filtering (20 times) at every warping step improves the HS formulation and the improvement is statistically significant (HS \(20 \times \) MF in Table 10).

Table 10 Additional results for HS: average end-point error (EPE) on the Middlebury training set

Good Practices: Median filtering the intermediate flow results once after every warping iteration is the single most important implementation detail here; \(5\times 5\) is a good filter size.

4.6 Best Practices

Combining the analysis above into a single approach means modifying the baseline to use the slightly non-convex generalized Charbonnier and the spline-based bicubic interpolation. This leads to a statistically significant improvement over the baseline (Table 6, Classic++). This method is directly descended from HS and BA, yet updated with the current best optimization practices known to us. This simple method ranks 32th out of 73 methods in both EPE and AAE on the Middlebury test set at the writing of the paper (Sep. 2012). However, as we will see soon, this method is somehow not “simple”. Instead of the original objective, a different objective is being optimized with the median filtering step. The same is true for the reported results of both HS and BA.

5 Models Underlying Median Filtering

Our analysis reveals the practical importance of median filtering during optimization. This effectively denoises the intermediate flow fields, preventing gross outliers, and making even non-robust methods like HS more robust. We ask whether there is a principle underlying this heuristic?

One interesting observation is that flow fields obtained with median filtering have substantially higher energy than those without (Table 8; Fig. 4). If the median filter is helping to optimize the objective, it should lead to lower energies. Higher energies and more accurate estimates suggest that incorporating median filtering changes the objective function being optimized.

The insight that follows from this is that the median filtering heuristic is related to the minimization of an objective function that differs from the classical one. In particular the optimization of Eq.  1, with interleaved median filtering, approximately minimizes

$$\begin{aligned} E(\mathbf{u },\mathbf{v })&= \sum _{i, j} \Big \{\rho _D(I_1(i, j) - I_2(i+u_{i,j}, j+v_{i,j}))\nonumber \\&\qquad + \lambda [\rho _S(u_{i,j}-u_{i+1,j}) \!+ \rho _S(u_{i,j}-u_{i,j+1})\nonumber \\&\qquad + \rho _S(v_{i,j}-v_{i+1,j}) + \rho _S(v_{i,j}-v_{i,j+1})]\Big \} \nonumber \\&\qquad + \lambda _N \sum _{i,j} \sum _{(i',j')\in \mathcal{N }_{i,j}} (|{u}_{i,j} \!-\!{u}_{i',j'}| \!+\! |{v}_{i,j}-{v}_{i',j'}|), \end{aligned}$$
(2)

where \(\mathcal{N }_{i,j}\) is the set of neighbors of pixel \((i,j)\) in a possibly large area and \(\lambda _N\) is a scalar weight. The term in braces is the same as the flow energy from Eq.  1, while the last term is new. This non-local term (Buades et al. 2005; Gilboa and Osher 2008) imposes a particular smoothness assumption within a specified region of the flow field.Footnote 1 Here we take this term to be a \(5\times 5\) rectangular region to match the size of the median filter in Classic-C. Figure 5 shows the neighborhood for the standard pairwise model and the non-local term.

Fig. 5
figure 5

From left to right, neighborhood structure for the center (red) pixel for the standard pairwise model, the unweighted non-local model, the unweighted non-local model with a larger neighborhood, and the weighted non-local model. The standard pairwise model connects a center pixel with its nearest neighbors, while the non-local term connects a pixel with many pixels in a large spatial neighborhood. By assigning larger weights (thicker red edges) to neighbors that are more likely to be on the same surface (blue circles), the weighted non-local model incorporates spatial scene structure information

It is usually difficult to directly optimize the objective (2) with a large spatial term. A common practice is to relax the objective with an auxiliary flow field as

$$\begin{aligned} E_A(\mathbf{u },\mathbf{v }, \hat{\mathbf{u }}, \hat{\mathbf{v }})&= \sum _{i, j} \Big \{\rho _D(I_1(i, j) - I_2(i+u_{i,j}, j+v_{i,j}))\nonumber \\&\qquad \ \ \,+\lambda [\rho _S(u_{i,j}-u_{i+1,j}) \!+ \rho _S(u_{i,j}-u_{i,j+1}) \nonumber \\&\qquad \ \ \,+\rho _S(v_{i,j}-v_{i+1,j}) + \rho _S(v_{i,j}-v_{i,j+1})]\Big \} \nonumber \\&\qquad \ \ \,+\lambda _C (||\mathbf{u }-\hat{\mathbf{u }}||^2 + ||\mathbf{v }-\hat{\mathbf{v }}||^2)\nonumber \\&\qquad \ \ \,+ \lambda _N \sum _{i,j} \sum _{(i',j')\in \mathcal{N }_{i,j}} (|\hat{u}_{i,j}-\hat{u}_{i',j'}|\,+\,|\hat{v}_{i,j}\nonumber \\&\qquad \ \ \, -\hat{v}_{i',j'}|), \end{aligned}$$
(3)

where \(\hat{\mathbf{u }}\) and \(\hat{\mathbf{v }}\) denote an auxiliary flow field and \(\lambda _C\) is a scalar weight. A third (coupling) term encourages \(\hat{\mathbf{u }}, \hat{\mathbf{v }}\) and \(\mathbf{u }, \mathbf{v }\) to be the same (cf. Wedel et al. 2009; Zach et al. 2007). Here the notation implies a pixelwise sum of squared errors between the auxiliary and main flow fields.

The connection to median filtering (as a denoising method) derives from the fact that there is a direct relationship between the median and L1 minimization. Consider a simplified version of Eq.  3 with just the coupling and non-local terms, where

$$\begin{aligned} E(\hat{\mathbf{u }}) \! = \! \lambda _C ||\mathbf{u }\!-\!\hat{\mathbf{u }}||^2 \!+\! \lambda _N \sum _{i,j} \sum _{(i',j')\in \mathcal{N }_{i,j}} |\hat{u}_{i,j}\!-\!\hat{u}_{i',j'}|. \end{aligned}$$
(4)

While minimizing this is similar to median filtering \(\mathbf{u }\), there are two differences. First, the non-local term minimizes the L1 distance between the central value and all flow values in its neighborhood except itself. Second, Eq.  4 incorporates information about the data term through the coupling equation; median filtering the flow ignores the data term.

The formal connection between Eq.  4 and median filteringFootnote 2 is provided by Li and Osher (2009) who show that minimizing Eq.  4 is related to a different median computation

$$\begin{aligned} \hat{u}_{i,j}^{(k+1)} = \text {median}({\mathrm{Neighbors}}^{(k)} \cup {\mathrm{Data}}) \end{aligned}$$
(5)

where \(\mathrm{Neighbors}^{(k)}=\{\hat{u}^{(k)}_{i',j'} \}\) for \((i',j')\in \mathcal{N }_{i,j}\) and \(\hat{\mathbf{u }}^{(0)}=\mathbf{u }\) as well as

$$\begin{aligned} \mathrm{Data} = \{ u_{i,j}, u_{i,j} \pm \tfrac{\lambda _N}{\lambda _C}, u_{i,j} \pm \tfrac{2\lambda _N}{\lambda _C} \ldots , u_{i,j} \pm \tfrac{|\mathcal{N }_{i,j}| \lambda _N}{2\lambda _C}\}, \end{aligned}$$

where \(|\mathcal{N }_{i,j}|\) denotes the (even) number of neighbors of \((i,j)\). Note that the set of “data” values is balanced with an equal number of elements on either side of the value \(u_{i,j}\) and that information about the data term is included through \(u_{i,j}\). Repeated application of Eq.  5 converges rapidly (Li and Osher 2009).

Observe that, as \(\lambda _N/\lambda _C\) increases, the weighted data values on either side of \(u_{i,j}\) move away from the values of \(\mathrm{Neighbors}\) and cancel each other out. As this happens, Eq.  5 approximates the median at the first iteration

$$\begin{aligned} \hat{u}_{i,j}^{(1)}\approx \text {median}( \mathrm{Neighbors}^{(0)} \cup \{ u_{i,j} \}). \end{aligned}$$
(6)

Equation 3 thus combines the original objective with an approximation to the median, the influence of which is controlled by \(\lambda _N/\lambda _C\). Note in practice the weight \(\lambda _C\) on the coupling term is usually small or is steadily increased from small values (Wedel et al. 2008b; Zach et al. 2007). We optimize the new objective (3) by alternately minimizing

$$\begin{aligned} E_O(\mathbf{u },\mathbf{v })&= \sum _{i, j} \Big \{\rho _D(I_1(i, j) - I_2(i+u_{i,j}, j+v_{i,j})) \nonumber \\&\qquad \ \ \,+ \lambda [\rho _S(u_{i,j}-u_{i+1,j}) \!+ \rho _S(u_{i,j}-u_{i,j+1}) \nonumber \\&\qquad \ \ \, + \rho _S(v_{i,j}-v_{i+1,j}) + \rho _S(v_{i,j}-v_{i,j+1})]\Big \} \nonumber \\&\qquad \ \ \, + \lambda _C ( ||\mathbf{u }-\hat{\mathbf{u }}||^2 + ||\mathbf{v }-\hat{\mathbf{v }}||^2) \end{aligned}$$
(7)

and

$$\begin{aligned} E_M(\hat{\mathbf{u }}, \hat{\mathbf{v }})&= \lambda _C (||\mathbf{u }\!-\!\hat{\mathbf{u }}||^2 \!+\! ||\mathbf{v }-\hat{\mathbf{v }}||^2)\nonumber \\&\quad + \lambda _N \sum _{i,j} \sum _{(i',j')\in \mathcal{N }_{i,j}} (|\hat{u}_{i,j}\!-\!\hat{u}_{i',j'}| \!+\! |\hat{v}_{i,j}\!-\!\hat{v}_{i',j'}|). \end{aligned}$$
(8)

We find that optimization of the coupled set of equations is superior in terms of EPE performance than optimization of the objective in Eq.  2.

The alternating optimization strategy first holds \(\hat{\mathbf{u }}, \hat{\mathbf{v }}\) fixed and minimizes Eq.  7 w. r. t. \(\mathbf{u }, \mathbf{v }\). Then, with \(\mathbf{u }, \mathbf{v }\) fixed, we minimize Eq.  8 w. r. t. \(\hat{\mathbf{u }}, \hat{\mathbf{v }}\). Note that Eqs.  4 and 8 can be minimized by repeated application of Eq.  5; we use this approach with five iterations. We perform 10 steps of alternating optimizations at every pyramid level and change \(\lambda _C\) logarithmically from \(10^{-4}\) to \(10^{2}\). During the first and second GNC stages, we set \(\mathbf{u }, \mathbf{v }\) to be \(\hat{\mathbf{u }}, \hat{\mathbf{v }}\) after every warping step (this replacement step helps reach solutions with lower energy and EPE than without performing this step; see Classic-C–A-noRep in Tables 11, 12). In the end, we take \(\hat{\mathbf{u }}, \hat{\mathbf{v }}\) as the final flow field estimate. The other parameters are \(\lambda = 5, \lambda _N = 1\).

Table 11 Average end-point error (EPE) on the Middlebury training set is shown for the new model with alternating optimization (Classic-C–A)
Table 12 Energy (\(\times 10^6\), Eq. 3) for the computed flow fields on the Middlebury training set
Table 13 Average end-point error (EPE) on the Middlebury training set for the proposed new objective with the non-local term and alternating optimization (Classic-C–A) and its improved models

Alternately optimizing this new objective function (Classic-C–A) leads to similar results as the baseline Classic-C (Tables 11, 13). We also compare the energy of these solutions using the new objective and find the alternating optimization produces the lowest energy solutions, as shown in Table 12.

We find that approximately optimizing the new objective by changing \(\lambda _C\) logarithmically from \(10^{-4}\) to \(10^{-1}\) has slightly better EPE results but higher energy solutions (Classic-C–A-II). We also try replacing the absolute value by the Charbonnier penalty function and using the conjugate gradient descent method (http://www.gaussianprocess.org/gpml/code/matlab/util/minimize.m; last accessed 24 July 2013) to solve Eq.  4 but obtain results with slightly worse EPE performance and higher energy.

In summary, we show that the heuristic median filtering step in Classic-C can now be viewed as energy minimization of a new objective with a non-local term. The explicit formulation emphasizes the value of robustly integrating information over large neighborhoods and enables the improved model described below.

6 Improved Model

By formalizing the median filtering heuristic as an explicit objective function, we can find ways to improve it. While median filtering in a large neighborhood has advantages as we have seen, it also has problems. A neighborhood centered on a corner or thin structure is dominated by the surround and computing the median results in oversmoothing as illustrated in Fig. 1.

Examining the non-local term suggests a solution. For a given pixel, if we know which other pixels in the area belong to the same surface, we can weight them more highly. The modification to the objective function is achieved by introducing a weight into the non-local term (Buades et al. 2005; Gilboa and Osher 2008):

$$\begin{aligned} \sum _{i,j} \sum _{(i',j')\in N_{i,j}} w_{i,j}^{i',j'}(|\hat{u}_{i,j}-\hat{u}_{i',j'}| + |\hat{v}_{i,j}-\hat{v}_{i',j'}|), \end{aligned}$$
(9)

where \(w_{i,j}^{i',j'}\) represents how likely pixel \(i',j'\) is to belong to the same surface as \(i, j\).

Of course, we do not know \(w_{i,j}^{i',j'}\), but can approximate it. We draw ideas from Sand and Teller (2008); Xiao et al. (2006); Yoon and Kweon (2006) to define the weights according to their spatial distance, their color-value distance, and their occlusion state as

$$\begin{aligned} w_{i,j}^{i',j'}\! \propto \! \exp \!\Big \{ \!-\! \tfrac{|i-i'|^2 \!+\! |j-j'|^2}{2\sigma ^2_1} -\tfrac{| \mathbf I (i,j) \!- \! \mathbf I (i',j')|^2}{2\sigma ^2_2 n_c} \! \Big \} \frac{o(i',j')}{o(i,j)}, \end{aligned}$$
(10)

where \(\mathbf I (i,j)\) is the color vector in the Lab space, \(n_c\) is the number of color channels, \(\sigma _1 = 7, \sigma _2 = 7\), and the occlusion variable \(o(i,j)\) is calculated using Eq.  22 in Sand and Teller (2008) as

$$\begin{aligned} o(i,j) \!= \! \exp \! \left\{ \! -\frac{ d^2(i,j) }{2\sigma ^2_d} \!-\! \frac{ \big (I(i,j) \!-\! I(i\!+\!u_{i,j}, j\! +\! v_{i,j}) \big )^2 }{2\sigma ^2_e} \! \right\} \!, \end{aligned}$$
(11)

where \(d(i,j)\) is the one-sided divergence function, defined as

$$\begin{aligned} d(i,j) = \left\{ \begin{array}{ll} \mathrm{div}(i,j), &{} \mathrm{div}(i,j) < 0\\ 0,&{} \mathrm{otherwise}\\ \end{array}\right. \end{aligned}$$
(12)

in which the flow divergence \(\mathrm{div}(i,j)\) is

$$\begin{aligned} \mathrm{div}(i,j) = \frac{\partial }{\partial x} u(i,j) + \frac{\partial }{\partial y} v(i,j), \end{aligned}$$
(13)

where \(\frac{\partial }{\partial x}\) and \(\frac{\partial }{\partial y}\) are respectively the horizontal and vertical flow derivatives. The occlusion variable \(o(i,j)\) is near zero for occluded pixels and near one for non-occluded pixels. We set the parameters in Eq.  11 as \(\sigma _d = 0.3\) and \(\sigma _e = 20\); this is the same as Sand and Teller (2008). Note that the occlusion state nonlinearly depends on the unknown flow field and we calculate the occlusion state using the latest flow estimate.

Examples of such weights are shown for several \(15\times 15\) neighborhoods in Fig. 6; bright values indicate higher weights. Note the neighborhood labeled d, corresponding to the rifle. Since pixels on the rifle are in the minority, an unweighted median oversmooths (Classic++ in Fig. 1). The weighted term instead robustly estimates the motion using values on the rifle. A closely related piece of work is by Ren (2008), who uses the intervening contour to define affinities among neighboring pixels for the local Lucas and Kanade (1981) method. However it only uses this scheme to estimate motion for sparse points and then interpolates the dense flow field.

Fig. 6
figure 6

Neighbor weights of the proposed weighted non-local term at different positions in the “Army” sequence. We use color, spatial distance, and occlusion cues to determine whether the neighboring pixels are likely to belong to the same surface. Among these cues, color is the most powerful (see Table 14 and text for an evaluation of the cues)

We approximately solve for \(\hat{\mathbf{u }}\) (and similarly \(\hat{\mathbf{v }}\)) using the following weighted median problem

$$\begin{aligned} \min _{\hat{u}_{i,j}} \sum _{(i',j')\in N_{i,j} \cup \{ i,j \} } w_{i,j}^{i',j'}|\hat{u}_{i,j}-u_{i',j'}|, \end{aligned}$$
(14)

using the formula (3.13) in Li and Osher (2009) for all the pixels (Classic+NL-Full). Note if all the weights are equal, the solution is just the median. In practice, we can adopt a fast version (Classic+NL) without performance loss: Given a current estimate of the flow, we detect motion boundaries using a Sobel edge detector and dilate these edges with a \(5\times 5\) mask to obtain flow boundary regions. In these regions we use the weighting in Eq.  10 in a \(15\times 15\) neighborhood. In the non-boundary regions, we use equal weights in a \(5 \times 5\) neighborhood to compute the median.

To further reduce the computation, we can adopt a two-stage GNC process and perform three warping steps per pyramid level. This fast version (Classic+NL-Fast) has nearly the same overall performance, with a slight decline in performance on the “Urban3” sequence, which has large motions; with an iterative warping scheme, large motions require more iterations.

Tables 14 and 15 show that the weighted non-local term (Classic+NL) improves the accuracy on both the training and the test sets, especially in the motion boundary regions. Note that the fine detail of the “rifle” is preserved in Fig. 1e. At the writing of this paper (Sep. 2012), Classic+NL ranks 13th in both AAE and EPE. Figures 7 and 8 show some of the results on the Middlebury dataset.

Table 14 Average end-point error (EPE) on the Middlebury training set is shown for the improved model and its variants
Table 15 Average end-point error (EPE) on the Middlebury test set for the Classic++ model with two different preprocessing techniques and its improved model
Fig. 7
figure 7

Results on the Middlebury test set. Top to bottom: “Teddy”, “Wooden”, and “Grove”. Classic+NL uses information from the color image to detect and preserve fine motion details. Note that the ground truth visualization from the Middlebury website has been compressed and has lower quality than the actual ground truth

Fig. 8
figure 8

Results on other Middlebury test sequences. Top left “Mequon”; top right “Schefflera”; bottom left “Urban”; bottom right “Yosemite”

We study some variants of the weighted non-local term (Classic+NL). Tables 14 and 16 show the importance of each term in determining the weight and influence of the parameter setting on the final results. Using different color spaces results in some performance decline. Using grayscale pixel values (Gray) or not using the static image information (w/o color) results in significant degradation in performance. Without occlusion (w/o occ) or spatial distance (w/o spa) cues does not degrade the performance significantly. The method is robust to the setting of \(\sigma _2\) for the color cue and 5 and 10 perform similarly as the default 7. The default \(\lambda \) is 3, while 1 and 9 result in some loss in performance. We also study the maximum size of the neighborhood for the non-local term and find \(11\times 11\) gives similar performance while \(19\times 19\) is slightly better.

Table 16 Average end-point error (EPE) on the Middlebury training set for the proposed new objective with the weighted non-local term and its variants

6.1 Closely-Related Work

Werlberger et al. (2010) independently propose a non-local term for optical flow estimation and the spatial term is similar to our non-local term. They use zero mean normalized cross correlation as the data term to deal with lighting changes. Their work is motivated by the success of the non-local regularization (Buades et al. 2005) in image restoration and stereo. Our work is inspired by the success of the heuristic median filtering step in flow estimation and we formalize the median filtering heuristic as a non-local regularization term. The use of the GPU and C++ makes their implementation faster than our implementation in Matlab. Classic+NL has lower average EPE on the Middlebury test sequences; 0.319 versus 0.388 (cf. Table 2). Readers can visually compare the results of both methods on the Middlebury website.

6.2 Results on the MIT Dataset

To test the robustness of these models on other data, we applied HS, Classic-C, and Classic+NL to sequences from the MIT dataset (Liu et al. 2008), and compared the estimated flow fields to the human labeled ground truth. Note only five of the eight test sequences of Liu et al. (2008) are available on-line; these are tested here.

Figure 9 and Table 17 show the results on these sequences, which are very different in nature from the Middlebury set and include an outdoor scene as well as a scene of a fish tank. The results are compared with the CLG method (Bruhn et al. 2005) used by Liu et al. (2008). It is important to point out that the CLG method was tuned to obtain the optimal results on the test sequences. Our method had no such tuning and we used the same parameters as those used in all the other experiments. This suggests that training on the Middlebury data results in a method that generalizes to other sequences. The only place where this fails is on the “fish” sequence where there is transparent motion in a liquid medium; the statistics in this sequence are very different from the Middlebury training data.

Fig. 9
figure 9

Results on MIT sequences. Top to bottom “Table”, “Hand”, “Toy”, “Fish”, and “CameraMotion”

6.3 Performance on MPI Sintel and KITTI Datasets

We evaluate the methods above (corresponding to our publicly released code) on the MPI Sintel (Butler et al. 2012) and the KITTI (Geiger et al. 2012) datasets using the default parameter settings in our conference paper (Sun et al. 2010a). As summarized by Tables 17 and 19, the conclusions contradict our findings reported above. On the MPI Sintel dataset, HS outperforms Classic++, which in turn outperforms Classic+NL-fast. The only consistent result is Classic+NL, which achieves the best performance. On the KITTI dataset, HS outperforms Classic+NL.

Table 17 Results on the MIT dataset (Liu et al. 2008)
Table 18 Average end-point error (EPE) on the MPI Sintel training set
Table 19 Average end-point error (EPE) on the MPI Sintel test set

We ask how these datasets differ from both Middlebury and the MIT dataset. What could lead to these inconsistent conclusions? One answer surprisingly lies in the unequal width and height of the images.

6.4 Asymmetric Pyramids for Wide-Aspect-Ratio Video

Our original implementation downsamples the image equally in the horizontal and vertical dimensions. The method automatically determines the number of pyramid levels using the smaller of the height and width of the input image. This scheme works well when the width-to-height ratio is close to 1, i. e., the Middlebury sequences. In contrast, the MPI Sintel images are \(1,024 \times 436\) and the KITTI images are around \(1,226 \times 370\). The small vertical dimension limits the height of the pyramid, but we find that the large horizontal dimension means that the sequences contain very large horizontal motions. As a result, at the top level of the pyramid, the horizontal motions can be much larger than a pixel.

To address this we can use an unequal downsampling factor in each direction to ensure that the motion at the top pyramid level is small in both directions (or at least similar). For the MPI Sintel and KITTI data sets, we use a downsampling factor of 0.5 in the horizontal direction and determine the downsampling factor in the vertical direction and the pyramid level number, so that the size of top pyramid level is around \(16\times 16\).

For MPI Sintel and KITTI this scheme results in a 7-level pyramid (instead of a 5-level pyramid in the standard symmetric scheme). This results in a significant improvement on both the the MPI Sintel and the KITTI data set, as summarized by Tables 18, 19 and 20. We denote the method with the new asymmetric pyramid by adding an “P” at the end of the name.

Table 20 Percentage of pixels with EPE larger than 3 pixels in non-occluded (Out-Noc) and all Out-All areas and average EPE in non-occluded (Avg-Noc) and all Avg-All areas on the KITTI test set

On MPI Sintel, the results of the four methods are consistent with those on the Middlebury data set. Note that even Classic++P outperforms the previous Classic+NL. Classic+NLP outperforms MDP-flow2 (Xu et al. 2012) on the final set, but not on the clean set. MDP-flow2 uses feature matching to deal with fast moving objects. Feature matching tends to work well on the clean set, but not the final set due to motion and optical blur in the latter. Figure 10 shows an example visual comparison between results using Classic+NLP and Classic+NL. The asymmetric pyramid leads to significant improvement in areas that undergo large motions.

Fig. 10
figure 10

Example results on MPI Sintel dataset. From top to bottom: first frame, second frame, results by Classic+NL (5-level), results by Classic+NLP (7-level), and ground truth. The asymmetric pyramid leads to a significant improvements in large regions undergoing large motion (head of the dragon on the left and background on the right). EPE results: “temple2” (left), 18.04 by Classic+NL (5-level) and 12.92 by Classic+NLP (7-level); “cave2” (right), 52.208 by Classic+NL (5-level) and 26.565 by Classic+NLP (7-level). Note that the estimated motion for fast-moving objects still contains large errors

On the KITTI set, Classic++P performs best among all our tested methods, both in the training and the test sets. Note that the KITTI sequences have been collected on a moving vehicle in an urban environment. The flow fields tend to be smooth with few flow boundaries. The image-independent smoothness assumption in Classic++P is better suited to such data. Figure 11 shows some results for Classic+NL-FastP and Classic+NL-fast; note the dramatic improvement resulting from the asymmetric pyramid.

Fig. 11
figure 11

Example results on the KITTI dataset. From top to bottom: first frame, second frame, results by Classic+NL-fast (5-level), results by Classic+NL-FastP (7-level), and ground truth for the non-occluded feature points. EPE results in non-occluded sparse feature points: “000002” (left), 12.124 by Classic+NL-fast (5-level) and 2.444 by Classic+NL-FastP (7-level); “000030” (right), 20.554 by Classic+NL-fast (5-level) and 0.615 by Classic+NL-FastP (7-level)

It is important to note that, apart from the change of pyramid method, all other parameters remain the same and are trained using the Middlebury training sequences.

6.5 Computational Time

Table 21 summarizes the running time of the evaluated methods on typical sequences from three different data sets in matlab on a 64-bit Linux desktop with 8GB memory. The additional cost from HS to Classic++ comes from the GNC stage and the non-convex penalty function. The additional cost from Classic++ to Classic+NL comes from the weighted median filtering step for detected motion boundaries. Applying the weighted median operation on all the pixels (Classic+NL-Full) increases the running time by more than three times with little performance gain. Using fewer iterations (Classic+NL-Fast) can significantly reduce the computational cost with little performance loss, especially on sequences with small motion. Note that we solve the weighted median problem at each pixel individually and do not reuse the sorting results from neighboring pixels. Future work should consider reformulating the weighted median filtering so that a convolution-type operation can be used to reduce the computational cost.

Table 21 Running time (in minutes) for computing one optical flow field from an image pair from different benchmark datasets using different methods in matlab on a 64-bit Linux desktop with 8GB memory

6.6 Limitations

Classic+NL produces larger errors in occlusion regions on some sequences, such as “Schefflera” shown in Fig. 12. The classical flow formulation assumes that every pixel at the current frame has a corresponding pixel at the next frame. However, this assumption breaks down in regions of occlusion. Pixels that are occluded by some foreground objects in one frame do not have corresponding pixels in the next, resulting in large errors with classical formulations. In contrast, a layered model (Wang and Adelson 1994) may provide a principled way to reason about occlusions. The motion model developed in this paper has enabled a recent layered approach (Sun et al. 2010b) to achieve a consistent improvement over the Classic+NL method, in particular near occlusion and motion boundary regions.

Fig. 12
figure 12

Occlusions are not explicitly modeled by Classic+NL and may cause problems in the estimated flow field. Dark pixels in the ground truth indicate occlusions

Small, fast moving objects also cause problems for the classical coarse-to-fine estimation used by Classic+NL, as shown in Fig. 10. The work by Brox and Malik (2011) on large displacement optical flow has inspired recent work (Chen et al. 2013; Steinbrücker et al. 2009; Xu et al. 2012) to embed feature matching into the coarse-to-fine estimation framework. Chen et al. (2013) show that, with proper initialization, Classic+NL can also handle large displacement optical flow on the Middlebury dataset.

7 Conclusions

When implemented using modern practices, classical optical flow formulations can produce fairly competitive results on existing datasets. To understand the techniques that help such basic formulations work well, we quantitatively studied various aspects of flow approaches from the literature, including their implementation details. Among the best practices, we found that using median filtering to denoise the flow after every warping step is key to improving accuracy, but that this increases the energy of the final result. Exploiting connections between median filtering and L1-based denoising, we showed that algorithms relying on a median filtering step are approximately optimizing a different objective that regularizes the flow field over a large spatial neighborhood. Understanding this enables us to design and optimize improved models that weight the neighbors adaptively in an extended image region. The Matlab code is publicly available at http://www.cs.brown.edu/people/dqsun; last accessed 24 July 2013.

There has been much debate about whether methods that perform well on Middlebury will generalize to other sequences. Here we tuned the parameters of the method on the Middlebury training set and tested on Middlebury, MIT HAMA, MPI Sintel, and KITTI. The conclusions on the Middlebury dataset are consistent with those on the MIT HAMA dataset. The one significant difference we found between Middlebury and the MPI Sintel and KITTI datasets was the aspect ratio of the images. This allowed us to make a change to the method by introducing a novel asymmetric image pyramid that downsamples more rapidly in the horizontal direction than the vertical direction. With only this change we found that our conclusions on Middlebury hold for MPI Sintel as well. The KITTI data set is somewhat different in nature and seems to favor methods with more spatial smoothing. As a result, the image-independent Classic++, which produces more smooth flow fields, performs slightly better than the image-dependent Classic+NL, with its sharp boundaries. It is open whether these conclusions will hold for data taken under totally different conditions, such as medical images. While the results on Middlebury generalize surprisingly well, we suspect that training the parameters for a specific dataset will improve results further.