3D Scene Flow Estimation with a Piecewise Rigid Scene Model

Vogel, Christoph; Schindler, Konrad; Roth, Stefan

doi:10.1007/s11263-015-0806-0

3D Scene Flow Estimation with a Piecewise Rigid Scene Model

Published: 24 February 2015

Volume 115, pages 1–28, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Christoph Vogel¹,
Konrad Schindler¹ &
Stefan Roth²

4305 Accesses
139 Citations
9 Altmetric
Explore all metrics

Abstract

3D scene flow estimation aims to jointly recover dense geometry and 3D motion from stereoscopic image sequences, thus generalizes classical disparity and 2D optical flow estimation. To realize its conceptual benefits and overcome limitations of many existing methods, we propose to represent the dynamic scene as a collection of rigidly moving planes, into which the input images are segmented. Geometry and 3D motion are then jointly recovered alongside an over-segmentation of the scene. This piecewise rigid scene model is significantly more parsimonious than conventional pixel-based representations, yet retains the ability to represent real-world scenes with independent object motion. It, furthermore, enables us to define suitable scene priors, perform occlusion reasoning, and leverage discrete optimization schemes toward stable and accurate results. Assuming the rigid motion to persist approximately over time additionally enables us to incorporate multiple frames into the inference. To that end, each view holds its own representation, which is encouraged to be consistent across all other viewpoints and frames in a temporal window. We show that such a view-consistent multi-frame scheme significantly improves accuracy, especially in the presence of occlusions, and increases robustness against adverse imaging conditions. Our method currently achieves leading performance on the KITTI benchmark, for both flow and stereo.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Article Open access 08 October 2020

Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey

Article 23 March 2023

Fully-Convolutional Siamese Networks for Object Tracking

Notes

“Left” and “right” are only used for intuition and do not necessarily correspond to the geometric configuration of the rig.
Compared to KITTI images, the less challenging lighting conditions allow us to refrain from our usual census data cost. We use brightness constancy with $\rho (a,b) \!=\! \min (|a\!-\!b|,\zeta )$, truncated at $\zeta \!=\!10\%$ of the intensity range. We use 3D regularization with rather aggressive truncation parameters ($\eta _G\!=\!\eta _M\!=\!1$). Other deviating parameters were set to $\lambda \!=\!0.1, \mu \!=\!0.1, \theta _\text {occ}\!=\!0.03$.

References

Adiv, G. (1985). Determining three-dimensional motion and structure from optical flow generated by several moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(4), 384–401.
Article Google Scholar
Ali, A. M., Farag, A. A., & Gimel’Farb, G. L. (2008). Optimizing binary MRFs with higher order cliques. In European Conference on Computer Vision.
Badino, H., & Kanade, T. (2011). A head-wearable short-baseline stereo system for the simultaneous estimation of structure and motion. In IAPR Conference on Machine Vision Application (pp 185–189).
Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1), 1–31. vision.middlebury.edu/flow
Barnes, C., & Shechtman, E. (2009). PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics, 28(3), 24:1–24:11.
Basha, T., Moses, Y., & Kiryati, N. (2010). Multi-view scene flow estimation: A view centered variational approach. In IEEE Conference on Computer Vision and Pattern Recognition.
Black, M. J., & Anandan, P. (1991). Robust dynamic motion estimation over time. In IEEE Conference on Computer Vision and Pattern Recognition.
Bleyer, M., Rother, C., & Kohli, P. (2010). Surface stereo with soft segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Bleyer, M., Rhemann, C., & Rother, C. (2011a). PatchMatch stereo: Stereo matching with slanted support windows. In British Machine Vision Conference.
Bleyer, M., Rother, C., Kohli, P., Scharstein, D., & Sinha, S. N. (2011b). Object stereo: Joint stereo matching and object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Braux-Zin, J., Dupont, R., & Bartoli, A. (2013). A general dense image matching framework combining direct and feature-based costs. In IEEE International Conference on Computer Vision.
Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 500–513.
Article Google Scholar
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In European Conference on Computer Vision.
Carceroni, R. L., & Kutulakos, K. N. (2002). Multi-view scene capture by surfel sampling: From video streams to non-rigid 3D motion, shape and reflectance. International Journal of Computer Vision, 49, 175–214.
Article MATH Google Scholar
Courchay, J., Pons, J. P., Monasse, P., & Keriven, R. (2009). Dense and accurate spatio-temporal multi-view stereovision. In Asian Conference on Computer Vision.
Demetz, O., Stoll, M., Volz, S., Weickert, J., & Bruhn, A. (2014). Learning brightness transfer functions for the joint recovery of illumination changes and optical flow. In European Conference on Computer Vision.
Devernay, F., Mateus, D., & Guilbert, M. (2006). Multi-camera scene flow by tracking 3-D points and surfels. In IEEE Conference on Computer Vision and Pattern Recognition.
Einecke, N., & Eggert, J. (2014). Block-matching stereo with relaxed fronto-parallel assumption. In IEEE Intelligent Vehicles Symposium Proceedings (pp 700–705).
Furukawa, Y., & Ponce, J. (2008). Dense 3D motion capture from synchronized video streams. In IEEE Conference on Computer Vision and Pattern Recognition.
Garg, R., Roussos, A., & Agapito, L. (2013). A variational approach to video registration with subspace constraints. International Journal of Computer Vision, 104(3), 286–314.
Article MATH MathSciNet Google Scholar
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? In IEEE Conference on Computer Vision and Pattern Recognition. www.cvlibs.net/datasets/kitti/.
Gorelick, L., Veksler, O., Boykov, Y., Ben Ayed, I., & Delong, A. (2014). Local submodular approximations for binary pairwise energies. In Computer Vision and Pattern Recognition.
Hirschmüller, H. (2008). Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 328–341.
Article Google Scholar
Hornacek, M., Fitzgibbon, A., & Rother, C. (2014). SphereFlow: 6 DoF scene flow from RGB-D pairs. In IEEE Conference on Computer Vision and Pattern Recognition.
Huguet, F., & Devernay, F. (2007). A variational method for scene flow estimation from stereo sequences. In IEEE International Conference on Computer Vision.
Hung, C. H., Xu, L., & Jia, J. (2013). Consistent binocular depth and scene flow with chained temporal profiles. International Journal of Computer Vision, 102(1–3), 271–292.
Article MATH MathSciNet Google Scholar
Irani, M. (2002). Multi-frame correspondence estimation using subspace constraints. International Journal of Computer Vision, 48(3), 173–194.
Article MATH MathSciNet Google Scholar
Ishikawa, H. (2009). Higher-order clique reduction in binary graph cut. In IEEE Conference on Computer Vision and Pattern Recognition.
Kolmogorov, V., & Zabih, R. (2001). Computing visual correspondence with occlusions using graph cuts. In IEEE International Conference on Computer Vision (pp 508–515).
Lempitsky, V., Roth, S., & Rother, C. (2008). FusionFlow: Discrete-continuous optimization for optical flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition.
Lempitsky, V., Rother, C., Roth, S., & Blake, A. (2010). Fusion moves for Markov random field optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8), 1392–1405.
Article Google Scholar
Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. International Joint Conference on Artificial Intelligence, 81, 674–679.
Google Scholar
Meister, S., Jähne, B., & Kondermann, D. (2012). Outdoor stereo camera system for the generation of real-world benchmark data sets. Optical Engineering, 51(2), 021107-1.
Müller, T., Rannacher, J., Rabe, C., & Franke, U. (2011). Feature- and depth-supported modified total variation optical flow for 3D motion field estimation in real scenes. In IEEE Conference on Computer Vision and Pattern Recognition.
Murray, D. W., & Buxton, B. F. (1987). Scene segmentation from visual motion using global optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(2), 220–228.
Article Google Scholar
Nir, T., Bruckstein, A., & Kimmel, R. (2008). Over-parameterized variational optical flow. International Journal of Computer Vision, 76(2), 205–216.
Article Google Scholar
Park, J., Oh, T. H., Jung, J., Tai, Y. W., & Kweon, I. S. (2012). A tensor voting approach for multi-view 3D scene flow estimation and refinement. In European Conference on Computer Vision.
Rabe, C., Müller, T., Wedel, A., & Franke, U. (2010). Dense, robust, and accurate motion field estimation from stereo image sequences in real-time. In European Conference on Computer Vision.
Ranftl, R., Pock, T., & Bischof, H. (2013). Minimizing TGV-based variational models with non-convex data terms. In International Conference on Scale Space and Variational Methods in Computer Vision.
Ranftl, R., Bredies, K., & Pock, T. (2014). Non-local total generalized variation for optical flow estimation. In European Conference on Computer Vision.
Rother, C., Kolmogorov, V., Lempitsky, V., & Szummer, M. (2007). Optimizing binary MRFs via extended roof duality. In IEEE Conference on Computer Vision and Pattern Recognition.
Rother, C., Kohli, P., Feng, W., & Jia, J. (2009). Minimizing sparse higher order energy functions of discrete variables. In: IEEE Conference on Computer Vision and Pattern Recognition.
Schoenemann, T., & Cremers, D. (2008). High resolution motion layer decomposition using dual-space graph cuts. In IEEE Conference on Computer Vision and Pattern Recognition.
Spangenberg, R., Langner, T., & Rojas, R. (2013). Weighted semi-global matching and center-symmetric census transform for robust driver assistance. In International Conference on Computer Analysis of Images and Patterns.
Sun, D., Sudderth, E. B., & Black, M. J. (2010). Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: Conference on Neural Information Processing Systems.
Sun, D., Wulff, J., Sudderth, E., Pfister, H., & Black, M. (2013). A fully-connected layered model of foreground and background flow. In IEEE Conference on Computer Vision and Pattern Recognition.
Tao, H., & Sawhney, H. S. (2000). Global matching criterion and color segmentation based stereo. In: IEEE Workshop on Applications in Computer Vision.
Unger, M., Werlberger, M., Pock, T., & Bischof, H. (2012). Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling. In IEEE Conference on Computer Vision and Pattern Recognition.
Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., & Theobalt, C. (2010). Joint estimation of motion, structure and geometry from stereo sequences. In European Conference on Computer Vision.
Vaudrey, T., Rabe, C., Klette, R., & Milburn, J. (2008). Differences between stereo and motion behaviour on synthetic and real-world stereo sequences. In International Conference on Image and Vision Computing New Zealand.
Vedula, S., Baker, S., Collins, R., Kanade, T., & Rander, P. (1999). Three-dimensional scene flow. In IEEE Conference on Computer Vision and Pattern Recognition.
Veksler, O., Boykov, Y., & Mehrani, P. (2010). Superpixels and supervoxels in an energy optimization framework. In European Conference on Computer Vision.
Vogel, C., Schindler, K., & Roth, S. (2011). 3D scene flow estimation with a rigid motion prior. In IEEE International Conference on Computer Vision.
Vogel, C., Roth, S., & Schindler, K. (2013a). An evaluation of data costs for optical flow. In Pattern Recognition (Proc. of GCPR) (pp 343–353).
Vogel, C., Schindler, K., & Roth, S. (2013b). Piecewise rigid scene flow. In IEEE International Conference on Computer Vision.
Vogel, C., Roth, S., & Schindler, K. (2014). View-consistent 3D scene flow estimation over multiple frames. In European Conference on Computer Vision.
Volz, S., Bruhn, A., Valgaerts, L., & Zimmer, H. (2011). Modeling temporal coherence for optical flow. In IEEE International Conference on Computer Vision.
Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image Processing, 3, 625–638.
Article Google Scholar
Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., & Cremers, D. (2008). Efficient dense scene flow from sparse or dense stereo data. In European Conference on Computer Vision.
Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., & Bischof, H. (2009). Anisotropic Huber-L1 optical flow. In British Machine Vision Conference.
Yamaguchi, K., Hazan, T., McAllester, D., & Urtasun, R. (2012). Continuous Markov random fields for robust stereo estimation. In European Conference on Computer Vision.
Yamaguchi, K., McAllester, D., & Urtasun, R. (2013). Robust monocular epipolar flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition.
Yamaguchi, K., McAllester, D., & Urtasun, R. (2014). Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision.
Zabih, R., & Woodfill, J. (1994). Non-parametric local transforms for computing visual correspondence. In European Conference on Computer Vision.

Download references

Acknowledgments

SR was supported in part by the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement No. 307942, as well as by the EU FP7 project “Harvest4D” (No. 323567).

Author information

Authors and Affiliations

Photogrammetry & Remote Sensing, ETH Zurich, Zurich, Switzerland
Christoph Vogel & Konrad Schindler
Department of Computer Science, TU Darmstadt, Darmstadt, Germany
Stefan Roth

Authors

Christoph Vogel
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Schindler
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Roth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Vogel.

Additional information

Communicated by Phil Torr, Steve Seitz, Yi Ma and Kiriakos Kutulakos.

Appendix: Higher-Order Reductions for Occlusion Handling with a Reference View

We here describe how to convert the occlusion-sensitive data term from Eq. (17) into a quadratic pseudo-Boolean function. Note that the only interesting case is $|\fancyscript{O}_\mathbf {p}^0| \!\ge \! 2$, that is there are two or more possibly occluding pixels. Otherwise, the problem is already in quadratic form $(|\fancyscript{O}_\mathbf {p}^0| \!=\! 1)$, or there is no occluding pixel and only the (unary) data term is required $(|\fancyscript{O}_\mathbf {p}^0| \!=\! 0)$.

Recall that Eq. (17) is defined as part of a single $\alpha $-expansion step, i.e. a pixel can only be assigned two possible labels ($\alpha $ or its previous label). For simplicity we restrict the analysis to the case $i \!=\! 0$. We thus consider the term

$$\begin{aligned} \hat{u}^0_{\mathbf {p}} [x_\mathbf {p}=0] \!\! \prod _{(\mathbf {q},j) \in \fancyscript{O}_\mathbf {p}^0} \!\! [x_\mathbf {q}\ne j]. \end{aligned}$$

(25)

The reduction for $i \!=\! 1$ is analogous.

First, let us consider the special case in which there is a pixel $\mathbf {q}$ that occludes pixel $\mathbf {p}$ in both possible assignments of $x_{\mathbf {q}}$, that is $(\mathbf {q},0) \in \fancyscript{O}_\mathbf {p}^0$ and $(\mathbf {q},1) \in \fancyscript{O}_\mathbf {p}^0$. In that case the pixel $\mathbf {p}$ is always occluded and Eq. (25) vanishes. For the remaining cases, we distinguish between $\hat{u}^0_{\mathbf {p}} \!<\! 0$ and $\hat{u}^0_{\mathbf {p}} \!>\! 0$.

Case $\hat{u}^0_{\mathbf {p}} <0$: We can substitute the whole term with the help of at most one non-submodular term with weight $\hat{u}^0_{\mathbf {p}}$. No non-submodular term is introduced if all Boolean variables in the term are inverted, i.e. $j\equiv 1$. In that case Eq. (25) becomes

$$\begin{aligned} \hat{u}^0_{\mathbf {p}} (1-x_\mathbf {p}) \!\! \prod _{(\mathbf {q},1) \in \fancyscript{O}_\mathbf {p}^0} \!\!(1-x_\mathbf {q}). \end{aligned}$$

(26)

Introducing an additional variable $z$, the polynomial in Eq. (26) can be replaced by

$$\begin{aligned} \begin{aligned} \min _{z} \hat{u}^0_{\mathbf {p}} \Bigg ( 1-z - (1-z)x_\mathbf {p} - \!\!\!\! \sum _{(\mathbf {q},1)\in \fancyscript{O}_\mathbf {p}^0}\!\! (1-z)x_\mathbf {q} \Bigg ) \end{aligned} \end{aligned}$$

(27)

in quadratic form.

If $x_\mathbf {p} \!=\! 0$ and the other variables encode a constellation where $\mathbf {p}$ is not occluded, then the expression becomes equal to $\hat{u}^0_{\mathbf {p}}$ (by setting $z \!=\! 0$). Otherwise, the minimum is attained at $0$ (with $z \!=\! 1$).

In the case of there being a $(\mathbf {q},0)\in \fancyscript{O}_\mathbf {p}^0$, we follow the scheme introduced by Rother et al. (2009). With the introduction of two auxiliary variables $z_0,z_1$, we replace the product in Eq. (25) by

$$\begin{aligned} \begin{aligned} \min _{z_0,z_1}&- \hat{u}^0_{\mathbf {p}}( z_0 z_1 - z_1 +(1-z_0)x_\mathbf {p}) \\&- \hat{u}^0_{\mathbf {p}} \sum _{(\mathbf {q},j)\in \fancyscript{O}_\mathbf {p}^0} \Big (z_1(1-x_\mathbf {q}) + (1-z_0) x_\mathbf {q}\Big ). \end{aligned} \end{aligned}$$

(28)

Here, the term $-\hat{u}^0_{\mathbf {p}} z_0 z_1$ is not submodular. Like in the previous case, if the variables do not encode an occlusion, and if $x_\mathbf {p}\!=\!0$, the minimum is $\hat{u}^0_{\mathbf {p}}$ (setting $z_0\!=\!0$ and $z_1\!=\!1$). Otherwise the minimum is 0 (setting $z_0\!=\!1$ and $z_1\!=\!0$).

Case $\hat{u}^0_{\mathbf {p}} >0$: We approach this problem using a series of substitutions. Following Ali et al. (2008), we replace a product of two variables in Eq. (25), $x_{\mathbf {q}_1} x_{\mathbf {q}_2}$, with a new variable $z$, and add

$$\begin{aligned} \min _z \hat{u}^0_{\mathbf {p}} (x_{\mathbf {q}_1} x_{\mathbf {q}_2} - 2 x_{\mathbf {q}_1} z - 2 x_{\mathbf {q}_2} z + 3 z), \end{aligned}$$

(29)

such that after the substitution Eq. (25) becomes

$$\begin{aligned} \begin{aligned}&\hat{u}^0_{\mathbf {p}} (x_{\mathbf {q}_1} x_{\mathbf {q}_2} - 2 x_{\mathbf {q}_1} z - 2 x_{\mathbf {q}_2} z + 3 z ) + \\&\hat{u}^0_{\mathbf {p}} (1-x_\mathbf {p}) z \!\! \!\!\!\! \prod _{\begin{array}{c} (\mathbf {q},j) \in \fancyscript{O}_\mathbf {p}^0 \setminus \\ \{(\mathbf {q}_1,0), (\mathbf {q}_2,0)\} \end{array}} \!\!\!\! [x_\mathbf {q}\ne j]. \end{aligned} \end{aligned}$$

(30)

Two inverted Boolean variables can be replaced in the same manner. Note that we are not restricted to replacing only variables from $\fancyscript{O}_\mathbf {p}^0$, but can also substitute $1-x_\mathbf {p}$ itself.

The substitution introduces one non-submodular term with weight $\hat{u}^0_{\mathbf {p}}$. To arrive at a quadratic polynomial one needs to replace all but two literals of the product as described, leading to $n-1$ or $n-2$ non-submodular terms.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vogel, C., Schindler, K. & Roth, S. 3D Scene Flow Estimation with a Piecewise Rigid Scene Model. Int J Comput Vis 115, 1–28 (2015). https://doi.org/10.1007/s11263-015-0806-0

Download citation

Received: 05 August 2014
Accepted: 24 January 2015
Published: 24 February 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11263-015-0806-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D Scene Flow Estimation with a Piecewise Rigid Scene Model

Abstract

Access this article

Similar content being viewed by others

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey

Fully-Convolutional Siamese Networks for Object Tracking

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Higher-Order Reductions for Occlusion Handling with a Reference View

Rights and permissions

About this article

Cite this article

Keywords

Navigation

3D Scene Flow Estimation with a Piecewise Rigid Scene Model

Abstract

Access this article

Similar content being viewed by others

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey

Fully-Convolutional Siamese Networks for Object Tracking

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Higher-Order Reductions for Occlusion Handling with a Reference View

Appendix: Higher-Order Reductions for Occlusion Handling with a Reference View

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation