International Journal of Computer Vision

, Volume 105, Issue 3, pp 269–297 | Cite as

Variational Recursive Joint Estimation of Dense Scene Structure and Camera Motion from Monocular High Speed Traffic Sequences

  • Florian BeckerEmail author
  • Frank Lenzen
  • Jörg H. Kappes
  • Christoph Schnörr


We present an approach to jointly estimating camera motion and dense structure of a static scene in terms of depth maps from monocular image sequences in driver-assistance scenarios. At each instant of time, only two consecutive frames are processed as input data of a joint estimator that fully exploits second-order information of the corresponding optimization problem and effectively copes with the non-convexity due to both the imaging geometry and the manifold of motion parameters. Additionally, carefully designed Gaussian approximations enable probabilistic inference based on locally varying confidence and globally varying sensitivity due to the epipolar geometry, with respect to the high-dimensional depth map estimation. Embedding the resulting joint estimator in an online recursive framework achieves a pronounced spatio-temporal filtering effect and robustness. We evaluate hundreds of images taken from a car moving at speed up to 100 km/h and being part of a publicly available benchmark data set. The results compare favorably with two alternative settings: stereo based scene reconstruction and camera motion estimation in batch mode using multiple frames. They, however, require a calibrated camera pair or storage for more than two frames, which is less attractive from a technical viewpoint than the proposed monocular and recursive approach. In addition to real data, a synthetic sequence is considered which provides reliable ground truth.


Structure from motion Variational approach Recursive formulation Dense depth map 

Supplementary material

11263_2013_639_MOESM1_ESM.mpg (3.1 mb)
Supplementary material 1 (MPG 3218 KB)
11263_2013_639_MOESM2_ESM.mpg (1.6 mb)
Supplementary material 2 (MPG 1598 KB)
11263_2013_639_MOESM3_ESM.mpg (2 mb)
Supplementary material 3 (MPG 2094 KB)
11263_2013_639_MOESM4_ESM.mpg (3.1 mb)
Supplementary material 4 (MPG 3196 KB)
11263_2013_639_MOESM5_ESM.mpg (4.6 mb)
Supplementary material 5 (MPG 4718 KB)
11263_2013_639_MOESM6_ESM.mpg (1.9 mb)
Supplementary material 6 (MPG 1944 KB)
11263_2013_639_MOESM7_ESM.mpg (3.1 mb)
Supplementary material 7 (MPG 3200 KB)


  1. Absil, P. A., Mahony, R., & Sepulchre, R. (2008). Optimization algorithms on matrix manifolds. Princeton: Princeton University Press.zbMATHGoogle Scholar
  2. Bagnato, L., Frossard, P., & Vandergheynst, P. (2011). A variational framework for structure from motion in omnidirectional image sequences. Journal of Mathematical Imaging and Vision, 41(3), 182–193.MathSciNetzbMATHCrossRefGoogle Scholar
  3. Bain, A., & Crisan, D. (2009). Fundamentals of stochastic filtering. New York: Springer.zbMATHCrossRefGoogle Scholar
  4. Baker, S., & Matthews, I. (2004). Lucas–Kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3), 221–255.CrossRefGoogle Scholar
  5. Becker, F., Lenzen, F., Kappes, J. H., & Schnörr, C. (2011). Variational recursive joint estimation of dense scene structure and camera motion from monocular high speed traffic sequences. In 2011 IEEE International Conference on Computer Vision (ICCV) (pp. 1692–1699).Google Scholar
  6. Bonnans, J. F., Gilbert, J. C., Lemaréchal, C., & Sagastizábal, C. (2003). Numerical optimization. Berlin: Springer.zbMATHCrossRefGoogle Scholar
  7. Bredies, K., Kunisch, K., & Pock, T. (2010). Total generalized variation. SIAM Journal on Imaging Sciences, 3(3), 492–526.MathSciNetzbMATHCrossRefGoogle Scholar
  8. Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In T. Pajdla & J. Matas (Eds.), European Conference on Computer Vision (ECCV) (Vol. 3024, pp. 25–36). Prague: Springer, LNCS.Google Scholar
  9. Bruhn, A., Weickert, J., & Schnörr, C. (2005). Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International Journal of Computer Vision, 61, 211–231.CrossRefGoogle Scholar
  10. Comport, A., Malis, E., & Rives, P. (2007). Accurate quadri-focal tracking for robust 3D visual odometry. In IEEE International Conference on Robotics and Automation, ICRA’07. Rome.Google Scholar
  11. Enzweiler, M., & Gavrila, D. (2009). Monocular pedestrian detection: Survey and experiments. Pattern Analysis and Machine Intelligence, 31(12), 2179–2195.CrossRefGoogle Scholar
  12. Fleet, D., & Weiss, Y. (2006). Optical flow estimation. Berlin: Springer.Google Scholar
  13. Geiger, A., Roser, M., & Urtasun, R. (2010). Efficient large-scale stereo matching. In Asian Conference on Computer Vision. Queenstown.Google Scholar
  14. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Computer Vision and Pattern Recognition (CVPR). Providence.Google Scholar
  15. Gerónimo, D., López, A., Sappa, A., & Graf, T. (2010). Survey of pedestrian detection for advanced driver assistance systems. Pattern Analysis and Machine Intelligence, 32(7), 1239–1258.CrossRefGoogle Scholar
  16. Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3rd ed.). Baltimore: The Johns Hopkins University Press.zbMATHGoogle Scholar
  17. Graber, G., Pock, T., & Bischof, H. (2011). Online 3D reconstruction using convex optimization. In 1st Workshop on Live Dense Reconstruction From Moving Cameras, , ICCV 2011 (pp. 708–711).Google Scholar
  18. Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Scoffier, M., Kavukcuoglu, K., et al. (2009). Learning long-range vision for autonomous off-road driving. Journal of Field Robotics, 26, 120–144.CrossRefGoogle Scholar
  19. Hartley, R., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  20. Helmke, U., Hüper, K., Lee, P., & Moore, J. (2007). Essential matrix estimation using Gauss–Newton iterations on a manifold. International Journal of Computer Vision, 74(2), 117–136.CrossRefGoogle Scholar
  21. Hirschmüller, H. (2008). Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 328–341.CrossRefGoogle Scholar
  22. Irani, M., Anandan, P., & Cohen, M. (2002). Direct recovery of planar-parallax from multiple frames. Transactions on Pattern Analysis and Machine Intelligence, 24(11), 1528–1534.CrossRefGoogle Scholar
  23. Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183–233.zbMATHCrossRefGoogle Scholar
  24. Klein, G., Murray, D. (2007). Parallel tracking and mapping for small AR workspaces. In Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07). Nara.Google Scholar
  25. Konolige, K., & Agrawal, M. (2008). FrameSLAM: From bundle adjustment to real-time visual mapping. IEEE Transactions on Robotics, 24(5), 1066–1077.CrossRefGoogle Scholar
  26. Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  27. Lenzen, F., Becker, F., & Lellmann, J. (2013). Adaptive second-order total variation: An approach aware of slope discontinuities. In Proceedings of the 4th International Conference on Scale Space and Variational Methods in Computer Vision (SSVM) 2013. Springer, LNCS. In press.Google Scholar
  28. Lin, W. Y., Cheong, L. F., Tan, P., Dong, G., & Liu, S. (2011). Simultaneous camera pose and correspondence estimation with motion coherence. International Journal of Computer Vision (pp. 1–17).Google Scholar
  29. Liu, B., Gould, S., & Koller, D. (2010). Single image depth estimation from predicted semantic labels. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1253–1260).Google Scholar
  30. Meister, S., Kondermann, D., & Jähne, B. (2012). An outdoor stereo camera system for the generation of real-world benchmark datasets with ground truth. SPIE Optical Engineering, 51(2), 6.Google Scholar
  31. Mester, R. (2011). Recursive live dense reconstruction: Some comments on established and imaginable new approaches. In 1st Workshop on Live Dense Reconstruction From Moving Cameras, ICCV, 2011 (pp. 712–714).Google Scholar
  32. Mouragnona, E., Lhuilliera, M., Dhomea, M., Dekeyserb, F., & Sayd, P. (2009). Generic and real-time structure from motion using local bundle adjustment. Image and Vision Computing, 27(8), 1178–1193.CrossRefGoogle Scholar
  33. Newcombe, R. A., & Davison, A. J. (2010). Live dense reconstruction with a single moving camera. In CVPR (pp. 1498–1505).Google Scholar
  34. Newcombe, R. A., Lovegrove, S. J., & Davison, A. J. (2011). DTAM: Dense tracking and mapping in real-time. In 2011 IEEE International Conference on Computer Vision (ICCV) (pp. 2320–2327).Google Scholar
  35. Nister, D., Naroditsky, O., & Bergen, J. (2004). Visual odometry. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004 (Vol. 1, pp. 652–659).Google Scholar
  36. Pennec, X. (2006). Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1), 127–154.MathSciNetCrossRefGoogle Scholar
  37. Rabe, C., Müller, T., Wedel, A., & Franke, U. (2010). Dense, robust, and accurate motion field estimation from stereo image sequences in real-time. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Proceedings of the 11th European Conference on Computer Vision (Vol. 6314, pp. 582–595). Berlin: Springer, Lecture Notes in Computer Science.Google Scholar
  38. Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press. Google Scholar
  39. Rhemann, C., Hosni, A., Bleyer, M., Rother, C., & Gelautz, M. (2011). Fast cost-volume filtering for visual correspondence and beyond. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 3017–3024).Google Scholar
  40. Rudin, L. I., Osher, S., & Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D, 60(1–4), 259–268.zbMATHCrossRefGoogle Scholar
  41. Saxena, A., Chung, S. H., & Ng, A. Y. (2008). 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 76, 53–69.CrossRefGoogle Scholar
  42. Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1–3), 7–42.zbMATHCrossRefGoogle Scholar
  43. Sheikh, Y., Hakeem, A., & Shah, M. (2007). On the direct estimation of the fundamental matrix. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google Scholar
  44. Stühmer, J., Gumhold, S., & Cremers, D. (2010). Parallel generalized thresholding scheme for live dense geometry from a handheld camera. In A. Doucet, N. De Freitas, & N. Gordon (Eds.), Trends and topics in computer vision, CVGPU. New York: Springer.Google Scholar
  45. Sturm, P., & Triggs, B. (1996). A factorization based algorithm for multi-image projective structure and motion. In ECCV (pp. 709–720). Cambridge: Springer.Google Scholar
  46. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., et al. (2008). A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. Transactions on Pattern Analysis and Machine Intelligence, 30, 1068–1080.CrossRefGoogle Scholar
  47. Tierney, L., & Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393), 82–86.MathSciNetzbMATHCrossRefGoogle Scholar
  48. Triggs, B., McLauchlan, P. F., Hartley, R. I., & Fitzgibbon, A. W. (2000). Bundle adjustment—A modern synthesis (Vol. 1883). Berlin: Springer.Google Scholar
  49. Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., & Theobalt, C. (2010). Joint estimation of motion, structure and geometry from stereo sequences. In Proceedings of the 11th European Conference on Computer Vision, ECCV 2010 (pp. 568–581). Berlin, Heidelberg: Springer.Google Scholar
  50. Valgaerts, L., Bruhn, A., Mainberger, M., & Weickert, J. (2012). Dense versus sparse approaches for estimating the fundamental matrix. International Journal of Computer Vision, 96(2), 212–234.MathSciNetzbMATHCrossRefGoogle Scholar
  51. Vaudrey, T., Rabe, C., Klette, R., & Milburn, J. (2008). Differences between stereo and motion behavior on synthetic and real-world stereo sequences. In 23rd International Conference of Image and Vision Computing New Zealand (IVCNZ ’08) (pp. 1–6).Google Scholar
  52. Žefran, M., Kumar, V., & Croke, C. (1999). Metrics and connections for rigid-body kinematics. The International Journal of Robotics Research, 18(2), 242-1–242-16.CrossRefGoogle Scholar
  53. Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., & Cremers, D. (2008). Efficient dense scene flow from sparse or dense stereo data. In ECCV, LNCS (Vol. 3021, pp. 739–751).Google Scholar
  54. Weishaupt, A., Bagnato, L., & Vandergheynst, P. (2010). Fast structure from motion for planar image sequences. In EUSIPCO. Aalborg.Google Scholar
  55. Wendel, A., Maurer, M., Graber, G., Pock, T., & Bischof, H. (2012). Dense reconstruction on-the-fly. In IEEE CVPR (pp. 1450–1457).Google Scholar
  56. Wojek, C., Roth, S., Schindler, K., & Schiele, B. (2010). Monocular 3d scene modeling and inference: Understanding multi-object traffic scenes. In ECCV, LNCS (Vol. 6314, pp. 467–481).Google Scholar
  57. Yamaguchi, K., Hazan, T., McAllester, D., & Urtasun, R. (2012). Continuous Markov random fields for robust stereo estimation. In ECCV 2012 (pp. 45–58)Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Florian Becker
    • 1
    Email author
  • Frank Lenzen
    • 1
  • Jörg H. Kappes
    • 2
  • Christoph Schnörr
    • 2
  1. 1.Heidelberg Collaboratory for Image ProcessingUniversity of HeidelbergHeidelbergGermany
  2. 2.Image and Pattern Analysis GroupUniversity of HeidelbergHeidelbergGermany

Personalised recommendations