Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

  • Abhijit Kundu
  • Yin Li
  • Frank Dellaert
  • Fuxin Li
  • James M. Rehg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8694)


We present an approach for joint inference of 3D scene structure and semantic labeling for monocular video. Starting with monocular image stream, our framework produces a 3D volumetric semantic + occupancy map, which is much more useful than a series of 2D semantic label images or a sparse point cloud produced by traditional semantic segmentation and Structure from Motion(SfM) pipelines respectively. We derive a Conditional Random Field (CRF) model defined in the 3D space, that jointly infers the semantic category and occupancy for each voxel. Such a joint inference in the 3D CRF paves the way for more informed priors and constraints, which is otherwise not possible if solved separately in their traditional frameworks. We make use of class specific semantic cues that constrain the 3D structure in areas, where multiview constraints are weak. Our model comprises of higher order factors, which helps when the depth is unobservable.We also make use of class specific semantic cues to reduce either the degree of such higher order factors, or to approximately model them with unaries if possible. We demonstrate improved 3D structure and temporally consistent semantic segmentation for difficult, large scale, forward moving monocular image sequences.


Conditional Random Field High Order Factor Structure From Motion Semantic Label Conditional Random Field Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

978-3-319-10599-4_45_MOESM1_ESM.pdf (142 kb)
Electronic Supplementary Material (PDF 142 KB)


  1. 1.
    Agarwal, S., Mierle, K.: Others: Ceres solver (2012),
  2. 2.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  3. 3.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Brostow, G., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definition ground truth database. PRL 30(2), 88–97 (2009)CrossRefGoogle Scholar
  5. 5.
    Cornelis, N., Leibe, B., Cornelis, K., Van Gool, L.: 3D urban scene modeling integrating recognition and reconstruction. IJCV 78(2-3), 121–141 (2008)Google Scholar
  6. 6.
    Floros, G., Leibe, B.: Joint 2D-3D temporally consistent segmentation of street scenes. In: CVPR (2012)Google Scholar
  7. 7.
    Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. PAMI 32(8), 1362–1376 (2010)CrossRefGoogle Scholar
  8. 8.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)Google Scholar
  9. 9.
    Häne, C., Zach, C., Cohen, A., Angst, R., Pollefeys, M.: Joint 3D scene reconstruction and class segmentation. In: CVPR (2013)Google Scholar
  10. 10.
    Hoiem, D., Efros, A., Hebert, M.: Recovering surface layout from an image. IJCV 75(1), 151–172 (2007)CrossRefGoogle Scholar
  11. 11.
    Hornung, A., Wurm, K.M., Bennewitz, M., Stachniss, C., Burgard, W.: OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots (2013)Google Scholar
  12. 12.
    Jancosek, M., Pajdla, T.: Multi-view reconstruction preserving weakly-supported surfaces. In: CVPR (2011)Google Scholar
  13. 13.
    Kaess, M., Johannsson, H., Roberts, R., Ila, V., Leonard, J., Dellaert, F.: iSAM2: Incremental smoothing and mapping using the Bayes tree. IJRR 31, 217–236 (2012)Google Scholar
  14. 14.
    Kappes, J.H., Speth, M., Reinelt, G., Schnorr, C.: Towards efficient and exact map-inference for large scale discrete computer vision problems via combinatorial optimization. In: CVPR (2013)Google Scholar
  15. 15.
    Kohli, P., Ladick, L., Torr, P.: Robust higher order potentials for enforcing label consistency. IJCV 82(3), 302–324 (2009)CrossRefGoogle Scholar
  16. 16.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press (2009)Google Scholar
  17. 17.
    Komodakis, N., Paragios, N.: Beyond pairwise energies: Efficient optimization for higher-order mrfs. In: CVPR (2009)Google Scholar
  18. 18.
    Krahenbuhl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: NIPS (2011)Google Scholar
  19. 19.
    Ladicky, L., Sturgess, P., Russell, C., Sengupta, S., Bastanlar, Y., Clocksin, W., Torr, P.H.: Joint optimisation for object class segmentation and dense stereo reconstruction. In: BMVC (2010)Google Scholar
  20. 20.
    Ladicky, L., Russell, C., Kohli, P., Torr, P.: Associative hierarchical crfs for object class image segmentation. In: ICCV (2009)Google Scholar
  21. 21.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML (2001)Google Scholar
  22. 22.
    Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: CVPR (2010)Google Scholar
  23. 23.
    Liu, S., Cooper, D.B.: Ray markov random fields for image-based 3D modeling: model and efficient inference. In: CVPR (2010)Google Scholar
  24. 24.
    Miksik, O., Munoz, D., Bagnell, J.A., Hebert, M.: Efficient temporal consistency for streaming video scene analysis. In: ICRA (2013)Google Scholar
  25. 25.
    Saxena, A., Chung, S., Ng, A.: 3-D Depth Reconstruction from a Single Still image. IJCV 76(1), 53–69 (2008)CrossRefGoogle Scholar
  26. 26.
    Sengupta, S., Greveson, E., Shahrokni, A., Torr, P.H.S.: Urban 3D semantic modelling using stereo vision. In: ICRA (2013)Google Scholar
  27. 27.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2009)Google Scholar
  28. 28.
    Sutton, C., McCallum, A.: An introduction to conditional random fields. PAMI 4(4), 267–373 (2012)Google Scholar
  29. 29.
    Tarlow, D., Givoni, I.E., Zemel, R.S.: Hop-map: Efficient message passing with high order potentials. In: AISTATS (2010)Google Scholar
  30. 30.
    Thrun, S., Burgard, W., Fox, D.: Probabilistic robotics. MIT Press (2005)Google Scholar
  31. 31.
    Tighe, J., Lazebnik, S.: Superparsing: Scalable nonparametric image parsing with superpixels. International Journal of Computer Vision (2012)Google Scholar
  32. 32.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(1-2), 1–305 (2008)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Abhijit Kundu
    • 1
  • Yin Li
    • 1
  • Frank Dellaert
    • 1
  • Fuxin Li
    • 1
  • James M. Rehg
    • 1
  1. 1.Georgia Institute of TechnologyAtlantaUSA

Personalised recommendations