Skip to main content
Log in

Towards Scene Understanding with Detailed 3D Object Representations

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Current approaches to semantic image and scene understanding typically employ rather simple object representations such as 2D or 3D bounding boxes. While such coarse models are robust and allow for reliable object detection, they discard much of the information about objects’ 3D shape and pose, and thus do not lend themselves well to higher-level reasoning. Here, we propose to base scene understanding on a high-resolution object representation. An object class—in our case cars—is modeled as a deformable 3D wireframe, which enables fine-grained modeling at the level of individual vertices and faces. We augment that model to explicitly include vertex-level occlusion, and embed all instances in a common coordinate frame, in order to infer and exploit object-object interactions. Specifically, from a single view we jointly estimate the shapes and poses of multiple objects in a common 3D frame. A ground plane in that frame is estimated by consensus among different objects, which significantly stabilizes monocular 3D pose estimation. The fine-grained model, in conjunction with the explicit 3D scene model, further allows one to infer part-level occlusions between the modeled objects, as well as occlusions by other, unmodeled scene elements. To demonstrate the benefits of such detailed object class models in the context of scene understanding we systematically evaluate our approach on the challenging KITTI street scene dataset. The experiments show that the model’s ability to utilize image evidence at the level of individual parts improves monocular 3D pose estimation w.r.t. both location and (continuous) viewpoint.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. While in the earlier work they were scaled to the same size, so as to keep the deformations from the mean shape small.

  2. In practice this amounts to a look-up in the precomputed response maps.

  3. Note, there is no 3D counterpart to this part-level evaluation, since we see no way to obtain sufficiently accurate 3D part annotations.

References

  • Bao, S. Y., & Savarese, S. (2011). Semantic structure from motion. In CVPR.

  • Belongie, S., Malik, J., & Puzicha, J. (2000). Shape context: A new descriptor for shape matching and object recognition. In NIPS.

  • Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.

  • Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17, 285–348.

    Article  Google Scholar 

  • Choi, W., Chao, Y. -W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.

  • Cootes, T. F., Taylor, C. J., Cooper, D. H., & Graham, J. (1995). Active shape models, their training and application. Computer Vision and Image Understanding, 61(1), 38–59.

    Article  Google Scholar 

  • Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  • Del Pero, L., Bowdish, J., Kermgard, B., Hartley, E., & Barnard, K. (2013). Understanding Bayesian rooms using composite 3D object models. In CVPR.

  • Enzweiler, M., Eigenstetter, A., Schiele, B., & Gavrila, D. M. (2010). Multi-Cue pedestrian classification with partial occlusion handling. In CVPR.

  • Ess, A., Leibe, B., Schindler, K., & Gool, L. V. (2009). Robust multi-person tracking from a mobile platform. Pattern Analysis and Machine Intelligence, 31(10), 1831–1846.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Fransens, R., Strecha, C., & Gool, L. V. (2006). A mean field EM-algorithm for coherent occlusion handling in MAP-estimation. In CVPR.

  • Gao, T., Packer, B., & Koller, D. (2011). A segmentation-aware object detection model with occlusion handling. In CVPR.

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving?. The KITTI vision benchmark suite. In CVPR.

  • Geiger, A., Wojek, C., & Urtasun, R. (2011). Joint 3D estimation of objects and scene layout. In NIPS.

  • Girshick, R. B., Felzenszwalb, P. F., & McAllester, D. (2011). Object detection with grammar models. In NIPS.

  • Gupta, A., Efros, A. A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.

  • Haag, M., & Nagel, H.-H. (1999). Combination of edge element and optical flow estimates for 3d-model-based vehicle tracking in traffic image sequences. International Journal of Computer Vision, 35(3), 295–319.

    Article  Google Scholar 

  • Hedau, V., Hoiem, D., & Forsyth, D. A. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.

  • Hejrati, M., & Ramanan, D. (2012). Analyzing 3D objects in cluttered images. In NIPS.

  • Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision, 80(1), 3–15.

    Article  Google Scholar 

  • Kanade, T. (1980). A theory of Origami world. Artificial Intelligence, 13, 279–311.

    Article  MATH  MathSciNet  Google Scholar 

  • Koller, D., Daniilidis, K., & Nagel, H. H. (1993). Model-based object tracking in monocular image sequences of road traffic scenes. International Journal of Computer Vision, 10(3), 257–281.

    Article  Google Scholar 

  • Kwak, S., Nam, W., Han, B., & Han, J. H. (2011). Learning occlusion with likelihoods for visual tracking. In ICCV.

  • Leibe, B., Leonardis, A., & Schiele, B. (2006). An implicit shape model for combined object categorization and segmentation. In Toward category-level object recognition.

  • Leordeanu, M., & Hebert, M. (2008). Smoothing-based optimization. In CVPR.

  • Li, Y., Gu, L., & Kanade, T. (2011). Robustly aligning a shape model and its application to car alignment of unknown pose. Pattern Analysis and Machine Intelligence, 33(9), 1860–1876.

    Article  Google Scholar 

  • Liu, X., Zhao, Y., & Zhu, S. -C. (2014). Single-view 3D scene parsing by attributed grammar. In CVPR.

  • Lowe, D. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31(3), 355–395.

    Article  Google Scholar 

  • Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In CVPR.

  • Malik, J. (1987). Interpreting line drawings of curved objects. International Journal of Computer Vision, 1(1), 73–103.

    Article  MathSciNet  Google Scholar 

  • Meger, D., Wojek, C., Schiele, B., & Little, J. J. (2011). Explicit occlusion reasoning for 3d object detection. In BMVC.

  • Oramas, J., De Raedt, L., & Tuytelaars, T. (2013). Allocentric pose estimation. In ICCV.

  • Pentland, A. (1986). Perceptual organization and representation of natural form. Artificial Intelligence, 28, 293–331.

    Article  MathSciNet  Google Scholar 

  • Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2013). Occlusion patterns for object class detection. In CVPR.

  • Roberts, L. G. (1963) Machine perception of three-dimensional solids, Ph.D. thesis, MIT.

  • Schönborn, S., Forster, A., Egger, B., & Vetter, T. (2013). A Monte Carlo strategy to integrate detection and model-based face analysis. In GCPR.

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.

  • Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In BMVC.

  • Sullivan, G. D., Worrall, A. D., & Ferryman, J. M. (1995). Visual object recognition using deformable models of vehicles. In IEEE workshop on context-based vision.

  • Tang, S., Andriluka, M., & Schiele, B. (2012). Detection and tracking of occluded oeople. In BMVC.

  • Vedaldi, A., & Zisserman, A. (2009). Structured output regression for detection with partial truncation. In NIPS.

  • Villamizar, M., Grabner, H., Andrade-Cetto, J., Sanfeliu, A., Gool, L. V., & Moreno-Noguer, F. (2011). Efficient 3D object detection using multiple pose-specific classifiers. In BMVC.

  • Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.

  • Wang, X., Han, T., & Yan, S. (2009). An HOG-LBP human detector with partial occlusion handling. In ICCV.

  • Wojek, C., Walk, S., Roth, S., Schindler, K., & Schiele, B. (2013). Monocular visual scene understanding: Understanding multi-object traffic scenes. In PAMI.

  • Xiang, Y., & Savarese, S. (2013). Object detection by 3D aspectlets and occlusion reasoning. In 3dRR.

  • Xiang, Y., & Savarese, S. (2012). Estimating the aspect layout of object categories. In CVPR.

  • Zhao, Y., & Zhu, S. -C. (2013). Scene parsing by integrating function, geometry and appearance models. In CVPR.

  • Zia, M. Z., Klank, U., & Beetz, M. (2009). Acquisition of a dense 3D model database for robotic vision. In ICAR.

  • Zia, M. Z., Stark, M., & Schindler, K. (2013). Explicit occlusion modeling for 3D object class representations. In CVPR.

  • Zia, M. Z., Stark, M., & Schindler, K. (2014). Are cars just 3D boxes? Jointly estimating the 3D shape of multiple objects. In CVPR.

  • Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. Pattern Analysis and Machine Intelligence, 35(11), 2608–2623.

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the Max Planck Center for Visual Computing & Communication.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Zeeshan Zia.

Additional information

Communicated by Derek Hoiem, James Hays, Jianxiong Xiao and Aditya Khosla.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zia, M.Z., Stark, M. & Schindler , K. Towards Scene Understanding with Detailed 3D Object Representations. Int J Comput Vis 112, 188–203 (2015). https://doi.org/10.1007/s11263-014-0780-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0780-y

Keywords

Navigation