Skip to main content

Semantic Structure from Motion: A Novel Framework for Joint Object Recognition and 3D Reconstruction

  • Conference paper
Outdoor and Large-Scale Real-World Scene Analysis

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7474))

Abstract

Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this chapter, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in a scene. These properties allow to jointly estimate the structure of the scene, the camera parameters as well as the 3D locations, poses, and categories of objects in a scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets – LiDAR cars, street-view pedestrians, and Kinect office desktop – verify our theoretical claims.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bao, S.Y., Savarese, S.: Semantic structure from motion. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)

    Google Scholar 

  2. Bao, S.Y., Sun, M., Savarese, S.: Toward coherent object detection and scene layout understanding. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2010)

    Google Scholar 

  3. Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  4. Cheng, Y.: Mean shift, mode seeking, and clustering. PAMI (1995)

    Google Scholar 

  5. Cornelis, N., Leibe, B., Cornelis, K., Gool, L.: 3d urban scene modeling integrating recognition and reconstruction. IJCV 78(2-3), 121–141 (2008)

    Article  Google Scholar 

  6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2005)

    Google Scholar 

  7. Dellaert, F., Seitz, S., Thrun, S., Thorpe, C.: Feature correspondence: A markov chain monte carlo approach. In: NIPS (2000)

    Google Scholar 

  8. Dick, A.R., Torr, P.H.S., Cipolla, R.: Modelling and interpretation of architecture from several images. IJCV 60(2), 111–134 (2004)

    Article  Google Scholar 

  9. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. TPAMI (2009)

    Google Scholar 

  10. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR, vol. 2, pp. 264–271 (2003)

    Google Scholar 

  11. Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Gilks, W., Richardson, S., Spiegelhalter, D.: Markov chain Monte Carlo in practice. Chapman and Hall (1996)

    Google Scholar 

  13. Golparvar-Fard, M., Pena-Mora, F., Savarese, S.: D4ar- a 4-dimensional augmented reality model for automating construction progress data collection, processing and communication. In: TCON Special Issue: Next Generation Construction IT (2009)

    Google Scholar 

  14. Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV (2009)

    Google Scholar 

  15. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000)

    Google Scholar 

  16. Helmer, S., Meger, D., Muja, M., Little, J., Lowe, D.: Multiple viewpoint recognition and localization. In: ACCV (2011)

    Google Scholar 

  17. Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. International Journal of Computer Vision 80(1) (2008)

    Google Scholar 

  18. Huber, D.: Automatic 3d modeling using range images obtained from unknown viewpoints. In: Int. Conf. on 3-D Digital Imaging and Modeling (2001)

    Google Scholar 

  19. Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)

    Google Scholar 

  21. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision (2004)

    Google Scholar 

  22. Li, L.-J., Socher, R., Fei-Fei, L.: Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In: CVPR (2009)

    Google Scholar 

  23. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)

    Google Scholar 

  24. Nister, D.: An efficient solution to the five-point relative pose problem. TPAMI (2004)

    Google Scholar 

  25. Pandey, G., McBride, J.R., Eustice, R.M.: Ford campus vision and lidar data set. International Journal of Robotics Research (2011)

    Google Scholar 

  26. Pollefeys, M., Gool, L.V.: From images to 3d models. Commun. ACM 45(7), 50–55 (2002)

    Article  Google Scholar 

  27. Reynolds, M., Doboš, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-flight data with confidence. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)

    Google Scholar 

  28. Rusu, R., Marton, Z., Blodow, N., Dolha, M., Beetz, M.: Towards 3d point cloud based object maps for household environments. Robotics and Autonomous Systems 56(11) (2008)

    Google Scholar 

  29. Savarese, S., Fei-Fei, L.: 3d generic object categorization, localization and pose estimation. In: ICCV (2007)

    Google Scholar 

  30. Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. PAMI 31(5), 824–840 (2009)

    Article  Google Scholar 

  31. Snavely, N., Seitz, S.M., Szeliski, R.S.: Modeling the world from internet photo collections. IJCV (2) (2008)

    Google Scholar 

  32. Soatto, S., Perona, P.: Reducing ”structure from motion”: a general framework for dynamic vision. part 1: modeling. International Journal of Computer Vision 20 (1998)

    Google Scholar 

  33. Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Depth from familiar objects: A hierarchical model for 3d scenes. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)

    Google Scholar 

  34. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbob, A.: Bundle adjustment: a modern synthesis. In: Vision Algorithms: Theory and Practice (1999)

    Google Scholar 

  35. Tuytelaars, T., Van Gool, L.: Wide baseline stereo matching based on local, affinely invariant regions. In: British Machine Vision Conference (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bao, S.Y., Savarese, S. (2012). Semantic Structure from Motion: A Novel Framework for Joint Object Recognition and 3D Reconstruction. In: Dellaert, F., Frahm, JM., Pollefeys, M., Leal-Taixé, L., Rosenhahn, B. (eds) Outdoor and Large-Scale Real-World Scene Analysis. Lecture Notes in Computer Science, vol 7474. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34091-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34091-8_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34090-1

  • Online ISBN: 978-3-642-34091-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics