Semantic Structure from Motion: A Novel Framework for Joint Object Recognition and 3D Reconstruction

Bao, Sid Yingze; Savarese, Silvio

doi:10.1007/978-3-642-34091-8_17

Sid Yingze Bao²¹ &
Silvio Savarese²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7474))

1456 Accesses
2 Citations

Abstract

Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this chapter, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in a scene. These properties allow to jointly estimate the structure of the scene, the camera parameters as well as the 3D locations, poses, and categories of objects in a scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets – LiDAR cars, street-view pedestrians, and Kinect office desktop – verify our theoretical claims.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bao, S.Y., Savarese, S.: Semantic structure from motion. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)
Google Scholar
Bao, S.Y., Sun, M., Savarese, S.: Toward coherent object detection and scene layout understanding. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2010)
Google Scholar
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)
Chapter Google Scholar
Cheng, Y.: Mean shift, mode seeking, and clustering. PAMI (1995)
Google Scholar
Cornelis, N., Leibe, B., Cornelis, K., Gool, L.: 3d urban scene modeling integrating recognition and reconstruction. IJCV 78(2-3), 121–141 (2008)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2005)
Google Scholar
Dellaert, F., Seitz, S., Thrun, S., Thorpe, C.: Feature correspondence: A markov chain monte carlo approach. In: NIPS (2000)
Google Scholar
Dick, A.R., Torr, P.H.S., Cipolla, R.: Modelling and interpretation of architecture from several images. IJCV 60(2), 111–134 (2004)
Article Google Scholar
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. TPAMI (2009)
Google Scholar
Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR, vol. 2, pp. 264–271 (2003)
Google Scholar
Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)
Chapter Google Scholar
Gilks, W., Richardson, S., Spiegelhalter, D.: Markov chain Monte Carlo in practice. Chapman and Hall (1996)
Google Scholar
Golparvar-Fard, M., Pena-Mora, F., Savarese, S.: D4ar- a 4-dimensional augmented reality model for automating construction progress data collection, processing and communication. In: TCON Special Issue: Next Generation Construction IT (2009)
Google Scholar
Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV (2009)
Google Scholar
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000)
Google Scholar
Helmer, S., Meger, D., Muja, M., Little, J., Lowe, D.: Multiple viewpoint recognition and localization. In: ACCV (2011)
Google Scholar
Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. International Journal of Computer Vision 80(1) (2008)
Google Scholar
Huber, D.: Automatic 3d modeling using range images obtained from unknown viewpoints. In: Int. Conf. on 3-D Digital Imaging and Modeling (2001)
Google Scholar
Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006)
Chapter Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)
Google Scholar
Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision (2004)
Google Scholar
Li, L.-J., Socher, R., Fei-Fei, L.: Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In: CVPR (2009)
Google Scholar
Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
Google Scholar
Nister, D.: An efficient solution to the five-point relative pose problem. TPAMI (2004)
Google Scholar
Pandey, G., McBride, J.R., Eustice, R.M.: Ford campus vision and lidar data set. International Journal of Robotics Research (2011)
Google Scholar
Pollefeys, M., Gool, L.V.: From images to 3d models. Commun. ACM 45(7), 50–55 (2002)
Article Google Scholar
Reynolds, M., Doboš, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-flight data with confidence. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)
Google Scholar
Rusu, R., Marton, Z., Blodow, N., Dolha, M., Beetz, M.: Towards 3d point cloud based object maps for household environments. Robotics and Autonomous Systems 56(11) (2008)
Google Scholar
Savarese, S., Fei-Fei, L.: 3d generic object categorization, localization and pose estimation. In: ICCV (2007)
Google Scholar
Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. PAMI 31(5), 824–840 (2009)
Article Google Scholar
Snavely, N., Seitz, S.M., Szeliski, R.S.: Modeling the world from internet photo collections. IJCV (2) (2008)
Google Scholar
Soatto, S., Perona, P.: Reducing ”structure from motion”: a general framework for dynamic vision. part 1: modeling. International Journal of Computer Vision 20 (1998)
Google Scholar
Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Depth from familiar objects: A hierarchical model for 3d scenes. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)
Google Scholar
Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbob, A.: Bundle adjustment: a modern synthesis. In: Vision Algorithms: Theory and Practice (1999)
Google Scholar
Tuytelaars, T., Van Gool, L.: Wide baseline stereo matching based on local, affinely invariant regions. In: British Machine Vision Conference (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Michigan, Ann Arbor, MI, USA
Sid Yingze Bao & Silvio Savarese

Authors

Sid Yingze Bao
View author publications
You can also search for this author in PubMed Google Scholar
Silvio Savarese
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computing Building, Georgia Institute of Technology, 801 Atlantic Drive, 30332-0280, Atlanta, GA, USA
Frank Dellaert
Department of Computer Science, University of North Carolina at Chapel Hill, Brooks Computer Science Building, CB#3175, 27599, Chapel Hill, NC, USA
Jan-Michael Frahm
CVG - Institute of Visual Computing, Department of Computer Science, ETH Zurich, CNB G105, Universitaetstrasse 6, 8092, Zurich, Switzerland
Marc Pollefeys
Institute for Information Processing (TNT), Leibniz Universität, Appelstr. 9A, 30167, Hannover, Germany
Laura Leal-Taixé
Institute for Information Processing (TNT), Leibniz Universität, Appelstr. 9A, 30167, Hannover, Germany
Bodo Rosenhahn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, S.Y., Savarese, S. (2012). Semantic Structure from Motion: A Novel Framework for Joint Object Recognition and 3D Reconstruction. In: Dellaert, F., Frahm, JM., Pollefeys, M., Leal-Taixé, L., Rosenhahn, B. (eds) Outdoor and Large-Scale Real-World Scene Analysis. Lecture Notes in Computer Science, vol 7474. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34091-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-34091-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34090-1
Online ISBN: 978-3-642-34091-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics