Abstract
Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this chapter, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in a scene. These properties allow to jointly estimate the structure of the scene, the camera parameters as well as the 3D locations, poses, and categories of objects in a scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets – LiDAR cars, street-view pedestrians, and Kinect office desktop – verify our theoretical claims.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bao, S.Y., Savarese, S.: Semantic structure from motion. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)
Bao, S.Y., Sun, M., Savarese, S.: Toward coherent object detection and scene layout understanding. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2010)
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)
Cheng, Y.: Mean shift, mode seeking, and clustering. PAMI (1995)
Cornelis, N., Leibe, B., Cornelis, K., Gool, L.: 3d urban scene modeling integrating recognition and reconstruction. IJCV 78(2-3), 121–141 (2008)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2005)
Dellaert, F., Seitz, S., Thrun, S., Thorpe, C.: Feature correspondence: A markov chain monte carlo approach. In: NIPS (2000)
Dick, A.R., Torr, P.H.S., Cipolla, R.: Modelling and interpretation of architecture from several images. IJCV 60(2), 111–134 (2004)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. TPAMI (2009)
Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR, vol. 2, pp. 264–271 (2003)
Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)
Gilks, W., Richardson, S., Spiegelhalter, D.: Markov chain Monte Carlo in practice. Chapman and Hall (1996)
Golparvar-Fard, M., Pena-Mora, F., Savarese, S.: D4ar- a 4-dimensional augmented reality model for automating construction progress data collection, processing and communication. In: TCON Special Issue: Next Generation Construction IT (2009)
Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV (2009)
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000)
Helmer, S., Meger, D., Muja, M., Little, J., Lowe, D.: Multiple viewpoint recognition and localization. In: ACCV (2011)
Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. International Journal of Computer Vision 80(1) (2008)
Huber, D.: Automatic 3d modeling using range images obtained from unknown viewpoints. In: Int. Conf. on 3-D Digital Imaging and Modeling (2001)
Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)
Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision (2004)
Li, L.-J., Socher, R., Fei-Fei, L.: Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In: CVPR (2009)
Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
Nister, D.: An efficient solution to the five-point relative pose problem. TPAMI (2004)
Pandey, G., McBride, J.R., Eustice, R.M.: Ford campus vision and lidar data set. International Journal of Robotics Research (2011)
Pollefeys, M., Gool, L.V.: From images to 3d models. Commun. ACM 45(7), 50–55 (2002)
Reynolds, M., Doboš, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-flight data with confidence. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)
Rusu, R., Marton, Z., Blodow, N., Dolha, M., Beetz, M.: Towards 3d point cloud based object maps for household environments. Robotics and Autonomous Systems 56(11) (2008)
Savarese, S., Fei-Fei, L.: 3d generic object categorization, localization and pose estimation. In: ICCV (2007)
Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. PAMI 31(5), 824–840 (2009)
Snavely, N., Seitz, S.M., Szeliski, R.S.: Modeling the world from internet photo collections. IJCV (2) (2008)
Soatto, S., Perona, P.: Reducing ”structure from motion”: a general framework for dynamic vision. part 1: modeling. International Journal of Computer Vision 20 (1998)
Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Depth from familiar objects: A hierarchical model for 3d scenes. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)
Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbob, A.: Bundle adjustment: a modern synthesis. In: Vision Algorithms: Theory and Practice (1999)
Tuytelaars, T., Van Gool, L.: Wide baseline stereo matching based on local, affinely invariant regions. In: British Machine Vision Conference (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bao, S.Y., Savarese, S. (2012). Semantic Structure from Motion: A Novel Framework for Joint Object Recognition and 3D Reconstruction. In: Dellaert, F., Frahm, JM., Pollefeys, M., Leal-Taixé, L., Rosenhahn, B. (eds) Outdoor and Large-Scale Real-World Scene Analysis. Lecture Notes in Computer Science, vol 7474. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34091-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-34091-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34090-1
Online ISBN: 978-3-642-34091-8
eBook Packages: Computer ScienceComputer Science (R0)