Co-inference for Multi-modal Scene Analysis

  • Daniel Munoz
  • James Andrew Bagnell
  • Martial Hebert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7577)


We address the problem of understanding scenes from multiple sources of sensor data (e.g., a camera and a laser scanner) in the case where there is no one-to-one correspondence across modalities (e.g., pixels and 3-D points). This is an important scenario that frequently arises in practice not only when two different types of sensors are used, but also when the sensors are not co-located and have different sampling rates. Previous work has addressed this problem by restricting interpretation to a single representation in one of the domains, with augmented features that attempt to encode the information from the other modalities. Instead, we propose to analyze all modalities simultaneously while propagating information across domains during the inference procedure. In addition to the immediate benefit of generating a complete interpretation in all of the modalities, we demonstrate that this co-inference approach also improves performance over the canonical approach.


Point Cloud Local Binary Pattern Contextual Feature Point Cloud Data Global Reference Frame 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: 3DRR Workshop (2011)Google Scholar
  2. 2.
    Janoch, A., Karayev, S., Jia, Y., Barron, J.T., Fritz, M., Saenko, K., Darrell, T.: A category-level 3-D object dataset putting the kinect to work. In: Consumer Depth Cameras in Computer Vision Workshop (2011)Google Scholar
  3. 3.
    Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: CVPR (2010)Google Scholar
  4. 4.
    Besl, P.J., Jain, R.C.: Invariant surface characteristics for 3D object recognition in range images. CVGIP 33 (1986)Google Scholar
  5. 5.
    Kweon, I.S., Hebert, M., Kanade, T.: Sensor fusion of range and reflectance data for outdoor scene analysis. In: NASA Workshop on Space Operations, Automation, and Robotics (1988)Google Scholar
  6. 6.
    Baseski, E., Pugeault, N., Kalkan, S., Kraft, D., Worgotter, F., Kruge, N.: Indoor scene segmentation using a structured light sensor. In: 3DRR Workshop (2007)Google Scholar
  7. 7.
    Koppula, H.S., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3D point clouds for indoor scenes. In: NIPS (2011)Google Scholar
  8. 8.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  9. 9.
    Gould, S., Baumstarck, P., Quigley, M., Ng, A.Y., Koller, D.: Integrating visual and range data for robotic object detection. In: M2SFA2 Workshop (2008)Google Scholar
  10. 10.
    Xiao, J., Quan, L.: Multiple view semantic segmentation for street view images. In: ICCV (2009)Google Scholar
  11. 11.
    Zhang, C., Wang, L., Yang, R.: Semantic Segmentation of Urban Scenes Using Dense Depth Maps. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 708–721. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Collet, A., Srinivasa, S., Hebert, M.: Structure discovery in multi-modal data: a region-based approach. In: ICRA (2011)Google Scholar
  13. 13.
    Tombari, F., Stefano, L.D.: 3D data segmentation by local classification and markov random fields. In: 3DIMPVT (2011)Google Scholar
  14. 14.
    Douillard, B., Fox, D., Ramos, F., Durrant-Whyte, H.: Classification and semantic mapping of urban environments. IJRR 30 (2011)Google Scholar
  15. 15.
    Lai, K., Bo, L., Ren, X., Fox, D.: Detection-based object labeling in 3D scenes. In: ICRA (2012)Google Scholar
  16. 16.
    Munoz, D., Bagnell, J.A., Hebert, M.: Stacked Hierarchical Labeling. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 57–70. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  17. 17.
    Xiong, X., Munoz, D., Bagnell, J.A., Hebert, M.: 3-D scene analysis via sequenced predictions over points and regions. In: ICRA (2011)Google Scholar
  18. 18.
    Wolpert, D.H.: Stacked generalization. Neural Networks 5 (1992)Google Scholar
  19. 19.
    Russell, B., Torralba, A., Murphy, K., Freeman, W.T.: Labelme: a database and web-based tool for image annotation. IJCV 77 (2007)Google Scholar
  20. 20.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59 (2004)Google Scholar
  21. 21.
    Medioni, G., Lee, M.S., Tang, C.K.: A Computational Framework for Segmentation and Grouping. Elsevier (2000)Google Scholar
  22. 22.
    Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: AISTATS (2011)Google Scholar
  23. 23.
    Ladicky, L.: Global Structured Models towards Scene Understanding. PhD thesis, Oxford Brookes University (2011)Google Scholar
  24. 24.
    Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. IJCV 80 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Daniel Munoz
    • 1
  • James Andrew Bagnell
    • 1
  • Martial Hebert
    • 1
  1. 1.The Robotics InstituteCarnegie Mellon UniversityUSA

Personalised recommendations