International Journal of Computer Vision

, Volume 76, Issue 1, pp 53–69 | Cite as

3-D Depth Reconstruction from a Single Still Image

Open Access
Article

Abstract

We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone.

Keywords

Monocular vision Learning depth 3D reconstruction Dense reconstruction Markov random field Depth estimation Monocular depth Stereo vision Hand-held camera Visual modeling 

References

  1. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). SCAPE: shape completion and animation of people. ACM Transactions on Graphics, 24(3), 408–416. CrossRefGoogle Scholar
  2. Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12, 43–77. CrossRefGoogle Scholar
  3. Brown, M. Z., Burschka, D., & Hager, G. D. (2003). Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8), 993–1008. CrossRefGoogle Scholar
  4. Bulthoff, I., Bulthoff, H., & Sinha, P. (1998). Top-down influences on stereoscopic depth-perception. Nature Neuroscience, 1, 254–257. CrossRefGoogle Scholar
  5. Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3d city modeling using cognitive loops. In Video proceedings of CVPR (VPCVPR). Google Scholar
  6. Criminisi, A., Reid, I., & Zisserman, A. (2000). Single view metrology. International Journal of Computer Vision, 40, 123–148. CrossRefMATHGoogle Scholar
  7. Das, S., & Ahuja, N. (1995). Performance analysis of stereo, vergence, and focus as depth cues for active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12), 1213–1219. CrossRefGoogle Scholar
  8. Davies, E. R. (1997). Laws’ texture energy in texture. In Machine vision: theory, algorithms, practicalities (2nd ed.). San Diego: Academic Press. Google Scholar
  9. Delage, E., Lee, H., & Ng, A. Y. (2005). Automatic single-image 3d reconstructions of indoor Manhattan world scenes. In 12th International Symposium of Robotics Research (ISRR). Google Scholar
  10. Delage, E., Lee, H., & Ng, A. Y. (2006). A dynamic Bayesian network model for autonomous 3D reconstruction from a single indoor image. In Computer vision and pattern recognition (CVPR). Google Scholar
  11. Forsyth, D. A., & Ponce, J. (2003). Computer vision: a modern approach. New York: Prentice Hall. Google Scholar
  12. Frueh, C., & Zakhor, A. (2003). Constructing 3D city models by merging ground-based and airborne views. In Computer vision and pattern recognition (CVPR). Google Scholar
  13. Gini, G., & Marchi, A. (2002). Indoor robot navigation with single camera vision. In PRIS. Google Scholar
  14. Harkness, L. (1977). Chameleons use accommodation cues to judge distance. Nature, 267, 346–349. CrossRefGoogle Scholar
  15. He, X., Zemel, R., & Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition (CVPR). Google Scholar
  16. Hertzmann, A., & Seitz, S. M. (2005). Example-based photometric stereo: Shape reconstruction with general, varying brdfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1254–1264. CrossRefGoogle Scholar
  17. Hoiem, D., Efros, A. A., & Herbert, M. (2005a). Geometric context from a single image. In International conference on computer vision (ICCV). Google Scholar
  18. Hoiem, D., Efros, A. A., & Herbert, M. (2005b). Automatic photo pop-up. In ACM SIGGRAPH. Google Scholar
  19. Hoiem, D., Efros, A. A., & Herbert, M. (2006). Putting objects in perspective. In Computer vision and pattern recognition (CVPR). Google Scholar
  20. Huang, J., Lee, A. B., & Mumford, D. (2000). Statistics of range images. In Computer vision and pattern recognition (CVPR). Google Scholar
  21. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Rother, C. (2006). Probabilistic fusion of stereo with color and contrast for bilayer segmentation. IEEE Pattern Analysis and Machine Intelligence, 28(9), 1480–1492. CrossRefGoogle Scholar
  22. Konishi, S., & Yuille, A. (2000). Statistical cues for domain specific image segmentation with performance analysis. In Computer vision and pattern recognition (CVPR). Google Scholar
  23. Kumar, S., & Hebert, M. (2003). Discriminative fields for modeling spatial dependencies in natural images. In Neural information processing systems (NIPS) (Vol. 16). Google Scholar
  24. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML). Google Scholar
  25. Lindeberg, T., & Garding, J. (1993). Shape from texture from a multi-scale perspective. In International conference on computer vision (ICCV). Google Scholar
  26. Loomis, J. M. (2001). Looking down is looking up. Nature News and Views, 414, 155–156. CrossRefGoogle Scholar
  27. Maki, A., Watanabe, M., & Wiles, C. (2002). Geotensity: combining motion and lighting for 3d surface reconstruction. International Journal of Computer Vision, 48(2), 75–90. CrossRefMATHGoogle Scholar
  28. Malik, J., & Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7(5), 923–932. CrossRefGoogle Scholar
  29. Malik, J., & Rosenholtz, R. (1997). Computing local surface orientation and shape from texture for curved surfaces. International Journal of Computer Vision, 23(2), 149–168. CrossRefGoogle Scholar
  30. Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. In 22nd international conference on machine learning (ICML). Google Scholar
  31. Moldovan, T. M., Roth, S., & Black, M. J. (2006). Denoising archival films using a learned Bayesian model. In International conference on image processing (ICIP). Google Scholar
  32. Mortensen, E. N., Deng, H., & Shapiro, L. (2005). A SIFT descriptor with global context. In Computer vision and pattern recognition (CVPR). Google Scholar
  33. Murphy, K., Torralba, A., & Freeman, W. T. (2003). Using the forest to see the trees: a graphical model relating features, objects, and scenes. In Neural information processing systems (NIPS) (Vol. 16). Google Scholar
  34. Nagai, T., Naruse, T., Ikehara, M., & Kurematsu, A. (2002). Hmm-based surface reconstruction from single images. In IEEE international conference on image processing (ICIP). Google Scholar
  35. Narasimhan, S. G., & Nayar, S. K. (2003). Shedding light on the weather. In Computer vision and pattern recognition (CVPR) Google Scholar
  36. Nestares, O., Navarro, R., Portilia, J., & Tabernero, A. (1998). Efficient spatial-domain implementation of a multiscale image representation based on Gabor functions. Journal of Electronic Imaging, 7(1), 166–173. CrossRefGoogle Scholar
  37. Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 155, 23–36. Google Scholar
  38. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an over-complete basis set: a strategy employed by v1? Vision Research, 37, 3311–3325. CrossRefGoogle Scholar
  39. Porrill, J., Frisby, J. P., Adams, W. J., & Buckley, D. (1999). Robust and optimal use of information in stereo vision. Nature, 397, 63–66. CrossRefGoogle Scholar
  40. Quartulli, M., & Datcu, M. (2001). Bayesian model based city reconstruction from high resolution ISAR data. In IEEE/ISPRS joint workshop remote sensing and data fusion over urban areas. Google Scholar
  41. Saxena, A., Anand, A., & Mukerjee, A. (2004). Robust facial expression recognition using spatially localized geometric model. In International conf systemics, cybernetics and informatics (ICSCI). Google Scholar
  42. Saxena, A., Chung, S. H., & Ng, A. Y. (2005). Learning depth from single monocular images. In Neural information processing system (NIPS) (Vol. 18). Google Scholar
  43. Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., & Ng, A. Y. (2006a). Learning to grasp novel objects using vision. In 10th international symposium on experimental robotics (ISER). Google Scholar
  44. Saxena, A., Sun, M., Agarwal, R., & Ng, A. Y. (2006b). Learning 3-d scene structure from a single still image. Stanford Technical Report, November 2006. Google Scholar
  45. Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. Y. (2006c). Robotic grasping of novel objects. In Neural information processing systems (NIPS) (Vol. 19). Google Scholar
  46. Saxena, A., Schulte, J., & Ng, A. Y. (2007). Depth estimation using monocular and stereo cues. In International joint conference on artificial intelligence (IJCAI). Google Scholar
  47. Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42. CrossRefMATHGoogle Scholar
  48. Scharstein, D., & Szeliski, R. (2003) High-accuracy stereo depth maps using structured light. In Computer vision and pattern recognition (CVPR). Google Scholar
  49. Schwartz, S. H. (1999). Visual perception (2nd ed.). Connecticut: Appleton and Lange. Google Scholar
  50. Serre, T., Wolf, L., & Poggio, T. (2005). Object recognition with features inspired by visual cortex. In Computer vision and pattern recognition (CVPR). Google Scholar
  51. Strang, G., & Nguyen, T. (1997). Wavelets and filter banks. Wellesley: Wellesley-Cambridge Press. Google Scholar
  52. Sudderth, E. B., Torralba, A., Freeman, W. T., & Willisky, A. S. (2006). Depth from familiar objects: A hierarchical model for 3D scenes. In Computer vision and pattern recognition (CVPR) Google Scholar
  53. Szeliski, R. (1990). Bayesian modeling of uncertainty in low-level vision. In International conference on computer vision (ICCV). Google Scholar
  54. Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In International conference on computer vision (ICCV). Google Scholar
  55. Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1–13. CrossRefGoogle Scholar
  56. Torresani, L., & Hertzmann, A. (2004). Automatic non-rigid 3D modeling from video. In European conference on computer vision. Google Scholar
  57. Wandell, B. A. (1995). Foundations of vision. Sunderland: Sinauer Associates. Google Scholar
  58. Welchman, A. E., Deubelius, A., Conrad, V., Bülthoff, H. H., & Kourtzi, Z. (2005). 3D shape perception from combined depth cues in human visual cortex. Nature Neuroscience, 8, 820–827. CrossRefGoogle Scholar
  59. Wexler, M., Panerai, F., Lamouret, I., & Droulez, J. (2001). Self-motion and the perception of stationary objects. Nature, 409, 85–88. CrossRefGoogle Scholar
  60. Willsky, A. S. (2002). Multiresolution Markov models for signal and image processing. Proceedings IEEE, 90(8), 1396–1458. CrossRefGoogle Scholar
  61. Wu, B., Ooi, T. L., & He, Z. J. (2004). Perceiving distance accurately by a directional process of integrating ground information. Letters to Nature, 428, 73–77. CrossRefGoogle Scholar
  62. Zhang, R., Tsai, P.-S., Cryer, J. E., & Shah, M. (1999). Shape from shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 690–706. CrossRefGoogle Scholar
  63. Zhao, W., Chellappa, R., Phillips, P. J., & Rosenfield, A. (2003). Face recognition: a literature survey. ACM Computing Surveys, 35, 399–458. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Ashutosh Saxena
    • 1
  • Sung H. Chung
    • 1
  • Andrew Y. Ng
    • 1
  1. 1.Computer Science DepartmentStanford UniversityStanfordUSA

Personalised recommendations