Disambiguating Monocular Depth Estimation with a Single Transient

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)


Monocular depth estimation algorithms successfully predict the relative depth order of objects in a scene. However, because of the fundamental scale ambiguity associated with monocular images, these algorithms fail at correctly predicting true metric depth. In this work, we demonstrate how a depth histogram of the scene, which can be readily captured using a single-pixel time-resolved detector, can be fused with the output of existing monocular depth estimation algorithms to resolve the depth ambiguity problem. We validate this novel sensor fusion technique experimentally and in extensive simulation. We show that it significantly improves the performance of several state-of-the-art monocular depth estimation algorithms.


Depth estimation Time-of-flight imaging 



D.L. was supported by a Stanford Graduate Fellowship. C.M. was supported by an ORISE Intelligence Community Postdoctoral Fellowship. G.W. was supported by an NSF CAREER Award (IIS 1553333), a Sloan Fellowship, by the KAUST Office of Sponsored Research through the Visual Computing Center CCF grant, and a PECASE by the ARL.

Supplementary material (33.6 mb)
Supplementary material 1 (zip 34424 KB)


  1. 1.
    Ahmad Siddiqui, T., Madhok, R., O’Toole, M.: An extensible multi-sensor fusion framework for 3D imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1008–1009 (2020)Google Scholar
  2. 2.
    Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. arXiv:1812.11941v2 (2018)
  3. 3.
    Burri, S., Bruschini, C., Charbon, E.: Linospad: a compact linear SPAD camera system with 64 FPGA-based TDC modules for versatile 50 ps resolution time-resolved imaging. Instruments 1(1), 6 (2017)CrossRefGoogle Scholar
  4. 4.
    Burri, S., Homulle, H., Bruschini, C., Charbon, E.: Linospad: a time-resolved \(256 \times 1\) CMOS SPAD line sensor system featuring 64 FPGA-based TDC channels running at up to 8.5 giga-events per second. In: Optical Sensing and Detection IV, vol. 9899, p. 98990D. International Society for Optics and Photonics (2016)Google Scholar
  5. 5.
    Caramazza, P., et al.: Neural network identification of people hidden from view with a single-pixel, single-photon detector. Sci. Rep. 8(1), 11945 (2018)CrossRefGoogle Scholar
  6. 6.
    Chang, J., Wetzstein, G.: Deep optics for monocular depth estimation and 3D object detection. In: Proceedings of ICCV (2019)Google Scholar
  7. 7.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of NeurIPS (2014)Google Scholar
  8. 8.
    Faccio, D., Velten, A., Wetzstein, G.: Non-line-of-sight imaging. Nat. Rev. Phys. 1–10 (2020)Google Scholar
  9. 9.
    Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of CVPR (2018)Google Scholar
  10. 10.
    Garg, R., Wadhwa, N., Ansari, S., Barron, J.T.: Learning single camera depth estimation using dual-pixels. In: Proceedings of ICCV (2019)Google Scholar
  11. 11.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  12. 12.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of CVPR (2017)Google Scholar
  13. 13.
    Gonzales, R., Fittes, B.: Gray-level transformations for interactive image enhancement. Mech. Mach. Theory 12(1), 111–122 (1977)CrossRefGoogle Scholar
  14. 14.
    Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice-Hall Inc, Upper Saddle River (2008)Google Scholar
  15. 15.
    Gupta, A., Ingle, A., Velten, A., Gupta, M.: Photon-flooded single-photon 3D cameras. In: Proceedings of CVPR. IEEE (2019)Google Scholar
  16. 16.
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of CVPR (2013)Google Scholar
  17. 17.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Proceedings of ECCV (2014)Google Scholar
  18. 18.
    Hao, Z., Li, Y., You, S., Lu, F.: Detail preserving depth estimation from a single image using attention guided networks. In: Proceedings of 3DV (2018)Google Scholar
  19. 19.
    Heide, F., Diamond, S., Lindell, D.B., Wetzstein, G.: Sub-picosecond photon-efficient 3D imaging using single-photon sensors. Sci. Rep. 8(17726), 1–8 (2018)Google Scholar
  20. 20.
    Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Trans. Graph. 24(3), 577–584 (2005)CrossRefGoogle Scholar
  21. 21.
    Karsch, K., Liu, C., Kang, S.: Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2144–2158 (2014)CrossRefGoogle Scholar
  22. 22.
    Kirmani, A., Venkatraman, D., Shin, D., Colaço, A., Wong, F.N., Shapiro, J.H., Goyal, V.K.: First-photon imaging. Science 343(6166), 58–61 (2014)CrossRefGoogle Scholar
  23. 23.
    Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: Proceedings of 3DV. IEEE (2016)Google Scholar
  24. 24.
    Lamb, R., Buller, G.: Single-pixel imaging using 3D scanning time-of-flight photon counting. SPIE Newsroom (2010)Google Scholar
  25. 25.
    Lasinger, K., Ranftl, R., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv:1907.01341 (2019)
  26. 26.
    Li, Z.P., et al.: Single-photon computational 3D imaging at 45 km. arXiv preprint arXiv:1904.10341 (2019)
  27. 27.
    Lin, D., Fidler, S., Urtasun, R.: Holistic scene understanding for 3D object detection with RGBD cameras. In: Proceedings of ICCV (2013)Google Scholar
  28. 28.
    Lindell, D.B., O’Toole, M., Wetzstein, G.: Single-photon 3D imaging with deep sensor fusion. ACM Trans. Graph. (SIGGRAPH) 37(4), 113 (2018)CrossRefGoogle Scholar
  29. 29.
    Lindell, D.B., Wetzstein, G., O’Toole, M.: Wave-based non-line-of-sight imaging using fast F-K migration. ACM Trans. Graph. 38(4), 1–13 (2019)CrossRefGoogle Scholar
  30. 30.
    Liu, X., Bauer, S., Velten, A.: Phasor field diffraction based reconstruction for fast non-line-of-sight imaging systems. Nat. Commun. 11(1), 1–13 (2020)CrossRefGoogle Scholar
  31. 31.
    Liu, X., et al.: Non-line-of-sight imaging using phasor-field virtual wave optics. Nature 572(7771), 620–623 (2019)CrossRefGoogle Scholar
  32. 32.
    Maturana, D., Scherer, S.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: Proceedings of IROS (2015)Google Scholar
  33. 33.
    McManamon, P.: Review of ladar: a historic, yet emerging, sensor technology with rich phenomenology. Opt. Eng. 51(6), 060901 (2012)CrossRefGoogle Scholar
  34. 34.
    Morovic, J., Shaw, J., Sun, P.L.: A fast, non-iterative and exact histogram matching algorithm. Pattern Recognit. Lett. 23(1–3), 127–135 (2002)zbMATHCrossRefGoogle Scholar
  35. 35.
    Niclass, C., Rochas, A., Besse, P.A., Charbon, E.: Design and characterization of a CMOS 3-D image sensor based on single photon avalanche diodes. IEEE J. Solid-State Circuits 40(9), 1847–1854 (2005)CrossRefGoogle Scholar
  36. 36.
    Nikolova, M., Wen, Y.W., Chan, R.: Exact histogram specification for digital images using a variational approach. J. Math. Imaging. Vis. 46(3), 309–325 (2013)MathSciNetzbMATHCrossRefGoogle Scholar
  37. 37.
    O’Connor, D.V., Phillips, D.: Time-Correlated Single Photon Counting. Academic Press, London (1984)Google Scholar
  38. 38.
    O’Toole, M., Heide, F., Lindell, D.B., Zang, K., Diamond, S., Wetzstein, G.: Reconstructing transient images from single-photon sensors. In: Proceedings of CVPR (2017)Google Scholar
  39. 39.
    O’Toole, M., Lindell, D.B., Wetzstein, G.: Confocal non-line-of-sight imaging based on the light-cone transform. Nature 555(7696), 338–341 (2018)CrossRefGoogle Scholar
  40. 40.
    Pawlikowska, A.M., Halimi, A., Lamb, R.A., Buller, G.S.: Single-photon three-dimensional imaging at up to 10 kilometers range. Opt. Express 25(10), 11919–11931 (2017)CrossRefGoogle Scholar
  41. 41.
    Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs for object classification on 3D data. In: Proceedings of CVPR (2016)Google Scholar
  42. 42.
    Rapp, J., Ma, Y., Dawson, R.M.A., Goyal, V.K.: Dead time compensation for high-flux depth imaging. In: Proceedings of ICASSP (2019)Google Scholar
  43. 43.
    Ren, X., Bo, L., Fox, D.: RGB-(D) scene labeling: features and algorithms. In: Proceedings of CVPR (2012)Google Scholar
  44. 44.
    Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In: Proceedings of CVPR (2006)Google Scholar
  45. 45.
    Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Proceedings of NeurIPS (2006)Google Scholar
  46. 46.
    Shin, D., Kirmani, A., Goyal, V.K., Shapiro, J.H.: Photon-efficient computational 3-D and reflectivity imaging with single-photon detectors. IEEE Trans. Computat. Imag. 1(2), 112–125 (2015)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Shin, D., et al.: Photon-efficient imaging with a single-photon camera. Nat. Commun. 7, 12046 (2016)CrossRefGoogle Scholar
  48. 48.
    Shrivastava, A., Gupta, A.: Building part-based object detectors via 3D geometry. In: Proceedings of ICCV (2013)Google Scholar
  49. 49.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Proceedings of ECCV (2012)Google Scholar
  50. 50.
    Song, S., Xiao, J.: Sliding shapes for 3D object detection in depth images. In: Proceedings of ECCV (2014)Google Scholar
  51. 51.
    Song, S., Xiao, J.: Deep sliding shapes for amodal 3D object detection in RGB-D images. In: Proceedings of CVPR (2016)Google Scholar
  52. 52.
    Stoppa, D., Pancheri, L., Scandiuzzo, M., Gonzo, L., Dalla Betta, G.F., Simoni, A.: A CMOS 3-D imager based on single photon avalanche diode. IEEE Trans. Circuits Syst. I Reg. Papers 54(1), 4–12 (2007)Google Scholar
  53. 53.
    Sun, Z., Lindell, D.B., Solgaard, O., Wetzstein, G.: Spadnet: deep RGB-SPAD sensor fusion assisted by monocular depth estimation. Opt. Express 28(10), 14948–14962 (2020)CrossRefGoogle Scholar
  54. 54.
    Swoboda, P., Schnörr, C.: Convex variational image restoration with histogram priors. SIAM J. Imaging Sci. 6(3), 1719–1735 (2013)MathSciNetzbMATHCrossRefGoogle Scholar
  55. 55.
    Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, Heidelberg (2010)Google Scholar
  56. 56.
    Veerappan, C., et al.: A \(160 \times 128\) single-photon image sensor with on-pixel 55ps 10b time-to-digital converter. In: Proceedings of ISSCC (2011)Google Scholar
  57. 57.
    Wu, Y., Boominathan, V., Chen, H., Sankaranarayanan, A., Veeraraghavan, A.: PhaseCam3D–learning phase masks for passive single view depth estimation. In: Proceedings of ICCP (2019)Google Scholar
  58. 58.
    Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of CVPR (2015)Google Scholar
  59. 59.
    Xin, S., Nousias, S., Kutulakos, K.N., Sankaranarayanan, A.C., Narasimhan, S.G., Gkioulekas, I.: A theory of Fermat paths for non-line-of-sight shape reconstruction. In: Proceedings of CVPR (2019)Google Scholar
  60. 60.
    Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. In: Proceedings of CVPR (2017)Google Scholar
  61. 61.
    Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of CVPR (2018)Google Scholar
  62. 62.
    Zhang, C., Lindner, S., Antolovic, I., Wolf, M., Charbon, E.: A CMOS SPAD imager with collision detection and 128 dynamically reallocating TDCs for single-photon counting and 3D time-of-flight imaging. Sensors 18(11), 4016 (2018)Google Scholar
  63. 63.
    Zhang, R., et al.: Real-time user-guided image colorization with learned deep priors. ACM Trans. Graph. 9(4) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Stanford UniversityStanfordUSA

Personalised recommendations