DDRNet: Depth Map Denoising and Refinement for Consumer Depth Cameras Using Cascaded CNNs

  • Shi Yan
  • Chenglei Wu
  • Lizhen Wang
  • Feng Xu
  • Liang An
  • Kaiwen Guo
  • Yebin LiuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)


Consumer depth sensors are more and more popular and come to our daily lives marked by its recent integration in the latest Iphone X. However, they still suffer from heavy noises which limit their applications. Although plenty of progresses have been made to reduce the noises and boost geometric details, due to the inherent illness and the real-time requirement, the problem is still far from been solved. We propose a cascaded Depth Denoising and Refinement Network (DDRNet) to tackle this problem by leveraging the multi-frame fused geometry and the accompanying high quality color image through a joint training strategy. The rendering equation is exploited in our network in an unsupervised manner. In detail, we impose an unsupervised loss based on the light transport to extract the high-frequency geometry. Experimental results indicate that our network achieves real-time single depth enhancement on various categories of scenes. Thanks to the well decoupling of the low and high frequency information in the cascaded network, we achieve superior performance over the state-of-the-art techniques.


Depth enhancement Consumer depth camera Unsupervised learning Convolutional neural networks DynamicFusion 



This work is supported by the National key foundation for exploring scientific instrument of China No. 2013YQ140517, and the National NSF of China grant No. 61522111, No. 61531014, No. 61671268 and No. 61727808.


  1. 1.
    Barron, J.T., Malik, J.: Intrinsic scene properties from a single RGB-D image. In: Proceedings of CVPR, pp. 17–24. IEEE (2013)Google Scholar
  2. 2.
    Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. Technical report, EECS, UC Berkeley, May 2013Google Scholar
  3. 3.
    Beeler, T., Bickel, B., Beardsley, P.A., Sumner, B., Gross, M.H.: High-quality single-shot capture of facial geometry. ACM Trans. Graph. 29(4), 40:1–40:9 (2010)CrossRefGoogle Scholar
  4. 4.
    Beeler, T., Bradley, D., Zimmer, H., Gross, M.: Improved reconstruction of deforming surfaces by cancelling ambient occlusion. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 30–43. Springer, Heidelberg (2012). Scholar
  5. 5.
    Besl, P.J., McKay, N.D.: Method for registration of 3-D shapes. In: Robotics-DL Tentative, pp. 586–606. International Society for Optics and Photonics (1992)Google Scholar
  6. 6.
    Chan, D., Buisman, H., Theobalt, C., Thrun, S.: A noise-aware filter for real-time depth upsampling. In: ECCV Workshop on Multi-camera & Multi-modal Sensor Fusion (2008)Google Scholar
  7. 7.
    Chen, J., Bautembach, D., Izadi, S.: Scalable real-time volumetric surface reconstruction. ACM Trans. Graph. 32(4), 113:1–113:16 (2013)zbMATHGoogle Scholar
  8. 8.
    Cui, Y., Schuon, S., Thrun, S., Stricker, D., Theobalt, C.: Algorithms for 3D shape scanning with a depth camera. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1039–1050 (2013)CrossRefGoogle Scholar
  9. 9.
    Diebel, J., Thrun, S.: An application of Markov random fields to range sensing. In: Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS 2005, pp. 291–298. MIT Press, Cambridge (2005)Google Scholar
  10. 10.
    Dolson, J., Baek, J., Plagemann, C., Thrun, S.: Upsampling range data in dynamic environments. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1141–1148, June 2010Google Scholar
  11. 11.
    Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q., Liu, Y.: Real-time geometry, albedo, and motion reconstruction using a single RGB-D camera. ACM Trans. Graph. 36(3), 32:1–32:13 (2017)CrossRefGoogle Scholar
  12. 12.
    Han, Y., Lee, J.Y., Kweon, I.S.: High quality shape from a single RGB-D image under uncalibrated natural illumination. In: Proceedings of ICCV (2013)Google Scholar
  13. 13.
    Han, Y., Lee, J.Y., Kweon, I.S.: High quality shape from a single RGB-D image under uncalibrated natural illumination. In: IEEE International Conference on Computer Vision, pp. 1617–1624 (2013)Google Scholar
  14. 14.
    Hariharan, B., Arbelaez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization, pp. 447–456 (2014)Google Scholar
  15. 15.
    He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2013)CrossRefGoogle Scholar
  16. 16.
    Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: RGB-D mapping: using kinect-style depth cameras for dense 3D modeling of indoor environments. Int. J. Robot. Res. 31(5), 647–663 (2012)CrossRefGoogle Scholar
  17. 17.
    Horn, B.K.: Obtaining shape from shading information. In: The Psychology of Computer Vision, pp. 115–155 (1975)Google Scholar
  18. 18.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks (2016)Google Scholar
  19. 19.
    Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of UIST, pp. 559–568. ACM (2011)Google Scholar
  20. 20.
    Kajiya, J.T.: The rendering equation. In: Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1986, pp. 143–150. ACM, New York (1986)Google Scholar
  21. 21.
    Khan, N., Tran, L., Tappen, M.: Training many-parameter shape-from-shading models using a surface database. In: Proceedings of ICCV Workshop (2009)Google Scholar
  22. 22.
    Kopf, J., Cohen, M.F., Lischinski, D., Uyttendaele, M.: Joint bilateral upsampling. ACM Trans. Graph. 26(3), 96 (2007)CrossRefGoogle Scholar
  23. 23.
    Lindner, M., Kolb, A., Hartmann, K.: Data-fusion of PMD-based distance-information and high-resolution RGB-images. In: 2007 International Symposium on Signals, Circuits and Systems, vol. 1, pp. 1–4, July 2007Google Scholar
  24. 24.
    Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)Google Scholar
  25. 25.
    Nestmeyer, T., Gehler, P.V.: Reflectance adaptive filtering improves intrinsic image estimation. In: CVPR, pp. 1771–1780 (2017)Google Scholar
  26. 26.
    Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 343–352, June 2015Google Scholar
  27. 27.
    Newcombe, R.A., Izadi, S., et al.: KinectFusion: real-time dense surface mapping and tracking. In: IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 127–136 (2011)Google Scholar
  28. 28.
    Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M.: Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph. (TOG) 32(6), 169 (2013)CrossRefGoogle Scholar
  29. 29.
    Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016)Google Scholar
  30. 30.
    Or El, R., Rosman, G., Wetzler, A., Kimmel, R., Bruckstein, A.M.: RGBD-fusion: real-time high precision depth recovery. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  31. 31.
    Park, J., Kim, H., Tai, Y.W., Brown, M.S., Kweon, I.: High quality depth map upsampling for 3D-TOF cameras. In: 2011 International Conference on Computer Vision, pp. 1623–1630, November 2011Google Scholar
  32. 32.
    RealityCapture (2017).
  33. 33.
    Richardson, E., Sela, M., Or-El, R., Kimmel, R.: Learning detailed face reconstruction from a single image. In: CVPR (2017)Google Scholar
  34. 34.
    Richardt, C., Stoll, C., Dodgson, N.A., Seidel, H.P., Theobalt, C.: Coherent spatiotemporal filtering, upsampling and rendering of RGBZ videos. Comput. Graph. Forum 31(2pt1), 247–256 (2012)CrossRefGoogle Scholar
  35. 35.
    Riegler, G., Ulusoy, A.O., Bischof, H., Geiger, A.: OctNetFusion: learning depth fusion from data. In: 2017 International Conference on 3D Vision (3DV), pp. 57–66. IEEE (2017)Google Scholar
  36. 36.
    Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  37. 37.
    Sela, M., Richardson, E., Kimmel, R.: Unrestricted facial geometry reconstruction using image-to-image translation (2017)Google Scholar
  38. 38.
    Tewari, A., et al.: MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: The IEEE International Conference on Computer Vision (ICCV), vol. 2, p. 5 (2017)Google Scholar
  39. 39.
    Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)Google Scholar
  40. 40.
    Wei, G., Hirzinger, G.: Learning shape from shading by a multilayer network. IEEE Trans. Neural Netw. 7(4), 985–995 (1996)CrossRefGoogle Scholar
  41. 41.
    Wojna, Z., et al.: The devil is in the decoder (2017)Google Scholar
  42. 42.
    Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph. (TOG) 32(6), 161 (2013)Google Scholar
  43. 43.
    Wu, C., Varanasi, K., Liu, Y., Seidel, H., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination, pp. 1108–1115 (2011)Google Scholar
  44. 44.
    Wu, C., Zollhöfer, M., Nießner, M., Stamminger, M., Izadi, S., Theobalt, C.: Real-time shading-based refinement for consumer depth cameras. ACM Trans. Graph. (TOG) 33(6), 200 (2014)zbMATHGoogle Scholar
  45. 45.
    Yang, Q., Yang, R., Davis, J., Nister, D.: Spatial-depth super resolution for range images. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2007Google Scholar
  46. 46.
    Yu, L., Yeung, S., Tai, Y., Lin, S.: Shading-based shape refinement of RGB-D images, pp. 1415–1422 (2013)Google Scholar
  47. 47.
    Yu, T., et al.: BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In: The IEEE International Conference on Computer Vision (ICCV). IEEE, October 2017Google Scholar
  48. 48.
    Yu, T., et al.: DoubleFusion: real-time capture of human performance with inner body shape from a depth sensor. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  49. 49.
    Zhang, Z., Tsa, P.S., Cryer, J.E., Shah, M.: Shape from shading: a survey. IEEE PAMI 21(8), 690–706 (1999)CrossRefGoogle Scholar
  50. 50.
    Zhu, J., Wang, L., Yang, R., Davis, J.: Fusion of time-of-flight depth and stereo for high accuracy depth maps. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Shi Yan
    • 1
  • Chenglei Wu
    • 2
  • Lizhen Wang
    • 1
  • Feng Xu
    • 1
  • Liang An
    • 1
  • Kaiwen Guo
    • 3
  • Yebin Liu
    • 1
    Email author
  1. 1.Tsinghua UniversityBeijingChina
  2. 2.Facebook Reality LabsPittsburghUSA
  3. 3.Google IncMountain ViewUSA

Personalised recommendations