DVI: Depth Guided Video Inpainting for Autonomous Driving

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)


To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud. By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target inpainting area in a frame, it is straightforward to transform pixels from other frames into the current one with correct occlusion. Furthermore, we are able to fuse multiple videos through 3D point cloud registration, making it possible to inpaint a target video with multiple source videos. The motivation is to solve the long-time occlusion problem where an occluded area has never been visible in the entire video. To our knowledge, we are the first to fuse multiple videos for video inpainting. To verify the effectiveness of our approach, we build a large inpainting dataset in the real urban road environment with synchronized images and Lidar data including many challenge scenes, e.g., long time occlusion. The experimental results show that the proposed approach outperforms the state-of-the-art approaches for all the criteria, especially the RMSE (Root Mean Squared Error) has been reduced by about \(\mathbf{13} \%\).


Video inpainting Autonomous driving Depth Image synthesis Simulation 


  1. 1.
    Ballester, C., Bertalmio, M., Caselles, V., Sapiro, G., Verdera, J.: Filling-in by joint interpolation of vector fields and gray levels. Trans. Img. Proc. 10(8), 1200–1211 (2001). Scholar
  2. 2.
    Bertalmio, M., Vese, L., Sapiro, G., Osher, S.: Simultaneous structure and texture image inpainting. Trans. Img. Proc. 12(8), 882–889 (2003). Scholar
  3. 3.
    Cheng, X., Wang, P., Yang, R.: Depth estimation via affinity learned with convolutional spatial propagation network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–119 (2018)Google Scholar
  4. 4.
    Darabi, S., Shechtman, E., Barnes, C., Goldman, D.B., Sen, P.: Image melding: combining inconsistent images using patch-based synthesis. ACM Trans. Graph. (TOG) (Proc. SIGGRAPH 2012), 31(4), 82:1–82:10 (2012)Google Scholar
  5. 5.
    Ebdelli, M., Le Meur, O., Guillemot, C.: Video inpainting with short-term windows: application to object removal and error concealment. IEEE Trans. Image Process. 24, 3034–3047 (2015). Scholar
  6. 6.
    Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 341–346. SIGGRAPH 2001, ACM, New York, NY, USA (2001).,
  7. 7.
    Goodfellow, I., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014).
  8. 8.
    Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of dynamic video. ACM Trans. Graph. (TOG) 35(6), 196 (2016)Google Scholar
  9. 9.
    Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM Trans. Graph. (Proc. SIGGRAPH 2017) 36(4), 107:1–107:14 (2017)Google Scholar
  10. 10.
    Izadi, S., et al.: Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 559–568. UIST 2011, ACM, New York, NY, USA (2011). DOIurl,
  11. 11.
    Li, W., et al.: Aads: augmented autonomous driving simulation using data-driven algorithms. Sci. Robot. 4(28) (2019).,
  12. 12.
    Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: Trafficpredict: trajectory prediction for heterogeneous traffic-agents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6120–6127 (2019).
  13. 13.
    Newson, A., Almansa, A., Fradet, M., Gousseau, Y., Pérez, P.: Towards fast, generic video inpainting. In: Proceedings of the 10th European Conference on Visual Media Production, pp. 7:1–7:8. CVMP 2013, ACM, New York, NY, USA (2013).,
  14. 14.
    Newson, A., Almansa, A., Fradet, M., Gousseau, Y., Pérez, P.: Video inpainting of complex scenes. SIAM J. Imaging Sci. 7, 1993–2019 (2014). Scholar
  15. 15.
    Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: Feature learning by inpainting. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  16. 16.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)Google Scholar
  17. 17.
    Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH 2003 Papers, pp. 313–318. SIGGRAPH 2003, ACM, New York, NY, USA (2003).,
  18. 18.
    Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)
  19. 19.
    Ren, J.S., Xu, L., Yan, Q., Sun, W.: Shepard convolutional neural networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 901–909. Curran Associates, Inc. (2015).
  20. 20.
    Shih, T.K., Tang, N.C., Hwang, J.N.: Exemplar-based video inpainting without ghost shadow artifacts by maintaining temporal continuity. IEEE Trans. Cir. and Sys. for Video Technol. 19(3), 347–360 (2009).,
  21. 21.
    Simakov, D., Caspi, Y., Shechtman, E., Irani, M.: Summarizing visual data using bidirectional similarity. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)Google Scholar
  22. 22.
    Steinbrücker, F., Sturm, J., Cremers, D.: Real-time visual odometry from dense RGB-D images. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 719–722, November 2011.
  23. 23.
    Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning temporal structure and spatial details. In: Proceedings of the 33th AAAI Conference on Artificial Intelligence (2019)Google Scholar
  24. 24.
    Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  25. 25.
    Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892 (2018)
  26. 26.
    Zhang, J., Singh, S.: Loam: lidar odometry and mapping in real-time. In: Robotics: Science and Systems Conference, July 2014Google Scholar
  27. 27.
    Zhang, R., et al.: Autoremover: automatic object removal for autonomous driving videos. arXiv preprint arXiv:1911.12588 (2019)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Baidu Research, Baidu Inc.BeijingChina

Personalised recommendations