Advertisement

World-Consistent Video-to-Video Synthesis

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12353)

Abstract

Video-to-video synthesis (vid2vid) aims for converting high-level semantic inputs to photorealistic videos. While existing vid2vid methods can achieve short-term temporal consistency, they fail to ensure the long-term one. This is because they lack knowledge of the 3D world being rendered and generate each frame only based on the past few frames. To address the limitation, we introduce a novel vid2vid framework that efficiently and effectively utilizes all past generated frames during rendering. This is achieved by condensing the 3D world rendered so far into a physically-grounded estimate of the current frame, which we call the guidance image. We further propose a novel neural network architecture to take advantage of the information stored in the guidance images. Extensive experimental results on several challenging datasets verify the effectiveness of our approach in achieving world consistency—the output video is consistent within the entire rendered 3D world.

Keywords

Neural rendering Video synthesis GAN 

Notes

Acknowledgements

We would like to thank Jan Kautz, Guilin Liu, Andrew Tao, and Bryan Catanzaro for their feedback, and Sabu Nadarajan, Nithya Natesan, and Sivakumar Arayandi Thottakara for helping us with the compute, without which this work would not have been possible.

Supplementary material

504445_1_En_22_MOESM1_ESM.pdf (245 kb)
Supplementary material 1 (pdf 245 KB)
504445_1_En_22_MOESM2_ESM.pdf (3.3 mb)
Supplementary material 2 (pdf 3355 KB)
504445_1_En_22_MOESM3_ESM.pdf (3.1 mb)
Supplementary material 3 (pdf 3219 KB)
504445_1_En_22_MOESM4_ESM.pdf (14.6 mb)
Supplementary material 4 (pdf 14918 KB)
504445_1_En_22_MOESM5_ESM.pdf (733 kb)
Supplementary material 5 (pdf 732 KB)
504445_1_En_22_MOESM6_ESM.pdf (3.7 mb)
Supplementary material 6 (pdf 3777 KB)
504445_1_En_22_MOESM7_ESM.pdf (2.6 mb)
Supplementary material 7 (pdf 2643 KB)

References

  1. 1.
  2. 2.
    Aliev, K.A., Ulyanov, D., Lempitsky, V.: Neural point-based graphics. arXiv preprint arXiv:1906.08240 (2019)
  3. 3.
    Benaim, S., Wolf, L.: One-shot unsupervised cross domain translation. In: Conference on Neural Information Processing Systems (NeurIPS) (2018)Google Scholar
  4. 4.
    Bonneel, N., Tompkin, J., Sunkavalli, K., Sun, D., Paris, S., Pfister, H.: Blind video temporal consistency. ACM Trans. Graph. (TOG) (2015)Google Scholar
  5. 5.
    Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  6. 6.
    Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (ICLR) (2019)Google Scholar
  7. 7.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008 (2018)
  8. 8.
    Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  9. 9.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40, 834–848 (2017)Google Scholar
  10. 10.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_49CrossRefGoogle Scholar
  11. 11.
    Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  12. 12.
    Chen, Y., Pan, Y., Yao, T., Tian, X., Mei, T.: Mocycle-GAN: unpaired video-to-video translation. In: ACM International Conference on Multimedia (MM) (2019)Google Scholar
  13. 13.
    Choi, I., Gallo, O., Troccoli, A., Kim, M.H., Kautz, J.: Extreme view synthesis. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  14. 14.
    Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  15. 15.
    Cordts, M., et al.: The Cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  16. 16.
    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  17. 17.
    Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  18. 18.
    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Conference on Robot Learning (CoRL) (2017)Google Scholar
  19. 19.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)Google Scholar
  20. 20.
    Flynn, J., et al.: Deepview: view synthesis with learned gradient descent. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  21. 21.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: DeepStereo: learning to predict new views from the world’s imagery. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  22. 22.
    Gafni, O., Wolf, L., Taigman, Y.: Vid2Game: controllable characters extracted from real-world videos. In: International Conference on Learning Representations (ICLR) (2020)Google Scholar
  23. 23.
    Golyanik, V., Kim, K., Maier, R., Nießner, M., Stricker, D., Kautz, J.: Multiframe scene flow with piecewise rigid motion. In: International Conference on 3D Vision (3DV) (2017)Google Scholar
  24. 24.
    Goodfellow, I., et al.: Generative adversarial networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2014)Google Scholar
  25. 25.
    Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  26. 26.
    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  27. 27.
    Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  28. 28.
    Hedman, P., Philip, J., Price, T., Frahm, J.M., Drettakis, G., Brostow, G.: Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph. (TOG) 37, 1–15 (2018)Google Scholar
  29. 29.
    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  30. 30.
    Hu, Q., Waelchli, A., Portenier, T., Zwicker, M., Favaro, P.: Video synthesis from a single image and motion stroke. arXiv preprint arXiv:1812.01874 (2018)
  31. 31.
    Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_11CrossRefGoogle Scholar
  32. 32.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  33. 33.
    Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  34. 34.
    Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for light field cameras. ACM Trans. Graph. (TOG) 35, 1–10 (2016)Google Scholar
  35. 35.
    Kalchbrenner, N., et al.: Video pixel networks. In: International Conference on Machine Learning (ICML) (2017)Google Scholar
  36. 36.
    Kaplanyan, A.S., Sochenov, A., Leimkühler, T., Okunev, M., Goodall, T., Rufo, G.: DeepFovea: neural reconstruction for foveated rendering and video compression using learned statistics of natural videos. ACM Trans. Graph. (TOG) 38, 1–13 (2019)Google Scholar
  37. 37.
    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (ICLR) (2018)Google Scholar
  38. 38.
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  39. 39.
    Lai, W.-S., Huang, J.-B., Wang, O., Shechtman, E., Yumer, E., Yang, M.-H.: Learning blind video temporal consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 179–195. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01267-0_11CrossRefGoogle Scholar
  40. 40.
    Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
  41. 41.
    Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Diverse image-to-image translation via disentangled representations. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 36–52. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_3CrossRefGoogle Scholar
  42. 42.
    Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.-H.: Flow-grounded spatial-temporal video prediction from still images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 609–625. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01240-3_37CrossRefGoogle Scholar
  43. 43.
    Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  44. 44.
    Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  45. 45.
    Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  46. 46.
    Liu, G., Reda, F.A., Shih, K.J., Wang, T.-C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 89–105. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01252-6_6CrossRefGoogle Scholar
  47. 47.
    Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  48. 48.
    Liu, M.Y., et al.: Few-shot unsupervised image-to-image translation. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  49. 49.
    Liu, X., Yin, G., Shao, J., Wang, X., et al.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)Google Scholar
  50. 50.
    Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature (1981)Google Scholar
  51. 51.
    Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  52. 52.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  53. 53.
    Meshry, M., et al.: Neural rerendering in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  54. 54.
    Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph. (TOG) 38, 1–14 (2019)Google Scholar
  55. 55.
    Miyato, T., Koyama, M.: cGANs with projection discriminator. In: International Conference on Learning Representations (ICLR) (2018)Google Scholar
  56. 56.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_54CrossRefGoogle Scholar
  57. 57.
    Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: International Conference on Machine Learning (ICML) (2017)Google Scholar
  58. 58.
    Pan, J., et al.: Video generation from single semantic label map. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  59. 59.
    Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  60. 60.
    Qi, X., Chen, Q., Jia, J., Koltun, V.: Semi-parametric image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  61. 61.
    Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning (ICML) (2016)Google Scholar
  62. 62.
    Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  63. 63.
    Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  64. 64.
    Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deepvoxels: Learning persistent 3d feature embeddings. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  65. 65.
    Srinivasan, P.P., Wang, T., Sreelal, A., Ramamoorthi, R., Ng, R.: Learning to synthesize a 4D RGBD light field from a single image. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  66. 66.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)Google Scholar
  67. 67.
    Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  68. 68.
    Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (TOG) 38, 1–12 (2019)Google Scholar
  69. 69.
    Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vis. (IJCV) 9, 137–154 (1992)Google Scholar
  70. 70.
    Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  71. 71.
    Vedula, S., Baker, S., Rander, P., Collins, R., Kanade, T.: Three-dimensional scene flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1999)Google Scholar
  72. 72.
    Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  73. 73.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)Google Scholar
  74. 74.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_51CrossRefGoogle Scholar
  75. 75.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  76. 76.
    Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)Google Scholar
  77. 77.
    Wang, T.C., et al.: Video-to-video synthesis. In: Conference on Neural Information Processing Systems (NeurIPS) (2018)Google Scholar
  78. 78.
    Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  79. 79.
    Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  80. 80.
    Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
  81. 81.
    Wu, Z., Shen, C., Van Den Hengel, A.: Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recogn. 90, 119–133 (2019)Google Scholar
  82. 82.
    Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 842–857. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_51CrossRefGoogle Scholar
  83. 83.
    Xie, S., Tu, Z.: Holistically-nested edge detection. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  84. 84.
    Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  85. 85.
    Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)Google Scholar
  86. 86.
    Yao, C.H., Chang, C.Y., Chien, S.Y.: Occlusion-aware video temporal consistency. In: ACM International Conference on Multimedia (MM) (2017)Google Scholar
  87. 87.
    Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning (ICML) (2019)Google Scholar
  88. 88.
    Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  89. 89.
    Zhao, B., Meng, L., Yin, W., Sigal, L.: Image generation from layout. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  90. 90.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  91. 91.
    Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. In: ACM SIGGRAPH (2018)Google Scholar
  92. 92.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Dance dance generation: motion transfer for internet videos. arXiv preprint arXiv:1904.00129 (2019)
  93. 93.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  94. 94.
    Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NVIDIASanta ClaraUSA

Personalised recommendations