Abstract
We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view. This capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm where we sample and render virtual camera trajectories, including cyclic camera paths, allowing our model to learn stable view generation from a collection of single views. At test time, despite never having seen a video, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse content. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality. Our project webpage, including video results, is at https://infinite-nature-zero.github.io.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akan, A.K., Erdem, E., Erdem, A., Guney, F.: SLAMP: stochastic latent appearance and motion prediction. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14728–14737 (2021)
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proceedings of the International Conference on Computer Vision (ICCV), vol. 2, pp. 1395–1402. IEEE (2005)
Bowen, R.S., Chang, H., Herrmann, C., Teterwak, P., Liu, C., Zabih, R.: OCONet: image extrapolation by object completion. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 2307–2317 (2021)
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2022)
Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: periodic implicit generative adversarial networks for 3d-aware image synthesis. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 5799–5809 (2021)
Chaurasia, G., Duchene, S., Sorkine-Hornung, O., Drettakis, G.: Depth synthesis and local warps for plausible image-based navigation. ACM Trans. Graphics 32(3), 1–12 (2013)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Chen, X., Song, J., Hilliges, O.: Monocular neural image based rendering with continuous view control. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4090–4100 (2019)
Cheng, Y.C., Lin, C.H., Lee, H.Y., Ren, J., Tulyakov, S., Yang, M.H.: In &out: diverse image outpainting via GAN inversion. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2022)
Choi, I., Gallo, O., Troccoli, A., Kim, M.H., Kautz, J.: Extreme view synthesis. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 7781–7790 (2019)
Chong, M.J., Lee, H.Y., Forsyth, D.: StyleGAN of all trades: image manipulation with only pretrained StyleGAN. arXiv preprint arXiv:2111.01619 (2021)
Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Clark, A., Donahue, J., Simonyan, K.: Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1174–1183. PMLR (2018)
DeVries, T., Bautista, M.A., Srivastava, N., Taylor, G.W., Susskind, J.M.: Unconstrained scene generation with locally conditioned radiance fields. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14304–14313 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 12873–12883 (2021)
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (2016)
Flynn, J., Broxton, M., et al.: Deepview: view synthesis with learned gradient descent. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 2367–2376 (2019)
Fox, G., Tewari, A., Elgharib, M., Theobalt, C.: StyleVideoGAN: a temporal generative model using a pretrained styleGAN. In: Proceedings of the British Machine Vision Conference (BMVC) (2021)
Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 349–356. IEEE (2009)
Goodfellow, I., et al.: Generative adversarial nets. In: Neural Information Processing Systems (2014)
Gu, J., Liu, L., Wang, P., Theobalt, C.: StyleNeRF: a style-based 3D-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021)
Hao, Z., Mallya, A., Belongie, S., Liu, M.Y.: Gancraft: unsupervised 3D neural rendering of minecraft worlds. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14072–14082 (2021)
Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: ACM Transactions on Graphics (SIGGRAPH North America) (2007)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Neural Information Processing Systems. vol. 33, pp. 6840–6851 (2020)
Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L.F., Niebles, J.C.: Learning to decompose and disentangle representations for video prediction. In: Neural Information Processing Systems, vol. 31 (2018)
Hu, R., Ravi, N., Berg, A.C., Pathak, D.: Worldsheet: wrapping the world in a 3D sheet for view synthesis from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Huang, X., Mallya, A., Wang, T.C., Liu, M.Y.: Multimodal conditional image synthesis with product-of-experts GANs. arXiv preprint arXiv:2112.05130 (2021)
Jampani, V., et al.: SLIDE: single image 3D photography with soft layering and depth-aware inpainting. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 12518–12527 (2021)
Jang, W., Agapito, L.: CodeNeRF: disentangled neural radiance fields for object categories. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 12949–12958 (2021)
Kaneva, B., Sivic, J., Torralba, A., Avidan, S., Freeman, W.T.: Infinite images: creating and exploring a large photorealistic virtual space. In: Proceedings of the IEEE (2010)
Karnewar, A., Wang, O.: Msg-GAN: multi-scale gradients for generative adversarial networks. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 7799–7808 (2020)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of styleGAN. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 8110–8119 (2020)
Koh, J.Y., Lee, H., Yang, Y., Baldridge, J., Anderson, P.: Pathdreamer: a world model for indoor navigation. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14738–14748 (2021)
Kopf, J., et al.: One shot 3D photography. ACM Trans. Graph. (Proc. ACM SIGGRAPH) 39(4), 1–13 (2020)
Kopf, J., et al.: One shot 3D photography. In: ACM Transactions on Graphics (SIGGRAPH North America) (2020)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 4681–4690 (2017)
Lee, W., et al.: Revisiting hierarchical approach for persistent long-term video prediction. arXiv preprint arXiv:2104.06697 (2021)
Levoy, M., Hanrahan, P.: Light field rendering. In: ACM Transactions on Graphics (SIGGRAPH North America) (1996)
Lin, C.H., Cheng, Y.C., Lee, H.Y., Tulyakov, S., Yang, M.H.: InfinityGAN: towards infinite-pixel image synthesis. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: perpetual view generation of natural scenes from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14458–14467 (2021)
Liu, H., Wan, Z., Huang, W., Song, Y., Han, X., Liao, J.: PD-GAN: probabilistic diverse GAN for image inpainting. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 9371–9381 (2021)
Liu, L., Gu, J., Lin, K.Z., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. NeurIPS (2020)
Liu, Y., Shu, Z., Li, Y., Lin, Z., Perazzi, F., Kung, S.Y.: Content-aware GAN compression. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 12156–12166 (2021)
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. 38(4), 1–14 (2019)
Mallya, A., Wang, T.-C., Sapra, K., Liu, M.-Y.: World-consistent video-to-video synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 359–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_22
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. In: ACM Transactions on Graphics (SIGGRAPH North America) (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Munoz, A., Zolfaghari, M., Argus, M., Brox, T.: Temporal shift GAN for large scale video generation. In: Proceedings Winter Conference on Computer Vision (WACV), pp. 3179–3188 (2021)
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: unsupervised learning of 3D representations from natural images. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Niemeyer, M., Geiger, A.: CAMPARI: camera-aware decomposed generative neural radiance fields. In: 2021 International Conference on 3D Vision (3DV), pp. 951–961. IEEE (2021)
Niemeyer, M., Geiger, A.: GIRAFFE: representing scenes as compositional generative neural feature fields. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 11453–11464 (2021)
Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 5437–5446 (2020)
Niklaus, S., Mai, L., Yang, J., Liu, F.: 3D Ken Burns effect from a single image. ACM Trans. Graphics 38(6), 1–15 (2019)
Park, T., et al.: Swapping autoencoder for deep image manipulation. In: Neural Information Processing Systems, pp. 7198–7211 (2020)
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. In: Transactions Pattern Analysis and Machine Intelligence (2020)
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. Neural Information Processing Systems 32 (2019)
Riegler, G., Koltun, V.: Free view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 623–640. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_37
Rockwell, C., Fouhey, D.F., Johnson, J.: Pixelsynth: Generating a 3D-consistent experience from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14104–14113 (2021)
Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: transformers and no 3D priors. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14356–14366 (2021)
Saharia, C., et al.: Palette: image-to-image diffusion models. arXiv preprint arXiv:2111.05826 (2021)
Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: Graf: Generative radiance fields for 3D-aware image synthesis. Neural Inf. Process. Syst. 33, 20154–20166 (2020)
Sengupta, A., Ye, Y., Wang, R., Liu, C., Roy, K.: Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 13, 95 (2019)
Shade, J., Gortler, S., He, L.W., Szeliski, R.: Layered depth images. In: ACM Transactions Graphics (SIGGRAPH North America), pp. 231–242 (1998)
Shaham, T.R., Dekel, T., Michaeli, T.: Singan: learning a generative model from a single natural image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4570–4580 (2019)
Shaham, T.R., Dekel, T., Michaeli, T.: Singan: learning a generative model from a single natural image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4569–4579 (2019)
Shi, L., Hassanieh, H., Davis, A., Katabi, D., Durand, F.: Light field reconstruction using sparsity in the continuous fourier domain. In: ACM Trans. Graphics (SIGGRAPH North America) (2014)
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3D photography using context-aware layered depth inpainting. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 8028–8038 (2020)
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3D photography using context-aware layered depth inpainting. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2020)
Shocher, A., Bagon, S., Isola, P., Irani, M.: InGAN: capturing and remapping the “DNA” of a natural image. In: Proceedings International Conference on Computer Vision (ICCV) (2019)
Shocher, A., Cohen, N., Irani, M.: “zero-shot” super-resolution using deep internal learning. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 3118–3126 (2018)
Skorokhodov, I., Sotnikov, G., Elhoseiny, M.: Aligning latent and image spaces to connect the unconnectable. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14144–14153 (2021)
Teterwak, P., et al.: Boundless: generative adversarial networks for image extension. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 10521–10530 (2019)
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069 (2021)
Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2020)
Tulsiani, S., Tucker, R., Snavely, N.: Layer-structured 3D scene inference via view synthesis. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 311–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_19
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 1526–1535 (2018)
Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: Neural Information Processing Systems (2019)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Neural Information Processing Systems (2016)
Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 1020–1028 (2017)
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699 (2021)
Wang, Y., Tao, X., Shen, X., Jia, J.: Wide-context semantic image extrapolation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 1399–1408 (2019)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: Neural Information Processing Systems, pp. 879–888 (2017)
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 7467–7477 (2020)
Yang, Z., Dong, J., Liu, P., Yang, Y., Yan, S.: Very long natural scenery image prediction by outpainting. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 10561–10570 (2019)
Ye, Y., Singh, M., Gupta, A., Tulsiani, S.: Compositional video prediction. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 5505–5514 (2018)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4471–4480 (2019)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: The Tenth International Conference on Learning Representations (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2018)
Zhao, S., et al.: Large scale image completion via co-modulated generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. In: ACM Transactions on Graphics (SIGGRAPH North America) (2018)
Zhou, Y., Zhu, Z., Bai, X., Lischinski, D., Cohen-Or, D., Huang, H.: Non-stationary texture synthesis by adversarial expansion. In: ACM Transactions on Graphics (SIGGRAPH North America) (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Z., Wang, Q., Snavely, N., Kanazawa, A. (2022). InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-19769-7_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)