We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view. This capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm where we sample and render virtual camera trajectories, including cyclic camera paths, allowing our model to learn stable view generation from a collection of single views. At test time, despite never having seen a video, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse content. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality. Our project webpage, including video results, is at https://infinite-nature-zero.github.io.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Akan, A.K., Erdem, E., Erdem, A., Guney, F.: SLAMP: stochastic latent appearance and motion prediction. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14728–14737 (2021)
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proceedings of the International Conference on Computer Vision (ICCV), vol. 2, pp. 1395–1402. IEEE (2005)
Bowen, R.S., Chang, H., Herrmann, C., Teterwak, P., Liu, C., Zabih, R.: OCONet: image extrapolation by object completion. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 2307–2317 (2021)
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2022)
Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: periodic implicit generative adversarial networks for 3d-aware image synthesis. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 5799–5809 (2021)
Chaurasia, G., Duchene, S., Sorkine-Hornung, O., Drettakis, G.: Depth synthesis and local warps for plausible image-based navigation. ACM Trans. Graphics 32(3), 1–12 (2013)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Chen, X., Song, J., Hilliges, O.: Monocular neural image based rendering with continuous view control. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4090–4100 (2019)
Cheng, Y.C., Lin, C.H., Lee, H.Y., Ren, J., Tulyakov, S., Yang, M.H.: In &out: diverse image outpainting via GAN inversion. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2022)
Choi, I., Gallo, O., Troccoli, A., Kim, M.H., Kautz, J.: Extreme view synthesis. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 7781–7790 (2019)
Chong, M.J., Lee, H.Y., Forsyth, D.: StyleGAN of all trades: image manipulation with only pretrained StyleGAN. arXiv preprint arXiv:2111.01619 (2021)
Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Clark, A., Donahue, J., Simonyan, K.: Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1174–1183. PMLR (2018)
DeVries, T., Bautista, M.A., Srivastava, N., Taylor, G.W., Susskind, J.M.: Unconstrained scene generation with locally conditioned radiance fields. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14304–14313 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 12873–12883 (2021)
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (2016)
Flynn, J., Broxton, M., et al.: Deepview: view synthesis with learned gradient descent. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 2367–2376 (2019)
Fox, G., Tewari, A., Elgharib, M., Theobalt, C.: StyleVideoGAN: a temporal generative model using a pretrained styleGAN. In: Proceedings of the British Machine Vision Conference (BMVC) (2021)
Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 349–356. IEEE (2009)
Goodfellow, I., et al.: Generative adversarial nets. In: Neural Information Processing Systems (2014)
Gu, J., Liu, L., Wang, P., Theobalt, C.: StyleNeRF: a style-based 3D-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021)
Hao, Z., Mallya, A., Belongie, S., Liu, M.Y.: Gancraft: unsupervised 3D neural rendering of minecraft worlds. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14072–14082 (2021)
Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: ACM Transactions on Graphics (SIGGRAPH North America) (2007)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Neural Information Processing Systems. vol. 33, pp. 6840–6851 (2020)
Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L.F., Niebles, J.C.: Learning to decompose and disentangle representations for video prediction. In: Neural Information Processing Systems, vol. 31 (2018)
Hu, R., Ravi, N., Berg, A.C., Pathak, D.: Worldsheet: wrapping the world in a 3D sheet for view synthesis from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Huang, X., Mallya, A., Wang, T.C., Liu, M.Y.: Multimodal conditional image synthesis with product-of-experts GANs. arXiv preprint arXiv:2112.05130 (2021)
Jampani, V., et al.: SLIDE: single image 3D photography with soft layering and depth-aware inpainting. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 12518–12527 (2021)
Jang, W., Agapito, L.: CodeNeRF: disentangled neural radiance fields for object categories. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 12949–12958 (2021)
Kaneva, B., Sivic, J., Torralba, A., Avidan, S., Freeman, W.T.: Infinite images: creating and exploring a large photorealistic virtual space. In: Proceedings of the IEEE (2010)
Karnewar, A., Wang, O.: Msg-GAN: multi-scale gradients for generative adversarial networks. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 7799–7808 (2020)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of styleGAN. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 8110–8119 (2020)
Koh, J.Y., Lee, H., Yang, Y., Baldridge, J., Anderson, P.: Pathdreamer: a world model for indoor navigation. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14738–14748 (2021)
Kopf, J., et al.: One shot 3D photography. ACM Trans. Graph. (Proc. ACM SIGGRAPH) 39(4), 1–13 (2020)
Kopf, J., et al.: One shot 3D photography. In: ACM Transactions on Graphics (SIGGRAPH North America) (2020)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 4681–4690 (2017)
Lee, W., et al.: Revisiting hierarchical approach for persistent long-term video prediction. arXiv preprint arXiv:2104.06697 (2021)
Levoy, M., Hanrahan, P.: Light field rendering. In: ACM Transactions on Graphics (SIGGRAPH North America) (1996)
Lin, C.H., Cheng, Y.C., Lee, H.Y., Tulyakov, S., Yang, M.H.: InfinityGAN: towards infinite-pixel image synthesis. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: perpetual view generation of natural scenes from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14458–14467 (2021)
Liu, H., Wan, Z., Huang, W., Song, Y., Han, X., Liao, J.: PD-GAN: probabilistic diverse GAN for image inpainting. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 9371–9381 (2021)
Liu, L., Gu, J., Lin, K.Z., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. NeurIPS (2020)
Liu, Y., Shu, Z., Li, Y., Lin, Z., Perazzi, F., Kung, S.Y.: Content-aware GAN compression. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 12156–12166 (2021)
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. 38(4), 1–14 (2019)
Mallya, A., Wang, T.-C., Sapra, K., Liu, M.-Y.: World-consistent video-to-video synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 359–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_22
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. In: ACM Transactions on Graphics (SIGGRAPH North America) (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Munoz, A., Zolfaghari, M., Argus, M., Brox, T.: Temporal shift GAN for large scale video generation. In: Proceedings Winter Conference on Computer Vision (WACV), pp. 3179–3188 (2021)
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: unsupervised learning of 3D representations from natural images. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Niemeyer, M., Geiger, A.: CAMPARI: camera-aware decomposed generative neural radiance fields. In: 2021 International Conference on 3D Vision (3DV), pp. 951–961. IEEE (2021)
Niemeyer, M., Geiger, A.: GIRAFFE: representing scenes as compositional generative neural feature fields. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 11453–11464 (2021)
Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 5437–5446 (2020)
Niklaus, S., Mai, L., Yang, J., Liu, F.: 3D Ken Burns effect from a single image. ACM Trans. Graphics 38(6), 1–15 (2019)
Park, T., et al.: Swapping autoencoder for deep image manipulation. In: Neural Information Processing Systems, pp. 7198–7211 (2020)
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. In: Transactions Pattern Analysis and Machine Intelligence (2020)
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. Neural Information Processing Systems 32 (2019)
Riegler, G., Koltun, V.: Free view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 623–640. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_37
Rockwell, C., Fouhey, D.F., Johnson, J.: Pixelsynth: Generating a 3D-consistent experience from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14104–14113 (2021)
Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: transformers and no 3D priors. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14356–14366 (2021)
Saharia, C., et al.: Palette: image-to-image diffusion models. arXiv preprint arXiv:2111.05826 (2021)
Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: Graf: Generative radiance fields for 3D-aware image synthesis. Neural Inf. Process. Syst. 33, 20154–20166 (2020)
Sengupta, A., Ye, Y., Wang, R., Liu, C., Roy, K.: Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 13, 95 (2019)
Shade, J., Gortler, S., He, L.W., Szeliski, R.: Layered depth images. In: ACM Transactions Graphics (SIGGRAPH North America), pp. 231–242 (1998)
Shaham, T.R., Dekel, T., Michaeli, T.: Singan: learning a generative model from a single natural image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4570–4580 (2019)
Shaham, T.R., Dekel, T., Michaeli, T.: Singan: learning a generative model from a single natural image. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4569–4579 (2019)
Shi, L., Hassanieh, H., Davis, A., Katabi, D., Durand, F.: Light field reconstruction using sparsity in the continuous fourier domain. In: ACM Trans. Graphics (SIGGRAPH North America) (2014)
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3D photography using context-aware layered depth inpainting. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 8028–8038 (2020)
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3D photography using context-aware layered depth inpainting. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2020)
Shocher, A., Bagon, S., Isola, P., Irani, M.: InGAN: capturing and remapping the “DNA” of a natural image. In: Proceedings International Conference on Computer Vision (ICCV) (2019)
Shocher, A., Cohen, N., Irani, M.: “zero-shot” super-resolution using deep internal learning. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 3118–3126 (2018)
Skorokhodov, I., Sotnikov, G., Elhoseiny, M.: Aligning latent and image spaces to connect the unconnectable. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 14144–14153 (2021)
Teterwak, P., et al.: Boundless: generative adversarial networks for image extension. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 10521–10530 (2019)
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069 (2021)
Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2020)
Tulsiani, S., Tucker, R., Snavely, N.: Layer-structured 3D scene inference via view synthesis. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 311–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_19
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 1526–1535 (2018)
Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: Neural Information Processing Systems (2019)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Neural Information Processing Systems (2016)
Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 1020–1028 (2017)
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699 (2021)
Wang, Y., Tao, X., Shen, X., Jia, J.: Wide-context semantic image extrapolation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 1399–1408 (2019)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: Neural Information Processing Systems, pp. 879–888 (2017)
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 7467–7477 (2020)
Yang, Z., Dong, J., Liu, P., Yang, Y., Yan, S.: Very long natural scenery image prediction by outpainting. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 10561–10570 (2019)
Ye, Y., Singh, M., Gupta, A., Tulsiani, S.: Compositional video prediction. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), pp. 5505–5514 (2018)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4471–4480 (2019)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: The Tenth International Conference on Learning Representations (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2018)
Zhao, S., et al.: Large scale image completion via co-modulated generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. In: ACM Transactions on Graphics (SIGGRAPH North America) (2018)
Zhou, Y., Zhu, Z., Bai, X., Lischinski, D., Cohen-Or, D., Huang, H.: Non-stationary texture synthesis by adversarial expansion. In: ACM Transactions on Graphics (SIGGRAPH North America) (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Z., Wang, Q., Snavely, N., Kanazawa, A. (2022). InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-19769-7_30
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)