Skip to main content

Video Extrapolation in Space and Time

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

Abstract

Novel view synthesis (NVS) and video prediction (VP) are typically considered disjoint tasks in computer vision. However, they can both be seen as ways to observe the spatial-temporal world: NVS aims to synthesize a scene from a new point of view, while VP aims to see a scene from a new point of time. These two tasks provide complementary signals to obtain a scene representation, as viewpoint changes from spatial observations inform depth, and temporal observations inform the motion of cameras and individual objects. Inspired by these observations, we propose to study the problem of Video Extrapolation in Space and Time (VEST). We propose a model that leverages the self-supervision and the complementary cues from both tasks, while existing methods can only solve one of them. Experiments show that our method achieves performance better than or comparable to several state-of-the-art NVS and VP methods on indoor and outdoor real-world datasets. (Project page: https://cs.stanford.edu/~yzzhang/projects/vest/.)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The LPIPS scores of MINE [11] computed are slightly worse compared to the original paper due to a bug in the evaluation script in their public codebase, where tensors in range [0, 1] are fed into an LPIPS package which expects inputs in range \([-1, 1]\).

References

  1. Bei, X., Yang, Y., Soatto, S.: Learning semantic-aware dynamics for video prediction. In: CVPR (2021)

    Google Scholar 

  2. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  3. Du, Y., Zhang, Y., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4D view synthesis and video processing. In: ICCV (2021)

    Google Scholar 

  4. Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: CVPR (2019)

    Google Scholar 

  5. Gao, H., Xu, H., Cai, Q.Z., Wang, R., Yu, F., Darrell, T.: Disentangling propagation and generation for video prediction. In: ICCV (2019)

    Google Scholar 

  6. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  7. Girdhar, R., Ramanan, D.: CATER: a diagnostic dataset for compositional actions and temporal reasoning. In: ICLR (2020)

    Google Scholar 

  8. Hu, R., Ravi, N., Berg, A.C., Pathak, D.: Worldsheet: wrapping the world in a 3D sheet for view synthesis from a single image. In: ICCV (2021)

    Google Scholar 

  9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  10. Lai, Z., Liu, S., Efros, A.A., Wang, X.: Video autoencoder: self-supervised disentanglement of static 3D structure and motion. In: ICCV (2021)

    Google Scholar 

  11. Li, J., Feng, Z., She, Q., Ding, H., Wang, C., Lee, G.H.: MINE: towards continuous depth MPI with NeRF for novel view synthesis. In: ICCV (2021)

    Google Scholar 

  12. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR (2021)

    Google Scholar 

  13. Lin, K.E., Xiao, L., Liu, F., Yang, G., Ramamoorthi, R.: Deep 3D mask volume for view synthesis of dynamic scenes. In: ICCV (2021)

    Google Scholar 

  14. Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: perpetual view generation of natural scenes from a single image. In: ICCV (2021)

    Google Scholar 

  15. Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: ICCV (2017)

    Google Scholar 

  16. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: ICLR (2017)

    Google Scholar 

  17. Lu, E., et al.: Layered neural rendering for retiming people in video. In: SIGGRAPH Asia (2020)

    Google Scholar 

  18. Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: CVPR (2021)

    Google Scholar 

  19. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  20. Park, K., et al.: Nerfies: deformable neural radiance fields. In: ICCV (2021)

    Google Scholar 

  21. Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: CVPR (2021)

    Google Scholar 

  22. Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)

  23. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  24. Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31

    Chapter  Google Scholar 

  25. Shade, J., Gortler, S., He, L.w., Szeliski, R.: Layered depth images. In: SIGGRAPH (1998)

    Google Scholar 

  26. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)

    Google Scholar 

  27. Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3D photography using context-aware layered depth inpainting. In: CVPR (2020)

    Google Scholar 

  28. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  29. Srinivasan, P.P., Tucker, R., Barron, J.T., Ramamoorthi, R., Ng, R., Snavely, N.: Pushing the boundaries of view extrapolation with multiplane images. In: CVPR (2019)

    Google Scholar 

  30. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)

    Google Scholar 

  31. Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: ICCV (2021)

    Google Scholar 

  32. Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR, pp. 551–560 (2020)

    Google Scholar 

  33. Tulsiani, S., Tucker, R., Snavely, N.: Layer-structured 3D scene inference via view synthesis. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 311–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_19

    Chapter  Google Scholar 

  34. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR (2017)

    Google Scholar 

  35. Wang, J.Y.A., Adelson, E.H.: Layered representation for motion analysis. In: CVPR (1993)

    Google Scholar 

  36. Wang, Y., Wu, H., Zhang, J., Gao, Z., Wang, J., Yu, P., Long, M.: Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE TPAMI (2022)

    Google Scholar 

  37. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)

    Google Scholar 

  38. Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: CVPR (2020)

    Google Scholar 

  39. Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction. In: CVPR (2020)

    Google Scholar 

  40. Xian, W., Huang, J.B., Kopf, J., Kim, C.: Space-time neural irradiance fields for free-viewpoint video. In: CVPR (2021)

    Google Scholar 

  41. Yoon, J.S., Kim, K., Gallo, O., Park, H.S., Kautz, J.: Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In: CVPR, pp. 5336–5345 (2020)

    Google Scholar 

  42. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)

    Google Scholar 

  43. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep networks as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  44. Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. In: SIGGRAPH (2018)

    Google Scholar 

  45. Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_18

    Chapter  Google Scholar 

Download references

Acknowledgement

We thank Angjoo Kanazawa, Hong-Xing (Koven) Yu, Huazhe (Harry) Xu, Noah Snavely, Ruohan Zhang, Ruohan Gao, and Shangzhe (Elliott) Wu for detailed feedback on the paper, and Kaidi Cao for collecting the cloud dataset. This work is in part supported by the Stanford Institute for Human-Centered AI (HAI), the Stanford Center for Integrated Facility Engineering (CIFE), the Samsung Global Research Outreach (GRO) Program, and Amazon, Autodesk, Meta, Google, Bosch, and Adobe.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunzhi Zhang .

Editor information

Editors and Affiliations

Appendices

A Architecture Details

The architecture used for the MPI encoder is specified in Table 4.

Table 4. MP2 is max pooling with stride 2, Up2 is nearest-neighbor upsampling with scale 2, + is concatenation. Reshape transforms a tensor with \(C \times D\) channels into C channels, and D is merged to the batch dimension, and ReshapeBack is the reverse operation. All layers up till up1b use ReLU activation and the layers for conv1, conv2 and conv3 use LeakyReLu with a negative slope 0.2. There is no activation following the very last layer. All layers use Instance Norm for activation normalization and Spectral Norm for weight normalization.

B Implementation Details

To have a better gradient flow, similar to Tucker et al. [32], we add a harmonious bias 1/i to the alpha channel prediction, so that \(w_i\) from Eq. (11) becomes uniformly 1/D during initialization. We also add an identity bias to \(f^\theta \) such that each MPI plane is associated with zero motion during initialization.

In all experiments, we set the number of MPI planes to be \(D = 16\). The depth values for MPI planes are linear in the inverse space, with \(d_1 = 1000\) and \(d_D = 1\).

Table 5. Ablation on the number of MPI planes D. Increasing the plane count improves the performance but also increases the training time. We adopt \(D=16\) in the main paper since further increasing D results in diminishing returns.

C Training Details

1.1 C.1 KITTI

Since videos from KITTI are taken by stereo cameras with fixed relative poses, the depth scale is consistent across scenes and therefore we set it to be a constant \(\sigma = 1\). We use \(\lambda _{1}^\text {space} = 1000\), \(\lambda _\text {spec}^\text {space} = 100\), \(\lambda _{1}^\text {time} = 1000\), and \( \lambda _\text {perc}^\text {time} = 10\). We use Adam Optimizer [9] with an initial learning rate 0.0002, which we exponentially decrease by a factor of 0.8 for every 5 epochs. We train our model for 200K iterations on two NVIDIA TITAN RTX GPUs for about two days. During training, we apply horizontal flip with 50% probability and apply color jittering as data augmentation.

1.2 C.2 RealEstate10K

We train our model for 200K iterations on one NVIDIA GeForce RTX 3090 GPU, which takes about one day. We use \(\mathcal {L}_1^\text {space} = 10, \mathcal {L}_\text {perc}^\text {space} = 10, \mathcal {L}_1^\text {time} = 10, \mathcal {L}_\text {perc}^\text {time} = 0\). We use Adam Optimizer [9] with a constant learning rate 0.0002.

1.3 C.3 Ablations on the Number of MPI Planes

To study the effect of the number of MPI planes, we perform an ablation study on the KITTI [6] dataset with resolution \(128\times 384\). As shown in Table 5, a small number of MPI planes (\(D=4\) or 8) results in degraded model performance. Further increasing the number of planes from 16 to 32 results in marginal performance gain, with a cost of \(2.1\times \) slower training time. Therefore, we use \(D=16\) for all other experiments.

Table 6. Results of next-frame prediction on CATER [7]. Our model achieves better performance compared to PredRNN [36].

1.4 C.4 Modeling Dynamic Scenes

To test whether our method is able to model more dynamic scenes, we test our method on CATER [7], a dataset of scenes with 5–10 individually moving objects. We show a quantitative comparison with a video prediction baseline PredRNN [36]. As shown in Table 6, our model achieves better performance across all three metrics.

Qualitatively, our method makes temporal prediction consistent with the ground truth object motions on this dataset. In Fig. 12, the model correctly recovers the purple object and the gold object occluded by the blue cone. Our model effectively handles object occlusions by warping from neighboring pixels with similar RGB values.

Fig. 12.
figure 12

Model prediction on an example scene with occlusion. (a) and (b) are two historical frames as model inputs, (c) and (d) are the predicted and ground truth next frame, respectively. Top-left corners of subfigures are zoomed-in views for occluded regions.

1.5 C.5 Discussions

While we focus on demonstrating the possibility of simultaneous extrapolation in both space and time, specific modules can be further optimized for each task. For example, it is possible to improve the dynamic scene representation to better handle video prediction with long horizons or highly complex motion, or to synthesize novel views with a large viewpoint change.

In the meantime, while our method is designed for natural scenes with many potential positive impacts such as interactive scene exploration for family entertainment, like all other visual content generation methods, our method might be exploited by malicious users with potential negative impacts. We expect such impacts to be minimal as our method is not designed to work with human videos. In our code release, we will explicitly specify allowable uses of our system with appropriate licenses. We will use techniques such as watermarking to label visual content generated by our system.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Wu, J. (2022). Video Extrapolation in Space and Time. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19787-1_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19786-4

  • Online ISBN: 978-3-031-19787-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics