Video Extrapolation in Space and Time

Zhang, Yunzhi; Wu, Jiajun

doi:10.1007/978-3-031-19787-1_18

Yunzhi Zhang¹² &
Jiajun Wu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

European Conference on Computer Vision

2162 Accesses
3 Citations

Abstract

Novel view synthesis (NVS) and video prediction (VP) are typically considered disjoint tasks in computer vision. However, they can both be seen as ways to observe the spatial-temporal world: NVS aims to synthesize a scene from a new point of view, while VP aims to see a scene from a new point of time. These two tasks provide complementary signals to obtain a scene representation, as viewpoint changes from spatial observations inform depth, and temporal observations inform the motion of cameras and individual objects. Inspired by these observations, we propose to study the problem of Video Extrapolation in Space and Time (VEST). We propose a model that leverages the self-supervision and the complementary cues from both tasks, while existing methods can only solve one of them. Experiments show that our method achieves performance better than or comparable to several state-of-the-art NVS and VP methods on indoor and outdoor real-world datasets. (Project page: https://cs.stanford.edu/~yzzhang/projects/vest/.)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The LPIPS scores of MINE [11] computed are slightly worse compared to the original paper due to a bug in the evaluation script in their public codebase, where tensors in range [0, 1] are fed into an LPIPS package which expects inputs in range \([-1, 1]\).

References

Bei, X., Yang, Y., Soatto, S.: Learning semantic-aware dynamics for video prediction. In: CVPR (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Du, Y., Zhang, Y., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4D view synthesis and video processing. In: ICCV (2021)
Google Scholar
Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: CVPR (2019)
Google Scholar
Gao, H., Xu, H., Cai, Q.Z., Wang, R., Yu, F., Darrell, T.: Disentangling propagation and generation for video prediction. In: ICCV (2019)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Girdhar, R., Ramanan, D.: CATER: a diagnostic dataset for compositional actions and temporal reasoning. In: ICLR (2020)
Google Scholar
Hu, R., Ravi, N., Berg, A.C., Pathak, D.: Worldsheet: wrapping the world in a 3D sheet for view synthesis from a single image. In: ICCV (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Lai, Z., Liu, S., Efros, A.A., Wang, X.: Video autoencoder: self-supervised disentanglement of static 3D structure and motion. In: ICCV (2021)
Google Scholar
Li, J., Feng, Z., She, Q., Ding, H., Wang, C., Lee, G.H.: MINE: towards continuous depth MPI with NeRF for novel view synthesis. In: ICCV (2021)
Google Scholar
Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR (2021)
Google Scholar
Lin, K.E., Xiao, L., Liu, F., Yang, G., Ramamoorthi, R.: Deep 3D mask volume for view synthesis of dynamic scenes. In: ICCV (2021)
Google Scholar
Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: perpetual view generation of natural scenes from a single image. In: ICCV (2021)
Google Scholar
Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: ICCV (2017)
Google Scholar
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: ICLR (2017)
Google Scholar
Lu, E., et al.: Layered neural rendering for retiming people in video. In: SIGGRAPH Asia (2020)
Google Scholar
Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: CVPR (2021)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Park, K., et al.: Nerfies: deformable neural radiance fields. In: ICCV (2021)
Google Scholar
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: CVPR (2021)
Google Scholar
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Google Scholar
Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31
Chapter Google Scholar
Shade, J., Gortler, S., He, L.w., Szeliski, R.: Layered depth images. In: SIGGRAPH (1998)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)
Google Scholar
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3D photography using context-aware layered depth inpainting. In: CVPR (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Srinivasan, P.P., Tucker, R., Barron, J.T., Ramamoorthi, R., Ng, R., Snavely, N.: Pushing the boundaries of view extrapolation with multiplane images. In: CVPR (2019)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
Google Scholar
Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: ICCV (2021)
Google Scholar
Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR, pp. 551–560 (2020)
Google Scholar
Tulsiani, S., Tucker, R., Snavely, N.: Layer-structured 3D scene inference via view synthesis. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 311–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_19
Chapter Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR (2017)
Google Scholar
Wang, J.Y.A., Adelson, E.H.: Layered representation for motion analysis. In: CVPR (1993)
Google Scholar
Wang, Y., Wu, H., Zhang, J., Gao, Z., Wang, J., Yu, P., Long, M.: Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE TPAMI (2022)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Google Scholar
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: CVPR (2020)
Google Scholar
Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction. In: CVPR (2020)
Google Scholar
Xian, W., Huang, J.B., Kopf, J., Kim, C.: Space-time neural irradiance fields for free-viewpoint video. In: CVPR (2021)
Google Scholar
Yoon, J.S., Kim, K., Gallo, O., Park, H.S., Kautz, J.: Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In: CVPR, pp. 5336–5345 (2020)
Google Scholar
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep networks as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. In: SIGGRAPH (2018)
Google Scholar
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_18
Chapter Google Scholar

Download references

Acknowledgement

We thank Angjoo Kanazawa, Hong-Xing (Koven) Yu, Huazhe (Harry) Xu, Noah Snavely, Ruohan Zhang, Ruohan Gao, and Shangzhe (Elliott) Wu for detailed feedback on the paper, and Kaidi Cao for collecting the cloud dataset. This work is in part supported by the Stanford Institute for Human-Centered AI (HAI), the Stanford Center for Integrated Facility Engineering (CIFE), the Samsung Global Research Outreach (GRO) Program, and Amazon, Autodesk, Meta, Google, Bosch, and Adobe.

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Yunzhi Zhang & Jiajun Wu

Authors

Yunzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunzhi Zhang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Appendices

A Architecture Details

The architecture used for the MPI encoder is specified in Table 4.

Table 4. MP2 is max pooling with stride 2, Up2 is nearest-neighbor upsampling with scale 2, + is concatenation. Reshape transforms a tensor with \(C \times D\) channels into C channels, and D is merged to the batch dimension, and ReshapeBack is the reverse operation. All layers up till up1b use ReLU activation and the layers for conv1, conv2 and conv3 use LeakyReLu with a negative slope 0.2. There is no activation following the very last layer. All layers use Instance Norm for activation normalization and Spectral Norm for weight normalization.

Full size table

B Implementation Details

To have a better gradient flow, similar to Tucker et al. [32], we add a harmonious bias 1/i to the alpha channel prediction, so that \(w_i\) from Eq. (11) becomes uniformly 1/D during initialization. We also add an identity bias to \(f^\theta \) such that each MPI plane is associated with zero motion during initialization.

In all experiments, we set the number of MPI planes to be \(D = 16\). The depth values for MPI planes are linear in the inverse space, with \(d_1 = 1000\) and \(d_D = 1\).

Table 5. Ablation on the number of MPI planes D. Increasing the plane count improves the performance but also increases the training time. We adopt \(D=16\) in the main paper since further increasing D results in diminishing returns.

Full size table

C Training Details

1.1 C.1 KITTI

Since videos from KITTI are taken by stereo cameras with fixed relative poses, the depth scale is consistent across scenes and therefore we set it to be a constant \(\sigma = 1\). We use \(\lambda _{1}^\text {space} = 1000\), \(\lambda _\text {spec}^\text {space} = 100\), \(\lambda _{1}^\text {time} = 1000\), and \( \lambda _\text {perc}^\text {time} = 10\). We use Adam Optimizer [9] with an initial learning rate 0.0002, which we exponentially decrease by a factor of 0.8 for every 5 epochs. We train our model for 200K iterations on two NVIDIA TITAN RTX GPUs for about two days. During training, we apply horizontal flip with 50% probability and apply color jittering as data augmentation.

1.2 C.2 RealEstate10K

We train our model for 200K iterations on one NVIDIA GeForce RTX 3090 GPU, which takes about one day. We use \(\mathcal {L}_1^\text {space} = 10, \mathcal {L}_\text {perc}^\text {space} = 10, \mathcal {L}_1^\text {time} = 10, \mathcal {L}_\text {perc}^\text {time} = 0\). We use Adam Optimizer [9] with a constant learning rate 0.0002.

1.3 C.3 Ablations on the Number of MPI Planes

To study the effect of the number of MPI planes, we perform an ablation study on the KITTI [6] dataset with resolution \(128\times 384\). As shown in Table 5, a small number of MPI planes (\(D=4\) or 8) results in degraded model performance. Further increasing the number of planes from 16 to 32 results in marginal performance gain, with a cost of \(2.1\times \) slower training time. Therefore, we use \(D=16\) for all other experiments.

Table 6. Results of next-frame prediction on CATER [7]. Our model achieves better performance compared to PredRNN [36].

Full size table

1.4 C.4 Modeling Dynamic Scenes

To test whether our method is able to model more dynamic scenes, we test our method on CATER [7], a dataset of scenes with 5–10 individually moving objects. We show a quantitative comparison with a video prediction baseline PredRNN [36]. As shown in Table 6, our model achieves better performance across all three metrics.

Qualitatively, our method makes temporal prediction consistent with the ground truth object motions on this dataset. In Fig. 12, the model correctly recovers the purple object and the gold object occluded by the blue cone. Our model effectively handles object occlusions by warping from neighboring pixels with similar RGB values.

1.5 C.5 Discussions

While we focus on demonstrating the possibility of simultaneous extrapolation in both space and time, specific modules can be further optimized for each task. For example, it is possible to improve the dynamic scene representation to better handle video prediction with long horizons or highly complex motion, or to synthesize novel views with a large viewpoint change.

In the meantime, while our method is designed for natural scenes with many potential positive impacts such as interactive scene exploration for family entertainment, like all other visual content generation methods, our method might be exploited by malicious users with potential negative impacts. We expect such impacts to be minimal as our method is not designed to work with human videos. In our code release, we will explicitly specify allowable uses of our system with appropriate licenses. We will use techniques such as watermarking to label visual content generated by our system.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Wu, J. (2022). Video Extrapolation in Space and Time. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-19787-1_18
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Video Extrapolation in Space and Time

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Architecture Details

B Implementation Details

C Training Details

1.1 C.1 KITTI

1.2 C.2 RealEstate10K

1.3 C.3 Ablations on the Number of MPI Planes

1.4 C.4 Modeling Dynamic Scenes

1.5 C.5 Discussions

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation