Skip to main content

High-Quality Video Generation from Static Structural Annotations


This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released:

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20


  1. Or other kinds of structural annotation maps, such as sketches or human skeletons. Intuitively a reference image works as well.

  2. Note that an early version of this work was published in (Pan et al. 2019). Compared to it, this paper has made substantial extensions including a new probabilistic formulation of the whole framework, a more comprehensive bi-directional flow-based video generation and its occlusion-aware image synthesis, a new section of ablation study and more experiments on human action datasets.

  3. The forward flow \(\mathbf {W}_t^f\) is warped according to the backward flow \(\mathbf {W}_t^b\), via the backward warping operation (2).

  4. UCF-101 and KTH Action datasets are at first divided into per-class sub-datasets.

  5. Our results are rescaled to match the resolutions of those by Wang et al. (2018a).


  • Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.

    Article  Google Scholar 

  • Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S. (2017). Stochastic variational video prediction. ICLR

  • Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., Guttag, J. (2018). Synthesizing images of humans in unseen poses. In: CVPR, IEEE.

  • Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR, IEEE.

  • Carreira, J., Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, IEEE.

  • Chen, B., Wang, W., Wang, J. (2017). Video imagination from a single image with transformation generation. In: ACM MM, ACM, pp 358–366.

  • Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR, IEEE.

  • Denton, E., Fergus, R. (2018). Stochastic video generation with a learned prior. ICML

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In: ICCV, IEEE, pp 2758–2766.

  • Finn, C., Goodfellow, I., Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In: NIPS, pp 64–72.

  • Ganin, Y., Kononenko, D., Sungatullina, D., Lempitsky, V. (2016). Deepwarp: Photorealistic image resynthesis for gaze manipulation. In: ECCV, Springer, pp 311–326.

  • Geiger, A., Lenz, P., Stiller, C., Urtasun, R. (2013). Vision meets robotics: The kitti dataset. IJRR.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. In: NIPS, pp 2672–2680.

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S. (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS, pp 6626–6637

  • Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. CVPR.

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In: NIPS, pp 2017–2025.

  • Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J. (2018). Super SloMo: High quality estimation of multiple intermediate frames for video interpolation. In: CVPR, IEEE.

  • Johnson, J., Alahi, A., Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In: ECCV, Springer, pp 694–711.

  • Johnson, J., Gupta, A., Fei-Fei, L. (2018). Image generation from scene graphs. In: CVPR, IEEE.

  • Kalchbrenner, N., Oord, Avd., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527.

  • Kingma, D.P., Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

  • Laptev, I., Caputo, B., et al. (2004) Recognizing human actions: a local svm approach. In: ICPR, IEEE, pp 32–36.

  • Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H. (2018). Flow-grounded spatial-temporal video prediction from still images. In: ECCV, Springer.

  • Liang, X., Lee, L., Dai, W., Xing, E.P. (2017). Dual motion GAN for future-flow embedded video prediction. In: ICCV, IEEE.

  • Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B. (2018). Image inpainting for irregular holes using partial convolutions. In: ECCV, Springer.

  • Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In: ICCV, IEEE.

  • Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L. (2017). Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821.

  • Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L. (2017). Pose guided person image generation. In: NIPS, pp 406–416.

  • Mathieu, M., Couprie, C., LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.

  • Meister, S., Hur, J., Roth, S. (2018). UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In: AAAI, New Orleans, Louisiana.

  • Oord, Avd, Kalchbrenner, N., Kavukcuoglu, K. (2016). Pixel recurrent neural networks. ICML.

  • Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., Wang, X. (2019). Video generation from single semantic label map. In: CVPR, IEEE.

  • Patraucean, V., Handa, A., Cipolla, R. (2015). Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309.

  • Pintea, S.L., van Gemert, J.C., Smeulders, A.W.M. (2014). Dejavu: Motion prediction in static images. In: ECCV, Springer.

  • Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

  • Saito, M., Matsumoto, E., Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In: ICCV, IEEE.

  • Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. In: CVPR, IEEE.

  • Sohn, K., Lee, H., Yan, X. (2015). Learning structured output representation using deep conditional generative models. In: NIPS, pp 3483–3491.

  • Soomro, K., Zamir, A.R., Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  • Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using LSTMs. In: ICML, pp 843–852

  • Sun D, Yang X, Liu MY, Kautz J (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR

  • Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In: CVPR, IEEE.

  • Uria, B., Côté, M. A., Gregor, K., Murray, I., & Larochelle, H. (2016). Neural Autoregressive Distribution Estimation. JLMR, 17(1), 7184–7220.

    Google Scholar 

  • Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033

  • Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee , H. (2017b). Learning to generate long-term future via hierarchial prediction. In: ICML.

  • Vondrick, C., Torralba, A. (2017). Generating the future with adversarial transformers. In: CVPR, IEEE.

  • Vondrick, C., Pirsiavash, H., Torralba, A. (2016a). Anticipating visual representations from unlabeled video. In: CVPR, IEEE, pp 98–106.

  • Vondrick, C., Pirsiavash, H., Torralba, A. (2016b). Generating videos with scene dynamics. In: NIPS, pp 613–621.

  • Walker, J., Doersch, C., Gupta, A., Hebert, M. (2016). An uncertain future: Forecasting from static images using variational autoencoders. In: ECCV, Springer, pp 835–851

  • Walker, J., Gupta, A., Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In: CVPR, IEEE, pp 3302–3309.

  • Walker, J., Gupta, A., Hebert, M. (2015). Dense optical flow prediction from a static image. In: ICCV, IEEE, pp 2443–2451.

  • Wang, T.C., Liu, M.Y., Zhu, J.Y., Liu, G., Tao, A., Kautz, J., Catanzaro, B. (2018a). Video-to-video synthesis. In: NeurIPS.

  • Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR, IEEE.

  • Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. TIP, 13(4), 600–612.

    Google Scholar 

  • Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T. (2017). Video enhancement with task-oriented flow. arXiv preprint arXiv:1711.09078.

  • Xue, T., Wu, J., Bouman, K., Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: NIPS, pp 91–99.

  • Yin, Z., Shi, J. (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, IEEE.

  • Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N. (2017). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV, pp 5907–5915.

  • Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In: ECCV, Springer.

  • Zheng, Z., Zheng, L., Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: ICCV, pp 3754–3762.

  • Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A. (2016). View synthesis by appearance flow. In: ECCV, Springer, pp 286–301.

  • Zhu, J.Y., Park, T., Isola, P., Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, IEEE.

Download references


This work was supported in part by the National Natural Science Foundation of China under Grant No. 61906012, and in part by Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lu Sheng.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Lu Sheng and Junting Pan contributed equally.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheng, L., Pan, J., Guo, J. et al. High-Quality Video Generation from Static Structural Annotations. Int J Comput Vis 128, 2552–2569 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: