High-Quality Video Generation from Static Structural Annotations


This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released:

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20


  1. Or other kinds of structural annotation maps, such as sketches or human skeletons. Intuitively a reference image works as well.

  2. Note that an early version of this work was published in (Pan et al. 2019). Compared to it, this paper has made substantial extensions including a new probabilistic formulation of the whole framework, a more comprehensive bi-directional flow-based video generation and its occlusion-aware image synthesis, a new section of ablation study and more experiments on human action datasets.

  3. The forward flow \(\mathbf {W}_t^f\) is warped according to the backward flow \(\mathbf {W}_t^b\), via the backward warping operation (2).

  4. UCF-101 and KTH Action datasets are at first divided into per-class sub-datasets.

  5. Our results are rescaled to match the resolutions of those by Wang et al. (2018a).


This work was supported in part by the National Natural Science Foundation of China under Grant No. 61906012, and in part by Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP.

Lu Sheng and Junting Pan contributed equally.

