Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models


Deep video action recognition models have been highly successful in recent years but require large quantities of manually-annotated data, which are expensive and laborious to obtain. In this work, we investigate the generation of synthetic training data for video action recognition, as synthetic data have been successfully used to supervise models for a variety of other computer vision tasks. We propose an interpretable parametric generative model of human action videos that relies on procedural generation, physics models and other components of modern game engines. With this model we generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for “Procedural Human Action Videos”. PHAV contains a total of 39,982 videos, with more than 1000 examples for each of 35 action categories. Our video generation approach is not limited to existing motion capture sequences: 14 of these 35 categories are procedurally-defined synthetic actions. In addition, each video is represented with 6 different data modalities, including RGB, optical flow and pixel-level semantic labels. These modalities are generated almost simultaneously using the Multiple Render Targets feature of modern GPUs. In order to leverage PHAV, we introduce a deep multi-task (i.e. that considers action classes from multiple datasets) representation learning architecture that is able to simultaneously learn from synthetic and real video datasets, even when their action categories differ. Our experiments on the UCF-101 and HMDB-51 benchmarks suggest that combining our large set of synthetic videos with small real-world datasets can boost recognition performance. Our approach also significantly outperforms video representations produced by fine-tuning state-of-the-art unsupervised generative models of videos.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19


  1. 1.

    Dataset and tools are available for download in

  2. 2.

    RootMotion’s PuppetMaster is an advanced active ragdoll physics asset for Unity\(^{\textregistered }\). For more details, please see

  3. 3.

    The Accord.NET Framework is a framework for image processing, computer vision, machine learning, statistics, and general scientific computing in .NET. It is available for most .NET platforms, including Unity\(^{\textregistered }\). For more details, see

  4. 4.

    Please note that a base motion can be assigned to more than one category, and therefore columns of this matrix do not necessarily sum up to one. An example is “car hit”, which could use motions that may belong to almost any other category (e.g., “run”, “walk”, “clap”) as long as the character gets hit by a car during its execution.

  5. 5.


  1. Abdulnabi, A. H., Wang, G., Lu, J., & Jia, K. (2015). Multi-task cnn model for attribute prediction. IEEE Transactions on Multimedia, 17(11), 1949–1959.

    Article  Google Scholar 

  2. Asensio, J. M. L., Peralta, J., Arrabales, R., Bedia, M. G., Cortez, P., & López, A. (2014). Artificial intelligence approaches for the generation and assessment of believable human-like behaviour in virtual characters. Expert Systems With Applications, 41(16), 1781–7290.

    Google Scholar 

  3. Aubry, M., & Russell, B. (2015). Understanding deep features with computer-generated imagery. In ICCV.

  4. Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.

    Google Scholar 

  5. Brostow, G., Fauqueur, J., & Cipolla, R. (2009). Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(20), 88–97.

    Article  Google Scholar 

  6. Butler, D., Wulff, J., Stanley, G., & Black, M. (2012). A naturalistic open source movie for optical flow evaluation. In ECCV.

  7. Carnegie Mellon Graphics Lab. (2016). Carnegie Mellon University motion capture database.

  8. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR.

  9. Carter, M. P. (1997). Computer graphics: principles and practice (Vol. 22). Boston: Addison-Wesley Professional.

    Google Scholar 

  10. Chen, C., Seff, A., Kornhauser, A., & Xiao, J. (2015). DeepDriving: Learning affordance for direct perception in autonomous driving. In ICCV.

  11. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. T-PAMI, 40(4), 834–848.

    Article  Google Scholar 

  12. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.

  13. De Souza, C. R. (2014). The Accord.NET framework, a framework for scientific computing in .NET.

  14. De Souza, C. R., Gaidon, A., Vig, E., & López, A. M. (2016). Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In ECCV.

  15. De Souza, C. R., Gaidon, A., Cabon, Y., & López, A. M. (2017). Procedural generation of videos to train deep action recognition networks. In CVPR.

  16. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open urban driving simulator. In Proceedings of the 1st annual conference on robot learning.

  17. Egges, A., Kamphuis, A., & Overmars, M. (Eds.). (2008). Motion in Games: First International Workshop, MIG 2008, Utrecht, The Netherlands, June 14–17, 2008, Revised Papers (Vol. 5277). Springer.

  18. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.

  19. Fernando, B., Gavves, E., Oramas, M. J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.

  20. Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. T-PAMI, 35(11), 2782–2795.

    Article  Google Scholar 

  21. Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In CVPR.

  22. Galvane, Q., Christie, M., Lino, C., & Ronfard, R. (2015). Camera-on-rails: Automated computation of constrained camera paths. In SIGGRAPH.

  23. Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In CVPR.

  24. Gu, C., Sun, C., Ross, D., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.

  25. Guay, M., Ronfard, R., Gleicher, M., Cani, M. P. (2015a). Adding dynamics to sketch-based character animations. In Sketch-based interfaces and modeling.

  26. Guay, M., Ronfard, R., Gleicher, M., & Cani, M. P. (2015b). Space-time sketching of character animation. ACM Transactions on Graphics, 34(4), 118.

    Article  Google Scholar 

  27. Haeusler, R., & Kondermann, D. (2013). Synthesizing real world stereo challenges. In German conference on pattern recognition

  28. Haltakov, V., Unger, C., & Ilic, S. (2013). Framework for generation of synthetic ground truth data for driver assistance applications. In German conference on pattern recognition.

  29. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2015). SynthCam3D: Semantic understanding with synthetic indoor scenes. CoRR. arXiv:1505.00171.

  30. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In CVPR.

  31. Hao, Z., Huang, X., & Belongie, S. (2018). Controllable video generation with sparse trajectories. In CVPR.

  32. Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015) Learning scene-specific pedestrian detectors without real data. In CVPR.

  33. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML (Vol. 37).

  34. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.

  35. Jiang, Y. G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes.

  36. Kaneva, B., Torralba, A., & Freeman, W. (2011). Evaluation of image features using a photorealistic virtual world. In ICCV.

  37. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  38. Kuehne, H., Jhuang, H. H., Garrote-Contreras, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.

  39. Lan, Z., Lin, M., Li, X., Hauptmann, A. G., & Raj, B. (2015). Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.

  40. Langer, M. S., & Bülthoff, H. H. (2000). Depth discrimination from shading under diffuse lighting. Perception, 29(6), 649–660.

    Article  Google Scholar 

  41. Lerer, A., Gross, S., & Fergus, R. (2016). Learning physical intuition of block towers by example. In Proceedings of machine learning research (Vol. 48).

  42. Li, Y., Min, M. R., Shen, D., Carlson, D. E., & Carin, L. (2018). Video generation from text. In AAAI.

  43. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common objects in context. In ECCV.

  44. Marín, J., Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In CVPR.

  45. Marwah, T., Mittal, G., & Balasubramanian, V. N. (2017). Attentive semantic video generation using captions. In ICCV.

  46. Massa, F., Russell, B., & Aubry, M. (2016). Deep exemplar 2D–3D detection by adapting from real to rendered views. In CVPR.

  47. Matikainen, P., Sukthankar, R., & Hebert, M. (2011). Feature seeding for action recognition. In ICCV.

  48. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR.

  49. Meister, S., & Kondermann, D. (2011). Real versus realistically rendered scenes for optical flow evaluation. In CEMT.

  50. Miller, G. (1994). Efficient algorithms for local and global accessibility shading. In SIGGRAPH.

  51. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013). Playing Atari with deep reinforcement learning. In NIPS workshops.

  52. Molnar, S. (1991). Efficient supersampling antialiasing for high-performance architectures. Technical report, North Carolina University at Chapel Hill.

  53. Nian, F., Li, T., Wang, Y., Wu, X., Ni, B., & Xu, C. (2017). Learning explicit video attributes from mid-level representation for video captioning. Computer Vision and Image Understanding, 163, 126–138.

    Article  Google Scholar 

  54. Onkarappa, N., & Sappa, A. (2015). Synthetic sequences and ground-truth flow field generation for algorithm validation. Multimedia Tools and Applications, 74(9), 3121–3135.

    Article  Google Scholar 

  55. Papon, J., & Schoeler, M. (2015). Semantic pose using deep networks trained on synthetic RGB-D. In ICCV.

  56. Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014). Action recognition with stacked fisher vectors. In ECCV.

  57. Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In ICCV.

  58. Perlin, K. (1995). Real time responsive animation with personality. IEEE Transactions on Visualization and Computer Graphics, 1(1), 5–15.

    Article  Google Scholar 

  59. Perlin, K., & Seidman, G. (2008). Autonomous digital actors. In Motion in games.

  60. Richter, S., Vineet, V., Roth, S., & Vladlen, K. (2016). Playing for data: Ground truth from computer games. In ECCV.

  61. Ritschel, T., Grosch, T., & Seidel, H. P. (2009). Approximating dynamic global illumination in image space. In Proceedings of the 2009 symposium on interactive 3D graphics and games—I3D ’09.

  62. Ros, G., Sellart, L., Materzyska, J., Vázquez, D., & López, A. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.

  63. Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.

  64. Selan, J. (2012). Cinematic color. In SIGGRAPH.

  65. Shafaei, A., Little, J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. In BMVC.

  66. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from a single depth image. In CVPR.

  67. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

  68. Sizikova1, E., Singh, V. K., Georgescu, B., Halber, M., Ma, K., & Chen, T. (2016). Enhancing place recognition using joint intensity—depth analysis and synthetic data. In ECCV workshops.

  69. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR. arXiv:1212.0402.

  70. Sousa, T., Kasyan, N., & Schulz, N. (2011). Secrets of cryengine 3 graphics technology. In SIGGRAPH.

  71. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal on Machine Learning Research, 15, 1929–1958.

    MathSciNet  MATH  Google Scholar 

  72. Steiner, B. (2011). Post processing effects. Institute of Graphics and Algorithms, Vienna University of Technology, Bachelour’s thesis.

  73. Su, H., Qi, C., Yi, Y., & Guibas, L. (2015a). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV.

  74. Su, H., Wang, F., Yi, Y., & Guibas, L. (2015b). 3D-assisted feature synthesis for novel views of an object. In ICCV.

  75. Sun, S., Kuang, Z., Sheng, L., Ouyang, W., & Zhang, W. (2018). Optical flow guided feature: A fast and robust motion representation for video action recognition. In CVPR.

  76. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In CVPR.

  77. Taylor, G., Chosak, A., & Brewer, P. (2007). OVVV: Using virtual worlds to design and evaluate surveillance systems. In CVPR.

  78. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In CVPR.

  79. Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.

  80. Vázquez, D., López, A., Ponsa, D., & Marín, J. (2011). Cool world: Domain adaptation of virtual and real worlds for human detection using active learning. In NIPS workshops.

  81. Vazquez, D., López, A. M., Marín, J., Ponsa, D., & Gerónimo, D. (2014). Virtual and real world adaptation for pedestrian detection. T-PAMI, 36(4), 797–809.

    Article  Google Scholar 

  82. Vedantam, R., Lin, X., Batra, T., Zitnick, C., & Parikh, D. (2015). Learning common sense through visual abstraction. In ICCV.

  83. Veeravasarapu, V., Hota, R., Rothkopf, C., & Visvanathan, R. (2015). Simulations for validation of vision systems. CoRR. arXiv:1512.01030.

  84. Veeravasarapu, V., Rothkopf, C., & Visvanathan, R. (2016). Model-driven simulations for deep convolutional neural networks. CoRR. arXiv:1605.09582.

  85. Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.

  86. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

  87. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103, 60–79.

    MathSciNet  Article  Google Scholar 

  88. Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016a). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.

    MathSciNet  Article  Google Scholar 

  89. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.

  90. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & van Gool, L. (2016b). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.

  91. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2017). Temporal segment networks for action recognition in videos. CoRR. arXiv:1705.02953.

  92. Wang, X., Farhadi, A., & Gupta, A. (2016c). Actions \(\sim \) Transformations. In CVPR.

  93. van Welbergen, H., van Basten, B. J. H., Egges, A., Ruttkay, Z. M., & Overmars, M. H. (2009). Real time character animation: A trade-off between naturalness and control. In Proceedings of the Eurographics.

  94. Wu, W., Zhang, Y., Li, C., Qian, C., & Loy, C. C. (2018). Reenactgan: Learning to reenact faces via boundary transfer. In ECCV.

  95. Xiong, W., Luo, W., Ma, L., Liu, W., & Luo, J. (2018). Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR.

  96. Xu, J., Vázquez, D., López, A., Marín, J., & Ponsa, D. (2014). Learning a part-based pedestrian detector in a virtual world. T-ITS, 15(5), 2121–2131.

    Google Scholar 

  97. Yan, X., Rastogi, A., Villegas, R., Sunkavalli, K., Shechtman, E., Hadap, S., et al. (2018). MT-VAE: Learning motion transformations to generate multimodal human dynamics. In ECCV (Vol. 11209).

  98. Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM-MM.

  99. Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV (Vol. 11214).

  100. Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. In Proceedings of the 29th DAGM conference on pattern recognition.

  101. Zhao, Y., Xiong, Y., & Lin, D. (2018). Recognize actions by disentangling components of dynamics. In CVPR.

  102. Zheng, Y., Lin, S., Kambhamettu, C., Yu, J., & Kang, S. B. (2009). Single-image vignetting correction. T-PAMI, 31, 2243–2256.

    Article  Google Scholar 

  103. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ADE20K dataset. In CVPR.

  104. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA.

  105. Zitnick, C., Vedantam, R., & Parikh, D. (2016). Adopting abstract images for semantic scene understanding. T-PAMI, 38(4), 627–638.

    Article  Google Scholar 

  106. Zolfaghari, M., Oliveira, G. L., Sedaghat, N., & Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV.

Download references


Antonio M. López acknowledges the financial support by the Spanish TIN2017-88709-R (MINECO/AEI/FEDER, UE), and by ICREA under the ICREA Academia Program. As CVC/UAB researcher, Antonio also acknowledges the Generalitat de Catalunya CERCA Program and its ACCIO agency.

Author information



Corresponding author

Correspondence to César Roberto de Souza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Xavier Alameda-Pineda, Elisa Ricci, Albert Ali Salah, Nicu Sebe, Shuicheng Yan.



In this appendix, we include random frames (Figs. 20, 21, 22, 23, and 24) for a subset of the action categories in PHAV, followed by a table of pixel colors (Table 10) used in our semantic segmentation ground-truth.

The frames below show the effect of different variables and motion variations being used (cf. Table 4). Each frame below is marked with a label indicating the value for different variables during the execution of the video, using the legend shown in Fig. 19.

Fig. 20

Changing environments. Top: kick ball, bottom: synthetic car hit.

Fig. 21

Changing phases of the day. Top: run, bottom: golf

Fig. 22

Changing weather. Top: walk, bottom: kick ball

Fig. 23

Changing motion variations. Top: kick ball, bottom: synthetic car hit

Fig. 24

Changing human models. Top: walk, bottom: golf

Table 10 Pixel-wise object-level classes in PHAV

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

de Souza, C.R., Gaidon, A., Cabon, Y. et al. Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models. Int J Comput Vis 128, 1505–1536 (2020).

Download citation


  • Procedural generation
  • Human action recognition
  • Synthetic data
  • Physics