Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)


Procedural models are being widely used to synthesize scenes for graphics, gaming, and to create (labeled) synthetic datasets for ML. In order to produce realistic and diverse scenes, a number of parameters governing the procedural models have to be carefully tuned by experts. These parameters control both the structure of scenes being generated (e.g. how many cars in the scene), as well as parameters which place objects in valid configurations. Meta-Sim aimed at automatically tuning parameters given a target collection of real images in an unsupervised way. In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. Meta-Sim2 proceeds by learning to sequentially sample rule expansions from a given probabilistic scene grammar. Due to the discrete nature of the problem, we use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. Experiments on a real driving dataset show that, without any supervision, we can successfully learn to generate data that captures discrete structural statistics of objects, such as their frequency, in real images. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods. Project page:

Supplementary material

504472_1_En_42_MOESM1_ESM.pdf (24.2 mb)
Supplementary material 1 (pdf 24767 KB)


  1. 1.
    Akkaya, I., et al.: Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019)
  2. 2.
    Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. 126(9), 961–972 (2018)CrossRefGoogle Scholar
  3. 3.
    Alvarez-Melis, D., Jaakkola, T.S.: Tree-structured decoding with doubly-recurrent neural networks (2016)Google Scholar
  4. 4.
    Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: Proceedings of the IEEE International Conference on Computer Vision (2019)Google Scholar
  5. 5.
    Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: ICLR (2018)Google Scholar
  6. 6.
    Brockman, G., et al.: OpenAI Gym. arXiv arXiv:1606.01540 (2016)
  7. 7.
    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). Scholar
  8. 8.
    Chebotar, Y., et al.: Closing the sim-to-real loop: Adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687 (2018)
  9. 9.
    Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 2547–2557. Curran Associates, Inc. (2018).
  10. 10.
    Chu, H., et al.: Neural turtle graphics for modeling city road layouts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4522–4530 (2019)Google Scholar
  11. 11.
    Cranmer, K., Brehmer, J., Louppe, G.: The frontier of simulation-based inference. arXiv preprint arXiv:1911.01429 (2019)
  12. 12.
    Dai, H., Tian, Y., Dai, B., Skiena, S., Song, L.: Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786 (2018)
  13. 13.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)Google Scholar
  14. 14.
    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)Google Scholar
  15. 15.
    Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: UAI (2015)Google Scholar
  16. 16.
    Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E., et al.: Attend, infer, repeat: fast scene understanding with generative models. In: Advances in Neural Information Processing Systems, pp. 3225–3233 (2016)Google Scholar
  17. 17.
    Fan, S., Huang, B.: Labeled graph generative adversarial networks. CoRR abs/1906.03220 (2019).
  18. 18.
    Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)Google Scholar
  19. 19.
    Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S., Vinyals, O.: Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118 (2018)
  20. 20.
    Gao, X., Gong, R., Shu, T., Xie, X., Wang, S., Zhu, S.: VRKitchen: an interactive 3D virtual environment for task-oriented learning. arXiv arXiv:1903.05757 (2019)
  21. 21.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  22. 22.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  23. 23.
    Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. JMLR 13, 723–773 (2012)MathSciNetzbMATHGoogle Scholar
  24. 24.
    Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Understanding real world indoor scenes with synthetic data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4077–4085 (2016)Google Scholar
  25. 25.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)Google Scholar
  26. 26.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).
  27. 27.
    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)Google Scholar
  28. 28.
    Juliani, A., et al.: Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627 (2018)
  29. 29.
    Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: LayoutVAE: stochastic scene layout generation from a label set. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)Google Scholar
  30. 30.
    Kar, A., et al.: Meta-Sim: learning to generate synthetic datasets. In: ICCV (2019)Google Scholar
  31. 31.
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 (2018)
  32. 32.
    Kim, Y., Dyer, C., Rush, A.M.: Compound probabilistic context-free grammars for grammar induction. CoRR abs/1906.10225 (2019).
  33. 33.
    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
  34. 34.
    Kolve, E., Mottaghi, R., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An interactive 3D environment for visual AI. arXiv:1712.05474 (2017)
  35. 35.
    Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: a probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4390–4399 (2015)Google Scholar
  36. 36.
    Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS, pp. 2539–2547 (2015)Google Scholar
  37. 37.
    Kusner, M.J., Paige, B., Hernández-Lobato, J.M.: Grammar variational autoencoder. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, pp. 1945–1954. (2017).
  38. 38.
    LeCun, Y.: The MNIST database of handwritten digits.
  39. 39.
    Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. In: NIPS (2017)Google Scholar
  40. 40.
    Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 12 (2019)CrossRefGoogle Scholar
  41. 41.
    Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: ICML (2015)Google Scholar
  42. 42.
    Li, Y., Vinyals, O., Dyer, C., Pascanu, R., Battaglia, P.: Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 (2018)
  43. 43.
    Liao, R., et al.: Efficient graph generation with graph recurrent attention networks. arXiv preprint arXiv:1910.00760 (2019)
  44. 44.
    Louppe, G., Cranmer, K.: Adversarial variational optimization of non-differentiable simulators. arXiv preprint arXiv:1707.07113 (2017)
  45. 45.
    Mansinghka, V.K., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.: Approximate Bayesian image interpretation using generative probabilistic graphics programs. In: Advances in Neural Information Processing Systems, pp. 1520–1528 (2013)Google Scholar
  46. 46.
    McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: SceneNet RGB-D: 5M photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079 (2016)
  47. 47.
    Mellor, J.F.J., et al.: Unsupervised doodling and painting with improved spiral (2019)Google Scholar
  48. 48.
    Prakash, A., et al.: Structured domain randomization: Bridging the reality gap by context-aware synthetic data. arXiv:1810.10093 (2018)
  49. 49.
    Puig, X., et al.: VirtualHome: simulating household activities via programs. In: CVPR (2018)Google Scholar
  50. 50.
    Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)Google Scholar
  51. 51.
    Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. arXiv preprint arXiv:1906.00446 (2019)
  52. 52.
    Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)
  53. 53.
    Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). Scholar
  54. 54.
    Ritchie, D., Wang, K., Lin, Y.A.: Fast and flexible indoor scene synthesis via deep convolutional generative models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)Google Scholar
  55. 55.
    Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)Google Scholar
  56. 56.
    Sadeghi, F., Levine, S.: CAD2RL: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201 (2016)
  57. 57.
    Savva, M., et al.: Habitat: A platform for embodied AI research. arXiv preprint arXiv:1904.01201 (2019)
  58. 58.
    Shugrina, M., et al.: Creative flow+ dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5384–5393 (2019)Google Scholar
  59. 59.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  60. 60.
    Such, F.P., Rawal, A., Lehman, J., Stanley, K.O., Clune, J.: Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data. arXiv preprint arXiv:1912.07768 (2019)
  61. 61.
    Tassa, Y., et al.: DeepMind control suite. Technical report, DeepMind (January 2018).
  62. 62.
    Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS (2017)Google Scholar
  63. 63.
    Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: International Conference on Intelligent Robots and Systems (2012)Google Scholar
  64. 64.
    Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 132 (2019)Google Scholar
  65. 65.
    Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 70 (2018)Google Scholar
  66. 66.
    Wang, R., Lehman, J., Clune, J., Stanley, K.O.: Poet: open-ended coevolution of environments and their optimized solutions. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 142–151 (2019)Google Scholar
  67. 67.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992). Scholar
  68. 68.
    Wrenninge, M., Unger, J.: SynScapes: A photorealistic synthetic dataset for street scene parsing. arXiv:1810.08705 (2018)
  69. 69.
    Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  70. 70.
    Wu, Y., Wu, Y., Gkioxari, G., Tiani, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv arXiv:1801.02209 (2018)
  71. 71.
    Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. CoRR abs/1704.01696 (2017).
  72. 72.
    Yin, P., Zhou, C., He, J., Neubig, G.: StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. CoRR abs/1806.07832 (2018).
  73. 73.
    You, J., Ying, R., Ren, X., Hamilton, W., Leskovec, J.: GraphRNN: generating realistic graphs with deep auto-regressive models. In: International Conference on Machine Learning, pp. 5694–5703 (2018)Google Scholar
  74. 74.
    Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. 30(4), 86 (2011)CrossRefGoogle Scholar
  75. 75.
    Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)Google Scholar
  76. 76.
    Zhou, Y., While, Z., Kalogerakis, E.: SceneGraphNet: neural message passing for 3D indoor scene augmentation. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NVIDIAWaterlooCanada
  2. 2.University of TorontoTorontoCanada
  3. 3.University of WaterlooWaterlooCanada
  4. 4.Vector InstituteTorontoCanada

Personalised recommendations