Skip to main content

Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12362))

Included in the following conference series:

Abstract

Procedural models are being widely used to synthesize scenes for graphics, gaming, and to create (labeled) synthetic datasets for ML. In order to produce realistic and diverse scenes, a number of parameters governing the procedural models have to be carefully tuned by experts. These parameters control both the structure of scenes being generated (e.g. how many cars in the scene), as well as parameters which place objects in valid configurations. Meta-Sim aimed at automatically tuning parameters given a target collection of real images in an unsupervised way. In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. Meta-Sim2 proceeds by learning to sequentially sample rule expansions from a given probabilistic scene grammar. Due to the discrete nature of the problem, we use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. Experiments on a real driving dataset show that, without any supervision, we can successfully learn to generate data that captures discrete structural statistics of objects, such as their frequency, in real images. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods. Project page: https://nv-tlabs.github.io/meta-sim-structure/.

J. Devaranjan and A. Kar—Contributed equally, work done during JD’s internship at NVIDIA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This equality does not hold in general for rendering, but it worked well in practice.

  2. 2.

    We did not explore sampling from a continuous relaxation of the discrete variable here.

References

  1. Akkaya, I., et al.: Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019)

  2. Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. 126(9), 961–972 (2018)

    Article  Google Scholar 

  3. Alvarez-Melis, D., Jaakkola, T.S.: Tree-structured decoding with doubly-recurrent neural networks (2016)

    Google Scholar 

  4. Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: Proceedings of the IEEE International Conference on Computer Vision (2019)

    Google Scholar 

  5. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: ICLR (2018)

    Google Scholar 

  6. Brockman, G., et al.: OpenAI Gym. arXiv arXiv:1606.01540 (2016)

  7. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44

    Chapter  Google Scholar 

  8. Chebotar, Y., et al.: Closing the sim-to-real loop: Adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687 (2018)

  9. Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 2547–2557. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/7521-tree-to-tree-neural-networks-for-program-translation.pdf

  10. Chu, H., et al.: Neural turtle graphics for modeling city road layouts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4522–4530 (2019)

    Google Scholar 

  11. Cranmer, K., Brehmer, J., Louppe, G.: The frontier of simulation-based inference. arXiv preprint arXiv:1911.01429 (2019)

  12. Dai, H., Tian, Y., Dai, B., Skiena, S., Song, L.: Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786 (2018)

  13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  14. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)

    Google Scholar 

  15. Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: UAI (2015)

    Google Scholar 

  16. Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E., et al.: Attend, infer, repeat: fast scene understanding with generative models. In: Advances in Neural Information Processing Systems, pp. 3225–3233 (2016)

    Google Scholar 

  17. Fan, S., Huang, B.: Labeled graph generative adversarial networks. CoRR abs/1906.03220 (2019). http://arxiv.org/abs/1906.03220

  18. Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)

    Google Scholar 

  19. Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S., Vinyals, O.: Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118 (2018)

  20. Gao, X., Gong, R., Shu, T., Xie, X., Wang, S., Zhu, S.: VRKitchen: an interactive 3D virtual environment for task-oriented learning. arXiv arXiv:1903.05757 (2019)

  21. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  22. Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)

    Google Scholar 

  23. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. JMLR 13, 723–773 (2012)

    MathSciNet  MATH  Google Scholar 

  24. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Understanding real world indoor scenes with synthetic data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4077–4085 (2016)

    Google Scholar 

  25. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385

  27. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)

    Google Scholar 

  28. Juliani, A., et al.: Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627 (2018)

  29. Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: LayoutVAE: stochastic scene layout generation from a label set. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)

    Google Scholar 

  30. Kar, A., et al.: Meta-Sim: learning to generate synthetic datasets. In: ICCV (2019)

    Google Scholar 

  31. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 (2018)

  32. Kim, Y., Dyer, C., Rush, A.M.: Compound probabilistic context-free grammars for grammar induction. CoRR abs/1906.10225 (2019). http://arxiv.org/abs/1906.10225

  33. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)

  34. Kolve, E., Mottaghi, R., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An interactive 3D environment for visual AI. arXiv:1712.05474 (2017)

  35. Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: a probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4390–4399 (2015)

    Google Scholar 

  36. Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS, pp. 2539–2547 (2015)

    Google Scholar 

  37. Kusner, M.J., Paige, B., Hernández-Lobato, J.M.: Grammar variational autoencoder. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, pp. 1945–1954. JMLR.org (2017). http://dl.acm.org/citation.cfm?id=3305381.3305582

  38. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/

  39. Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. In: NIPS (2017)

    Google Scholar 

  40. Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 12 (2019)

    Article  Google Scholar 

  41. Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: ICML (2015)

    Google Scholar 

  42. Li, Y., Vinyals, O., Dyer, C., Pascanu, R., Battaglia, P.: Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 (2018)

  43. Liao, R., et al.: Efficient graph generation with graph recurrent attention networks. arXiv preprint arXiv:1910.00760 (2019)

  44. Louppe, G., Cranmer, K.: Adversarial variational optimization of non-differentiable simulators. arXiv preprint arXiv:1707.07113 (2017)

  45. Mansinghka, V.K., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.: Approximate Bayesian image interpretation using generative probabilistic graphics programs. In: Advances in Neural Information Processing Systems, pp. 1520–1528 (2013)

    Google Scholar 

  46. McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: SceneNet RGB-D: 5M photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079 (2016)

  47. Mellor, J.F.J., et al.: Unsupervised doodling and painting with improved spiral (2019)

    Google Scholar 

  48. Prakash, A., et al.: Structured domain randomization: Bridging the reality gap by context-aware synthetic data. arXiv:1810.10093 (2018)

  49. Puig, X., et al.: VirtualHome: simulating household activities via programs. In: CVPR (2018)

    Google Scholar 

  50. Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)

    Google Scholar 

  51. Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. arXiv preprint arXiv:1906.00446 (2019)

  52. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)

  53. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7

    Chapter  Google Scholar 

  54. Ritchie, D., Wang, K., Lin, Y.A.: Fast and flexible indoor scene synthesis via deep convolutional generative models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Google Scholar 

  55. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)

    Google Scholar 

  56. Sadeghi, F., Levine, S.: CAD2RL: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201 (2016)

  57. Savva, M., et al.: Habitat: A platform for embodied AI research. arXiv preprint arXiv:1904.01201 (2019)

  58. Shugrina, M., et al.: Creative flow+ dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5384–5393 (2019)

    Google Scholar 

  59. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  60. Such, F.P., Rawal, A., Lehman, J., Stanley, K.O., Clune, J.: Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data. arXiv preprint arXiv:1912.07768 (2019)

  61. Tassa, Y., et al.: DeepMind control suite. Technical report, DeepMind (January 2018). https://arxiv.org/abs/1801.00690

  62. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS (2017)

    Google Scholar 

  63. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: International Conference on Intelligent Robots and Systems (2012)

    Google Scholar 

  64. Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 132 (2019)

    Google Scholar 

  65. Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 70 (2018)

    Google Scholar 

  66. Wang, R., Lehman, J., Clune, J., Stanley, K.O.: Poet: open-ended coevolution of environments and their optimized solutions. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 142–151 (2019)

    Google Scholar 

  67. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992). https://doi.org/10.1007/BF00992696

    Article  MATH  Google Scholar 

  68. Wrenninge, M., Unger, J.: SynScapes: A photorealistic synthetic dataset for street scene parsing. arXiv:1810.08705 (2018)

  69. Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  70. Wu, Y., Wu, Y., Gkioxari, G., Tiani, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv arXiv:1801.02209 (2018)

  71. Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. CoRR abs/1704.01696 (2017). http://arxiv.org/abs/1704.01696

  72. Yin, P., Zhou, C., He, J., Neubig, G.: StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. CoRR abs/1806.07832 (2018). http://arxiv.org/abs/1806.07832

  73. You, J., Ying, R., Ren, X., Hamilton, W., Leskovec, J.: GraphRNN: generating realistic graphs with deep auto-regressive models. In: International Conference on Machine Learning, pp. 5694–5703 (2018)

    Google Scholar 

  74. Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. 30(4), 86 (2011)

    Article  Google Scholar 

  75. Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

    Google Scholar 

  76. Zhou, Y., While, Z., Kalogerakis, E.: SceneGraphNet: neural message passing for 3D indoor scene augmentation. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amlan Kar .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 24767 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Devaranjan, J., Kar, A., Fidler, S. (2020). Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12362. Springer, Cham. https://doi.org/10.1007/978-3-030-58520-4_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58520-4_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58519-8

  • Online ISBN: 978-3-030-58520-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics