Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation

Devaranjan, Jeevan; Kar, Amlan; Fidler, Sanja

doi:10.1007/978-3-030-58520-4_42

Jeevan Devaranjan^12,14,
Amlan Kar^12,13,15 &
Sanja Fidler^12,13,15

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12362))

Included in the following conference series:

European Conference on Computer Vision

3420 Accesses
34 Citations

Abstract

Procedural models are being widely used to synthesize scenes for graphics, gaming, and to create (labeled) synthetic datasets for ML. In order to produce realistic and diverse scenes, a number of parameters governing the procedural models have to be carefully tuned by experts. These parameters control both the structure of scenes being generated (e.g. how many cars in the scene), as well as parameters which place objects in valid configurations. Meta-Sim aimed at automatically tuning parameters given a target collection of real images in an unsupervised way. In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. Meta-Sim2 proceeds by learning to sequentially sample rule expansions from a given probabilistic scene grammar. Due to the discrete nature of the problem, we use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. Experiments on a real driving dataset show that, without any supervision, we can successfully learn to generate data that captures discrete structural statistics of objects, such as their frequency, in real images. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods. Project page: https://nv-tlabs.github.io/meta-sim-structure/.

J. Devaranjan and A. Kar—Contributed equally, work done during JD’s internship at NVIDIA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This equality does not hold in general for rendering, but it worked well in practice.
2.
We did not explore sampling from a continuous relaxation of the discrete variable here.

References

Akkaya, I., et al.: Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019)
Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. 126(9), 961–972 (2018)
Article Google Scholar
Alvarez-Melis, D., Jaakkola, T.S.: Tree-structured decoding with doubly-recurrent neural networks (2016)
Google Scholar
Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Google Scholar
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: ICLR (2018)
Google Scholar
Brockman, G., et al.: OpenAI Gym. arXiv arXiv:1606.01540 (2016)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Chebotar, Y., et al.: Closing the sim-to-real loop: Adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687 (2018)
Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 2547–2557. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/7521-tree-to-tree-neural-networks-for-program-translation.pdf
Chu, H., et al.: Neural turtle graphics for modeling city road layouts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4522–4530 (2019)
Google Scholar
Cranmer, K., Brehmer, J., Louppe, G.: The frontier of simulation-based inference. arXiv preprint arXiv:1911.01429 (2019)
Dai, H., Tian, Y., Dai, B., Skiena, S., Song, L.: Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)
Google Scholar
Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: UAI (2015)
Google Scholar
Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E., et al.: Attend, infer, repeat: fast scene understanding with generative models. In: Advances in Neural Information Processing Systems, pp. 3225–3233 (2016)
Google Scholar
Fan, S., Huang, B.: Labeled graph generative adversarial networks. CoRR abs/1906.03220 (2019). http://arxiv.org/abs/1906.03220
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)
Google Scholar
Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S., Vinyals, O.: Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118 (2018)
Gao, X., Gong, R., Shu, T., Xie, X., Wang, S., Zhu, S.: VRKitchen: an interactive 3D virtual environment for task-oriented learning. arXiv arXiv:1903.05757 (2019)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. JMLR 13, 723–773 (2012)
MathSciNet MATH Google Scholar
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Understanding real world indoor scenes with synthetic data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4077–4085 (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Google Scholar
Juliani, A., et al.: Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627 (2018)
Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: LayoutVAE: stochastic scene layout generation from a label set. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar
Kar, A., et al.: Meta-Sim: learning to generate synthetic datasets. In: ICCV (2019)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 (2018)
Kim, Y., Dyer, C., Rush, A.M.: Compound probabilistic context-free grammars for grammar induction. CoRR abs/1906.10225 (2019). http://arxiv.org/abs/1906.10225
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kolve, E., Mottaghi, R., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An interactive 3D environment for visual AI. arXiv:1712.05474 (2017)
Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: a probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4390–4399 (2015)
Google Scholar
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS, pp. 2539–2547 (2015)
Google Scholar
Kusner, M.J., Paige, B., Hernández-Lobato, J.M.: Grammar variational autoencoder. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, pp. 1945–1954. JMLR.org (2017). http://dl.acm.org/citation.cfm?id=3305381.3305582
LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/
Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. In: NIPS (2017)
Google Scholar
Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 12 (2019)
Article Google Scholar
Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: ICML (2015)
Google Scholar
Li, Y., Vinyals, O., Dyer, C., Pascanu, R., Battaglia, P.: Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 (2018)
Liao, R., et al.: Efficient graph generation with graph recurrent attention networks. arXiv preprint arXiv:1910.00760 (2019)
Louppe, G., Cranmer, K.: Adversarial variational optimization of non-differentiable simulators. arXiv preprint arXiv:1707.07113 (2017)
Mansinghka, V.K., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.: Approximate Bayesian image interpretation using generative probabilistic graphics programs. In: Advances in Neural Information Processing Systems, pp. 1520–1528 (2013)
Google Scholar
McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: SceneNet RGB-D: 5M photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079 (2016)
Mellor, J.F.J., et al.: Unsupervised doodling and painting with improved spiral (2019)
Google Scholar
Prakash, A., et al.: Structured domain randomization: Bridging the reality gap by context-aware synthetic data. arXiv:1810.10093 (2018)
Puig, X., et al.: VirtualHome: simulating household activities via programs. In: CVPR (2018)
Google Scholar
Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)
Google Scholar
Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. arXiv preprint arXiv:1906.00446 (2019)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Ritchie, D., Wang, K., Lin, Y.A.: Fast and flexible indoor scene synthesis via deep convolutional generative models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Google Scholar
Sadeghi, F., Levine, S.: CAD2RL: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201 (2016)
Savva, M., et al.: Habitat: A platform for embodied AI research. arXiv preprint arXiv:1904.01201 (2019)
Shugrina, M., et al.: Creative flow+ dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5384–5393 (2019)
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Such, F.P., Rawal, A., Lehman, J., Stanley, K.O., Clune, J.: Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data. arXiv preprint arXiv:1912.07768 (2019)
Tassa, Y., et al.: DeepMind control suite. Technical report, DeepMind (January 2018). https://arxiv.org/abs/1801.00690
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS (2017)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: International Conference on Intelligent Robots and Systems (2012)
Google Scholar
Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 132 (2019)
Google Scholar
Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 70 (2018)
Google Scholar
Wang, R., Lehman, J., Clune, J., Stanley, K.O.: Poet: open-ended coevolution of environments and their optimized solutions. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 142–151 (2019)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992). https://doi.org/10.1007/BF00992696
Article MATH Google Scholar
Wrenninge, M., Unger, J.: SynScapes: A photorealistic synthetic dataset for street scene parsing. arXiv:1810.08705 (2018)
Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Wu, Y., Wu, Y., Gkioxari, G., Tiani, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv arXiv:1801.02209 (2018)
Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. CoRR abs/1704.01696 (2017). http://arxiv.org/abs/1704.01696
Yin, P., Zhou, C., He, J., Neubig, G.: StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. CoRR abs/1806.07832 (2018). http://arxiv.org/abs/1806.07832
You, J., Ying, R., Ren, X., Hamilton, W., Leskovec, J.: GraphRNN: generating realistic graphs with deep auto-regressive models. In: International Conference on Machine Learning, pp. 5694–5703 (2018)
Google Scholar
Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. 30(4), 86 (2011)
Article Google Scholar
Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
Google Scholar
Zhou, Y., While, Z., Kalogerakis, E.: SceneGraphNet: neural message passing for 3D indoor scene augmentation. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar

Download references

Author information

Authors and Affiliations

NVIDIA, Waterloo, Canada
Jeevan Devaranjan, Amlan Kar & Sanja Fidler
University of Toronto, Toronto, Canada
Amlan Kar & Sanja Fidler
University of Waterloo, Waterloo, Canada
Jeevan Devaranjan
Vector Institute, Toronto, Canada
Amlan Kar & Sanja Fidler

Authors

Jeevan Devaranjan
View author publications
You can also search for this author in PubMed Google Scholar
Amlan Kar
View author publications
You can also search for this author in PubMed Google Scholar
Sanja Fidler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amlan Kar .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 24767 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Devaranjan, J., Kar, A., Fidler, S. (2020). Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12362. Springer, Cham. https://doi.org/10.1007/978-3-030-58520-4_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-58520-4_42
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58519-8
Online ISBN: 978-3-030-58520-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics