Playing for Data: Ground Truth from Computer Games

  • Stephan R. RichterEmail author
  • Vibhav Vineet
  • Stefan Roth
  • Vladlen Koltun
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9906)


Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data to supplement real-world images significantly increases accuracy and that the acquired data enables reducing the amount of hand-labeled real-world data: models trained with game data and just \(\tfrac{1}{3}\) of the CamVid training set outperform models trained on the complete CamVid training set.


Association Rule Association Rule Mining Graphic Hardware Visual Odometry Semantic Label 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



SRR was supported in part by the German Research Foundation (DFG) within the GRK 1362. Additionally, SRR and SR were supported in part by the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement No. 307942. Work on this project was conducted in part while SRR was an intern at Intel Labs.


  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB (1994)Google Scholar
  2. 2.
    Akenine-Möller, T., Haines, E., Hoffman, N.: Real-Time Rendering, 3rd edn. A K Peters, Natick (2008)CrossRefGoogle Scholar
  3. 3.
    Appleby, A.: Murmurhash.
  4. 4.
    Aubry, M., Maturana, D., Efros, A.A., Russell, B.C., Sivic, J.: Seeing 3D chairs: exemplar part-based 2D–3D alignment using a large dataset of CAD models. In: CVPR (2014)Google Scholar
  5. 5.
    Aubry, M., Russell, B.C.: Understanding deep features with computer-generated imagery. In: ICCV (2015)Google Scholar
  6. 6.
    Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV 92(1), 1–31 (2011)CrossRefGoogle Scholar
  7. 7.
    Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. IJCV 12(1), 43–77 (1994)CrossRefGoogle Scholar
  8. 8.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recogn. Lett. 30(2), 88–97 (2009)CrossRefGoogle Scholar
  9. 9.
    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33783-3_44 Google Scholar
  10. 10.
    Chen, C., Seff, A., Kornhauser, A.L., Xiao, J.: DeepDriving: learning affordance for direct perception in autonomous driving. In: ICCV (2015)Google Scholar
  11. 11.
    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  12. 12.
    Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)Google Scholar
  13. 13.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Rob. Res. 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  14. 14.
    Haltakov, V., Unger, C., Ilic, S.: Framework for generation of synthetic ground truth data for driver assistance applications. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 323–332. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40602-7_35 CrossRefGoogle Scholar
  15. 15.
    Handa, A., Pătrăucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Understanding real world indoor scenes with synthetic data. In: CVPR (2016)Google Scholar
  16. 16.
    Handa, A., Whelan, T., McDonald, J., Davison, A.J.: A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In: ICRA (2014)Google Scholar
  17. 17.
    He, X., Zemel, R.S., Carreira-Perpiñán, M.: Multiscale conditional random fields for image labeling. In: CVPR (2004)Google Scholar
  18. 18.
    Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)CrossRefGoogle Scholar
  19. 19.
    Hunt, G., Brubacher, D.: Detours: binary interception of Win32 functions. In: 3rd USENIX Windows NT Symposium (1999)Google Scholar
  20. 20.
    Intel: Intel Graphics Performance Analyzers.
  21. 21.
    Kaneva, B., Torralba, A., Freeman, W.T.: Evaluation of image features using a photorealistic virtual world. In: ICCV (2011)Google Scholar
  22. 22.
    Karlsson, B.: RenderDoc.
  23. 23.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2011)Google Scholar
  24. 24.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  25. 25.
    Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: CVPR (2016)Google Scholar
  26. 26.
    Liebelt, J., Schmid, C., Schertler, K.: Viewpoint-independent object class detection using 3D feature maps. In: CVPR (2008)Google Scholar
  27. 27.
    Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: CVPR (2015)Google Scholar
  28. 28.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  29. 29.
    Marín, J., Vázquez, D., Gerónimo, D., López, A.M.: Learning appearance in virtual scenarios for pedestrian detection. In: CVPR (2010)Google Scholar
  30. 30.
    Massa, F., Russell, B.C., Aubry, M.: Deep exemplar 2D–3D detection by adapting from real to rendered views. In: CVPR (2016)Google Scholar
  31. 31.
    Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)Google Scholar
  32. 32.
    McCane, B., Novins, K., Crannitch, D., Galvin, B.: On benchmarking optical flow. CVIU 84(1), 126–143 (2001)zbMATHGoogle Scholar
  33. 33.
    Papon, J., Schoeler, M.: Semantic pose using deep networks trained on synthetic RGB-D. In: ICCV (2015)Google Scholar
  34. 34.
    Peng, X., Sun, B., Ali, K., Saenko, K.: Learning deep object detectors from 3D models. In: ICCV (2015)Google Scholar
  35. 35.
    Pepik, B., Stark, M., Gehler, P.V., Schiele, B.: Multi-view and 3D deformable part models. PAMI 37(11), 2232–2245 (2015)CrossRefGoogle Scholar
  36. 36.
    Ren, X., Malik, J.: Learning a classification model for segmentation. In: ICCV (2003)Google Scholar
  37. 37.
    Richter, S.R., Roth, S.: Discriminative shape from shading in uncalibrated illumination. In: CVPR (2015)Google Scholar
  38. 38.
    Rockstar Games: Policy on posting copyrighted Rockstar Games material.
  39. 39.
    Ros, G., Ramos, S., Granados, M., Bakhtiary, A., Vázquez, D., López, A.M.: Vision-based offline-online perception paradigm for autonomous driving. In: WACV (2015)Google Scholar
  40. 40.
    Saito, T., Takahashi, T.: Comprehensible rendering of 3-D shapes. In: SIGGRAPH (1990)Google Scholar
  41. 41.
    Sharp, T., Keskin, C., Robertson, D.P., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A.W., Izadi, S.: Accurate, robust, and flexible real-time hand tracking. In: CHI (2015)Google Scholar
  42. 42.
    Shotton, J., Girshick, R.B., Fitzgibbon, A.W., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., Blake, A.: Efficient human pose estimation from single depth images. PAMI 35(12), 2821–2840 (2013)CrossRefGoogle Scholar
  43. 43.
    Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV 81(1), 2–23 (2009)CrossRefGoogle Scholar
  44. 44.
    Stark, M., Goesele, M., Schiele, B.: Back to the future: learning shape models from 3D CAD data. In: BMVC (2010)Google Scholar
  45. 45.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2009)Google Scholar
  46. 46.
    Taylor, G.R., Chosak, A.J., Brewer, P.C.: OVVV: using virtual worlds to design and evaluate surveillance systems. In: CVPR (2007)Google Scholar
  47. 47.
    Tighe, J., Lazebnik, S.: Superparsing – scalable nonparametric image parsing with superpixels. IJCV 101(2), 329–349 (2013)MathSciNetCrossRefGoogle Scholar
  48. 48.
    Tripathi, S., Belongie, S., Hwang, Y., Nguyen, T.Q.: Semantic video segmentation: exploring inference efficiency. In: ISOCC (2015)Google Scholar
  49. 49.
    Vázquez, D., López, A.M., Marín, J., Ponsa, D., Gomez, D.G.: Virtual and real world adaptation for pedestrian detection. PAMI 36(4), 797–809 (2014)CrossRefGoogle Scholar
  50. 50.
    Xie, J., Kiefel, M., Sun, M.T., Geiger, A.: Semantic instance annotation of street scenes by 3D to 2D label transfer. In: CVPR (2016)Google Scholar
  51. 51.
    Xu, J., Vázquez, D., López, A.M., Marín, J., Ponsa, D.: Learning a part-based pedestrian detector in a virtual world. IEEE Trans. Intell. Transp. Syst. 15(5), 2121–2131 (2014)CrossRefGoogle Scholar
  52. 52.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)Google Scholar
  53. 53.
    Veeravasarapu, V.S.R., Rothkopf, C., Ramesh, V.: Model-driven simulations for deep convolutional neural networks (2016). arXiv preprint arXiv:1605.09582
  54. 54.
    Shafaei, A., Little, J.J., Schmidt, M.: Play and learn: using video Games to train computer vision models. In: BMVC (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Stephan R. Richter
    • 1
    Email author
  • Vibhav Vineet
    • 2
  • Stefan Roth
    • 1
  • Vladlen Koltun
    • 2
  1. 1.TU DarmstadtDarmstadtGermany
  2. 2.Intel LabsSanta ClaraUSA

Personalised recommendations