International Journal of Computer Vision

, Volume 126, Issue 9, pp 1027–1044 | Cite as

Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance

Can We Learn Pedestrian Detectors and Pose Estimators Without Real Data?
  • Hironori HattoriEmail author
  • Namhoon Lee
  • Vishnu Naresh Boddeti
  • Fares Beainy
  • Kris M. Kitani
  • Takeo Kanade


We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine. We first present an efficient discriminative learning method that takes these synthetic renders and generates a unique spatially-varying and geometry-preserving pedestrian appearance classifier customized for every possible location in the scene. In order to extend our approach to multi-task learning for further analysis (i.e., estimating pose and segmentation of pedestrians besides detection), we build a more generalized model employing a fully convolutional neural network architecture for multi-task learning leveraging the “free" ground-truth annotations that can be obtained from our pedestrian synthesizer. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.


Training with synthetic data Pedestrian detection Pose estimation 


  1. Agarwal, A, & Triggs, B. (2006). A local basis representation for estimating human pose from cluttered images. In ACCV. Berlin: Springer.Google Scholar
  2. Athitsos, V., Wang, H., & Stefan, A. (2010). A database-based framework for gesture recognition. Personal and Ubiquitous Computing, 14(6), 511–526.CrossRefGoogle Scholar
  3. Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR.Google Scholar
  4. Benfold, B., & Reid, I. (2011). Stable multi-target tracking in real-time surveillance video. In CVPR (pp. 3457–3464).Google Scholar
  5. Boddeti, V. N., Kanade, T., & Kumar, B. V. K. (2013). Correlation filters for object alignment. In CVPR (pp. 2291–2298).Google Scholar
  6. Bose, B., & Grimson, E. (2004). Improving object classification in far-field video. In CVPR, 2004 (Vol. 2, pp. II–II). IEEE.Google Scholar
  7. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.CrossRefzbMATHGoogle Scholar
  8. Broggi, A., Fascioli, A., Grisleri, P., Graf, T., & Meinecke, M. (2005). Model-based validation approaches and matching techniques for automotive vision based pedestrian detection. In CVPR workshop (pp. 1–1). IEEE.Google Scholar
  9. Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17(13), 285–348.CrossRefGoogle Scholar
  10. Cai, Z., Saberian, M., & Vasconcelos, N. (2015). Learning complexity-aware cascades for deep pedestrian detection. In ICCV.Google Scholar
  11. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.Google Scholar
  12. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (pp. 886–893).Google Scholar
  13. Dhome, M., Yassine, A., & Lavest, J.-M. (1993). Determination of the pose of an articulated object from a single perspective view. In BMVC (pp. 1–10).Google Scholar
  14. Dollár, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features. In BMVC.Google Scholar
  15. Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. PAMI, 34(4), 743–761.CrossRefGoogle Scholar
  16. Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. PAMI, 31(12), 2179–2195.CrossRefGoogle Scholar
  17. Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In ICCV (pp. 1–8).Google Scholar
  18. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.CrossRefGoogle Scholar
  19. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.CrossRefGoogle Scholar
  20. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV.Google Scholar
  21. Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In CVPR.Google Scholar
  22. Girshick, R. (2015). Fast r-cnn. In ICCV.Google Scholar
  23. Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar models. In NIPS.Google Scholar
  24. Grauman, K., Shakhnarovich, G., & Darrell, T. (2003). Inferring 3d structure with a statistical image-based shape model. In ICCV (pp. 641–647). IEEE.Google Scholar
  25. Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR.Google Scholar
  26. Hattori, K., Hattori, H., Ono, Y., Nishino, K., Itoh, M., Boddeti, V. N., et al. (2014). Carnegie Mellon University Surveillance Research Dataset (CMUSRD). Technical report, Carnegie Mellon University. Accessed November, 2014.
  27. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027.
  28. Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In CVPR (pp. 2449–2456). IEEE.Google Scholar
  29. Henriques, J. F., Carreira, J., Caseiro, R., & Batista, J. (2013). Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In ICCV.Google Scholar
  30. Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV, 80(1), 3–15.CrossRefGoogle Scholar
  31. Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.Google Scholar
  32. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.Google Scholar
  33. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.Google Scholar
  34. Lai, K., Bo, L., & Fox, D. (2012). Unsupervised feature learning for 3d scene labeling. In ICRA.Google Scholar
  35. Liu, W., Anguelov, D., Erhan, D., & Szegedy, C. (2016). SSD: Single shot multibox detector. In ECCV.Google Scholar
  36. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.Google Scholar
  37. Marin, J, Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In CVPR (pp. 137–144). IEEE.Google Scholar
  38. Matikainen, P., Sukthankar, R., & Hebert, M. (2012). Classifier ensemble recommendation. In ECCV workshop (pp. 209–218). Berlin: Springer.Google Scholar
  39. Movshovitz-Attias, Y., Boddeti, V. N., Wei, Z., & Sheikh, Y. (2014). 3d pose-by-detection of vehicles via discriminatively reduced ensembles of correlation filters. In BMVC.Google Scholar
  40. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937.
  41. Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.Google Scholar
  42. Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In CVPR (pp. 3362–3369). IEEE.Google Scholar
  43. Pishchulin, L., Andriluka, M., Gehler, P., & Schiele, B. (2013). Strong appearance and expressive spatial models for human pose estimation. In ICCV.Google Scholar
  44. Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR.Google Scholar
  45. Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormahlen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In CVPR (pp. 1473–1480). IEEE.Google Scholar
  46. Potamias, M., & Athitsos, V. (2008). Nearest neighbor search methods for handshape recognition. In Proceedings of the 1st international conference on pervasive technologies related to assistive environments (p. 30). ACM.Google Scholar
  47. Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, A. J., & Sheikh, Y. (2014). Pose machines: Articulated pose estimation via inference machines. In ECCV.Google Scholar
  48. Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.Google Scholar
  49. Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV), volume 9906 of LNCS (pp. 102–118). Berlin: Springer International Publishing.Google Scholar
  50. Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS.Google Scholar
  51. Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: real-time 3d reconstruction of hands in interaction with objects. In ICRA.Google Scholar
  52. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.Google Scholar
  53. Roth, P. M., Sternig, S., Grabner, H., & Bischof, H. (2009). Classifier grids for robust adaptive object detection. In CVPR (pp. 2727–2734). IEEE.Google Scholar
  54. Sangineto, E. (2014). Statistical and spatial consensus collection for detector adaptation. In ECCV (pp. 456–471). Berlin: Springer.Google Scholar
  55. Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3d models. In BMVC.Google Scholar
  56. Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., et al. (2013). Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2821–2840.CrossRefGoogle Scholar
  57. Stalder, S., Grabner, H., & Gool, L. V. (2009). Exploring context to learn scene specific object detectors. In Proceedings of PETS.Google Scholar
  58. Stalder, S., Grabner, H., & Van Gool, L. (2010). Cascaded confidence filtering for improved tracking-by-detection. In ECCV, 2010 (pp. 369–382). Berlin: Springer.Google Scholar
  59. Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV.Google Scholar
  60. Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In BMVC.Google Scholar
  61. Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In CVPR.Google Scholar
  62. Taylor, G. R., Chosak, A. J., & Brewer, P. C. (2007). Ovvv: Using virtual worlds to design and evaluate surveillance systems. In CVPR (pp. 1–8).Google Scholar
  63. Thirde, D., Li, L., & Ferryman, F. (2006). Overview of the PETS2006 challenge. In Proceedings 9th IEEE International workshop on performance evaluation of tracking and surveillance (PETS 2006) (pp. 47–50).Google Scholar
  64. Tian, Y., Wang, X., Luo, P., & Tang, X. (2015). Deep learning strong parts for pedestrian detection. In ICCV.Google Scholar
  65. Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.Google Scholar
  66. Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., et al. (2017). Learning from Synthetic Humans. In CVPR.Google Scholar
  67. Vazquez, D. A., López, J. M., Ponsa, D., & Gerónimo, D. (2014). Virtual and real world adaptation for pedestrian detection. PAMI, 36(4), 797–809.CrossRefGoogle Scholar
  68. Wang, M., Li, W., & Wang, X. (2012). Transferring a generic pedestrian detector towards specific scenes. In CVPR (pp. 3274–3281). IEEE.Google Scholar
  69. Wang, M., & Wang, X. (2011). Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR (pp. 3401–3408). IEEE.Google Scholar
  70. Wang, X., Wang, M., & Li, W. (2014). Scene-specific pedestrian detection for static video surveillance. PAMI, 36(2), 361–374.CrossRefGoogle Scholar
  71. Wei, S., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.Google Scholar
  72. Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In CVPR (pp. 794–801).Google Scholar
  73. Xu, J., Vázquez, D., Ramos, S., López, A. M., & Ponsa, D. (2013). Adapting a pedestrian detector by boosting lda exemplar classifiers. In CVPR workshop (pp. 688–693). IEEE.Google Scholar
  74. Yang, W., Ouyang, W., Li, H., & Wang, X. (2016). End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR.Google Scholar
  75. Yang, Y., Shu, G., & Shah, M. (2013). Semi-supervised learning of feature hierarchies for object detection in a video. In CVPR (pp. 1650–1657). IEEE.Google Scholar
  76. Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.CrossRefGoogle Scholar
  77. Zhang, S., Benenson, R., & Schiele, B. (2015). Filtered channel features for pedestrian detection. In CVPR.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.The Robotics InstituteCarnegie Mellon UniversityPittsburghUSA
  2. 2.Volvo Construction EquipmentGöthenburgSweden
  3. 3.Institute of Industrial ScienceThe University of TokyoTokyoJapan
  4. 4.Engineering Science DepartmentUniversity of OxfordOxfordUK
  5. 5.Computer Science and EngineeringMichigan State UniversityLansingUSA

Personalised recommendations