Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance

Can We Learn Pedestrian Detectors and Pose Estimators Without Real Data?


We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine. We first present an efficient discriminative learning method that takes these synthetic renders and generates a unique spatially-varying and geometry-preserving pedestrian appearance classifier customized for every possible location in the scene. In order to extend our approach to multi-task learning for further analysis (i.e., estimating pose and segmentation of pedestrians besides detection), we build a more generalized model employing a fully convolutional neural network architecture for multi-task learning leveraging the “free" ground-truth annotations that can be obtained from our pedestrian synthesizer. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18


  1. 1.

    The connectivity graph considered in this paper is a Markov Random Field over all regions while ignoring the regions defined as walls and obstacles.


  1. Agarwal, A, & Triggs, B. (2006). A local basis representation for estimating human pose from cluttered images. In ACCV. Berlin: Springer.

  2. Athitsos, V., Wang, H., & Stefan, A. (2010). A database-based framework for gesture recognition. Personal and Ubiquitous Computing, 14(6), 511–526.

    Article  Google Scholar 

  3. Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR.

  4. Benfold, B., & Reid, I. (2011). Stable multi-target tracking in real-time surveillance video. In CVPR (pp. 3457–3464).

  5. Boddeti, V. N., Kanade, T., & Kumar, B. V. K. (2013). Correlation filters for object alignment. In CVPR (pp. 2291–2298).

  6. Bose, B., & Grimson, E. (2004). Improving object classification in far-field video. In CVPR, 2004 (Vol. 2, pp. II–II). IEEE.

  7. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.

    Article  MATH  Google Scholar 

  8. Broggi, A., Fascioli, A., Grisleri, P., Graf, T., & Meinecke, M. (2005). Model-based validation approaches and matching techniques for automotive vision based pedestrian detection. In CVPR workshop (pp. 1–1). IEEE.

  9. Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17(13), 285–348.

    Article  Google Scholar 

  10. Cai, Z., Saberian, M., & Vasconcelos, N. (2015). Learning complexity-aware cascades for deep pedestrian detection. In ICCV.

  11. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.

  12. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (pp. 886–893).

  13. Dhome, M., Yassine, A., & Lavest, J.-M. (1993). Determination of the pose of an articulated object from a single perspective view. In BMVC (pp. 1–10).

  14. Dollár, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features. In BMVC.

  15. Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. PAMI, 34(4), 743–761.

    Article  Google Scholar 

  16. Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. PAMI, 31(12), 2179–2195.

    Article  Google Scholar 

  17. Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In ICCV (pp. 1–8).

  18. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.

    Article  Google Scholar 

  19. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  20. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV.

  21. Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In CVPR.

  22. Girshick, R. (2015). Fast r-cnn. In ICCV.

  23. Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar models. In NIPS.

  24. Grauman, K., Shakhnarovich, G., & Darrell, T. (2003). Inferring 3d structure with a statistical image-based shape model. In ICCV (pp. 641–647). IEEE.

  25. Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR.

  26. Hattori, K., Hattori, H., Ono, Y., Nishino, K., Itoh, M., Boddeti, V. N., et al. (2014). Carnegie Mellon University Surveillance Research Dataset (CMUSRD). Technical report, Carnegie Mellon University. Accessed November, 2014.

  27. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027.

  28. Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In CVPR (pp. 2449–2456). IEEE.

  29. Henriques, J. F., Carreira, J., Caseiro, R., & Batista, J. (2013). Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In ICCV.

  30. Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV, 80(1), 3–15.

    Article  Google Scholar 

  31. Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.

  32. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.

  33. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  34. Lai, K., Bo, L., & Fox, D. (2012). Unsupervised feature learning for 3d scene labeling. In ICRA.

  35. Liu, W., Anguelov, D., Erhan, D., & Szegedy, C. (2016). SSD: Single shot multibox detector. In ECCV.

  36. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  37. Marin, J, Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In CVPR (pp. 137–144). IEEE.

  38. Matikainen, P., Sukthankar, R., & Hebert, M. (2012). Classifier ensemble recommendation. In ECCV workshop (pp. 209–218). Berlin: Springer.

  39. Movshovitz-Attias, Y., Boddeti, V. N., Wei, Z., & Sheikh, Y. (2014). 3d pose-by-detection of vehicles via discriminatively reduced ensembles of correlation filters. In BMVC.

  40. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937.

  41. Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.

  42. Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In CVPR (pp. 3362–3369). IEEE.

  43. Pishchulin, L., Andriluka, M., Gehler, P., & Schiele, B. (2013). Strong appearance and expressive spatial models for human pose estimation. In ICCV.

  44. Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR.

  45. Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormahlen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In CVPR (pp. 1473–1480). IEEE.

  46. Potamias, M., & Athitsos, V. (2008). Nearest neighbor search methods for handshape recognition. In Proceedings of the 1st international conference on pervasive technologies related to assistive environments (p. 30). ACM.

  47. Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, A. J., & Sheikh, Y. (2014). Pose machines: Articulated pose estimation via inference machines. In ECCV.

  48. Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.

  49. Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV), volume 9906 of LNCS (pp. 102–118). Berlin: Springer International Publishing.

  50. Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS.

  51. Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: real-time 3d reconstruction of hands in interaction with objects. In ICRA.

  52. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.

  53. Roth, P. M., Sternig, S., Grabner, H., & Bischof, H. (2009). Classifier grids for robust adaptive object detection. In CVPR (pp. 2727–2734). IEEE.

  54. Sangineto, E. (2014). Statistical and spatial consensus collection for detector adaptation. In ECCV (pp. 456–471). Berlin: Springer.

  55. Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3d models. In BMVC.

  56. Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., et al. (2013). Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2821–2840.

    Article  Google Scholar 

  57. Stalder, S., Grabner, H., & Gool, L. V. (2009). Exploring context to learn scene specific object detectors. In Proceedings of PETS.

  58. Stalder, S., Grabner, H., & Van Gool, L. (2010). Cascaded confidence filtering for improved tracking-by-detection. In ECCV, 2010 (pp. 369–382). Berlin: Springer.

  59. Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV.

  60. Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In BMVC.

  61. Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In CVPR.

  62. Taylor, G. R., Chosak, A. J., & Brewer, P. C. (2007). Ovvv: Using virtual worlds to design and evaluate surveillance systems. In CVPR (pp. 1–8).

  63. Thirde, D., Li, L., & Ferryman, F. (2006). Overview of the PETS2006 challenge. In Proceedings 9th IEEE International workshop on performance evaluation of tracking and surveillance (PETS 2006) (pp. 47–50).

  64. Tian, Y., Wang, X., Luo, P., & Tang, X. (2015). Deep learning strong parts for pedestrian detection. In ICCV.

  65. Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.

  66. Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., et al. (2017). Learning from Synthetic Humans. In CVPR.

  67. Vazquez, D. A., López, J. M., Ponsa, D., & Gerónimo, D. (2014). Virtual and real world adaptation for pedestrian detection. PAMI, 36(4), 797–809.

    Article  Google Scholar 

  68. Wang, M., Li, W., & Wang, X. (2012). Transferring a generic pedestrian detector towards specific scenes. In CVPR (pp. 3274–3281). IEEE.

  69. Wang, M., & Wang, X. (2011). Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR (pp. 3401–3408). IEEE.

  70. Wang, X., Wang, M., & Li, W. (2014). Scene-specific pedestrian detection for static video surveillance. PAMI, 36(2), 361–374.

    Article  Google Scholar 

  71. Wei, S., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.

  72. Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In CVPR (pp. 794–801).

  73. Xu, J., Vázquez, D., Ramos, S., López, A. M., & Ponsa, D. (2013). Adapting a pedestrian detector by boosting lda exemplar classifiers. In CVPR workshop (pp. 688–693). IEEE.

  74. Yang, W., Ouyang, W., Li, H., & Wang, X. (2016). End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR.

  75. Yang, Y., Shu, G., & Shah, M. (2013). Semi-supervised learning of feature hierarchies for object detection in a video. In CVPR (pp. 1650–1657). IEEE.

  76. Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.

    Article  Google Scholar 

  77. Zhang, S., Benenson, R., & Schiele, B. (2015). Filtered channel features for pedestrian detection. In CVPR.

Download references

Author information



Corresponding author

Correspondence to Hironori Hattori.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hattori, H., Lee, N., Boddeti, V.N. et al. Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance. Int J Comput Vis 126, 1027–1044 (2018).

Download citation


  • Training with synthetic data
  • Pedestrian detection
  • Pose estimation