Pedestrian Detection via Structure-Sensitive Deep Representation Learning
Pedestrian detection is a fundamental task in a wide range of computer vision applications. Detecting the head-shoulder appearance is an attractive way for pedestrian detection, especially in scenes with crowd, heavy occlusion or large camera tilt angles. However, the head-shoulder part contains less information than the full human body, which requires better feature extraction to ensure the effectiveness of the detection. This paper proposes a head-shoulder detection method based on the convolutional neural network (CNN). According to the characteristics of the head and shoulders, our method integrates a structure-sensitive ROI pooling layer into the CNN architecture. The proposed CNN is trained in a multi-task scheme with classification and localization outputs. Furthermore, the convolutional layers of the network are pre-trained using a triplet loss to capture better features of the head-shoulder appearance. Extensive experimental results demonstrate that the average accuracy of the proposed method is 89.6% when the IoU threshold is 0.5. Our method obtains close results to the state-of-the-art method Faster R-CNN while outperforming it in speed. Even when the number of extracted candidate regions increases, the increased detection time is negligible. In addition, when the IoU threshold is greater than 0.6, the average accuracy of our method is higher than that of Faster R-CNN, which indicates that our results have higher IoU with ground truth.
This research is supported by Natural Science Foundation of Guangdong Province (2014A030310348, 2014A030313154), National Natural Science Foundation of China (61472455, 61402120), Guangdong Provincial Department of Science and Technology (GDST16EG04) 2016A050503024, and the Startup Program in Guangdong University of Foreign Studies (299-X5122029).
- 4.Teichman, A., Thrun, S.: Practical object recognition in autonomous driving and beyond. In: Advanced Robotics and its Social Impacts (ARSO), pp. 35–38 (2011)Google Scholar
- 5.Li, M., Zhang, Z., Huang, K., Tan, T.: Rapid and robust human detection and tracking based on omega-shape features. In: IEEE International Conference on Image Processing, pp. 2545–2548 (2010)Google Scholar
- 6.Li, M., Zhang, Z., Huang, K., Tan, T.: Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection. In: International Conference on Pattern Recognition, pp. 1–4 (2008)Google Scholar
- 7.Zeng, C., Ma, H.: Robust head-shoulder detection by PCA-based multilevel HOG-LBP detector for people counting. In: International Conference on Pattern Recognition, pp. 2069–2072 (2010)Google Scholar
- 8.Wu, B., Nevatia, R.: Tracking of multiple humans in meetings. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop, p. 143 (2006)Google Scholar
- 9.Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005)Google Scholar
- 10.Wang, X., Han, T.X., Yan, S.: An HOG-LBP human detector with partial occlusion handling. In: IEEE International Conference on Computer Vision (ICCV), pp. 32–39 (2010)Google Scholar
- 11.Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1491–1498 (2006)Google Scholar
- 12.Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2013)Google Scholar
- 14.Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)Google Scholar
- 16.Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., Lecun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 (2015)
- 17.Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015)Google Scholar