Robust real-time pedestrian detection in surveillance videos

Abstract

Detecting different categories of objects in an image and video content is one of the fundamental tasks in computer vision research. Pedestrian detection is a hot research topic, with several applications including robotics, surveillance and automotive safety. We address the problem of detecting pedestrians in surveillance videos. In this paper, we present a new feature extraction method based on Multi-scale Center-symmetric Local Binary Pattern operator. All the modules (foreground segmentation, feature pyramid, training, occlusion handling) of our proposed method are introduced with its details about design and implementation. Experiments on CAVIAR and other sequences show that the presented system can detect pedestrians in real-time effectively and accurately in surveillance videos.

Introduction

Detection of objects of one or more classes is a fundamental interest in computer vision. Pedestrian detection is more complicated than most of other object detection problems because of the high articulation of human body. The appearance of pedestrian is dependent on viewpoint, illumination, clothing, and occlusion. Pedestrian detection is the first step for a number of applications such as smart video surveillance, driving assistance system, human-robot interaction, people-finding for military applications and intelligent digital management.

There is extensive literature on pedestrian detection algorithms. An extensive review on these algorithms is beyond the scope of this paper. We refer readers to comprehensive surveys (Benenson et al. 2014; Dollar et al. 2012) for more details about existing detectors. In this section, we review only the works related to our method.

Broadly speaking there are three major types of approaches for visual pedestrian detection: model-based, part-based, and feature-classifier-based.

In model-based pedestrian detection, an exact pedestrian model is defined first of all. Then the algorithm searches the video frames for matched positions with the model to detect pedestrians. Model-based pedestrian detection corresponds to the generative models in pattern recognition. Most of the matching process is under the framework of the Bayesian theory to estimate the maximum posterior probability of the object class. Gavrila (2007) presented a probabilistic approach to hierarchical, exemplar-based shape matching. A template tree was constructed in order to represent and match the variety of shape exemplars. This tree was generated offline by a bottom-up clustering approach using stochastic optimization. Applying coarse-to-fine probabilistic matching strategy, Chamfer distance was used as the similarity measurement between two contours. Similarly, Lin and Davis (2010) used part-based examplar tree while the pose-specific detector is combined to validate the matching result.

Part-based models have a long history in computer vision for object detection in general and for pedestrian detection in particular (Forsyth 1997; Leibe et al. 2005). There are two main components in part-based models. The first component uses low-level features or classifiers to model individual parts of a pedestrian. The second component models the topology of the pedestrian to enable the accumulation of part evidence (Seemann et al. 2006). Though this approach is attractive, part detection itself is a difficult task. Implementation of this approach follows a standard procedure for processing the image data that consists of creating a densely sampled image pyramid, computing features at each scale, performing classification at all possible locations, and finally performing non-maximal suppression to generate the final set of bounding boxes.

In feature-classifier-based pedestrian detection the detection windows are extracted (usually sliding-windows search) from video frames first. Next features are extracted from the detection window. A classifier is trained based on a large number of training samples. The classifier classifies the feature vectors as pedestrian class or non-pedestrian class.

An early work in pedestrian detection is the VJ detector (Viola 2004). Using simple Haar-like features and a cascade of boosted classifiers, this method has achieved a very fast detection speed. Papageorgiou and Poggio (2000) introduced the over-complete dictionary of Haar wavelet features in combination with a Support Vector Machine. Histogram of oriented gradients (HOG) has been proposed by Dalal and Triggs (2005) which became a very popular feature in computer vision. Dollár et al. (2009a) introduced the Integral Channel Features (ICF), including multiple types of features (gray-scale, LUV color channels, gradient magnitude, etc.) that can be quickly computed using the integral image trick. ICF is widely used in many state-of-the-art pedestrian detectors. Based on HOG, Felzenszwalb et al. (2010) proposed the deformable part based model (DPM) which was breakthrough in pedestrian detection.

The rest of this paper is organized as follows. We describe the proposed detection method in Sect. 2. We introduce all modules of the system such as training, foreground segmentation, feature extraction, feature pyramid and occlusion handling. Section 3 shows the experimental and analysis of the pedestrian detection algorithm. We draw the conclusions in Sect. 4. The possible directions of further research are outlined in Sect. 5.

Our system architecture

Figure 1 presents our system overview, the input video frames are segmented to determine the foreground. With the help of the foreground segmentation, we rapidly filter out negative regions, while keeping the positive regions. The detection system scans the image all relevant positions and scales to detect a pedestrian. The so-called feature pyramid is derived from the standard image pyramid in order to accelerate the feature extraction. The detection window scans the feature pyramid and extracts the feature vector with the help of it. The feature component encodes the visual appearance of the pedestrian, while the classifier component determines for each sliding-window independently whether it contains a pedestrian or not.

To train our classifier, we gathered a set of 14,525 gray-scale sample images of pedestrians which were taken from public datasets (Enzweiler and Gavrila 2009; Ess et al. 2008; Hwang et al. 2013) and from our surveillance videos. We made a database of negative samples too, which consists of 16,800 non-pedestrian images. In order to improve the performance of the detector we put 8000 vertical structures like poles, trees or street signs to the negative samples. The vertical structures are common false positives in pedestrian detection. To train the classifier, we have used the open-source LIBSVM (Chang and Lin 2011).

Fig. 1
figure1

Architecture of our pedestrian detection system

Foreground segmentation

In order to reduce detection time and eliminate false detections from background, multi-scale wavelet transformation (WT) using frame difference is developed to segment foreground. A signal is broken into similar (low-pass) and discontinuous (high-pass) sub-signals by the low-pass and high-pass filters of WT (Guan 2010).

The HSV color space is related to human color perception and it separates chromaticity and luminosity. That is why it is selected to be used here. We define a foreground mask in the following way (Guan 2010; Xu et al. 2015):

$$\begin{aligned} P_f={\left\{ \begin{array}{ll} 1, &\quad E_{\Delta V}\ge T_{\Delta V}\wedge E_{\Delta S}\ge T_{\Delta S}\\ 0, &\quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where \(\Delta V\) and \(\Delta S\) are the difference between the two successive frames of the value and the saturation component, respectively; \(E_{\Delta V}\), \(E_{\Delta S}\) stand for multi-scale WT across \(\Delta V\), and \(\Delta S\), respectively; \(T_{\Delta V}\), \(T_{\Delta S}\) represent a threshold value of \(\Delta V\), and \(\Delta S\), respectively.

In order to remove ghost effects the WT-based edge detection is used to extract edges of current frame,

$$\begin{aligned} P_e={\left\{ \begin{array}{ll} 1 &{} E_{V}\ge T_{V} \\ 0 &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(2)

where V is the value component of current frame, \(T_V\) stands for a threshold value for \(E_V\).

A bitwise AND operation is applied on \(P_f\) and \(P_e\) to extract the whole foreground region mask:

$$\begin{aligned} P=P_f\cdot P_e. \end{aligned}$$
(3)

Multi-scale center-symmetric local binary pattern operator

Local binary pattern (LBP) is a simple, but very efficient texture operator which labels the pixels of an image by thresholding the neighborhood of each pixel and considers the result as a binary number (Pietikäinen 2005). LBP feature performs well in various applications, including texture classification and segmentation (Topi et al. 2000), surface analysis (Ojala et al. 2002), face recognition (Ahonen et al. 2004) and visual descriptors (Heikkil et al. 2014). The original LBP operator labels the pixels of an image by thresholding the 3-by-3 neighborhood of each pixel with central pixel value and the result is taken as a binary number. Later extensions of LBP operator use neighborhoods of different sizes. The notation (PR) is used for the neighborhood description, where P is the number of sampling points on a circle of radius R. The descriptor describes the result over the neighborhood as a binary pattern:

$$\begin{aligned} LBP_{P,R(x, y)}=\sum _{i=0}^{P-1} s(u_i-u_c)\cdot 2^i, \quad s(x)={\left\{ \begin{array}{ll} 1 &{} x\ge 0 \\ 0 &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(4)

where \(u_c\) corresponds to the gray-level of the center pixel and \(u_i\) to the gray-levels of P equally spaced pixels on a circle of radius R. A histogram of the labeled image \(f_l(x,y)\) can be calculated:

$$\begin{aligned} H_i=\sum _{x,y} I\{f_l(x,y)=i\}, i=0,...,n-1, \end{aligned}$$
(5)

where \(n=2^P\) is the number of different labels and

$$\begin{aligned} I\{A\}={\left\{ \begin{array}{ll} 1 &{} \text { if \, A \, is \, true} \\ 0 &{} \text { if \, A \, is \, false.} \end{array}\right. } \end{aligned}$$
(6)

The center-symmetric local binary pattern (CS-LBP) was introduced in (Heikkil et al. 2006). In CS-LBP, pixel values are not compared to the center pixel but to the opposing pixel symmetrically with respect to the center pixel. We can see that for eight neighbors, LBP produces \(2^8=256\) different binary patterns, whereas for CS-LBP this number is only \(2^4=16\).

The idea of multi-scale center-symmetric local binary pattern is based on the simple principle of varying the radius R of the CS-LBP label and combining the resulting histograms (Varga et al. 2015). The neighborhood is described with two parameters \(P, R=\{R_1, R_2, ..., R_{n_R}\}\), where \(n_R\) is the number of radii utilized in the process of calculation. Each pixel in Multi-scale CS-LBP image is described with \(n_R\) values. The Multi-scale CS-LBP histogram for different values of \(R=\{R_1, R_2, ..., R_{n_R}\}\) can be determined by summing \(H^{(1)}, H^{(2)}, ..., H^{(n_R)}\) vectors:

$$\begin{aligned} H=\sum _{i=1}^{n_R} H^{(i)}. \end{aligned}$$
(7)

In our experiments we used the following parameters: \(P=8\), \(R_1=1\), \(R_2=2\), \(R_3=3\), and \(n_R=3\).

Feature extraction

In this subsection, we introduce the implementation details of our feature extraction method. The key steps are as follows.

  1. 1.

    We normalize the gray-level of the input image to reduce the illumination variance in different images. After the gray-level normalization, all input images have gray-level ranging from 0 to 1.

  2. 2.

    We obtain four layers of the input image in the following way: first, we compute the gradient magnitude of each pixel of the input gray-scale image (detection window), then we repeat this calculation three times on the previous output image. Considering the speed of the calculation, we compute an approximation of the gradients using Sobel operator.

  3. 3.

    The detection window and each of the four layers of the detection window are split into equally sized overlapping blocks. The rate of overlapping is 50 %. In our case, the size of the detection window is \(64\times 128\) and the size of the blocks is \(16 \times 16\).

  4. 4.

    We take the detection window and the multi-scale CS-LBP histograms are extracted from each block independently. Let \(v_i\) be the unnormalized descriptor of the ith block, f be the descriptor of the detection window. We obtain f in the following way:

    • \(f=[v_1, v_2, ..., v_N]\);

    • \(l_1\)-norm, \(f\leftarrow f/(\mid \mid f \mid \mid _{1} +\epsilon )\);

    • \(l_1\)-norm followed by square root \(f\leftarrow f/\sqrt{\mid \mid f \mid \mid _{1} +\epsilon }\).

  5. 5.

    We take each layers one after the other and the multi-scale CS-LBP histograms are extracted from each block independently. Let \(v_{i,j}\) be the unnormalized descriptor of the ith block in the jth layer, \(g_j\) be the descriptor of the jth layer. We obtain \(g_j\) in the following way:

    • \(g_j=[v_{1,j}, v_{2,j}, ..., v_{N,j}]\);

    • \(l_1\)-norm, \(g_j\leftarrow g_j/(\mid \mid g_j \mid \mid _{1} +\epsilon )\);

    • \(l_1\)-norm followed by square root \(g_j\leftarrow g_j/\sqrt{\mid \mid g_j \mid \mid _{1} +\epsilon }\).

  6. 6.

    We obtain the feature vector of the detection window in the following way:

    $$\begin{aligned} F=f+\sum _{j=1}^{4} \frac{1}{j+1}g_j \end{aligned}$$
    (8)

The overall length of the feature vector for a \(128\times 64\) detection window is \(7\times 15\times 16=1680\) because each window is represented by \(7\times 15\) blocks. Experiments on the INRIA Person dataset show that the proposed feature with linear support vector machine (SVM) performs well. In Sect. 3, we report on the effect of the layer’s number. Over four layers we found no significant performance improvement.

Feature representation

In many applications such as video surveillance, detection speed is as important as accuracy. A standard pipeline for performing multi-scale detection is to create a densely sampled image pyramid then the detection system scans all images of the pyramid to detect a pedestrian. In order to accelerate the scanning process, we define a feature pyramid using a standard image pyramid. We obtain the four layers of an image of the standard pyramid as described in Sect. 2.3. The Multi-scale CS-LBP operator (\(P=8\), \(R_1=1\), \(R_2=2\), \(R_3=3\), \(n_R=3\)) is applied to the image and its four layers. In this way we correspond five values to each pixel of the image. An image of the standard pyramid can be substituted by an \((W-2\cdot R_3)\times (H-2\cdot R_3)\times 5\) array where W stands for the width of the image and H is the height of the image. Using the feature pyramid derived from a standard image pyramid, the time of the feature extraction and thereby the scanning process can be reduced.

Occlusion handling

The linear SVM finds the optimal hyperplane that divides the space between positive and negative samples. Let be \(\mathbf x \in \mathbf R ^n\) a new input then the decision function of the holistic classifier can be defined as:

$$\begin{aligned} H(\mathbf x )=\beta +\mathbf w ^T \mathbf x , \end{aligned}$$
(9)

where \(\mathbf w\) stands for the weighting vector, and \(\beta\) represents the constant bias of the learned hyperplane.

In our occlusion handling method, we determine first whether the score of the holistic classifier is ambiguous. The response of a linear SVM classifier is ambiguous if it is close to 0. When the output is ambiguous, an occlusion inference process is applied (Fig. 2).

Fig. 2
figure2

Occlusion handling scheme

The body of the pedestrian is divided into certain parts: head-shoulder, torso, legs, left side and right side (Fig. 3). We trained five part detectors using the original database but the images were annotated according to the part of the pedestrian. If the output of the holistic detector is ambiguous, the responses of each part detectors are combined to infer the occluded pedestrians.

Fig. 3
figure3

Parts partition of pedestrian

Experimental results

We perform the experiments on CAVIAR sequences (Caviar 2007) and our own video sequences. For the CAVIAR project a number of video clips were recorded acting out the different scenarios of interest. These include pedestrians walking alone, meeting with others, window shopping, entering and exitting shops, and leaving a package in a public place.

In this paper, we use per-image performance, plotting detection rate versus false positives per-image (FPPI). As pointed out in (Dollár et al. 2009b), per-window performance can fail to predicate per-image performance because per-window evaluation does not measure errors caused by detections at incorrect scales, positions or arising from false detections on body parts. Figure 4 presents the FPPI curve of our proposed system measured on CAVIAR. Some sample detections on CAVIAR sequences can be seen in Fig. 6. Figure 5 demonstrates the effect of the layer’s number. Over four layers we experienced no significant performance improvement.

Fig. 4
figure4

Detection rate versus false positive per-image (FPPI) curve for the proposed detector. \(4\times 4\) is the step size and 1.0905 is the scale factor of the sliding-window

Fig. 5
figure5

Detection rate versus false positive per-image (FPPI) curves with respect to the number of the layers in the proposed detector. \(2\times 2\) is the step size and 1.09 is the scale factor of the sliding-window detection

Fig. 6
figure6

Sample detections on CAVIAR sequences

Table 1 presents the speed and the average miss-rate (lower is better) of our and other methods. We measured these values on CAVIAR sequences. We can see that the proposed algorithm makes a compromise between average miss-rate and speed.

Table 1 Listing of algorithms considered on CAVIAR sequences, sorted by average miss-rate (lower is better)

We have also tested our system on the video frames of a surveillance camera which operates in an outdoor environment. Figure 7 shows some examples from the results.

Fig. 7
figure7

Sample detections on outdoor scenes

In order to prove the discriminative power of our feature, we applied our system to the videos of a thermal surveillance camera. Our presented feature extraction method captures mainly gradient and edge information, some texture and scale information. Our detector shows high invariance to clothing and illumination, and performs well on thermal images too. Figure 8 shows some detections on thermal images.

Fig. 8
figure8

Sample detections on thermal images

Conclusion

In this paper we have proposed our pedestrian detection system and reported on experimental results. We have presented our feature extraction method based on multi-scale CS-LBP operator for pedestrian detection. We combined the pedestrian detection with foreground segmentation in order to filter out effectively the false detections. We have also shown that the system is able to detect pedestrians in real-time in surveillance videos. All the modules of the algorithm were introduced with its details about design and implementation. We have compared the system to other algorithms and presented results on RGB images and thermal images. We have proved that our detector is highly invariant to clothing and illumination.

Further research

There are many directions for further research. It is worth studying how to combine the proposed feature with other features efficiently. Another further direction of research is improving the searching strategy of the detection window. To make our system faster, we would like to apply parallel architectures. However, our detector cannot handle the articulated deformation of pedestrians, which is the next problem to be tackled.

References

  1. Ahonen T, Hadid A, Pietikäinen M (2004) Face recognition with local binary patterns. In: Computer vision-eccv 2004. Springer, Berlin, Heidelberg, pp 469–481

  2. Benenson R, Omran M, Hosang J, Schiele B (2014) Ten years of pedestrian detection, what have we learned? In: Agapito L, Bronstein MM, Rother C (eds) Computer Vision-ECCV 2014 Workshops. Springer, pp 613–627

  3. Caviar (2007). http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/. Accessed 3 Aug 2015

  4. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  5. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on IEEE, vol 1, pp 886–893

  6. Dollár P, Tu Z, Perona P, Belongie S (2009a) Integral channel features. In: BMVC, vol 2, p 5

  7. Dollár P, Wojek C, Schiele B, Perona P (2009b) Pedestrian detection: a benchmark. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 304–311

  8. Dollar P, Wojek C, Schiele B, Perona P (2012) Pedestrian detection: an evaluation of the state of the art. Pattern Anal Mach Intell IEEE Trans 34(4):743–761

    Article  Google Scholar 

  9. Enzweiler M, Gavrila DM (2009) Monocular pedestrian detection: survey and experiments. Pattern Anal Mach Intell IEEE Trans 31(12):2179–2195

    Article  Google Scholar 

  10. Ess A, Leibe B, Schindler K, Gool LV (2008) A mobile vision system for robust multi-person tracking. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8

  11. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. Pattern Anal Mach Intell IEEE Trans 32(9):1627–1645

    Article  Google Scholar 

  12. Forsyth DA, Fleck MM (1997) Body plans. In: Computer Vision and Pattern Recognition, 1997. Proceedings, 1997 IEEE Computer Society Conference on IEEE, pp 678–683

  13. Gavrila DM (2007) A bayesian, exemplar-based approach to hierarchical shape matching. Pattern Anal Mach Intell IEEE Trans 29(8):1408–1421

    Article  Google Scholar 

  14. Guan YP (2010) Spatio-temporal motion-based foreground segmentation and shadow suppression. Computer Vision, IET 4(1):50–60

    Article  Google Scholar 

  15. Havasi L, Varga D, Szirányi T (2014) Lhi-tree: an efficientdisk-based image search application. In: Computational Intelligence for Multimedia Understanding (IWCIM), 2014 International Workshop on IEEE, pp 1–5

  16. Heikkilä M, Pietikäinen M, Schmid C (2006) Description of interest regions with center-symmetric local binary patterns. In: Computer vision, graphics and image processing. Springer, Berlin, Heidelberg, pp 58–69

  17. Hwang S, Park J, Kim N, Choi Y, Kweon IS (2013) Multispectral pedestrian detection: benchmark dataset and baseline. Integr Comput Aided Eng 20:347–360

    Google Scholar 

  18. Leibe B, Seemann E, Schiele B (2005) Pedestrian detection in crowded scenes. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on IEEE, vol 1, pp 878–885

  19. Lin Z, Davis LS (2008) A pose-invariant descriptor for human detection and segmentation. In: Computer Vision–ECCV 2008. Springer, Berlin, Heidelberg, pp 423–436

  20. Lin Z, Davis LS (2010) Shape-based human detection and segmentation via hierarchical part-template matching. Pattern Anal Mach Intell IEEE Trans 32(4):604–618

    Article  Google Scholar 

  21. Maji S, Berg AC, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8

  22. Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Anal Mach Intell IEEE Trans 24(7):971–987

    Article  MATH  Google Scholar 

  23. Papageorgiou C, Poggio T (2000) A trainable system for object detection. Int J Comput Vis 38(1):15–33

    Article  MATH  Google Scholar 

  24. Pietikäinen M (2005) Image analysis with local binary patterns. In: Image analysis. Springer, Berlin, Heidelberg, pp 115–118

  25. Seemann E, Leibe B, Schiele B (2006) Multi-aspect detection of articulated objects. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, IEEE, vol 2, pp 1582–1588

  26. Topi M, Timo O, Matti P, Maricor S (2000) Robust texture classification by subsets of local binary patterns. In: Pattern Recognition, 2000. Proceedings. 15th International Conference on, IEEE, vol 3, pp 935–938

  27. Varga D, Szirányi T, Kiss A, Sporás L, Havasi L (2015) A multi-view pedestrian tracking method in an uncalibrated camera network. In: Proceedings of the IEEE international conference on computer vision workshops, pp 37–44

  28. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154

    Article  Google Scholar 

  29. Wojek C, Schiele B (2008) A performance evaluation of single and multi-feature people detection. In: Pattern recognition. Springer, pp 82–91

  30. Xu R, Guan Y, Huang Y (2015) Multiple human detection and tracking based on head detection for real-time video surveillance. Multimed Tools Appl 74(3):729–742

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the EU FP7 Programme (FP7-SEC-2011-1) No. 285320 (PROACTIVE project). The research was also partially supported by the Hungarian Scientific Research Fund (No. OTKA 106374).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Domonkos Varga.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Varga, D., Szirányi, T. Robust real-time pedestrian detection in surveillance videos. J Ambient Intell Human Comput 8, 79–85 (2017). https://doi.org/10.1007/s12652-016-0369-0

Download citation

Keywords

  • Video surveillance
  • Pedestrian detection
  • Feature extraction