1 Introduction

Detecting the pedestrians from original images which contain a wealth of object information, such as car, tree and the sky, is a very challenging work. The significant advances in traditional model [4, 26, 28] and deep model [11, 15,16,17, 22] in this area have been witnessed in recent years.

Fig. 1.
figure 1

The associated work network contain three parts: (a) the RPN in low layer. (b) weighted association CNN. (c) metric coding net

The traditional model which extracts the low-level features (\(e.g.\) HoG [4], Haar [26], HoG-LBP [28]) from images and then selects rich representations to train the classifier(\(e.g.\) SVM [4], boosting classifiers [6]), is a widely used strategy, but it is hard to be optimized unitedly for decreasing the error rate.

In the deep model, CNNs have played a significant role in the pedestrian detection, owing to their capacity of learning representative and discriminative features from the original images. For example, as the number of negatives is significantly larger than that of positive ones in one database, Tian et al. [24] transfered scene attribute information from existing background scene segmentation databases to the pedestrian dataset for learning representative features.

However, most of the previous deep models must crop or warp the images to fixed-size (\(e.g.\) \(224 \times 224\) in VGG16 [23]) which leads to the low performance on multifarious input sizes [9]. To solve this problem, spatial pyramid pooling in deep convolutional networks (SPP-net) [9] has been proposed to pool the feature maps with arbitrary size before the full-connected layers using spatial pyramid pooling. Further, [21] proposed a Region Proposal Network (RPN) that it shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. But these strategies are not suitable for little size inputs (the size of pedestrian) with a very deep CNN (\(e.g.\) VGG16, VGG19 [23]), meanwhile lacking of any strategy for hard negatives mining constrains the ability of recognition in these methods. The problems have been attracting increasing attention for accurate, yet efficient, pedestrian detection. [32] redesign RPN for pedestrian size and unite Boosted Forests (BF) to mine the hard negatives. But using RPN united the BF in low layer simply will lost many rich features for large sizes and they are hard to be optimized unitedly.

Driven by these observations and considered the excellent performance of VGG16 and RPN. We propose an effective baseline for pedestrian detection, apply RPN in low layer for generating the proposal windows with arbitrary size. Furthermore, as the mining of negative samples is significant, an associated work network feeding with the labeled multi-class negative and positive samples, is introduced in our work. It contains two networks: metric coding net (MC-net) and weighted association CNN (WA-CNN). MC-net which is based on metric learning theory, is devised to reinforce the intra-class distance. With the strategy, the network will encode the feature by the template parameter, and the generated codes can be seen as the comparability determination between the feature and the template parameter. Finally, WA-CNN which is designed to strengthen the inter-class difference by a deep model, associates the metric codes to accomplish the detection task using a weighted loss function.

This work has the following main contributions. (1) MC-net is devised to reinforce the intra-class distance. Feeding with the feature maps extracted from labeled viewpoint pedestrian and no-pedestrian images, the template parameter will be trained. After the training, the metric codes are encoded by the net, and the codes can be seen as a comparability measurement between the inputs and the template parameter. (2) WA-CNN is proposed to reinforce the distance of inter-class in our network with a deep CNN network, and it will associate the metric codes to accomplish the detection task with a weighted loss function.

2 From Supervised Generalized Max Pooling to the Template of Multi-class

Our goal is to propose a template learning mechanism that we attempt to represent multi-class by vectors. Motivated by a property proposed in Generalized Max Pooling (GMP) [12] that the dot-product similarity between the max-pooling representation (a vector) and a feature matrix is a constant value:

$$\begin{aligned} \varvec{\psi ' \phi } = {\varvec{\alpha }},\end{aligned}$$
(1)

where \(\varvec{\psi '} \) is the feature matrix, \(\varvec{\phi }\) is max-pooling representation, \({\varvec{\alpha } } \) is a vector with all elements being a constant value and the value of the constant has no influence [12].

We generalize the max-pooling representation to be a template vector (representing one class). Because of the randomicity of \({\varvec{\alpha } } \), we enforce the dot-product similarity between the feature and the template vector to be a generalized max pooling vector which is the mean value of all max-pooling vectors belonging to one class. The primal formulation is

$$\begin{aligned} \sum _{i=1}^{\tau }{\varvec{\psi '}_{\varvec{i}}\varvec{\phi }=\varvec{\alpha ,}}\end{aligned}$$
(2)

where \(\varvec{\psi '_i} \) is the \(i\)-th labeled feature matrix of \(\tau \) images belonging to one class. The vector \(\varvec{\alpha }\) can be seen as a supervised method to calculate the max pooling, and it can be seen as a comparability determination between the template vector and the feature matrix as well.

In order to learn the template vector, we can turn Eq. (2) into a least square regression problem

$$\begin{aligned} \varGamma =\frac{1}{2}||\sum _{i=1}^\tau {\varvec{\psi ' _i}\varvec{\phi } -\varvec{\alpha }}||^2.\end{aligned}$$
(3)

To the problem, we must calculate several template vectors to represent multi-class in this system. Thus, we introduce metric coding net (MC-net) to generalize vectors \(\varvec{\phi }\) as a template parameter which is learned by a neural network, to represent multi-class template.

Fig. 2.
figure 2

Metric coding net with GMM learning and two full-connected layers.

2.1 Formulation of MC-Net

Because the low layers focus on the local features and they encode more discriminative features to capture intra-class variations [27]. Inspired by this property, MC-net is introduced to reinforce the intra-class distance by the low layer feature maps.

The net employs the feature maps \(\varvec{\psi } \in {\mathbb {R}^{10 \times 5 \times 256}}\) as input and all feature maps are reshaped into a matrix (\(\varvec{\psi ''}\in \mathbb {R}^{50\times 256}\)) with \(50\)-dimensional vector and \(256\) feature maps by vectorization.

To derive a more compact and discriminative representation, we utilize Gaussian Mixture Model (GMM) to model the generation process of feature maps. Assume that the feature maps are subject to parametric distribution \(\mathcal {P}_{\lambda }\left( \varvec{\psi ''} \right) \). Then, \( \mathcal {P}_{\lambda }\left( \varvec{\psi ''} \right) \) can be written as

$$\begin{aligned} \mathcal {P}_{\lambda }\left( \varvec{ \psi ''} \right) =\sum _{t=1}^T{\omega _tp_t\left( \varvec{ \psi ''} \right) },\end{aligned}$$
(4)

where \(p_t\) is the \(t\)-th component of GMM with

$$\begin{aligned} p_t\left( \varvec{ \psi '' }\right) =\frac{1}{\left( 2\pi \right) ^{d/2}|\varvec{\varSigma } _t|^{1/2}}e^{\left( -\frac{1}{2}\left( \varvec{ \psi ''}-\varvec{\mu } _t \right) ^T{\varvec{\varSigma } _t}^{-1}\left( \varvec{\psi ''}-\varvec{\mu } _t\right) \right) }, \end{aligned}$$
(5)

and \(\lambda = {\{ {\omega _t},{\varvec{\mu }_t},{\varvec{\varSigma } _t}\} _{t = 1, \cdots ,T}}\) (\(T\) is 25 in our work) denotes the parameters of GMM training by \(\varvec{\psi ''}\). Because the weight parameters bring little additional information, we use \(\varvec{\psi '}=\{\varvec{\mu }_t,\varSigma _t\}_{1}^{T}\in \mathbb {R}^{50\times 50}\) to represent the feature maps.

The coding framework contains one full-connected layer (MC-fc\(\phi \), input map 1) at the beginning. The weight \(\phi \) of MC-fc\(\phi \) can be seen as the template vector in Eq. (2), the output

$$\begin{aligned} \varvec{\vartheta }=\varvec{\psi ' \phi }+\varvec{b} \end{aligned}$$
(6)

will be calculated using Eq. (3) as the loss function and \(\varvec{\alpha }=\varvec{\vartheta }-\varvec{b} \).

To strengthen the capacity of representing the multi-class, we increase one full-connected layer (MC-fc1) in the framework and initialize the weights of MC-fc1 and MC-fc\(\varvec{\phi } \) randomly. The detailed setting is in Fig. 1(c). The forward propagation is passed from MC-fc\(\varvec{\phi } \) without activation function to MC-fc1 by

$$\begin{aligned} \varvec{\vartheta }=\varvec{W}^{\vartheta \left( L \right) }\left( \varvec{\psi '\phi }+\varvec{b} \right) +\varvec{b}^{\vartheta \left( L \right) },\end{aligned}$$
(7)

where \(\varvec{\vartheta } \), \(\varvec{W}^{\vartheta \left( \varvec{L} \right) }\), \( \varvec{b}^{\vartheta \left( L \right) } \) indicate the top-layer feature vector, weights and bias respectively, \(\varvec{\phi } \), \( \varvec{b} \) are the weight and the bias parameter of MC-fc\(\varvec{\phi } \) respectively. Without the activation function, the two layers can be combined by linear combination, Eq. (7) can be written as

$$\begin{aligned} \varvec{\vartheta }=\varvec{\psi '}\left( \varvec{W}^{\vartheta \left( L \right) }\oplus \varvec{\phi } \right) +\left( \varvec{b}\oplus \varvec{b}^{\vartheta \left( L \right) } \right) \end{aligned}$$
(8)

where \( \oplus \) is the operation of linear combination. This method is equivalent to increase the dimension of \(\varvec{\phi } \) simply. We use the non-linear activation function to reconstitute the network

$$\begin{aligned} \varvec{\vartheta }=\varvec{W}^{\vartheta \left( L \right) }\left( ReLu\left( \varvec{\psi '\phi }+\varvec{b} \right) \right) +\varvec{b}^{\vartheta \left( L \right) }, \end{aligned}$$
(9)

where \(ReLu\) is the rectified linear function [13]. The output

$$\begin{aligned} \varvec{\vartheta }=\varvec{\psi ' }\left( \varvec{W}^{\vartheta \left( L \right) }\bowtie \varvec{\phi } \right) +\left( \varvec{b}\bowtie \varvec{b}^{\vartheta \left( L \right) } \right) \end{aligned}$$
(10)

can be seen as the comparability determination between the input feature \(\varvec{\psi '}\) and the multi-class template parameter, \(\bowtie \) is a generalized symbol of non-linear combination by \(ReLu\).

2.2 The Training of the Net

Let \(\varvec{B} = \{ (\varvec{\psi } ,{\varvec{\alpha }_i})\} _{i = 1}^K\) be the training set, \(K\) is the number of training images. Specifically, corresponding to the max pooling vector in Eq. (2), \({\varvec{\alpha }_i}\) denotes eight labeled mean values, where we pool \(\varvec{\psi } '\) to \(\varvec{\psi }^{\max }\in \mathbb {R}^{50\times 1}\) by max-pooling and calculate the mean value \({\varvec{\alpha }_i}\) in each labeled class.

Corresponding to Eq. (3),

$$\begin{aligned} E^{\left( MC \right) }=\frac{1}{2}||\varvec{\vartheta }-\varvec{\alpha ||}^2,\end{aligned}$$
(11)

is used as the loss function, where \(\varvec{\vartheta }\) is the output of the net, \(\varvec{\alpha }\) is the labeled mean value, as show in Table 1.

Table 1. The labeled mean value.

During the training, we set a maximum epoch number (2000) and the training process will be terminated when the objective converges on a relative little value. Therefore, after the training, the parameter \(\left[ \left( \varvec{W}^{\vartheta \left( L \right) }\bowtie \varvec{\phi } \right) ,\left( \varvec{b}^{\vartheta \left( L \right) }\bowtie \varvec{b} \right) \right] \) can be seen as a generalized template. After inputting an arbitrary image, the output will be the metric code \(\varvec{\vartheta }\), it can be used directly as the comparability metric between the image and template.

3 Weighted Association CNN

As analyzed in [27], different layers encode different types of features and higher layers semantic concepts on object categories. Motivated by the property, WA-CNN is proposed to reinforce the distance of inter-class by a deep model.

3.1 The Formulation of WA-CNN

Let \(\mathrm{{\varvec{D }= \{ (}}\varvec{\psi } ,{{y'}_i}\mathrm{{)\} }}_{i=1}^K\) be the training feature maps set, where \( y'_i=\left( y_i,\varrho _i^p,\varrho _i^n \right) \) is a three-tuple. \( y_i \) indicates whether a feature map is pedestrian or not. Binary labels \(\varrho _i^p=\{\varrho _i^{pj}\}_{j=1}^{4}\), \(\varrho _i^p=\{\varrho _i^{nj}\}_{j=1}^{4} \) represent the viewpoint pedestrian and non-pedestrian, and the labels are shown in Fig. 1.

As shown in Fig. 1(b), WA-CNN employs feature maps \( \varvec{\psi } \in \mathbb {R}^{10\times 5\times 256}\) as input by stacking four convolutional layers (WA-conv1 to WA-conv4), one max-pool and three full-connected layers (WA-fc1 to WA-fc3), and the detailed setting is shown in Fig. 1(b). For all these layers, we utilize the rectified linear function [13] as the activation function.

As shown in Fig. 1(b), to strengthen the discriminant validity of intra-class, WA-CNN associates the metric codes which are generated by MC-net,

$$\begin{aligned} \begin{aligned}\varvec{ H}^{\left( L-1 \right) }=&\,ReLu(\varvec{W}^{\left( wa \right) }\varvec{H}^{\left( L-2 \right) }+\varvec{b}^{\left( wa \right) }\\ {}&+\varvec{W}^{ \left( mc \right) }\varvec{\vartheta } +\varvec{b}^{ \left( mc \right) } ), \end{aligned} \end{aligned}$$
(12)

where \(\varvec{ H}^{\left( L \right) } \) is the top-layer feature vector of WA-CNN, \(\varvec{W}^{\left( wa \right) }\), \(\varvec{b}^{\left( wa \right) }\) and \(\varvec{W}^{\left( mc \right) }\), \(\varvec{b}^{ \left( mc \right) }\) are the parameter matrices corresponding to the two networks respectively.

We use

$$\begin{aligned} \begin{aligned} E^{\left( WA \right) }&=-\sum \limits _{i=1}^K{\log p\left( y_i,\varvec{\varrho }^p,\varvec{\varrho }^n|\varvec{\psi ,\vartheta } \right) }\\&=-y\log p\left( y|\varvec{\psi ,\vartheta } \right) -\sum _{i=1}^4{\varvec{\varrho }^{pi}\log p\left( \varvec{\varrho }^p|\varvec{\psi ,\vartheta } \right) }\\&\quad -\sum _{j=1}^4{\varvec{\varrho }^{nj}\log p\left( \varvec{\varrho }^n|\varvec{\psi ,\vartheta } \right) },\end{aligned}\end{aligned}$$
(13)

as the loss function and the loss function is expand to three parts, the main pedestrian, the viewpoint pedestrian and the non-pedestrian. The main task is to predict the pedestrian label \(y\). \( \varvec{ \varrho }^{pi} \), \(\varvec{ \varrho } ^{nj}\) are the \(i\)-th pedestrian estimations and the \(j\)-th non-pedestrian estimations. \(p\left( y|\varvec{\psi ,\vartheta } \right) \), \( p\left( \varvec{\varrho }^p|\varvec{\psi ,\vartheta } \right) \), \( p\left( \varvec{\varrho }^n|\varvec{\psi ,\vartheta } \right) \) are modeled by softmax functions.

3.2 The Training of WA-CNN

Because the main task is to predict the pedestrian label, the others are the auxiliary tasks. Thus, in the phase of training, we reformulate Eq. (11) using \(\omega \) and \( \varepsilon \) to associate multiple tasks by a weighted strategy as the following

$$\begin{aligned} \begin{aligned} E^{\left( WA \right) }&=-y\log p\left( y|\varvec{\psi ,\vartheta } \right) -\sum _{i=1}^4{\omega _i\varvec{\varrho }^{pi}\log p\left( \varvec{\varrho }^p|\varvec{\psi ,\vartheta } \right) }\\&\quad -\sum _{j=1}^4{\varepsilon _j\varvec{\varrho }^{nj}\log p\left( \varvec{\varrho }^n|\varvec{\psi ,\vartheta } \right) }. \end{aligned}\end{aligned}$$
(14)

In our work, \(\omega \) and \(\varepsilon \) can be values between zero and one. we set \( \forall \omega _i=0.1, \ i=1,2,...4 \), \(\forall \varepsilon _j=0.1, \ j=1,2,...4 \) simply.

With the training set \(\mathrm{{\varvec{D }= \{ (}}\varvec{\psi } ,{{y'}_i}\mathrm{{)\} }}_{i=1}^K\), WA-CNN is trained to reinforce the distance of inter-class, further.

4 Overview on Our Method

Figure 1 shows our pipeline of pedestrian detection, where VGG16 (conv1–conv3) with RPN extract the candidate regions from the images with arbitrary image size. The generated feature maps of candidate regions \(\varvec{\psi } \in {\mathbb {R}^{10 \times 5 \times 256}}\) will be reconstituted to the set \(\varvec{B} = \{ (\varvec{\psi } ,{\varvec{\alpha }_i})\} _{i = 1}^K\) and \(\mathrm{{\varvec{D }= \{ (}}\varvec{\psi } ,{{y'}_i}\mathrm{{)\} }}_{i=1}^K\) with the labels, and they will be the training sets for our associated work network.

Our associated work network contains two network, MC-net and WA-CNN. WA-CNN can be seen a network to reinforce the distance of inter-class, on the contrary, the Mc-net plays the role in enhancing the instance difference of intra-class.

As show in Fig. 2, with the training set \(\varvec{B} = \{ (\varvec{\psi } ,{\varvec{\alpha }_i})\} _{i = 1}^K\) which contain the labeled viewpoint pedestrian and non-pedestrian images, MC-net learn the template parameter \(\left[ \left( \varvec{W}^{\vartheta \left( L \right) }\bowtie \varvec{\phi } \right) ,\left( \varvec{b}^{\vartheta \left( L \right) }\bowtie \varvec{b} \right) \right] \). After the training, it codes the feature maps \(\varvec{\psi }\) with template parameter, and the output \(\varvec{\vartheta }\) is a comparability determination between the input map and the template parameter.

Finally, the outputs of MC-net and WA-CNN are jointly learned by the two full-connected layers of our network. And the weighted loss function is designed to accomplish the detection task with the joint features.

5 Experiments

To evaluate the performance of our detector on Caltech-Test [7] and INRIA [8] datasets, the evaluation protocol is following with [7] strictly.

The training data generated by transferring scene attribute information from existing background scene segmentation databases to seventeen attributes in pedestrian dataset by TA-CNN [24]. We only use eight attributes (showing in Table 1) as the training data, and the data are reconstituted into two parts: the viewpoint pedestrian (left, right, front, back) and the non-pedestrian (tree, car, road and building). Note that our network does not employ any motion and context information.

For Caltech-test reasonable subset, all results of our network are obtained by training on the reconstituted training data and evaluating on Caltech-Test (set06–set10). And, to evaluate the generalization capacity of the our network, we report overall results on INRIA-test in this section. All results of our network are obtained by training on reconstituted training data and evaluating on INRIA-test.

Hereinafter, RPN will be fine-turning in our network and WA-CNN and MC-net which are the brand-new network, will be evaluated on the performance and effectiveness (Table 2).

Table 2. The runtime and performance on Caltech.

5.1 Effectiveness of Different Components in WA-CNN

Under the framework of deep neural networks, we compare the result of our network (WA-CNN+non-linear MC-net) with WA-CNN+linear MC-net and WA-CNN without MC-net to verify the capacity of MC-net in representing multi-class.

Fig. 3.
figure 3

Results on Caltech-test reasonable subset and INRIA dataset: (a), (c): Overall performance (b), (d): Log-average miss rate reduction procedure.

Caltech-Test reasonable subset: we systematically study the effectiveness of different components on our network. After the training, WA-CNN without MC-net gets 23.71% miss rate. With this baseline, we implement WA-CNN+linear MC-net. As shown in Fig. 3(b), WA-CNN+linear MC-net gets 16.72% miss rate, and it gets 7 % improvement. To verify the capacity of the non-linear MC-net in our network, we re-train MC-net with non-linear activation function. The result shows in Fig. 3(b), MC-net (non-linear) achieves 13.74% performance, and improves 3.02% than linear MC-net. The result is also compared with the other deep models: JointDeep, SDN, ACF+SDT, SpatialPooling, TA-CNN, and our method gets the lowest miss rate.

INRIA: WA-CNN without MC-net gets 11.61% miss rate and WA-CNN+linear MC-net gets 10.19% miss rate. MC-net (non-linear) improves 9.20% performance, as shown in Fig. 3(d). The results show that our network has a good capacity on the generalization.

5.2 Comparisons with State-of-the-Art Methods

Finally, the results of our network with existing best-performing methods which contain handcrafted features and deep neural networks are evaluated.

Caltech-Test reasonable subset: we compare the result of our network with existing best-performing methods, including VJ [25], HOG [4], ACF-Caltech [5], MT-DPM [29], MTDPM+Context [29], JointDeep [16], SDN [11], ACF+SDT [20], InformedHaar [33], ACF-Caltech+ [14], SpatialPooling [19], LDCF [14], Katamari [3], SpatialPooling+ [18],TA-CNN [24],CCF [30], CCF+CF [30]. Figure 3(a) reports the results, and our method achieves the smallest miss rate (13.74 \(\%\)) compared to all existing methods.

INRIA: We compare the result of our network with existing best-performing methods, including ACF, VeryFast [1], SCCpriors [31], LDCF, Roerei Z [2], SketchTokens [10], SpatialPooling, and parts method mentioned in Sect. 4.1. As shown in Fig. 3(c), our method achieves the lowest miss rate.

Fig. 4.
figure 4

Compares with ACF, our network show a good performance in hard negatives mining, the scores in hard negatives is discriminative than ACF.

5.3 Evaluation on Hard Negatives Mining

In order to evaluate our method on hard negatives mining intuitively, we chose twenty images (hard negatives) which are difficult to other methods in Caltech dataset. The scores of ACF are as our baseline in the evaluation, and the scores of our network are calculated by softmax functions. To be fair, the scores of ACF are normalized to [−1,1]. Figure 4 shows that our method have a good performance to mine the hard negatives.

6 Conclusions

In this paper, with the plenty negative and positive samples, MC-net and WA-CNN are introduced to associated work for mining the hard negatives in pedestrian detection. They enforce the intra-class and inter-class differences using the properties of low-level and high-level features in a CNN model. Under the network, the problem of input size was alleviated by a flexible using on RPN. Extensive experiments demonstrate the effectiveness of the proposed method.