Keywords

1 Introduction

Face detection is a fundamental and essential task in various face applications. The breakthrough work by Viola-Jones [1] utilizes AdaBoost algorithm with Haar-Like features to train a cascade of face vs. non-face classifiers. Since that, numerous of subsequent works [2,3,4,5,6,7] are proposed for improving the cascade detectors. Then, [8,9,10] introduce deformable part models (DPM) into face detection tasks by modeling the relationship of deformable facial parts. These methods are mainly based on designed features which are less representable and trained by separated steps.

With the great breakthrough of convolutional neural networks(CNN), a lot of progress for face detection has been made in recent years due to utilizing modern CNN-based object detectors, including R-CNN [11,12,13,14], SSD [15], YOLO [16], FocalLoss [17] and their extensions [18]. Benefiting from the powerful deep learning approach and end-to-end optimization, the CNN-based face detectors have achieved much better performance and provided a new baseline for later methods.

Recent anchor-based detection frameworks aim at detecting hard faces in uncontrolled environment such as WIDER FACE [19]. SSH [20] and \(\hbox {S}^{3}\hbox {FD}\) [21] develop scale-invariant networks to detect faces with different scales from different layers in a single network. Face R-FCN [22] re-weights embedding responses on score maps and eliminates the effect of non-uniformed contribution in each facial part using a position-sensitive average pooling. FAN [23] proposes an anchor-level attention by highlighting the features from the face region to detect the occluded faces.

Though these works give an effective way to design anchors and related networks to detect faces with different scales, how to use the contextual information in face detection has not been paid enough attention, which should play a significant role in detection of hard faces. Actually, as shown in Fig. 1, it is clear that faces never occur isolated in the real world, usually with shoulders or bodies, providing a rich source of contextual associations to be exploited especially when the facial texture is not distinguishable for the sake of low-resolution, blur and occlusion. We address this issue by introducing a novel framework of context assisted network to make full use of contextual signals as the following steps.

Fig. 1.
figure 1

Hard faces are difficult to be located and classified due to the lack of visual consistency, while the larger regions which give hints to the position of face are easier to be located and classified, such as head and body.

Firstly, the network should be able to learn features for not only faces, but also contextual parts such as heads and bodies. To achieve this goal, extra labels are needed and the anchors matched to these parts should be designed. In this work, we use a semi-supervised solution to generate approximate labels for contextual parts related to faces and a series of anchors called PyramidAnchors are invented to be easily added to general anchor-based architectures.

Secondly, high-level contextual features should be adequately combined with the low-level ones. The appearances of hard and easy faces can be quite different, which implies that not all high-level semantic features are really helpful to smaller targets. We investigate the performance of Feature Pyramid Networks (FPN) [24] and modify it into a Low-level Feature Pyramid Network (LFPN) to join mutually helpful features together.

Thirdly, the predict branch network should make full use of the joint feature. We introduce the Context-sensitive prediction module (CPM) to incorporate context information around the target face with a wider and deeper network. Meanwhile, we propose a max-in-out layer for the prediction module to further improve the capability of classification network.

In addition, we propose a training strategy named as Data-anchor-sampling to make an adjustment on the distribution of the training dataset. In order to learn more representable features, the diversity of hard-set samples is important and can be gained by data augmentation across samples.

For clarity, the main contributions of this work can be summarized as five-fold:

  1. 1.

    We propose an anchor-based context assisted method, called PyramidAnchors, to introduce supervised information on learning contextual features for small, blurred and partially occluded faces.

  2. 2.

    We design the Low-level Feature Pyramid Networks (LFPN) to merge contextual features and facial features better. Meanwhile, the proposed method can handle faces with different scales well in a single shot.

  3. 3.

    We introduce a context-sensitive prediction module, consisting of a mixed network structure and max-in-out layer to learn accurate location and classification from the merged features.

  4. 4.

    We propose the scale aware Data-anchor-sampling strategy to change the distribution of training samples to put emphasis on smaller faces.

  5. 5.

    We achieve superior performance over state-of-the-art on the common face detection benchmarks FDDB and WIDER FACE.

The rest of the paper is organized as follows. Section 2 provides an overview of the related works. Section 3 introduces the proposed method. Section 4 presents the experiments and Sect. 5 concludes the paper.

2 Related Work

Anchor-Based Face Detectors. Anchor was first proposed by Faster R-CNN [14], and then it was widely used in both two-stage and one single shot object detectors. Then anchor-based object detectors [15, 16] have achieved remarkable progress in recent years. Similar to FPN [24], Lin [17] uses translation-invariant anchor boxes, and Zhang [21] designs scales of anchors to ensure that the detector can handle various scales of faces well. FaceBoxes [25] introduces anchor densification to ensure different types of anchors have the same density on the image. \(\hbox {S}^{3}\hbox {FD}\) [21] proposed anchor matching strategy to improve the recall rate of tiny faces.

Scale-Invariant Face Detectors. To improve the performance of face detector to handle faces of different scales, many state-of-the-art works [20, 21, 23, 26] construct different structures in the same framework to detect faces with variant size, where the high-level features are designed to detect large faces while low-level features for small faces. In order to integrate high-level semantic feature into low-level layers with higher resolution, FPN [24] proposed a top-down architecture to use high-level semantic feature maps at all scales. Recently, FPN-style framework achieves great performance on both objection detection [17] and face detection [23].

Context-Associated Face Detectors. Recently, some works show the importance of contextual information for face detection, especially for finding small, blurred and occluded faces. CMS-RCNN [27] used Faster R-CNN in face detection with body contextual information. Hu et al. [28] trained separate detectors for different scales. SSH [20] modeled the context information by large filters on each prediction module. FAN [23] proposed an anchor-level attention, by highlighting the features from the face region, to detect the occluded faces.

Fig. 2.
figure 2

Architecture of PyramidBox. It consists of Scale-equitable Backbone Layers, Low-level Feature Pyramid Layers (LFPN), Context-sensitive Predict Layers and PyramidBox Loss Layer.

Fig. 3.
figure 3

(a) Feature Pyramid Net. (b) Context-sensitive Prediction Module. (c) PyramidBox Loss.

3 PyramidBox

This section introduces the context-assisted single shot face detector, PyramidBox. We first briefly introduce the network architecture in Sect. 3.1. Then we present a context-sensitive prediction module in Sect. 3.2, and propose a novel anchor method, named PyramidAnchors, in Sect. 3.3. Finally, Sect. 3.4 presents the associated training methodology including data-anchor-sampling and max-in-out.

3.1 Network Architecture

Anchor-based object detection frameworks with sophisticated design of anchors have been proved effective to handle faces of variable scales when predictions are made at different levels of feature map [14, 15, 20, 21, 23]. Meanwhile, FPN structures showed strength on merging high-level features with the lower ones. The architecture of PyramidBox (Fig. 2) uses the same extended VGG16 backbone and anchor scale design as \(\hbox {S}^{3}\hbox {FD}\) [21], which can generate feature maps at different levels and anchors with equal-proportion interval. Low-level FPN is added on this backbone and a Context-sensitive Predict Module is used as a branch network from each pyramid detection layer to get the final output. The key is that we design a novel pyramid anchor method which generates a series of anchors for each face at different levels. The details of each component in the architecture are as follows:

Scale-Equitable Backbone Layers. We use the base convolution layers and extra convolutional layers in \(\hbox {S}^{3}\hbox {FD}\) [21] as our backbone layers, which keep layers of VGG16 from \(conv \,1\_1\) to \(pool \,5\), then convert \(fc \,6\) and \(fc \,7\) of VGG16 to \(conv \_fc \) layers, and then add more convolutional layers to make it deeper.

Low-Level Feature Pyramid Layers. To improve the performance of face detector to handle faces of different scales, the low-level feature with high-resolution plays a key role. Hence, many state-of-the-art works [20, 21, 23, 26] construct different structures in the same framework to detect faces with variant size, where the high-level features are designed to detect large faces while low-level features for small faces. In order to integrate high-level semantic feature into low-level layers with higher resolution, FPN [24] proposed a top-down architecture to use high-level semantic feature maps at all scales. Recently, FPN-style framework achieves great performance on both objection detection [17] and face detection [23].

As we know, all of these works build FPN start from the top layer, which should be argued that not all high-level features are undoubtedly helpful to small faces. First, faces that are small, blurred and occluded have different texture feature from the large, clear and complete ones. So it is rude to directly use all high-level features to enhance the performance on small faces. Second, high-level features are extracted from regions with little face texture and may introduce noise information. For example, in the backbone layers of our PyramidBox, the receptive field [21] of the top two layers \(conv \,7\_2\) and \(conv \,6\_2\) are 724 and 468, respectively. Notice that the input size of training image is 640, which means that the top two layers contain too much noisy context features, so they may not contribute to detecting medium and small faces.

Alternatively, we build the Low-level Feature Pyramid Network (LFPN) starting a top-down structure from a middle layer, whose receptive field should be close to the half of the input size, instead of the top layer. Also, the structure of each block of LFPN, as same as FPN [24], one can see Fig. 3(a) for details.

Pyramid Detection Layers. We select \(lfpn \,\_2 \), \(lfpn \,\_1 \), \(lfpn \,\_0 \), \(conv \,\_fc \,7\), \(conv \,6\_2\) and \(conv \,7\_2\) as detection layers with anchor size of 16, 32, 64, 128, 256 and 512, respectively. Here \(lfpn \,\_2 \), \(lfpn \,\_1 \) and \(lfpn \,\_0 \) are output layer of LFPN based on \(conv \,3\_3\), \(conv \,4\_3\) and \(conv \,5\_3\), respectively. Moreover, similar to other SSD-style methods, we use L2 normalization [29] to rescale the norm of LFPN layers.

Predict Layers. Each detection layer is followed by a Context-sensitive Predict Module (CPM), see Sect. 3.2. Notice that the outputs of CPM are used for supervising pyramid anchors, see Sect. 3.3, which approximately cover face, head and body region in our experiments. The output size of the l-th CPM is \(w_l\times h_l \times c_l\), where \(w_l = h_l = 640/2^{2+l}\) is the corresponding feature size and the channel size \(c_l\) equals to 20 for \(l = 0,1,\ldots ,5\). Here the features of each channels are used for classification and regression of faces, heads and bodies, respectively, in which the classification of face need 4 \((= cp_l + cn_l)\) channels, where \(cp_l\) and \(cn_l\) are max-in-out of foreground and background label respectively, satisfying

$$\begin{aligned} cp_l = \left\{ \begin{array}{ll} 1, &{}\text { if }l = 0,\\ 3, &{}\text { otherwise.} \end{array}\right. \end{aligned}$$

Moreover, the classification of both head and body need two channels, while each of face, head and body have four channels to localize.

PyramidBox Loss Layers. For each target face, see in Sect. 3.3, we have a series of pyramid anchors to supervise the task of classification and regression simultaneously. We design a PyramidBox Loss. see Sect. 3.4, in which we use softmax loss for classification and smooth L1 loss for regression.

3.2 Context-Sensitive Predict Module

Predict Module. In original anchor-based detectors, such as SSD [15] and YOLO [16], the objective functions are applied to the selected feature maps directly. As proposed in MS-CNN [30], enlarging the sub-network of each task can improve accuracy. Recently, SSH [20] increases the receptive field by placing a wider convolutional prediction module on top of layers with different strides, and DSSD [31] adds residual blocks for each prediction module. Indeed, both SSH and DSSD make the prediction module deeper and wider separately, so that the prediction module get the better feature to classify and localize.

Inspired by the Inception-ResNet [32], it is quite clear that we can jointly enjoy the gain of wider and deeper network. We design the Context-sensitive Predict Module (CPM), see Fig. 3(b), in which we replace the convolution layers of context module in SSH by the residual-free prediction module of DSSD. This would allow our CPM to reap all the benefits of the DSSD module approach while remaining rich contextual information from SSH context module.

Max-in-Out. The conception of Maxout was first proposed by Goodfellow et al. [33]. Recently, \(\hbox {S}^{3}\hbox {FD}\) [21] applied max-out background label to reduce the false positive rate of small negatives. In this work, we use this strategy on both positive and negative samples. Denote it as max-in-out, see Fig. 3(c). We first predict \(c_p + c_n\) scores for each prediction module, and then select \(\max {c_p}\) as the positive score. Similarly, we choose the max score of \(c_n\) to be the negative score. In our experiment, we set \(c_p = 1\) and \(c_n = 3\) for the first prediction module since that small anchors have more complicated background [25], while \(c_p = 3\) and \(c_n = 1\) for other prediction modules to recall more faces.

Fig. 4.
figure 4

Illustration of PyramidAnchors. For example, the largest purple face with size of 128 have pyramid-anchors at \(P_3\), \(P_4\) and \(P_5\), where \(P_3\) are anchors generated from \(conv \_fc 7\) labeled by the face-self, \(P_4\) are anchors generated from \(conv 6\_2\) labeled by the head (of size about 256) of the target face, and \(P_5\) are anchors generated from \(conv 7\_2\) labeled by the body (of size about 512) of the target face. Similarly, to detect the smallest cyan face with the size of 16, one can get a supervised feature from pyramid-anchors on \(P_0\) which labeled by the original face, pyramid-anchors on \(P_1\) which labeled by the corresponding head with size of 32, and pyramid-anchors on \(P_2\) labeled by the corresponding body with size of 64.

3.3 PyramidAnchors

Recently anchor-based object detectors [15,16,17, 24] and face detectors [21, 25] have achieved remarkable progress. It has been proved that balanced anchors for each scale are necessary to detect small faces [21]. But it still ignored the context feature at each scale because the anchors are all designed for face regions. To address this problem, we propose a novel alternatively anchor method, named PyramidAnchors.

For each target face, PyramidAnchors generate a series of anchors corresponding to larger regions related to a face that contains more contextual information, such as head, shoulder and body. We choose the layers to set such anchors by matching the region size to the anchor size, which will supervise higher-level layers to learn more representable features for lower-level scale faces. Given extra labels of head, shoulder or body, we can accurately match the anchors to ground truth to generate the loss. As it’s unfair to add additional labels, we implement it in a semi-supervised way under the assumption that regions with the same ratio and offset to different faces own similar contextual feature. Namely, we can use a set of uniform boxes to approximate the actual regions of head, shoulder and body, as long as features from these boxes are similar among different faces. For a target face localized at \(region_{target}\) at original image, considering the \(anchor_{i,j}\), which means the j-th anchor at the i-th feature layer with stride size \(s_i\), we define the label of k-th pyramid-anchor by

$$\begin{aligned} label_k(anchor_{i,j}) = \left\{ \begin{array}{ll} 1, &{} \text {if }iou(anchor_{i,j}\cdot s_i/s_{pa}^k, region_{target}) > threshold,\\ 0, &{} otherwise, \end{array}\right. \end{aligned}$$
(1)

for \(k = 0, 1, \ldots , K\), respectively, where \(s_{pa}\) is the stride of pyramid anchors. \(anchor_{i,j}\cdot s_i\) denotes the corresponding region in the original image of \(anchor_{i,j}\), and \(anchor_{i,j}\cdot s_i/s_{pa}^k\) represents the corresponding down-sampled region by stride \(s_{pa}^k\). The threshold is the same as other anchor-based detectors. Besides, a PyramidBox Loss will be demonstrated in Sect. 3.4.

In our experiments, we set the hyper parameter \(s_{pa} = 2\) since the stride of adjacent prediction modules is 2. Furthermore, let \(threshold = 0.35\) and \(K=2\). Then \(label_0\), \(label_1\) and \(label_2\) are labels of face, head and body respectively. One can see that a face would generate 3 targets in three continuous prediction modules, which represent for the face itself, the head and body corresponding to the face. Figure 4 shows an example.

Benefited from the PyramidBox, our face detector can handle small, blurred and partially occluded faces better. Notice that the pyramid anchors are generated automatically without any extra label and this semi-supervised learning help PyramidAnchors extract approximate contextual features. In prediction process, we only use output of the face branch, so no additional computational cost is incurred at runtime, compared to standard anchor-based face detectors.

3.4 Training

In this section, we introduce the training dataset, data augmentation, loss function and other implementation details.

Train Dataset. We trained PyramidBox on 12, 880 images of the WIDER FACE training set with color distort, random crop and horizontal flip.

Fig. 5.
figure 5

Data-anchor-sampling changes the distribution of the train data. Dotted lines show the distribution of certain attribute, while solid lines represent the corresponding distribution of those attribute after the data-anchor-sampling.

Data-Anchor-Sampling. Data sampling [34] is a classical subject in statistics, machine learning and pattern recognition, it achieves great development in recent years. For the task of objection detection, Focus Loss [17] address the class imbalance by reshaping the standard cross entropy loss.

Here we utilize a data augment sample method named Data-anchor-sampling. In short, data-anchor-sampling resizes train images by reshaping a random face in this image to a random smaller anchor size. More specifically, we first randomly select a face of size \(s_{face}\) in a sample. As previously mentioned that the scales of anchors in our PyramidBox, as shown in Sect. 3.1, are

$$\begin{aligned} s_{i} = 2^{4+i}, \text { for }i = 0, 1, \ldots , 5, \end{aligned}$$

let

$$\begin{aligned} i_{anchor} = \text {argmin}_i \text {abs}(s_{anchor_i} - s_{face}) \end{aligned}$$

be the index of the nearest anchor scale from the selected face, then we choose a random index \(i_{target}\) in the set

$$\begin{aligned} \{ 0, 1, \ldots , \text {min}(5, i_{anchor} + 1)\}, \end{aligned}$$

finally, we resize the face of size of \(s_{face}\) to the size of

$$\begin{aligned} s_{target} = random(s_{i_{target}}/2, s_{i_{target}}*2). \end{aligned}$$

Thus, we got the image resize scale

$$\begin{aligned} s^* = s_{target}/s_{face}. \end{aligned}$$

By resizing the original image with the scale \(s^*\) and cropping a standard size of \(640\times 640\) containing the selected face randomly, we get the anchor-sampled train data. For example, we first select a face randomly, suppose its size is 140, then its nearest anchor-size is 128, then we need to choose a target size from 16, 32, 64, 128 and 256. In general, assume that we select 32, then we resize the original image by scale of \(32/140 = 0.2285\). Finally, by cropping a \(640\times 640\) sub-image from the last resized image containing the originally selected face, we get the sampled train data.

As shown in Fig. 5, data-anchor-sampling changes the distribution of the train data as follows: (1) the proportion of small faces is larger than the large ones. (2) generate smaller face samples through larger ones to increase the diversity of face samples of smaller scales.

PyramidBox Loss. As a generalization of the multi-box loss in [13], we employ the PyramidBox Loss function for an image is defined as

$$\begin{aligned} L(\{p_{k,i}\}, \{t_{k,i}\}) = \sum _k \lambda _k L_k(\{p_{k,i}\}, \{t_{k,i}\}), \end{aligned}$$
(2)

where the k-th pyramid-anchor loss is given by

$$\begin{aligned} L_k(\{p_{k,i}\}, \{t_{k,i}\}) = \frac{\lambda }{N_{k,cls}}\sum _{i_k}L_{k,cls}(p_{k,i},p_{k,i}^*) + \frac{1}{N_{k,reg}}\sum _{i_k}p_{k,i}^*L_{k,reg}(t_{k,i},t_{k,i}^*). \end{aligned}$$
(3)

Here k is the index of pyramid-anchors (\(k = 0, 1\), and 2 represents for face, head and body, respectively, in our experiments), and i is the index of an anchor and \(p_{k,i}\) is the predicted probability of anchor i being the k-th object (face, head or body). The ground-truth label defined by

$$\begin{aligned} p_{k,i}^* = \left\{ \begin{array}{lll} 1, &{} \text {if the anchor down-sampled by stride } s_{pa}^k \text { is positive},\\ 0, &{} otherwise. \end{array}\right. \end{aligned}$$
(4)

For example, when \(k=0\), the ground-truth label is equal to the label in Fast R-CNN [13], otherwise, when \(k\ge 1\), one can determine the corresponding label by matching between the down-sampled anchors and ground-truth faces. Moreover, \(t_{k,i}\) is a vector representing the 4 parameterized coordinates of the predicted bounding box, and \(t_{k,i}^*\) is that of ground-truth box associated with a positive anchor, we can define it by

$$\begin{aligned} t_{k,i}^*= & {} \left( t_x^* + \frac{1-s_{p_a}^k}{2}t_w^*s_{w,k}+\varDelta _{x,k}, t_y^* +\frac{1-s_{p_a}^k}{2}t_h^*s_{h,k}+\varDelta _{y,k}, \right. \nonumber \\&\left. s_{p_a}^kt_w^*s_{w,k}-2\varDelta _{x,k}, s_{p_a}^kt_h^*s_{h,k}-2\varDelta _{y,k}\right) , \end{aligned}$$
(5)

where \({\varDelta _{x,k}}\) and \({\varDelta _{y,k}}\) denote offset of shifts, \(s_{w,k}\) and \(s_{h,k}\) are scale factors respect to width and height respectively. In our experiments, we set \(\varDelta _{x,k}=\varDelta _{y,k}=0,s_{w,k}=s_{h,k}=1\) for \(k<2\) and \({\varDelta _{x,2}}=0,{\varDelta _{y,2}}=t_h^*,s_{w,2}=\frac{7}{8},s_{h,2}=1\) for \(k =2\). The classification loss \(L_{k,cls}\) is log loss over two classes ( face vs. not face) and the regression loss \(L_{k,reg}\) is the smooth \(L_1\) loss defined in [13]. The term \(p_{k,i}^*L_{k,reg}\) means the regression loss is activated only for positive anchors and disabled otherwise. The two terms are normalized with \(N_{k,cls}\), \(N_{k,reg}\), and balancing weights \(\lambda \) and \(\lambda _k\) for \(k = 0, 1, 2\).

Optimization. As for the parameter initialization, our PyramidBox use the pre-trained parameters from VGG16 [35]. The parameters of \(conv \_fc \,67\) and \(conv \_fc \,7\) are initialized by sub-sampling parameters from \(fc \,6\) and \(fc \,7\) of VGG16 and the other additional layers are randomly initialized with “xavier” in [36]. We use a learning rate of \(10^{-3}\) for 80k iterations, and \(10^{-4}\) for the next 20k iterations, and \(10^{-5}\) for the last 20k iterations on the WIDER FACE training set with batch size 16. We also use a momentum of 0.9 and a weight decay of 0.0005 [37].

4 Experiments

In this section, we firstly analyze the effectiveness of our PyramidBox through a set of experiments, and then evaluate the final model on WIDER FACE and FDDB face detection benchmarks.

4.1 Model Analysis

We analyze our model on the WIDER FACE validation set by contrast experiments.

Baseline. Our PyramidBox shares the same architecture of \(\hbox {S}^{3}\hbox {FD}\), so we directly use it as a baseline.

Contrast Study. To better understand PyramidBox, we conduct contrast experiments to evaluate the contributions of each proposed component, from which we can get the following conclusions.

Low-Level Feature Pyramid Network (LFPN) Is Crucial for Detecting Hard Faces. The results listed in Table 1 prove that LFPN started from a middle layer, using \(conv \,\_fc 7\) in our PyramidBox, is more powerful, which implies that features with large gap in scale may not help each other. The comparison between the first and forth column of Table 1 indicates that LFPN increases the mAP by \(1.9\%\) on hard subset. This significant improvement demonstrates the effectiveness of joining high-level semantic features with the low-level ones.

Table 1. Performances of LFPN starting from different layers.

Data-Anchor-Sampling Makes Detector Easier to Train. We employ Data-anchor-sampling based on LFPN network and the result shows that our data-anchor-sampling effectively improves the performance. The mAP is increased by \(0.4\%\), \(0.4\%\) and \(0.6\%\) on easy, medium and hard subset, respectively. One can see that Data-anchor-sampling works well not only for small hard faces, but also for easy and medium faces.

Table 2. The parameters of PyramidAnchors.

PyramidAnchor and PyramidBox Loss Is Promising. By comparing the first and last column in Table 2, one can see that PyamidAnchor effectively improves the performance, i.e., \(0.7\%\), \(0.6\%\) and \(0.9\%\) on easy, medium and hard, respectively. This dramatical improvement shows that learning contextual information is helpful to the task of detection, especially for hard faces.

Wider and Deeper Context Prediction Module Is Better. Table 3 shows that the performance of CPM is better than both DSSD module and SSH context module. Notice that the combination of SSH and DSSD gains very little compared to SSH alone, which indicates that large receptive field is more important to predict the accurate location and classification. In addition, by comparing the last two column of Table 4, one can find that the method of Max-in-out improves the mAP on WIDER FACE validation set about \(+0.2\%\)(Easy), \(+0.3\%\)(Medium) and \(+0.1\%\)(Hard), respectively.

Table 3. Context-sensitive Predict Module.

To conclude this section, we summarize our results in Table 4, from which one can see that mAP increase \(2.1\%\), \(2.3\%\) and \({\varvec{4.7\%}}\) on easy, medium and hard subset, respectively. This sharp increase demonstrates the effectiveness of proposed PyramidBox, especially for hard faces.

Table 4. Contrast results of the PyramidBox on WIDER FACE validation subset.

4.2 Evaluation on Benchmark

We evaluate our PyramidBox on the most popular face detection benchmarks, including Face Detection Data Set and Benchmark (FDDB) [38] and WIDER FACE [39].

Fig. 6.
figure 6

Evaluation on the FDDB dataset.

FDDB Dataset. It has 5, 171 faces in 2, 845 images collected from the Yahoo! news website. We evaluate our face detector on FDDB against the other state-of-art methods [4, 19, 21, 25, 30, 40,41,42,43,44,45,46,47,48,49,50,51,52,53,54]. The PyramidBox achieves state-of-art performance and the result is shown in Fig. 6(a) and (b).

Fig. 7.
figure 7

Precision-recall curves on WIDER FACE validation and test sets.

WIDER FACE Dataset. It contains 32, 203 images and 393, 703 annotated faces with a high degree of variability in scale, pose and occlusion. The database is split into training (\(40\%\)), validation (\(10\%\)) and testing (\(50\%\)) set, where both validation and test set are divided into “easy”, “medium” and “hard” subsets, regarding the difficulties of the detection. Our PyramidBox is trained only on the training set and evaluated on both validation set and testing set comparing with the state-of-the-art face detectors, such as [6, 20,21,22,23, 25,26,27,28, 30, 39, 40, 43, 51, 55, 56]. Figure 7 presents the precision-recall curves and mAP values. Our PyramidBox outperforms others across all three subsets, i.e. 0.961 (easy), 0.950 (medium), 0.889 (hard) for validation set, and 0.956 (easy), 0.946 (medium), 0.887 (hard) for testing set.

5 Conclusion

This paper proposed a novel context-assisted single shot face detector, denoted as PyramidBox, to handle the unconstrained face detection problem. We designed a novel context anchor, named PyramidAnchor, to supervise face detector to learn features from contextual parts around faces. Besides, we modified feature pyramid network into a low-level feature pyramid network to combine features from high-level and high-resolution, which are effective for finding small faces. We also proposed a wider and deeper prediction module to make full use of joint feature. In addition, we introduced Data-anchor-sampling to augment the train data to increase the diversity of train data for small faces. The experiments demonstrate that our contributions lead PyramidBox to the state-of-the-art performance on the common face detection benchmarks, especially for hard faces.