YuNet: A Tiny Millisecond-level Face Detector

Great progress has been made toward accurate face detection in recent years. However, the heavy model and expensive computation costs make it difficult to deploy many detectors on mobile and embedded devices where model size and latency are highly constrained. In this paper, we present a millisecond-level anchor-free face detector, YuNet, which is specifically designed for edge devices. There are several key contributions in improving the efficiency-accuracy trade-off. First, we analyse the influential state-of-the-art face detectors in recent years and summarize the rules to reduce the size of models. Then, a lightweight face detector, YuNet, is introduced. Our detector contains a tiny and efficient feature extraction backbone and a simplified pyramid feature fusion neck. To the best of our knowledge, YuNet has the best trade-off between accuracy and speed. It has only 75856 parameters and is less than 1/5 of other small-size detectors. In addition, a training strategy is presented for the tiny face detector, and it can effectively train models with the same distribution of the training set. The proposed YuNet achieves 81.1% mAP (single-scale) on the WIDER FACE validation hard track with a high inference efficiency (Intel i7-12700K: 1.6ms per frame at 320 × 320). Because of its unique advantages, the repository for YuNet and its predecessors has been popular at GitHub and gained more than 11K stars at https://github.com/ShiqiYu/libfacedetection


Introduction
Face detection has been an attractive topic in computer vision for decades.It is heavily dependent as a prerequisite step for many face-related applications such as face recognition, face beautification, face alignment, face tracking, etc.Given an image, face detection locates the face regions by bounding boxes.Many methods have been proposed to improve face detection performance, from early hand-crafted features such as Haar in [1] to current CNN-based features.As described in [2], the runtime of the two-stage or multi-stage detectors depends on the number of faces.Therefore, the single-stage CNN-based detectors have become popular in recent years.
Face detection is less challenging than generic object detection.The accuracy reaches saturation on the challenging benchmark WIDER FACE [3] .Some people may think face detection is a solved problem.However, it is not.The top-ranked methods [4−11] all use large pre-trained backbone networks, complex feature enhancement modules and heavy test time augmentations (TTAs) for better ranks [2] .For example, one of the best detectors, Mog-face [10] , achieves state-of-the-art accuracy with 711 M parameters and 808 GFLOPs (for VGA images).The impressive accuracy comes from the consumption of considerable storage and computation resources.
However, face detection is widely deployed on edge devices such as cell phones, service robots, surveillance cameras and Internet of things (IoT) devices in real-world applications.These devices have limited storage resources and computing capability due to their cost.In addition, only a few noticeable faces need to be detected, and tiny faces in the background are normally not needed in many applications.Even when deployed in a central server, a fast and efficient detector can save considerable energy and make the server handle considerable data synchronously.Compared with a huge face detector that can improve the average precision (AP) slightly on some benchmarks, we argue that an efficient tiny detector is more urgently needed.
The backbone networks in a face detector are essential for performance.Some popular backbone networks such as VGG-16 from the VGGNet [12] series, ResNet-50/101/152 from the ResNet [13] series and MobileNet [14] were originally designed for image classification of Im-ageNet [15] .As shown in Fig. 1, face detection is different from image classification, which takes the output of the deepest layer as the feature vector.To handle objects of different scales, different feature maps from different layers are employed for detection.Large faces are easier to detect due to the richness of information.In addition, large faces are normally detected from a deeper feature map and are easier to detect than smaller faces.It gives a strong hint that the backbone should focus on small faces in face detection.
We should also note the distribution of the face sizes.In the WIDER FACE dataset, most faces are small ones, which are less than 20 pixels.It is similar in many facerelated applications.Many data augmentation operations, especially random cropping, will change the distributions of face sizes.If we train a model with a dataset of different distributions (distribution A, B and C in Fig. 2), the AP will decrease obviously.The further from the original distribution, the lower AP will be.
A tiny millisecond-level face detector, YuNet, has been designed and presented in the following part of the paper.The contributions of the paper are listed as follows.
1) According to our unique understanding of face detection, we designed a tiny face detector, which has a very limited number of parameters, a very low latency and promising accuracies.
2) We suggested a data sampling strategy for model training.It can obviously improve the accuracy of a deep detector, especially of a lightweight detector.
3) To the best of our knowledge, the proposed YuNet should be the best tiny face detector, which achieves an AP of 81.1% on the WIDER FACE validation hard set and has gained more than 11 K stars at GitHub.com for its effectiveness.

Related works
Face detection is a popular topic in object detection and is also very mature for real applications.In the past decade, deep learning-based face detection can handle face scale, pose, occlusion, expression, makeup, illumination, blur, etc., very well.Some benchmarks, such as WIDER FACE [3] , have been widely used for evaluating different methods and have promoted research.
As introduced in [2], the latency of a two-stage face detector varies with the number of faces in an image.Single-stage detectors have become more popular in recent years.Some recent single-stage detectors are as follows.Najibi et al. [16] build three detection modules cooperating with context modules for scale-invariant face detection.RetinaFace [9] employs additional facial landmark annotations to improve hard face detection.Li et al. [7] introduce small face supervision signals on the backbone, which implicitly boosts the performance of pyramid features.Zhang et al. [17,18] adopt neural architecture search (NAS) on feature enhancement modules and faceappropriate necks, respectively, for efficient context enhancement and multiscale feature fusion.Zhang et al. [5] , Chi et al. [6] , Tang et al. [19] , and Liu et al. [20] work on different anchor sampling/matching strategies to balance the proportion of positive and negative samples, match outer faces with high-quality anchors and accelerate model convergence.All these methods achieve extremely high accuracies by employing techniques such as complicated feature enhancement modules, sophisticated anchor matching/alignment, and training strategies.The expens-  Step lr to 0.001 Step lr to 0. ive cost and intolerable latency hinder their application in real-world scenarios.Many efficient face detection methods have been developed to address practical applications.YOLO5Face proposed by Qi et al. [21] , inherited from the YOLOv5 [22] generic object detector, adds facial key point labels and optimized data augmentation.Guo et al. [23] introduce sample redistribution (SR) to augment training samples and computation redistribution (CR) to reallocate the computation across different components (backbone, neck, and head) and a broad range of computing regimes.Feature fusion is a key technology for improving feature representation, and it is widely used in face detection and some other tasks.For example, gOctConv proposed by Gao et al. [24] fuse both in-stage and cross-stages multiscale features.Those works design models to meet resource constraints.However, we believe the models can be even smaller and faster.

× 320
RetinaFace [9] is a recent good detector for face detection and achieves excellent accuracy on the WIDER FACE [3] benchmark.The whole network can be divided into three components, i.e., backbone, neck, and head.The backbone consists of ResNet-50 [13] except the adaptive average pooling layer and the fully connected layer, and outputs feature maps of 1/8, 1/16, and 1/32 of the input resolution, respectively.The neck is a standard feature pyramid network (FPN) [25] , which consists of a combination of lateral and vertical paths.The head consists of multiple cascaded feature enhancement modules (FEMs) and a convolution for output classification and regression.The RetinaFace model has 27.27 M parameters and 11.07 GFLOPs with an input of size .

Methodology
Before introducing the proposed YuNet, some analysis and design principles will be given first.By analysing the relationship among the model size, computational cost and speed, we can have some ideas on how to design a good backbone for face detection.We take RetinaFace [9] as an example to analyse how to design a good detector and then introduce our YuNet in this section.Most CNNbased face detectors follow a similar manner as Retina-Face.

Analysis on different layers 3.1.1 Number of parameters of different layers
A tiny face detector has many advantages.In addition to its fast inference speed, it is also easy to deploy on many edge devices with limited random access memory (RAM) and limited storage.More parameters can bring a better detection accuracy.However, we have to consider which layers deserve more parameters.
Most parameters in a CNN are in convolutional layers.For standard convolution, the number of parameters is K Cin Cout where , and represent the convolution kernel size, generally 3 or 5, the input and output channels respectively.Obviously, the number of parameters does not depend on the size of the feature map, but is correlated with the size of the convolution kernel and the number of channels.
The numbers of parameters of the convolutional layers in RetinaFace are shown in Fig. 3 as blue bars.The number of parameters increases exponentially, and the deepest layer, Layer 4, contributes 63.55% of the parameters.The reason is that the number of channels increases greatly.It also shows that we should reduce the number of channels of some deep layers if we want to reduce the model size.

Computational cost of different layers
Floating point operations per second (FLOPs) is a widely used measure for computational cost.We can also use it to evaluate different layers.Since the convolutional layers contribute more than 90% of the computational cost of most CNN models, we only consider the FLOPs of the convolutional layers.
Hout Wout where and are the height and width of the output feature map respectively.
The computational costs of different layers are plotted as orange bars in Fig. 3.We find that Layer 1 contributes 16.34% of the total computational cost, but only 0.92% of the parameters.Layer 4 contributes 19.8% of the computational cost, but 63.66% of parameters.Fig. 3 shows that the computational cost is not highly correlated with the number of parameters.

Contributions of different layers
From the red line and the light blue line in Fig. 3, Layer 2 contributes more than half of the candidates for face detection, but Laye 4 only contributes less than 10%, even though it has 63.55% of the parameters and 19.8% of the computational cost.The small faces, which will be predicted from Layer 2, should be put more emphasis than those large faces.Large faces are easier to detect due to their rich information, so it is not necessary to have too many channels in Layer 4, or some deeper layers.

YuNet
According to the previous analysis, we designed a tiny network for face detection.One principle is to focus on difficult small faces and remove computational cost from easy large faces.Another one is to use depthwise convolution and pointwise convolution to replace standard convolution.The architecture of the proposed YuNet is shown in Fig. 4, and it contains a backbone, a tiny feature pyramid network (TFPN) neck and a head.

Backbone
The backbone is the main part of the network and is used for extracting features for detection.It must be efficient and lightweight.To deploy convolutional neural networks to edge devices, efficient units with fewer parameters and faster speed are expected.Depthwise separable convolution is a form of factorized convolution that factorizes the standard convolution into depthwise convolution and pointwise convolution.It is originally from MobileNet [14] .The depthwise separable convolution achieves approximately 1/9 to 1/8 computational and 3 × 3 parametric costs of the standard convolution, and the accuracy may decrease slightly.The DWUnit in YuNet is created by a depthwise separable convolution and its following batch normalization and activation layer, and it is the main module in the proposed network.Another module is DWBlock, which contains two DWUnits.Their designs are presented in the top-right corner of Fig. 4.

Neck
The neck is for fusing multiscale features for a higher level of features.In our YuNet, the TFPN neck takes advantage of FPN and depthwise separable convolution.We can consider the feature map from Stage 2 as low-level features and the one from Stage 4 as high-level features because of their different depths.One of the pioneering works, the FPN [25] , introduces a top-down pathway and lateral connections to combine multiscale features.The top-down pathway in FPN can generate higher-resolution features by upsampling feature maps from higher pyramid levels.The connections combine feature maps of the same spatial size from the backbone and the topdown pathway.Fig. 5 where a channel adjustment operation for dimension matching, is an upsampling operation for resolution matching, and is usually a convolutional operation for feature processing.The convolution, 2×up and convolution in Fig. 5(a) correspond to these three operations.
However, the standard convolutional layers in FPN have a heavy computational cost and a lot of parameters need to be trained.The TFPN neck uses depthwise separable convolution to replace the standard convolution.As shown in Fig. 5(b), DWUnit replaces Conv 3×3 of FPN.DWUnit is the dominant module of the backbone and in the neck.It can reduce the number of parameters to 12% from that of FPN.The computational cost is also greatly reduced.

Head
We adopt the anchor-free mechanism to conduct our YuNet.Compared to the anchor-based RetinaFace, we reduce the candidates for each location from more than 2 to 1 and make them directly predict the four values of a location, which are the two values for the left-top corner, the height and the width of the box.Inspired by Ge et al. [26] , we employ simple optimal transport assignment (simOTA) [27] for positive anchor matching.For any matching candidates, we compute the intersection over union (IOU) between the predicted bounding box and ground truth as the soft label.We then use Crossentro-pyLoss to compute the classification loss.The losses of bounding box regression, landmark regression and objectness are extended-IoULoss (EIOU) [28] , SmoothL1Loss and CrossEntropyLoss, respectively.In the training phase, we minimize the multitask loss:

N α1 α2 α3
where indicates the total number of positive samples.The hyper-parameters , and are recommended to be 1.0, 5.0 and 0.1, respectively.

× 640
Due to the extremely large scale variations (from several pixels to thousands pixels) of faces in real-world scenarios, different scale augmentation strategies are employed to adjust the sample scale distribution in the training phase.The most popular scale augmentation strategies are RandomCrop and its variants.Given an image, a patch is cropped from the original images, whose size is the short edge product with a scale randomly selected from a predefined set of scales.Then, the patch is resized to in case of tensor alignment of an entire batch.The predefined scale set will change the scale distribution of the training samples.A commonly used value range is [0.3, 1.0].To generate more positive tiny samples, SCRFD proposed by Guo et al. [23] enlarges the random size range to [0.3, 2.0].When the scale is greater than 1.0, the cropped box will be beyond the original image.The regions out of the range will be filled with the average RGB values.Moreover, they present a searchable zoom-in and zoom-out space to search the optimal range set of scales under the estimation metric of AP on WIDER FACE.With enough rounds of searching, this searchable algorithm can finally obtain the optimal set of scales achieving the best AP on the WIDER FACE validation dataset.
Normally we do not know the face scale distribution in real scenarios, and it is impossible to provide the optimization criteria for the search algorithm.We explicitly indicate that a consistent sample scale distribution between testing and training makes performance optimal.Intuitively, in real-world scenarios, if there are primarily large faces (Easy track of WIDER FACE), we can increase the proportion of small scale of the range set, i.e., [0.With such a simple and intuitive strategy, our model is more accessible for deployment to various real-world scenarios and achieves optimal performance.In addition, we only adopt random horizontal flipping, with a probability of 0.5 besides scale augmentation.

Training details
16 × 2 To conduct experiments in a more organized manner, we implement the proposed YuNet by PyTorch and opensource MMDetection [29] .We adopt the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight decay of 0.000 5, and a batch size of on two NVIDIA 1080Ti (12 GB) GPUs.The learning rate is linearly from 0.001 warmed up to 0.01 within the first 1 500 iterations.We adopt the StepLr scheduler to make the learning rate decay by a factor of 10 at the 400th and 544th epochs.Without any pretraining, the model can be well trained from scratch in 560 epochs.

Dataset
WIDER FACE [3] is the largest public face detection dataset and has 32 203 images and 393 703 faces.The images are split into three subsets for training, validation and testing.Each subset is divided into three levels of difficulty: Easy, Medium, and Hard.Its large variety of scale, pose, occlusion, expression, illumination and event is close to reality and very challenging.Furthermore, we empirically analyse the annotations of the data and observe that Hard covers Medium and Easy, which indicates that performance on Hard can better reflect the effectiveness of different methods.In the following experiments, we pay more attention to performance on Hard.

Evaluation on WIDER FACE 640 × 640
To make a comprehensive evaluation in terms of model complexity, inference latency, and detection accuracy, we collect some recent methods with similar research purposes according to two requirements: 1) the model size is less than 3 MB, and the computation is less than 1 GFLOPs ( input resolution); and 2) the source code has been released, and the trained models have been provided.The following state-of-the-art face detectors are collected for comparison.They are SCRFD [23] , Retina-Face [9] and YOLO5Face [21] .Some good detectors, such as BlazeFace [30] and FaceBoxes [31] , are not included because they have no officially released source code.The performance results of SCRFD, RetinaFace and YOLO5Face are achieved under different test conditions.For a fair comparison, we do not refer to their experimental results and evaluate all the detectors in our experiments.They have been evaluated as follows.
1) All compared models are converted to ONNX format, and then the inference phase is performed under CPU using ONNXRuntime.We reimplement the preprocessing and postprocessing with NumPy referring to the official code.320 × 320 640 × 640 2) Without any test time augmentation (TTA) tricks (e.g., image flip, image pyramid, model ensemble, etc.), we use , and the original size (approximately 1 000 pixels) as the input resolution, to evaluate the models on the WIDER FACE validation subset.We set the confidence threshold close to zero, 0.01, during evaluation to obtain the best mAP, although that will cause a large number of false detections.The NMS threshold is fixed at 0.45.
3) We only evaluate inference efficiency under CPU (Intel i7-12700K) because most mobile and embedded devices only have CPU but no GPU.The evaluation results are obtained by averaging the total inference time spent testing the whole WIDER FACE validation dataset.

AP hard
The comparisons are listed in Table 1, and the best ones are highlighted.The proposed YuNet has the fewest number of parameters and the lowest computational cost.YuNet has only 75 856 parameters and is almost an order of magnitude smaller than most other small models in the Table 1.According to the latency, YuNet is several times faster than most other models.The accuracy, evaluated by , and , of YuNet is similar to that of other models.In short, YuNet can achieve a similar accuracy to most other small models, but it has much fewer parametres and is much faster.
In addition, we detect the world's largest selfie in Fig. 6, with the input resolution being origin size and the confidence threshold being 0.5.Our YuNet can accurately detect 619 faces out of approximately 1 000 faces reported.Analysis on the face scale reveals that undetected faces contain only a few pixels with blurred or even unrecognizable features to the human eye.In contrast, faces that are obvious and recognizable are accurately located.

Ablation study
To better understand our YuNet, we further conduct experiments to examine how to impact performance by adding or removing some components and present the comparison in Table 2. Some modules are removed from YuNet (the symbol -) and some modules are strengthened (the symbol + ) to discover the functions of differ-ent modules.The first row shows a noticeable accuracy drop compared to the others.This indicates that the deep feature maps from Stages 3 and 4 are indispensable even there are few large faces.The second row illustrates that Table 1 Comparison of YuNet with other well-known methods.YOLO5Face [21] does not participate in the comparison of origin size since its ONNX model exported does not support dynamic size input.YuNet-s, RetinaFace and SCRFD-10g are not involved in the comparison of values.
the proposed TFPN neck can boost accuracy by 1-2 percent without adding any parameters.The fourth row shows that the convolution in the FPN serves channel alignment and can be discarded since we have already aligned to 64.We increased the number of channels for Stages 3 and 4 in the fifth row, and the results show that the accuracy is improved by only approximately 1%.In addition, YuNet-s, implemented by halving number of channels in Stage 1, reduces the computational cost by half, the accuracy drops but not obviously.YuNet-s can be an excellent face detector with a satisfying accuracy for applications that have very limited computational resources.
Another oblation study is on the sample scale distribution.Different data augmentation methods have different distributions.We listed 4 augmentation methods and their scale distributions in Fig. 7. Their corresponding results are listed in Table 3.The best range is [0.5, 1.5], and its scale range (the cyan line in Fig. 7) is the closest to the original range (the red line).Therefore we chose [0.5, 1.5], which has a very close distribution to the original training samples, in our experiments.

Inference efficiency
As presented in Table 1, YuNet demonstrates superior inference efficiency compared to all other detectors across all resolutions, thanks to its compact backbone, in-1280 × 960 novative TFPN neck, and straightforward detection head.We also study speed at other resolutions, and the results are shown in Table 4.The test conditions are consistent with those mentioned in Section 5.2, except for looping through an image 1 000 times instead of the entire validation subset.Our YuNet and YuNet-s can run at considerable real-time speed for all listed resolutions, even for images.

Conclusions
In this paper, an efficient tiny face detector, YuNet, is specifically designed for real-time applications.It can achieve a millisecond-level speed on CPUs, and is suitable for mobile and embedded devices.The design of YuNet is inspired by the principles for efficient small models.We studied different components and strategies (backbone, neck, head, training, etc.) of face detection to make a good trade-off between the computation cost and the accuracy.To the best of our knowledge, YuNet should be the smallest and simplest model for face detection.In the future, we hope to continue to reduce the size of the model and to improve the speed while keeping the accuracy unchanged.

Fig. 1
Fig.1To handle faces of different sizes, normally large faces will be detected from a deeper feature map, and small faces will be detected from a shallower feature map since a pixel on different feature maps has a different field of view.

Fig. 2
Fig.2If we train a face detector with datasets of different distributions (A in red, B in green, and C in blue), the AP tends to decrease as the distribution moves further away from the original distribution.

Fig. 3
Fig. 3 Numbers of parameters and the computational costs of different convolutional layers in Retinaface′s backbone (ResNet-50).The red line and the light blue line indicate the predicted candidates of Layers 2-4 on WIDER FACE under two conditions, which are with original image sizes and with resized images of the long edge to 2 048.

3 × 3
The backbone consists of 5 stages.Stage 0, a Conv-Head module, contains a standard convolution layer with kernels and a stride of 2. Stage 0 is followed by Stage 1, which contains a maxpooling, a ReLU and two DWBlocks.The first two stages, Stage 0 and Stage 1, reduce the feature maps to 1/4 of the input size and increase the channels from 3 to 64.The remaining three stages, Stages 2-4, use exactly the same network structure and feed their hierarchical output feature maps to the TFPN neck.
(a) shows how FPN fuses multiscale features.Its fusion can be formulated as follows:

Fig. 4
Fig. 4 Architecture of YuNet.It consists of a backbone, a TFPN neck and a head, which are all based on depthwise separable convolution."ConvHead, 16" indicates a module named ConvHead with 16 output channels.All the head outputs will be concatenated together and produce detection results via non-maximum suppression (NMS).

Fig. 7
Fig. 7 Vertical axis indicates the ratio for each scale of boxes, and the horizontal axis indicates the specific value.

Table 4
Inference efficiency of YuNet at various commonly used image sizes.The experiments were carried out withONNXRuntime on a CPU of Intel i7-12700K.The latency is the average time of a loop of 1 000.

Table 2
Ablation study of the proposed modification.In the fourth column, "1" represents that the detection head consists of one DWUnit while "2" consists of two DWUnit, i.e., one DWBlock.

Table 3
Oblation study of different sample scale distributions * denotes the best performance and is adopted by YuNet and YuNets for comparison