1 Introduction and motivation

It is necessary to design a model for automatic identification of face mask-wearing conditions and use it as a first step to unmask the faces for face authentication in mobile devices and security systems, such as ATMs, banks, airport security checkpoints, and facial-biometric attendance systems.

The face mask condition identification is a very challenging task because while samples from the different classes are highly similar, samples from the same class may be much different. In other words, there are a great intra-class variation and a small inter-class variation, which make it difficult to learn discriminant features. Figure 1 depicts some samples from the three classes.

Fig. 1
figure 1

Sample images of MaskedFace-Net dataset. a Correct Face Mask (CFM), b Incorrectly Face Mask (IFM), and c Not Face Mask (NFM) wearing show the challenge of this task: Samples of different classes are highly similar, while those of the same classes are so different

In this paper, a new method has been developed for face mask-wearing identification using well-known deep convolutional neural networks (CNNs) as feature extractors and a novel large margin piecewise linear (LMPL) [1] as a classifier.

The proposed method contains four main steps: image preprocessing, deep feature extraction, face mask- wearing classification, and face unmasking. The proposed method showed an excellent performance in a computational resource-limited environment, for both classification tasks with 99.53 and 99.64% accuracy, respectively. Moreover, unmasking the masked faces showed a promising result. It can be concluded that the proposed EfficientMask-Net method is effective in face mask-wearing identification, as well as face unmasking. Therefore, it can be used in many security systems for epidemic prevention and face authentication.

2 Related work

2.1 Masked face detection

Prasad et al. [2] proposed a lightweight model called “MaskedFaceNet” for real-time mask detection using a progressive semi-supervised approach. Fasfous et al. [3] presented BinaryCoP (Binary COVID-mask Predictor) to detect correct face mask-wearing and positioning. The proposed BinaryCoP was a low-power binary neural network (BNN) classifier, which performed the classification on edge devices, such as embedded FPGA accelerator. They used the MaskedFaceNet dataset with four classes, including IMFD Nose and Mouth, IMFD Nose, IMFD Chin, and CMFD, and balanced the dataset with data augmentation techniques. As a result, accuracy of up to 98% was obtained for the wearing positioning problem.

2.2 Mobile-based face mask detection

Cabani et al. [4] introduced the MaskedFace-Net dataset with 137,016 images. This large-scale dataset includes Correctly Masked Face Dataset (CMFD) and Incorrectly Masked Face Dataset (IMFD), in which masked faces are created by applying a deformable model on the Flickr-Faces-HQ3 (FFHQ) face dataset. Qin et al. [5] proposed image super-resolution and classification network (SRCNet), in which a super-resolution method was applied to improve the performance of low-quality images. They classified the face mask-wearing situations into three classes, including correct mask-wearing, incorrect mask-wearing, and no mask-wearing, and achieved an accuracy of 98.70%. The training and evaluation were performed on the public Medical Masks Dataset containing 3835 images.

2.3 Identification of face mask-wearing conditions

Dey et al. [6] proposed a deep learning and multi-stage face mask detection method called “Mobile-Net Mask.” They used two different datasets with 5200 images to detect Masked or NotMasked faces from still images and video streams. The Mobile-Net Mask reached an accuracy of 93%. Jiang et al. [7] presented a RetinaFaceMask detector based on the one-stage RetinaNet for high-accuracy face mask detection. The introduced model contained ResNet or MobileNet as a backbone, along with a feature pyramid network (FPN) and context attention modules. The authors achieved a 93.4% precision, which was higher than baseline results.

As presented in this section, although researchers have introduced several approaches to identify face mask-wearing conditions, face authentication lacks a unified system. In this study, we developed a unified, efficient method for face mask-wearing identification besides unmasking the masked faces, which can be useful in authentication systems.

3 Materials and methods

This section describes the overall process of the proposed EfficientMask-Net method. Figure 2 demonstrates the diagram of the proposed mask-wearing system.

Fig. 2
figure 2

A schematic of the proposed EfficientMask-Net

3.1 Image preprocessing

Image preprocessing enhances the visual appearances of images and results in higher accuracy of the detection system.

3.2 Resizing face images

The input images of EfficientNet were resized to \(224 \times 224 \times 3\) using bicubic interpolation.

3.3 Image adjustment

Real-world images have a considerable variation in contrast and exposure. The images were adjusted by mapping input intensity to the new values to saturate 1% of the pixel values in low and high intensities. Besides, the histogram of images was calculated to determine the adjustment limit automatically.

3.4 Deep feature extraction

High-level and abstract features can be extracted by deep CNNs. This study focused on a small and efficient network in computational power. Transfer learning was used to prevent overfitting and obtain better generalization.

EfficeintNet was introduced by Tan and Le [8] in 2019. It is one of the most efficient CNN models among well-known pre-trained networks with a small number of FLOPS. Compared to other models achieving similar ImageNet accuracy, the EfficientNet is much smaller and faster. The authors have shown that the proposed EfficientNet is five times faster for inference on mobile devices [8].

3.5 Large margin piecewise linear (LMPL) classifier

The novel large margin piecewise linear (LMPL) classifier [1] works based on a cellular structure. First of all, a grid is considered on feature space. In fact, some random hyper-planes partition feature space into subpartitions called cells. Each cell is labeled by a class label based on covered training instances. The main problem is with tuning of initial hyper-planes.

  1. 1)

    Normal: Ordinary samples, which are correctly classified at just one side of the hyper-plane. Their loss function is Hinge loss as defined in (1):

    $$ \begin{gathered} l\left( x \right)_{{{\text{Normal}}^{{(\tilde{y})}} }} ~ = ~\max \left( {0,~1 - \widetilde{{y~}}\left( {w^{T} .x + b} \right)} \right) \hfill \\ {\text{where: }}~\widetilde{{y~}} = \left\{ { - 1, + 1} \right\} \hfill \\ \end{gathered} $$
    (1)

    where \(\widetilde{y }\) is the virtual label of sample \(x\) and determines at which side of the hyper-plane, \(x\) is correctly classified.

  2. 2)

    Negative don’t care: These samples are classified incorrectly on both sides of the hyper-plane. Their loss is defined in (2):

    $$ l\left( {\varvec{x}} \right)_{{{\text{DontCare}}^{ - } }} = \max \left( {l\left( {\varvec{x}} \right)_{{{\text{Normal}}^{{\left( { + 1} \right)}} }} ,l\left( {\varvec{x}} \right)_{{{\text{Normal}}^{{\left( { - 1} \right)}} }} } \right) $$
    (2)
  3. 3)

    Positive don’t care: This group is the opposite of Negative don’t care, and samples are classified correctly on both sides of the hyper-plane. Their loss function is defined in (3):

    $$ l\left( {\varvec{x}} \right)_{{{\text{DontCare}}^{ + } }} = \min \left( {l\left( {\varvec{x}} \right)_{{{\text{Normal}}^{{\left( { + 1} \right)}} }} ,l\left( {\varvec{x}} \right)_{{{\text{Normal}}^{{\left( { - 1} \right)}} }} } \right) $$
    (3)

The Positive don’t care samples, which are always classified correctly, are ignored in this paper. The main reason is that the loss function in (3) is not convex. Therefore, the objective function is defined as presented in (4):

$$ \begin{aligned} \min \frac{1}{2}~\left\| w \right\|^{2} &+ C_{1} \mathop \sum \limits_{{x \in {\text{Normal}}}} l\left( x \right)_{{{\text{Normal}}^{{(\widetilde{{y)}}}} }} ~\\ &+ C_{2} \mathop \sum \limits_{{x \in DC^{ - } }} l\left( \user2{x} \right)_{{{\text{DontCare}}^{ - } }} \end{aligned}$$
(4)

The scalar values \(C_{1}\) and \(C_{2}\) control the balance between the structural and empirical error. In this paper, both \(C_{1}\) and \(C_{2}\) were experimentally tested and set to 1000.

The LMPL classifier optimizes each hyper-plane based on the introduced objective function with a convex optimizer. After some iterations, the model converges to some hyper-planes in order to classify samples of different classes, and extra hyper-planes that are not useful in the classification are removed. Therefore, regarding the distribution and the complexity of the decision boundaries, the complexity of the model is tuned by removing redundant hyper-planes, and an efficient large margin approach is obtained.

3.6 Unmasking the face

Given its contactless nature, especially in the pandemic era, using faces is preferred in biometric recognition. However, these systems are designed for non-occluded faces [9], the proposed method was designed to work based on existing face authentication methods and avoid retraining them on masked face datasets. Most of the recent works have focused on the eye area exclusively [10] or retraining existing methods on the simulated masked faces [11].

  1. 1)

    Image segmentation

As the first step, faces were segmented into Mask and Non-Mask segments to determine missed parts of the face. Figure 3a illustrates an example of an input masked face and the resulting segmented face.

Fig. 3
figure 3

The steps of unmasking the faces. a A masked face and the resulting segmented face. b 25 generated images by GAN. c The selected synthetic face and extracted facial parts of the mask area. d The unmasked face

  1. 2)

    Generating Synthetic Faces

A generative advertising network (GAN) was trained on 15,000 real-world faces without face masks. Then, 25 synthetic faces were generated by the trained GAN to complete the masked faces, as shown in Fig. 3b.

  1. 3)

    Selecting the Matched generated face

The distance between a masked face and generated faces was calculated at pixel level based on the normalized root-mean-square error (NRMSE), which ranges from 0 (identical) to 1 (completely different). The synthetic face with the smallest value was selected to complete the masked face. An example is shown in Fig. 3c.

  1. 4)

    Face Completion

Facial parts of the mask area were extracted from the selected synthetic face to fill missed parts of the masked face. An example of the final output of the proposed method is shown in Fig. 3d.

Algorithm 1 shows the whole process of the proposed EfficientMask-Net method.

4 Experimental results

4.1 Experimental setup

All experiments were implemented using the deep learning and image processing toolboxes of MATLAB R2021a. A CPU Core i7 4.00 GHz with 24 GB RAM was applied to implement the

MaskEfficeint-Net. The Adam optimizer [12] with \(\beta_{1} = 0.9\), \(\beta_{2} = 0.999\), and \(\epsilon = 10^{ - 8}\) was also used. Moreover, weight decay of \(10^{ - 4}\) for L2 regularization was applied to avoid overfitting.

The network was trained for five epochs with a mini-batch size of 64. The initial learning rate was set on \(10^{ - 3}\), and the learning rate drop factor was set on \(0.1\) for all three epochs to increase the learning speed. Besides, the training dataset shuffled every epoch.

figure a

In this study, two experiments were carried out for two different classification schemes:

  1. I.

    Experiment 1: Correctly Face Mask (CFM), Incorrectly Face Mask (IFM), and Not Face Mask (NFM) wearing

  2. II.

    Experiment 2: Uncovered Chin IFM, Uncovered Nose IFM, and Uncovered Nose and Mouth IFM

  1. B.

    MaskedFace Dataset

4.2 MaskedFace dataset

In this study, we combined the novel MaskedFace-NetFootnote 1 and the well-known Flicker-Face-HQFootnote 2 (FFHQ) datasets. FFHQ is an open-access high-quality dataset with PNG images of \(1024\times 1024\) resolution. The original FFHQ was used as the Not mask-wearing dataset. The details of class samples and the related experiments are listed in Table 1. Finally, 14,783 and 4992 face images were used in Experiments 1 and 2, respectively. The complete dataset for each experiment can be found in the Zenodo repository (https://zenodo.org/record/4892677).

Table 1 Details of face image dataset

4.3 Experimental results and analysis

  1. 1)

    Performance Analysis

Several lightweight deep networks were compared as an end-to-end network and a feature extractor with the novel LMPL classifier (called CNN+) in terms of different metrics, as shown in Tables 2 and 3 for both experiments. In both experiments, EfficientNetB0 achieved the best results in both schemes as an end-to-end network and a feature extractor with the LMPL classifier (EfficientNetB0+).

Table 2 Comparison of deep CNNs as an end-2-end network and as a feature extractor, along with the proposed LMPL classifier \(({\text{CNN}}^{ + }\)) in experiment 1: correctly Face Mask (CFM), Incorrectly Face Mask (IFM), and Not Face Mask (NFM) wearing
Table 3 Comparison of deep CNNs as an end-2-end network and as a feature extractor, along with the proposed LMPL classifier \(({\text{CNN}}^{ + }\)) in experiment 2: uncovered chin IFM, uncovered nose IFM, and uncovered nose and mouth IFM

The novel LMPL was also compared with well-known classifiers. According to the results, LMPL outperformed all other classifiers in terms of performance metrics. As illustrated in Tables 4 and 5, the LMPL achieved the best classification accuracy in both experiments.

Table 4 Comparison of well-known classifiers with the proposed LMPL classifier in experiment 1: correctly face mask (CFM), incorrectly face mask (IFM), and not face mask (NFM) wearing
Table 5 Comparison of well-known classifiers with the proposed LMPL classifier in experiment 2: uncovered chin IFM, uncovered nose IFM, and uncovered nose and mouth IFM
  1. 2)

    Statistical Analysis

Friedman test is a popular statistical analysis for simple, nonparametric, and safe comparison of at least three-related samples. It has no assumption about primary data distribution. This test ranks methods for each metric independently. Indeed, \(R_{j}\) is the average rank of the \(j\,{\text{th}}\) method based on different metrics. Note that in the case of tie, i.e., identical performance, the same ranks are assigned.

As can be seen in Tables 2, 3, 4, 5, the novel LMPL improved the performance metrics significantly and obtained the best average ranks in all cases. These tables reveal the significant difference between the efficiency of the different methods.

  1. 3)

    Visual Analysis

Gradient-weighted class activation mapping (Grad-CAM) technique [15] was used for detailed visual analysis, which provides a visualization of the extracted deep features through the fine-tuned EfficientNetB0, as shown in Fig. 6. Grad-CAM is a technique to interpret deep CNN predictions and check whether the CNN is focusing on the right parts of the input image. Prediction regions can be investigated using heat maps. The spatial parts with the greatest impact on the network score were identified by Grad-CAM heat mapping, as shown in Fig. 6. The standard jet map was used in which red and yellow indicate regions with high contribution to the right predictions and blue denotes regions with low contribution. As can be seen, the fine-tuned deep EfficientNetB0 well identified the effective regions in the classification predictions.

Fig. 4
figure 6

Grad-CAM visualization results of different class face images. a Correctly Face Mask (CFM), b Incorrectly Face Mask (IFM), and c Not Face Mask (NFM) wearing. (Original images are shown in Fig. 1)

  1. 4)

    Comparison with State-of-the-Art Studies

According to Table 6, the developed method showed superior performance in comparison to several recent studies. It can be concluded that the proposed Efficient-Mask Net can be useful in face mask-wearing monitoring systems, especially in public places, to control coronavirus spreading, as well as face authentication services in lightweight devices like mobile phones.

Table 6 Comparison of the proposed method with state-of-the-art deep models in face mask detection (CFM = correctly face masK-, IFM = incorrectly face mask-, NFM = not face mask-wearing)

5 Conclusion and future work

The proposed EfficientMask-Net model is lightweight and needs low power resources. Hence, the method can be useful in real-time face mask-wearing systems to identify mask-wearing conditions in public places for epidemic prevention. Two experiments were conducted to evaluate the proposed method on various deep CNNs. The EffientNetB0 with the novel LMPL classifier showed the best average accuracy in both experiments, equal to 99.53 and 99.64%, respectively. The face unmasking was also performed on masked faces and showed promising results that can be useful in face authentication systems.

In the future, the proposed method can be extended to work on real-world masked face datasets. In order to improve face unmasking, the existing face completion methods under occlusion can be applied to masked faces. Besides, the impact of unmasking on present face recognition methods can be investigated.