1 Introduction

Since the outbreak, COVID-19 has spread to all continents around the world, becoming a global pandemic. While causing serious harm to people’s lives and health, COVID-19 has also had a major impact on the economy, society and politics [1]. The spread of COVID-19 is very broad. The virus is spread by aerosols or droplets formed when infected people talk, cough, or sneeze. When healthy people are in close contact with infected people, healthy people may be infected with the virus through direct contact with aerosols or droplets. Therefore, wearing a mask is a powerful and effective way to avoid infection and spread of COVID-19 [2]. The World Health Organization (WHO) calls on people to wear masks in workplaces, schools and shops without adequate ventilation, and areas where COVID-19 spreads should strictly abide by this guideline [3].

At present, the entrances of public places are equipped with cameras to detect whether pedestrians are wearing masks. However, the current detection system is not robust, as long as pedestrians cover their faces, they can pass the system detection, and pedestrians can pass through the gate. Many pedestrians are not aware of safety and do not wear masks when they go out. When entering large places, they enter relevant places through system detection by covering their faces with their hands or covering their faces with scarves. There are also frugal people who pass the detection system by wearing cotton masks that are less protective but can be used multiple times [4]. Such actions to deceive the detection system are called presentation attacks, and the biometrics or related instruments used in presentation attacks are called Presentation Attack Instruments (PAI). The ability of a detection system to identify PAI is called Presentation Attack Detection (PAD) [5]. Generally, the stronger the PAD of a system, the more types of PAIs the system can recognize. An ideal PAD system should be able to detect all known PAIs, as well as new unknown PAIs that appear in the future [6]. An ideal mask recognition system should be able to detect all obstructions.

Therefore, at the entrance of public places, the detection system should be able to distinguish between pedestrians wearing masks and other coverings. In this paper, a cascaded [7] network model for identifying mask types is proposed. This model can process the images detected by the camera to identify whether the face in the image wears a mask and the type of mask worn. Since the main purpose of the detection system is to allow pedestrians wearing masks to conduct relevant places, the main purpose of the cascade learning network constructed here is to improve the ability of the system to identify masks, not to improve the ability of the system to identify specific species of PAIs [8].

In this paper, a cascaded three-stage mask classification network is constructed by sequentially inputting images into a Multi-task convolutional neural network (MTCNN) [9] and two modified MobilenetV3 networks [10,11,12]. This cascaded network is used to classify images of wearing masks, wearing other coverings, and not wearing masks. The main contributions of the paper are listed below.

  • By modifying the lightweight network architecture MobilenetV3, updating the activation function of the MobilenetV3 output layer and the structure of the model, two modified MobilenetV3 networks are obtained.

  • Aiming at the problems of many parameters, long training time and large amount of computation in deep neural networks, transfer learning is applied to the training process of MTCNN and two modified MobilenetV3 networks respectively, and a deep neural network based on transfer learning is proposed. When the network model is trained, the underlying parameters use the transferred parameters, which reduces the training time of the model and the calculation amount of the model.

  • By cascading MTCNN with two modified MobilenetV3 networks, a cascaded deep learning network based on transfer learning is proposed. First use MTCNN to detect faces, excluding the influence of the background and clothing of pedestrians in the images on the subsequent classification of masks. Then use the first modified MobilenetV3 network for mask detection, and finally use the second modified MobilenetV3 network for mask recognition. After double screening by two modified MobilenetV3 networks, the final accuracy of the cascaded network in identifying qualified masks has been effectively improved.

  • Design a mask recognition system based on a cascaded deep learning network using a cascaded network. By recognizing face masks, the mask recognition system finds out pedestrians who are not wearing masks and other coverings, and ensures that pedestrians entering large places during the COVID-19 epidemic wear masks.

The rest of the paper is organized as follows. Section 2 introduces the related research on mask detection and recognition. Section 3 describes the construction and training of the cascaded network model. Section 4 uses a cascaded network for mask recognition. Section 5 analyzes the results of the experiments. Section 6 presents the conclusion.

2 Related work

Due to the easy spread of COVID-19, people must wear masks when traveling. It is even more necessary for pedestrians to wear masks when entering large crowded places. At present, research has been carried out on algorithms for pedestrians wearing masks.

Riyadh et al. [13] proposed a method to detect whether a face is wearing a mask. This method first uses OpenCV for face detection, and then uses the MobileNetV2 network for mask detection. This detection method is implemented using TensorFlow and OpenCV in the Jupyter Notebook simulation environment. When using OpenCV for face detection, use the confidence to filter the face frame. Confidence represents the probability that there is a face in the face frame. The higher the confidence level, the greater the probability that there is a face in the box, then the box is retained. The lower the confidence, the lower the probability of having a face in the box, and the box is discarded. In this way, the face can be detected more accurately. The MobileNetV2 network is used for mask detection, MobileNetV2 is a lightweight network that takes less time to run. However, this mask detection method has two drawbacks. One disadvantage is that this method requires the mask to properly cover the mouth and nose area, otherwise the algorithm will have a lower probability of detecting the mask. Another disadvantage is that the author only implements the algorithm to detect whether pedestrians wear masks. If the pedestrian uses other objects to cover the face, the detection system will think that the pedestrian is wearing a mask.

Puja Gupta et al. [14] proposed a ResNet-based multi-pose face representation attack detection method. This method can detect whether the pedestrian is wearing a mask even if the pedestrian’s whole body is within the range of the camera and the posture is different. The method utilizes the Extended Mask R-CNN algorithm (Ex-Mask R-CNN) to detect individuals wearing masks, and uses RES-NET-152 to extract facial features from input images. Although this method achieves good results in the CASIASURF database, it does not solve the difficult problem of unknown attack detection.

Su et al. [15] proposed a new algorithm for mask detection and classification that integrates transfer learning and deep learning. This algorithm combines transfer learning and an Efficient-Yolov3 neural network for mask detection. In the feature extraction stage, EfficientNet is used as the backbone feature extraction network. The author uses EfficientNet as the feature extractor and MobileNet as the classifier, combining the two neural networks to take full advantage of the advantages of each technique. The generalization ability of the model is enhanced. The problem with this algorithm is that it does not address the detection of unknown PAIs.

As can be seen from the above discussion, many methods for masks have been proposed, but none of them address the problem of unknown PAI [16, 17]. The cascaded neural network proposed in this paper focuses detection on qualified masks. As long as all pedestrians wearing qualified masks can be identified, it doesn’t really matter what coverings the remaining unqualified pedestrians are wearing, so even if a new unknown PAI appears, it will not affect the judgment of the cascade system.

3 Cascaded deep learning network based on transfer learning

In this section, we construct and train the MTCNN, MobilenetV3a, and MobilenetV3b networks in the cascaded deep learning network model. Section 3.1 presents the construction process of the cascaded network. Section 3.2 constructs three sub-networks, and obtains MobilenetV3a and MobilnetV3b by modifying MobilenetV3. Section 3.3 uses transfer learning [18] to train the constructed network model, so as to obtain the trained parameters.

3.1 Cascade network

The cascaded deep learning network consists of three sub-networks: MTCNN, MobilenetV3a and MobilenetV3b. When building a cascaded deep learning network, the first step is to build a model of the sub-network. The activation function of the MobilenetV3 output layer and the structure of the model are modified to obtain the MobilenetV3a and MobilenetV3b networks. When performing mask identification, it is necessary to first cut out the face of the person wearing the mask, so as to remove the influence of interference factors such as background and clothes. So use MTCNN for face detection. MobilenetV3a is used to detect masks, and MobilenetV3b is used to identify masks. After the network model is constructed, transfer learning is used to train MTCNN, MobilenetV3a and MobilenetV3b respectively [19]. The underlying parameters of the sub-network use the parameters of ImagNet, which reduces the training time. After the three network models are constructed and trained, the models are cascaded to obtain a cascaded deep learning network based on transfer learning [20, 21].

3.2 Construction of cascaded network model

In this section, each sub-network is first analyzed, and the specific functions that each network needs to implement are clarified. The process and details of each sub-network model construction are given in detail.

In the face detection stage, an MTCNN model is constructed. It is a multi-task neural network model for face detection proposed by the Shenzhen Research Institute of the Chinese Academy of Sciences in 2016. MTCNN consists of three cascaded networks, which are the P-Net network for generating face candidate boxes, the R-Net network for filtering and selecting high-precision candidate boxes, and the O-Net network for generating the final bounding box. MTCNN can exclude the influence of lighting, pose, and occlusion conditions, and perform face detection in an unconstrained environment. The MTCNN network is used here for face detection and returns the captured face image. If no face is detected, the program aborts this run.

In the mask detection stage, a MobilenetV3a network modified based on MobilenetV3 is constructed. The function implemented by MobilenetV3a is to divide the input image into two categories: wearing masks and not wearing masks. When building the MobilenetV3a network, the features of the input image are extracted using convolutional layers. There are multiple convolution kernels in the convolution layer, and different convolution kernels can be used to extract different features, which is convenient for classifying by features. Because the data is at risk of overfitting, batch normalization is performed on the data in MobilenetV3a. Since MobilenetV3a needs to divide the input image into two categories, the activation function of the output layer of the MobilenetV3 network is modified to the sigmod function. The sigmod function maps the output value of the network to between 0 and 1 to achieve binary classification. In this way, the MobilenetV3a network is constructed.

In the mask recognition stage, a MobilenetV3b network modified based on MobilenetV3 is constructed. MobilenetV3b classifies input images into qualified images and unqualified images, where qualified images include images with qualified masks, and unqualified images include images without masks and images using other coverings (veils, cotton masks, scarves, etc.). MobilenetV3b uses dropout to process the input after the pooling layer. For neural network units, they are discarded from the network with a certain probability to prevent the network from overfitting. Since MobilenetV3b performs multi-classification, the activation function of its output layer is softmax. In this way, MobilnetV3b is also built.

After the three models are constructed, the next step is to train them separately.

3.3 Training of the network model

During training, the training of the three sub-network models is performed independently and without association.

Transfer learning is applied during MTCNN model training, and the MTCNN parameters in the ImagNet dataset are used to reduce model training time. Transfer learning refers to transferring the parameters of the trained model to the new model to help train the new model.The new model uses the trained parameters and does not retrain. This behavior is called freezing parameters. Due to the large amount of parameters and the large amount of data during model training, the training takes a long time. Using transfer learning during network model training can reduce the parameters of model training and reduce the time spent on model training [22].

Fig. 1
figure 1

MobilenetV3a application transfer learning.The part inside the dashed box indicates that MobilenetV3a uses the parameters of MobilenetV3

The images in the dataset have already been cropped with faces. When training MobilenetV3a and MobilenetV3b models, it is no longer necessary to use MTCNN for preprocessing, which greatly improves the efficiency of model training. The training process of MobilenetV3a and MobilenetV3b has the same idea. The training process is divided into the following 5 steps:

  1. (1)

    The data set must be divided first when the model is trained. Randomly shuffle the input images of the dataset to enhance the generalization ability of the model [23]. Then randomly select 90% of the data as the training set and 10% of the data as the test set. The data set is randomly divided and the data is randomly distributed, which is beneficial to enhance the robustness of the model [24].

  2. (2)

    Transfer learning is applied during model training to reduce training time. MobilenetV3a and MobilenetV3b use the first 80 layer parameters of MobilenetV3 in ImageNet, and only train network parameters after 80 layers during training. The process of network model using transfer learning is shown in Figure 1, using MobilenetV3a as an example. In Figure 1, the first layer of MobilenetV3 represents the input layer, the middle layers represent the hidden layer, and the last layer represents the output layer. In MobilenetV3a, the first and last layers represent the input and output layers, respectively. The maiddle layers are all hidden layers. The part inside the dashed box indicates that MobilenetV3a uses the parameters of the corresponding layers of MobilenetV3.

  3. (3)

    Set the conditions for early stopping of model training. During model training, if the loss of the training set has reached the minimum, the training can be stopped in advance without reaching the specified training round. Determining that the loss of the training set is minimized is achieved by the following operations. When the model is trained, the loss of the training set is monitored. When 3 epochs have passed and the loss of the training set has not changed, it means that the performance of the model has not improved. At this time, the action of reducing the learning rate is triggered. The purpose of reducing the training set loss is achieved by reducing the learning rate. Even so, after the operation of reducing the learning rate, when 10 epochs have passed and the loss of the training set has not changed, it means that the loss of the training set is no longer reduced, and the training will be stopped in advance.

  4. (4)

    If the model reaches the specified training round, but the loss of the training set does not reach the minimum, it needs to unfreeze training. Specifying the parameters after 80 layers of training is called simple training. The epochs for simple training of the model are set to 50. If the training is not stopped early in 50 epochs, it means that only training parameters after 80 layers cannot minimize the loss of the model. At this time, the first 80 layers of the model are unfrozen and trained in more detail. Stop training until the training is completed or the conditions are met.

  5. (5)

    During training, the parameters are saved every 2 rounds, and the obtained parameters are saved to the corresponding path.

The MobilenetV3b model initially uses the weights of ImagNet, freezing the first 80 layers of the network. However, when thawing training is performed, it is found that the effect of thawing training is better. The final MobilenetV3 model is trained from scratch based on three classes of images with masks, without masks, and with other occlusions. In this way, the parameters of the trained MobilenetV3a and MobilenetV3b models are obtained.

Fig. 2
figure 2

Cascaded network construction, training and testing process

In this way, the construction and training of the three sub-network models are all completed, and the three networks can be cascaded in the next step. First, MTCNN is used to detect faces in the input images, and the faces are cut out from the images. Then input the face images into MobilenetV3a for mask detection. If the people in the image are wearing masks, then feed the image into MobilenetV3b to classify the masks. The construction, training and testing process of the cascaded network is shown in Figure 2.

4 Mask recognition based on cascaded deep learning network

After completing the construction and training of the sub-network model, the cascade network can be used for mask recognition. In this section, we use a cascaded deep learning network to detect whether pedestrians wear qualified masks. In Section 4.1, the datasets used, the types, numbers, and sizes of images in the datasets are introduced, and different types of images are shown. The dataset is preprocessed in Section 4.2. In Section 4.3, the process of mask recognition by cascaded neural network is clarified. Mask recognition using a cascaded network in Section 4.4.

4.1 Datasets

The first dataset is the face mask dataset. The size of the images in the dataset is 16 \(\times \) 16, and the categories of the images can be divided into two categories: faces with masks and faces without masks. An image with a mask is called a mask, and an image without a mask is called a NO-mask. Among them, the number of images with faces wearing masks is 313, the number of images without masks is 443, and the dataset has a total of 756 images. We call this dataset “Face-Mask1” here.

Fig. 3
figure 3

Example images of all kinds in the dataset

The second dataset is the mask classification dataset. The images of the dataset are 224 \(\times \) 224 in size, and the categories of masks in the images can be divided into qualified masks and unqualified masks. An Qualified mask is called OK-mask. Qualified masks include N95 masks and disposable surgical masks, and the number is 1361. An Unqualified mask is called NG-mask. Unqualified masks mainly include sponge masks, cloth masks and scarves, etc. There are 1880 masks. This mask dataset contains a total of 3241 images for mask classification. We call this dataset “Face-Mask” here. Pictures of different categories in the two datasets are shown in Figure 3.

4.2 Preprocessing

Before the images are passed into the MobilenetV3a network for classification, they enter a preprocessing stage, which includes face detection, face alignment, and normalization. First, the MTCNN algorithm is used to detect the faces, and the landmarks of the key points are marked, so that the landmarks can be used to cut out the faces from the images. In the process of using the MTCNN algorithm, the image pyramid is used to solve the problem of different sizes of faces in the images. Then use the coordinates of the eyes to perform an affine transformation to align the faces [25]. After these steps, the face images are aligned. In addition, the face image is normalized, so that the eigenvalues of the image are adjusted to a similar range, which is convenient for subsequent processing [26].

4.3 Mask recognition process

After the data preprocessing is completed, it can be passed to the cascade network for classification. The input images are first passed into MTCNN, and MTCNN detects the faces and intercepts the faces in the images. Then use the MobilenetV3a network to classify the face images. The images are divided into two categories: wearing masks and not wearing masks. Images without masks are judged to be unqualified, and images wearing masks are passed to MobilenetV3b. MobilenetV3b classifies incoming images in detail into images with qualified masks, images with other occlusions, and images without masks. Images wearing qualified masks were judged as qualified, and the other two types of images were judged as unqualified. The process of mask recognition using cascade network is shown in Figure 4.

Fig. 4
figure 4

Cascaded networks for mask recognition

4.4 Mask recognition details

After determining the process of identifying masks by the cascade network, the next step is to create models, read parameters, and identify masks.

First create an MTCNN model for face interception in images. Specify the eligibility criteria for the Pnet, Rnet, and Onet networks in MTCNN, and use the weights in the ImagNet dataset. Next, create a MobilenetV3a model for identifying whether the face is wearing a mask. The labels of the specified images are NO-mask and mask, and the number of labels is 2, so MobilenetV3a divides the images into 2 categories. Download the weights of MobilenetV3a that have been trained during the training phase. Finally, a MobilenetV3b model is created to identify whether the mask is qualified or not. The labels of the specified images are OK-mask, NG-mask and NO-mask, and the number of labels is 3, so MobileNetV3b divides the images into 3 categories. Download the MobileNetV3b weights that have been trained in the training phase.

After the model is created, the identification of the mask begins. First, the images are passed into MTCNN to intercept the faces. If MTCNN does not detect a face, the program aborts. Otherwise, MTCNN will return the coordinates of the face frame and the coordinates of the face key points. If there is more than one face in an image, then MTCNN will return more than one face box. The face frames returned by MTCNN may be rectangles. Since MobilenetV3a requires the input images to be squares, the face frames captured by MTCNN are processed, and the frames are changed into squares without losing frames. The face frame may contain a part outside the image, and the frame is processed so that the frame does not exceed the scope of the image. In this way, the processed face frame coordinates and face key point coordinates in the images are obtained.

Then, the face frames are intercepted on the original image, and the intercepted images is the face images. The faces may be crooked, and straightening the faces will help improve the recognition effect of the model [27]. The key points of the faces include the eyes, and the coordinates of the eyes are used to straighten the faces using the radiation transformation. Normalize the straightened face images, and adjust the values of the images to a similar range, which is convenient for later data processing. The normalized images are passed into MobilenetV3a, and MobilenetV3a classifies the incoming images. Images are classified as those with masks and those without masks.

Finally, the images with masks are passed into MobilenetV3b. MobilenetV3b classifies the images, and the images are classified into images with qualified masks, images with other occlusions, and images without masks. Pedestrians corresponding to the images wearing qualified masks can pass through the gates, and the other two categories are judged to be unqualified and will not be released.

5 Experimental results

To compare the classification performance of a single neural network model with a cascaded neural network model, we conduct three experiments in this subsection. The first experiment and the second experiment respectively test MobilenetV3a and MobilenetV3b, and observe the experimental results. In experiment 3, the images processed by MTCNN are passed into the cascade network for classification, and the experimental results are observed. MobilenetV3b in Experiment 2 is used as a single neural network model, and its experimental results are compared with the results of the cascaded network in Experiment 3.

Table 1 Experimental details

5.1 Experiment introduction

Experiment 1 uses a modified MobilenetV3 network, which is called MobilenetV3a, by using transfer learning for training. Freeze different layers of the network for testing, and finally decide to freeze the first 80 layers of the network. In this experiment, images were divided into two categories, images with masks and images without masks. Mask-wearing images include images wearing acceptable masks and images using other occlusions.

Experiment 2 trains a modified MobilenetV3 network from scratch called MobilenetV3b. In this experiment, images were divided into three categories: images with qualified masks, images with other occlusions, and images without masks.

Experiment 3 cascades MobilenetV3a and MobilenetV3b for performance evaluation. In this experiment, images are divided into three categories: images with qualified masks, images with other occlusions, and images without masks.

Table 1 lists the details of the models used, the pooling layers of the model, the activation function, and the number of frozen layers in Experiment 1, Experiment 2, and Experiment 3. The three experiments use different neural networks respectively, and the network with the best performance is obtained by comparing the experimental results. The experimental results are analyzed in 5.2.

5.2 Result analysis

In this section, the results of each experiment are presented and analyzed.

Fig. 5
figure 5

Confusion matrix for MobilenetV3a

Figure 5 shows the results of Experiment 1, the confusion matrix for the classification results of the fine-tuned trained MobilenetV3a. The confusion matrix takes into account the classes of wearing masks and not wearing masks. A confusion matrix is an analytical summary of the prediction results of a classification problem. In Figure 5, the 1 in the upper left corner of the confusion matrix represents the proportion of the Mask class that is correctly classified. The 0 in the upper right corner represents the proportion of the Mask class that was misclassified as the NO-mask class. The 0 in the lower left corner represents the proportion of the NO-mask class that was misclassified as the Mask class. The 1 in the lower right corner represents the proportion of NO-mask correctly classified. The line on the right of the confusion matrix indicates that different values correspond to different colors, and the colors from small to large correspond to light to dark. In the confusion matrix, the larger the value, the darker the background color of the box where the value is. In Figure 5, the box where the value 1 is located is obviously much darker than the box where the value 0 is located. Since the difference between wearing a mask and not wearing a mask is very obvious, in experiment 1, the accuracy rate of MobilenetV3a judging wearing a mask is 1.

Fig. 6
figure 6

Confusion matrix for MobilenetV3b

Figure 6 shows the results of Experiment 2, the confusion matrix for the classification results of MobilenetV3b trained from scratch. The confusion matrix takes into account the three categories of wearing eligible masks, using other items to cover the faces, and not wearing masks. The 0.85 in the upper left corner means that 85% of OK-masks are correctly classified. We first test the classification results of MobilenetV3b alone in Experiment 2, and then compare with the classification results of Experiment 3 in Experiment 3. In this experiment, due to the addition of NG-mask images, the features of NG-mask and OK-mask are similar, which interferes with the classification of the model, so the accuracy of the model for OK-mask is significantly reduced.

Figure 7 shows the confusion matrix of the classification results of the cascaded neural network in Experiment 3. The confusion matrix considers the classes of wearing qualified masks, wearing other items, and not wearing masks. A 0.92 in the upper left corner means that 92% of OK-masks are correctly classified. Compared with MobilenetV3b, the cascaded network improves the accuracy of OK-mask by 7%. It fully shows that compared with a single model, the cascade network can effectively improve the accuracy of model classification.

Fig. 7
figure 7

Confusion matrix for cascaded networks

Figure 8 is the result of mask recognition for pedestrians by MobilenetV3b, and Figure 9 is the result of mask recognition by cascaded neural networks. MobilenetV3b and cascaded neural network are used for mask recognition on the same image.The picture contains three cases of wearing masks, namely OK-mask, NO-mask and NG-mask. It can be seen intuitively from the figure that the pedestrian with the schoolbag in Figure 8 does not wear a mask, but MobilenetV3b thinks that the pedestrian wears other coverings. In Figure 9, the cascaded neural network identified correctly that the pedestrian did not wear a mask. This is due to the addition of the MobilenetV3a network to the cascade network, which filters out pedestrians without masks first to avoid confusion between pedestrians without masks and those wearing other masks during subsequent classification.As can be seen intuitively from the two figures above, the cascaded neural network is more effective in classification than a single model.

Fig. 8
figure 8

Mask recognition results of MobilenetV3b

Fig. 9
figure 9

Mask recognition results of cascaded neural networks

6 Conclusion

In this paper, we propose a cascaded convolutional neural network based on MobilenetV3 to identify whether a pedestrian is wearing a mask and whether the pedestrian is wearing a qualified mask. We combined the images from the Face-Mask1 and Face-Mask2 databases to increase the types and number of PAIs, making the trained model more robust. In the end, we conducted three experiments. By comparing the experimental results of a single MobilenetV3b neural network and a cascaded neural network, it shows that the cascaded neural network is very effective for mask classification.

In the era of epidemic, it is very dangerous to take off the mask for face recognition. In the future work, we will further study the face recognition when wearing a mask, and improve the safety protection of face recognition in the epidemic environment.