1 Introduction

In order to secure systems, different types of biometric authentication systems have been employed, such as the face, iris, and fingerprint authenticators. Computational advancements have taken place to determine presentation attacks on such systems. A presentation attack can be defined as an act of breaching the system by the presence of an illegitimate subject, either by impersonation or hiding its identity. Such attacks are grossly categorized as print attack [1], where a printed photo is presented to the authenticator; replay attacks, where a video of the subject is shown; and masked attacks, where objects are used to hide the identity. Both static and dynamic methods have been employed in the detection of presentation attacks. A static attack is less computationally intensive, as it considers the information per video frame or images. The dynamic attack allows generalization and aims at the detection of liveness in the subject, such as eye movement and pulse; thus, is computationally intensive. In this research work, the static detection of presentation attacks has been used with prospect of dynamic detection in future using live videos.

Traditionally, the static detection methods look for handcrafted features such as distortion in the image [2], Moiré-effect [3], color distortion, and shape deformation [4]. Like in other fields [5, 6], deep learning methods drift from handcrafted features to computationally learned features. Feature-based classifiers have shown promise in cross-dataset evaluations [7], while deep learning-based models show better generalization.

Further, the research space has grown with publicly available datasets such as CASIA-SURF [8], CRMA[9], and continual face presentation attack detection challenges organized by conferences. Such a competitive environment provides a new direction for research. While the initial datasets focused on handcrafted features such as image quality, color features, and frequency patterns, the use of learned feature representations and ensemble techniques have overtaken mainly in the recent research work [10,11,12]. Also, the research question has evolved from assessing the presentation detection algorithm to improving the generalization of the algorithms. In our work, we use the Spoofing in the wild (SiW) dataset developed by Liu et al. [13] to build our presentation attack, detection model. To our best knowledge, performance metrics of any machine learning or deep learning models on the SiW dataset have not been published.

Previous research focus has been mainly on detecting the presentation attack. In this research work, we have trained different deep learning models to identify types of presentation attacks as well. Using SiW dataset, our models will be able to determine if its presentation attack or not; furthermore, they will also be able to classify if its print attack or replay attack.

In this research work, we have developed models for presentation attack detection (PAD) and presentation attack type classification (PATC) using SiW datasets. Finally, we implement transfer learning and perform benchmark tests on two datasets: NUAA photo imposter database [14] and Replay-attack database [15]. The contribution of this research can be summarized as follows: 1) comparative analysis using different deep learning models, 3) development of the PAD-CNN model for presentation attack detection and presentation attack type classification, and 3) different sets of independent sets including benchmark tests.

The rest of the paper is organized as follows. Section 2 includes a literature review. Section 3 covers materials and methods, which description of the dataset, preprocessing, and different DL models. Section 4 presents results and analysis, and Sect. 5 is a discussion and conclusions.

2 Related work

The research in presentation attack detection is supported by consistent, collaborative competition that summarizes the limitations and future directions in the field. The first competition [16] defined the evaluation algorithm with the case of detecting 2D face spoofing attacks, while the second competition [17] expanded the detection of attacks to display attacks and defined the evaluation algorithm. Later competitions [10, 12, 18, 19] took advantage of large-scale datasets and deep learning algorithms to improve their performance. A multiclass model [20] was proposed to detect 3D high-fidelity mask face presentation attack detection in ICCV2021. With increased ease of defeating face authentication with improved technology, there is a need for an improved algorithm that can detect attacks with precision and speed.

Likewise, the algorithms in initial competitions were focused on handcrafted features such as local binary pattern (LBP) [21], and the use of first-order Haar wavelet to distinguish noise in the images. Further, with the availability of datasets in different modalities- RGB, depth, and NIR, multimodal and unimodal tracks of the algorithms were proposed. There have been many datasets developed [13,14,15, 22] and provided as open source. Some datasets include print attack [14], and others replay attacks [15]. The availability of different modalities has been exploited using the multi-model techniques [10,11,12], and fusion techniques [23, 24] has also been employed to improve the model's capability. Khade et al. [25] proposed an ensemble-based approach using the handcrafted features of Discrete Cosine Transform and Haar transform for iris-liveness detection. Though, the ensemble-based model is not a reliable technique to use in real-time situations, as it will take more resources to train in data scarce like mobile settings.

While there are few models that explore the feature fusion methods that integrate the traditional features with deep learning models, MFNET-Leu [24] utilizes Laplacian embedding to discover discriminative features with a multilevel fusion network to detect face presentation attacks. BioPAD [26] applies a biologically inspired method that uses spectral information to fuse at feature and score levels to detect face presentation attack detection.

Convolutional neural networks (CNN) [27] have been a predominant choice in prediction tasks involving videos and images. With the attention mechanisms, many models sprouted based on the CNN in transformers action [28, 29]. The feature-based methods learn the discriminative features of the images, such as the RGB frequency spectrum, to identify live images from fake images. Agarwal et al. [22] use both handcrafted features and CNN-based features to detect 2D video replay attacks with an SVM classifier, emphasizing the impact of brightness and transformations on the error rate of the detection algorithm.

Fang et al. [30] proposed the Attention-based Pixel wise Binary Supervision (A-PBS) network to address the generalization issue in Iris presentation attack detection. The same group did work in partial attack supervision [31]. DFCANet [32] applied channel attention to determine Iris level PAD and showed results on benchmark databases IIITD-CLI, IIIT-CSD, and NDCLD'13 to measure generalizability across the intra-domain and cross-domain scenarios. ViTransPAD [33] applies the attention mechanism on the vision transformers to obtain pixel-level discrimination for face PAD.

Other approaches to propose face PAD systems is involved with detection of liveness in the subject through the observation of gaze and other contributing factors. DeepEyeIdentification [34] live trains on the gaze patterns of an eye to detect the liveness of the subject, and is shown well to perform for replay attacks, is a CNN-based model. TransRPPG [35] is a pure-transformer framework to extract features to support liveness detection for mask attack detection. Chou [36] proposes an algorithm that detects subjects' liveness using a multimodal presentation attack detection method utilizing the score level fusion technique.

The generalizability and the need to measure the performance of PAD algorithms on images from people with not a single image used for training (hereafter termed as unseen subjects) is crucial. Thus, it concludes with the necessity of determining the spoof features that can be used to attain good performance on unseen attacks. A cross-domain prediction has been explored to improve performance on unseen datasets and thus improve generalization. A domain-based generalization technique [37] aims at capturing features by taking into account sample bias and irrelevant domain features, thus improving performance for unseen scenarios. Shao et al. [38] propose a generalized model, FedPAD, while preserving the privacy of face images using the domain-invariant and domain-specific features from the image. Federated Test time Adaptive Face presentation attack detection [39] is another model that maintains the privacy of images and improves generalizability by minimizing entropy on the test data and using a test-time adaptation. Mohammadi et al. [40] explored domain-guided pruning of CNN in a mobile setting and tested the generalization capabilities of the model on intra-domain and cross-domain evaluation. Thus, it concludes with the need to determine the spoof features that can be used to attain good performance on unseen attacks.

However, analysis and development of presentation attack detection systems are lacking for different types of attacks as well as identifying the type of attack itself. Furthermore, models based on CNN have been utilized underwhelmingly, even though CNN is the preferred algorithm for images and videos. In this research, we develop presentation attack detection (PAD) models based on CNN to apply to the Spoofing in the wild (SiW) dataset developed by Liu et al. [13], which contains both printed and replay attacks. SiW provides live and spoof videos from 165 subjects; each subject has 8 live and up to 20 spoof videos, in total 4,478 videos. These models will be further used to classify types of attacks as well. Benchmarking on two different datasets will also be performed to analyze the performance of other models.

3 Materials and methods

In this section, we present the development of a presentation attack detection on the Spoofing in the Wild (SiW) dataset. Overall approach is shown in Fig. 1.

Fig.1
figure 1

Dataset preprocessing and preparation

3.1 Spoofing in the wild dataset

Spoofing in the wild (SiW) dataset developed by Liu et al. [13] was used primarily for the analysis of the presentation attack detection system. It contains live and spoof videos from 165 subjects with a total of 4478 videos. These are 30 frames per second videos with an average length of 15 s at 1080p resolution. The live videos are collected with variations of distance, pose, illumination and expression. There are two types of spoof videos, printed paper, and replay.

3.1.1 Image extraction

Images were extracted from the videos in SiW dataset for the development of models. Ten frames per second from each video were extracted to build our training and testing dataset. These images were further classified/labeled in two ways. For presentation attack detection (PAD), binary classes, spoof or not, were created. Three classes; live, printed paper, and replay were used for multiclass classification: presentation attack type classification (PATC).

3.1.2 Training and testing dataset

For both binary and multiclass classification, images from 140 subjects were used for initial training and testing. Following the division of the dataset into 80% train and 20% test, it was balanced using undersampling. The initial independent test on this data will be termed Test 1. Images from the remaining 25 subjects will be used for the seconduppl independent test termed Test 2, where the model has none of the images fed from these subjects. Table 1 shows the total number of training and test data for binary classification and multiclass classification, PAD, and PATC, respectively.

Table 1 Training and test datasets for both PAD and PATC

3.2 Benchmark datasets

In this research work, we have used the NUAA photo imposter database [14] and Replay-attack database [15] as benchmark datasets. These benchmark datasets were used to analyze the performance of different models, trained and tested on the SiW database for generalizability. NUAA photo imposter database contains images from 15 subjects, whereas Replay-attack contains videos from 50 subjects.

3.3 DL Models

In this research, we have used the PAD-CNN model based on CNN [27], Mobilenets [41], and InceptionV3 [42]. InceptionV3 has the highest number of parameters and the most increased runtime, whereas PAD-CNN has the least number of parameters and the least runtime. These three models were applied as they represent the base model, medium, and high level, respectively. All these three models are used in SiW dataset for both binary and multiclass classification. Furthermore, these models are tested in benchmark datasets as well.

3.4 PAD-CNN

Firstly, the input image is preprocessed with resolution 224 × 224, batch size 32, and color mode RGB. The preprocessed image input is fed into a first 2D convolutional layer with a filter size of 3 × 3. The output from the first convolutional layer is fed into the max pooling layer. Similarly, the following output is fed into a set of three convolutional layers and a max pooling layer. The dropout layers are used in between to minimize overfitting and maximize generalization. Next, a flattening layer is used. Dense layers follow the flattening layer. Adam [43] was used as an optimizer for the PAD-CNN architecture; it uses adaptive learning rates to calculate individual learning rates for each parameter. TruncatedNormal is used as layer weight initializer. ModelCheckpoint function is used to save best model from different number of epochs. SoftMax is used as an activation function. It assigns probabilities to each class that sums up to one. PAD-CNN model architecture is shown in Fig. 2. More information on hyperparameter tuning has been added to Additional file 1.

Fig.2
figure 2

Presentation attack detection (PAD) CNN model architecture

3.4.1 Mobilenet

MobileNet [41] is based on a streamlined architecture that uses depth-wise(dw) separable convolutions to build lightweight deep neural networks. The Mobilenet architecture is shown in Table 2. It introduces two simple global hyper-parameters that efficiently tradeoff between latency and accuracy. The network consists of 28 convolutional layers and one fully connected layer followed by a SoftMax layer. Batch normalization and ReLU are applied after convolution layers. Similar performance with state-of-the-art approaches but with a much smaller network is achieved using MobileNet, favored by depth-wise separable convolution.

Table 2 Mobilenet architecture with convolutional layers (Conv) and depth-wise(dw) [41]

3.4.2 Inception v3

Inception v3 [42] has proved to be more computationally efficient, both in terms of the number of parameters generated by the network and the economic cost. It utilizes factorized convolutions that keep a check on the network efficiency by reducing the number of parameters involved in a network. It consists of smaller convolutions replacing bigger ones in its prior version. The convolutions are asymmetric in nature to reduce the number of parameters. Auxiliary classifier and grid size reduction are applied to improve the model.

3.5 Performance metrics

In this research work, the first independent test was performed on 20% of the initially held out set. Furthermore, the second independent test was performed on unseen subjects' images.

Confusion matrix, precision, recall, and F1-score were used as performance metrics. For binary classification PAD, the dimension of the confusion matrix is 2 × 2, and for multiclass classification PATC with three classes, the dimension is 3 × 3. The diagonal of the matrix gives the counts of true predicted values. It consists of true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN). The following metrics are used, and their values are between 0 to 1.

$$Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}} \times 100$$
(1)
$$Precision = \frac{TP}{{TP + FP}} \times 100$$
(2)
$$Recall = \frac{TP}{{TP + FN}} \times 100$$
(3)
$$F1 Score = \frac{Precision \times Recall}{{Precision + Recall}} \times 2$$
(4)

4 Results

In this section, we present the results for presentation attack detection (PAD) and presentation attack type classification (PATC) on the SiW dataset by various implemented DL models. Model benchmarking on the NUAA photo imposter database and Replay-attack database with models trained on the SiW dataset is also included.

4.1 Presentation attack detection (PAD)

Different DL models were applied for the analysis of the SiW dataset for presentation attack detection, which was binary classification. The dataset was divided into training and test sets with an 80:20 ratio. First, an independent test on holds out 20% test set termed test 1 was done. The results are shown in Table 3. All three models achieved 100% accuracy, precision, recall, and F1 score in test 1. PAD- CNN had a shorter training time with respect to other DL models.

Table 3 Performance metrics for presentation attack detection (PAD)

Furthermore, the second independent test was performed on 25 unseen subjects, termed test 2. The results are shown in Table 3 and Fig. 4. The ROC curve is shown in Fig. 3. PAD-CNN achieved the highest accuracy of 91%, with Mobilenet and Inceptionv3 lagging with 86% and 85%, respectively. Similarly, PAD-CNN has a precision of 92%, with MobiliNet and Inceptionv3 getting 89% and 88%, respectively. PAD-CNN has 91% recall, whereas Mobilenet and Inceptionv3 got 86% and 85%, respectively. Furthermore, PAD-CNN achieved 91% on the F1 score, followed by Mobilenet and Inceptionv3 with 86% and 85%, respectively. The lower complexity of the problem could have made the lighter model PAD-CNN perform better or on par with more complex models like Mobilenet and Inceptionv3.

Fig.3
figure 3

ROC curve for independent test on unseen subjects for PAD

Fig.4
figure 4

Performance metrics for presentation attack detection (PAD) for different models

4.2 Presentation attack type classification (PATC)

The next step in this analysis was the multiclass classification of attack types as well as live video images. In total, there were three classes printed paper, replay, and live video. Like presentation attack detection, three DL models were applied for independent tests on a 20% hold-out dataset: termed test 1. The results are shown in Table 4. All three models achieved 100% on all performance metrics: accuracy, precision, recall, and F1-score. Training times per epoch for PAD-CNN, Mobilenet, and Inceptionv3 were ~ 1 h, ~ 2 h, and ~ 6 h, respectively.

Table 4 Performance metrics for presentation attack type classification (PATC)

Furthermore, the second independent test was done on unseen 25 subjects' datasets like PAD analysis. The results are shown in Table 4 and Fig. 5. Mobilenet achieved the highest accuracy of 92%, with Inceptionv3 and PAD-CNN achieving 89% and 89%, respectively. Mobilenet gained 93% on precision, followed by Inceptionv3 and PAD-CNN with 91% and 90%, respectively. Similarly, Mobilenet has a recall of 92%, with Inceptionv3 and PAD-CNN getting 89% and 89%, respectively. Mobilenet has a 92% F1 score, whereas Inceptionv3 and PAD-CNN got 88% and 89%, respectively. By adding complexity to the analysis, Mobilenet performed well compared to PAD-CNN and Inceptionv3.

Fig.5
figure 5

Performance metrics for presentation attack type classification (PATC) for different models

4.3 Model benchmarking

We implemented transfer learning by utilizing a trained model on the SiW dataset to train and test on benchmark datasets; NUAA photo imposter database [14] and Replay-attack database [15]. All three trained deep learning models were used for benchmarking. Independent tests on the NUAA dataset yielded 100% performance metrics (accuracy, precision, recall, and F1-score) for all three models. The results are shown in Table 5.

Table 5 Performance metrics for benchmarking on NUAA and Replay-attack dataset

Mobilenet achieved 100% accuracy, precision, recall, and F1-score on independent tests on the Replay-attack dataset. Both PAD-CNN and Inceptionv3 performed highly in this test. All DL models performed well in the independent tests on both benchmark datasets, with high-performance metrics.

4.4 Conclusion and discussions

In this research work, we developed presentation attack detection (PAD) and presentation attack types of classification (PATC) models for the SiW dataset. We used the PAD-CNN model based on CNN, Mobilenet, and Inceptionv3 for both binary and multiclass classification.

Initially, for both binary and multiclass classification, images from 140 subjects were used for initial training and testing, following the division of the dataset into 80% train and 20% test. Images from the remaining 25 subjects were used for the second independent test, where the model has none of the images fed from these subjects.

For PAD, all models achieved the highest performance metrics on the independent test on the 20% hold-out set. PAD-CNN achieved the highest performance metrics for independent tests on images from 25 unseen subjects. Mobilenet and Inceptionv3 were slightly lower in performance than PAD-CNN, whose accuracy was 91%.

With the added complexity of multiclass classification to identify types of attack, Mobilenet performed slightly better than PAD-CNN and Inceptionv3 on independent tests on unseen subjects. Mobilenet was able to achieve 92% accuracy.

Furthermore, transfer learning was implemented by using trained models on the SiW dataset to train and test on benchmark datasets (NUAA photo imposter database and Replay-attack database). Independent tests on the NUAA dataset yielded 100% performance metrics (accuracy, precision, recall, and F1-score) for all three models. Mobilenet achieved 100% accuracy, precision, recall, and F1-score on independent tests on the Replay-attack dataset. Both PAD-CNN and Inceptionv3 performed highly in this test. Training times per epoch for PAD-CNN, Mobilenet, and Inceptionv3 were ~ 1 h, ~ 2 h, and ~ 6 h, respectively. Mobilenet and Incepetionv3 performance are slightly better than PAD-CNN and current state of the art models are expected to have better performance as well. However, at high accuracy slight performance boost does come with high computational cost.

For images and videos, models based on CNN from low to high complexity have been widely preferred and used. In this research work we have presented PAD-CNN, the model based on CNN with lower complexity than Mobilnet and Inceptionv3; both CNN based models. PAD-CNN was able to perform in par or even better in some scenarios than these recent models with half the runtime. With the addition of more data overtime, this runtime difference will only grow larger. Therefore, in terms of performance cost and high tuning capabilities, PAD-CNN is a viable option for practicality and general use for presentation attack detection.

For future work, new datasets with added attacks need to be developed. With improvements in datasets, further optimization of models will be required to compensate for the added complexities. Furthermore, to deal with the black box nature of the deep learning models, research on the interpretability side of it is required. With these developments, presentation attacks can be nullified by the dynamic nature of the presentation attack detection systems.