1 Introduction

Over the past few decades, facial recognition (FR) technology has been frequently used in numerous real-world applications, such as mobile payments, access control, immigration, education, surveillance, and healthcare (Kim et al., 2022). The accuracy of FR is no longer a major concern, with the error rate dropping to 0.08%, according to the National Institute of Standards and Technology (NIST) (Grother et al., 2019). Despite its great success, a simple FR system might be vulnerable to spoofing, known as a presentation attack. For instance, print attacks, video replays, and 3D masks are the most common attacks reported recently in the face anti-spoofing domain (Muhammad & Oussalah, 2022, 2023; Wu et al., 2021; Jia et al., 2020; Arashloo, 2020). To counter these attacks, several hand-crafted and deep representation methods have been proposed to protect FR systems (Boulkenafet et al., 2016; Liu et al., 2021; Shao et al., 2020; Muhammad et al., 2022a; Wang et al., 2020; Saha et al., 2020; Muhammad & Hadid, 2019; Shao et al., 2019). Many of these models have reported promising performance in the intra-domain testing scenario. However, their performances remain limited in the cross-dataset testing scenario due to the distributional discrepancy between the source domain and the target domain.

1.1 Domain Adaptation and Generalization

In the context of cross-dataset testing scenarios, a key contributing factor to the performance limitations of deep learning models can often be attributed to the restricted size or inadequacy of the training dataset. Another reason is the inherent assumption in many face presentation attack detection methods that the training and testing data come from the same target distribution. This raises several questions. For instance, if a model is trained on cut photo attack images, would it work on mask attack images? What if a model is trained only on replay attack images but tested on warped photo attacks? Is it possible to deploy a model that is trained using different illumination conditions and background scenes under controlled lighting scenarios? Answers to all these questions depend on how a machine learning model can deal with such domain shift problems. Therefore, domain generalization (DG) is essential when the objective is to build a model that can perform well in entirely new and diverse domains that were not seen during its training. As shown in Fig. 1, domain generalization refers to the task of training a model on data from multiple source domains and then deploying it on a new, unseen target domain. To address the domain generalization issue, several face anti-spoofing methods, such as adversarial learning (Liu et al., 2022), meta- pattern learning (Cai et al., 2022), generative domain adaptation (Zhou et al., 2022), hypothesis verification (Liu et al., 2022), and cross-adversarial learning (Huang et al., 2022), have been introduced to improve the generalization ability of the model.

Since generalization is a fundamental challenge in machine learning, researchers have explored various generalization-related topics such as meta-learning (learning to learn), regularization techniques, ensemble learning, and data augmentation (Wang et al., 2020; Saha et al., 2020; Shao et al., 2019; Liu & Liu, 2022). In particular, domain generalization is important in face anti-spoofing because collecting and annotating new datasets in real-world scenarios can be expensive and time-consuming. On the other hand, domain adaptation (DA) approaches focus on adapting a model trained on one or multiple source domains to perform well on a specific target domain that is different from the source domains. The main difference between these two approaches is that domain adaptation assumes that the target domain is known during model training, while domain generalization does not. For instance, the research in Jia et al. (2020) relies on a shared feature space and assumes that it would also be invariant to domain shift. However, this assumption has limitations in face PAD, because as the source domains become more diverse, learning a domain-invariant model becomes more difficult (Zhou et al., 2021). Instead of concentrating on some domain-specific cues, such as paper texture, diverse data can help the model generalize better if more generalized cues are shared by all source domains (Shao et al., 2019). As spoofing attacks are launched physically by malicious hackers outside the control of the biometric system, domain generalization is generally more important than domain adaptation because DG does not require the target domain to be known during training.

Fig. 1
figure 1

Domain generalization: The source domains are trained with diverse sets of synthetic images, and the meta-learner leverages this diversity to acquire complementary information, enabling robust generalization to unseen target distributions

In most face PAD databases, both real and spoofed classes exhibit spatiotemporal attributes, offering valuable insights into facial movements, texture changes, and dynamic characteristics, which are essential for effectively distinguishing genuine faces from spoofing attacks. For instance, irregular patterns are often observed in spoofing attacks, particularly with textured materials like paper or fabric, and these patterns become more evident in video sequences. Similarly, print attacks can be detected by hand movements or material reflections, while replay attacks reveal artifacts caused by screen sloping. These spatiotemporal effects can help improve the generalization ability of the model. However, the presence of noisy camera movements poses a challenge that can make the detection of spoofing attacks more difficult. This is especially relevant in the context of face PAD, where the deep learning models need to effectively analyze various attributes, patterns, and cues to distinguish between genuine actions and spoofing attacks. Thus, effectively analyzing both spatial and temporal information is essential for improving the robustness of deep learning-based PAD models. This enables more accurate detection and prevention of spoofing attacks, ensuring higher security and reliability in face biometric authentication processes.

1.2 Ensemble Methods and Their Limitations

Recently, ensemble methods have demonstrated increased generalization capacity for unseen attacks. The concept behind ensemble learning involves aggregating the predictions of multiple models and combining them to produce one optimal predictive model. In particular, this combination of models expects to produce significantly improved results. For instance, by ensembling diverse face-spoofing attacks as source domains, researchers achieved the top-ranking position in the ChaLearn LAP face anti-spoofing challenge, showcasing the effectiveness of employing an ensemble technique (Parkin & Grinchuk, 2019). This achievement demonstrates the capacity of a diverse ensemble of models to enhance the stability and efficacy of a face anti-spoofing system. Additionally, an ensemble learning approach was used in Vareto and Schwartz (2021) with three different models, while a score fusion was applied to enhance the generalization of face PAD. Similarly, a depth-based ensemble learning approach was introduced in Jiang and Sun (2022), incorporating multiple domain-specific modules, to optimize spoof detection. In Fatemifar et al. (2019), the authors combined predictions from multiple models using a simple weighted average rule where distinct weights were assigned to individual prediction models according to their performance.

However, one of the main challenges in score fusion-like approach is finding the correct weighting for the predictions of various models to minimize the impact of noise on the final prediction, as ensembles may include models that are sensitive to noise or data outliers (Shahhosseini et al., 2022). Furthermore, the main challenge in depth-based ensemble learning arises from the assumption that one of the training domains closely resembles the target domain where the model will be deployed. However, this assumption may not always hold true. Indeed, the target domain may exhibit differences in data distribution, feature representation, or other characteristics compared to the training domains. This domain shift can make it difficult for the model to generalize to previously unseen data, potentially leading to a decline in performance.

In contrast to score-based fusion or depth-based ensemble learning, Fatemifar et al. (2020) introduced a client-specific anomaly-based stacking ensemble method, where multiple deep Convolutional Neural Network (CNN) models were trained on various facial regions in an unseen attack scenario. However, implementing client-specific stacking poses a significant challenge due to the necessity of training multiple deep CNNs on various facial regions, which requires substantial amounts of labeled data. Specifically, obtaining labeled data for each facial region and for multiple clients can be laborious and may not always be attainable. Another challenge is associated with the availability of data for models that are not adequately trained or optimized for real-world deployment. This is due to the fact that real-world environments are inherently complex and dynamic, with variations that may not have been present or adequately represented in the training data. These unforeseen variations can pose challenges for approaches that rely on ensemble learning.

1.3 Limitations in Global Motion Methods

Approaches to global motion estimation are typically categorized into two main categories: feature-based methods and direct (or featureless) methods. Feature-based methods involve detecting key points, edges, or other significant features in the data, whereas direct methods, also known as featureless methods, operate on the raw data by directly minimizing the difference between the intensities or pixel values of the images (Déniz et al., 2011). In feature-based global motion estimation, the objective is to estimate a transformation (e.g., affine or homography) that aligns the source image with the target image. However, this transformation may cause parts of the target frame to remain empty, leading to black borders around the transformed image (Muhammad et al., 2022a). This occurs because each successive transformation, such as rotations, translations, scaling, or affine changes, modifies the spatial characteristics of the image. These cumulative alterations can result in the transformed image exceeding the dimensions of the original bounding box.

The black border artifacts can be removed by first compensating for global motion using a feature-based method and then applying a frame difference approach. This combination is shown to enhance the overall motion estimation in Zhang et al. (2023). Although the dense-based feature extraction approach was employed to address the black framing issue in face PAD, it requires the extraction and processing of every pixel or grid point in the image compared to sparse methods, resulting in significantly higher computational demand (Muhammad et al., 2022b). We argue that existing approaches in face PAD often overlook the need to address this issue, which could result in the loss of critical details crucial for analyzing live and spoofed attacks. Moreover, black borders or black framing can introduce artificial cues into the video data, which might mislead or bias the models during training. Therefore, effectively managing such spatiotemporal variations and addressing black framing is essential before training any models, as it is crucial for improving the performance of face PAD.

1.4 Our Contributions

To address the aforementioned issues, we first introduce a data synthesis approach that integrates temporal and spatial transformations. Our method aims to tackle the challenge of domain generalization by employing a stacking-based ensemble learning framework based on synthetic data. To achieve this, we present a video distillation technique that blends a static image with a stabilized spatiotemporal encoded image. This blending process merges the static image’s details with the blurred features of the spatiotemporal encoded image, aiding models in gaining a deeper understanding of motion blur.

We assume that data in real-world applications often contain varying degrees of blur caused by motion, low frame rates, or camera shake. On the contrary, a static image, being a single frame, may not always provide complete information. Therefore, training with temporally blurred data can make the models more resilient and better prepared for such real-world conditions. Specifically, we utilize three RNNs, namely LSTM, BILSTM, and GRU, to capture different aspects of temporal blur information, potentially enhancing the ensemble’s resilience in real-world scenarios. This diversity helps the meta-learner in the stacking ensemble combine these complementary strengths effectively.

We also address the issue of black framing in face PAD, which is generated by feature-based global motion estimation methods (Muhammad et al., 2022a). We observe that existing works on face PAD do not explicitly focus on compensating for this effect, which can degrade important details to analyze in live and spoofed attacks for video-based PAD detection. This is achieved through alpha composition as a post-processing step, which involves blending the transformed image with the target frame using alpha values to seamlessly merge them. This process reduces the visibility of black framing artifacts and aids the model in learning to recognize actions in the presence of motion blur. By doing so, the proposed video distillation technique not only allows the models to focus on motion attributes, but also enables them to operate on a smaller subset of frames, thereby reducing computational overhead and improving generalization for subsequent analysis. Since our video distillation scheme follows a uniform sampling approach by dividing the original video into video clips of fixed size, it provides the flexibility to easily control the sampling rate by adjusting the segment size of the video. This is important because a higher sampling rate results in a higher temporal resolution but can introduce more noise, whereas a lower sampling rate is associated with less frequent sampling, leading to a smoother representation but potentially lower temporal detail.

Intuitively, we extend our previous approach (Muhammad et al., 2022a) in the following ways: (i) We introduce a new data augmentation technique as a post-processing step to seamlessly composite the transformed image back into the target frame, thereby reducing the visibility of black framing artifacts. These artifacts are unwanted black borders or areas that can appear around images; (ii) We address the domain generalization issue by learning from the diversity of the proposed synthetic data and introducing a deep ensemble learning framework; (iii) We use several explainability methods to answer questions such as “why did the model make a particular prediction?” or “what features were most influential in the decision-making process?”; and (iv) We balance the computational cost of the global motion estimation and system performance.

In summary, our key contributions can be summarized as follows:

  1. 1.

    We introduce a video-based data augmentation mechanism by considering both the spatial and temporal domains of the video. The proposed approach can assist deep learning models in capturing spatiotemporal information and enhancing their performance in face PAD tasks.

  2. 2.

    A meta-model is presented that leverages information from different subsets of synthetic samples, leading to improvements in the overall performance and robustness of the ensemble model.

  3. 3.

    Explainability techniques, which include gradient-weighted class activation mapping, occusion sensitivity maps and LIME, are employed to explain the decisions made by the employed model. The model reveals that motion cues are the most important factors for distinguishing whether an input image is spoofed or not.

  4. 4.

    Experiments on seven benchmark datasets show that our proposed method provides very competitive performance in comparison with other state-of-the-art generalization methods used nowadays.

The rest of this work is organized as follows: Sect. 2 discusses recent developments and related past works, highlighting their advantages and disadvantages. Section 3 details the various steps of our proposed method. Section 4 emphasizes the implementation details, ablation study, and comparison against several public benchmark datasets. Section 5 concludes the entire work and provides suggestions for future research.

Table 1 Summary of advantages and disadvantages of appearance-based methods

2 Literature Review

Over the past few years, face PAD methods have received considerable attention from both academia and industry. In general, these methods can be roughly categorized into appearance-based methods, temporal-based methods, and domain generalization methods.

Appearance-based methods: Conventional appearance-based techniques generally rely on extracting hand-crafted or low-level features developed prior to the emergence of deep learning. These methods involve manual design and engineering of algorithms and features, with a primary focus on analyzing static visual attributes like textures, colors, and shapes within an image or frame to make decisions. For instance, Boulkenafet et al. (2016) claimed that color information is essential for effective face presentation attack detection (PAD). They discovered that using luminance-chrominance color spaces enhances detection performance compared to RGB and grayscale representations. These color spaces allowed the method to more effectively exploit the differences in color information between genuine and spoofed faces, thereby improving the overall effectiveness of the detection system. In another subsequent study (Boulkenafet et al., 2016), the authors proposed deriving a new multiscale space for image representation prior to texture feature extraction. This is achieved through the application of three multiscale filtering methods: Gaussian scale space, Difference of Gaussian scale space, and Multiscale Retinex. A robust face spoof detection algorithm using image distortion analysis (IDA) was introduced in Wen et al. (2015), considering features like specular reflection, blurriness, chromatic moment, and color diversity. Next, an ensemble of SVM classifiers, each trained for different spoof attacks (printed photos and replayed videos), was then employed to distinguish genuine faces from spoofed ones. In Yang et al. (2013), a component-based coding framework was proposed. This consists of four steps: locating face components, encoding low-level features for each component, creating a high-level face representation using Fisher criterion-based pooling, and concatenating histograms from all components to form a classifier for identification. Freitas Pereira et al. (2012) advocated the use of LBP-TOP descriptor to enhance face PAD detection. This descriptor integrates spatial and temporal information into a cohesive representation. By extending the analysis into the time domain, notable improvements over previous static frame methods were reported. Additionally, in Patel et al. (2016), image distortions were examined using various intensity channels (R, G, B, and grayscale) and across different image regions (entire image, detected face, and the facial area between the nose and chin). In Li and Feng (2019), traditional handcrafted features were combined with convolutional neural networks (CNNs) to enhance face PAD. Lately, a hybrid technique was presented in Muhammad et al. (2019), combining information on appearance from two CNNs, with an SVM classifier employed to distinguish between live and spoofed images. While appearance-focused methods exhibit enhanced performance, especially in intra-database testing, each method has its own advantages and disadvantages. We summarize them in Table 1.

Table 2 Summary of advantages and disadvantages of temporal-based methods

Temporal-based methods: Temporal-based methods focus on the temporal dynamics or changes occurring over a sequence of frames or in a video. These methods often involve techniques such as optical flow analysis, 3D CNN, or recurrent neural networks (RNNs) to capture temporal dependencies and patterns in the data. For instance, in Yin et al. (2016), a dense optical flow scheme was proposed to estimate the motion between successive frames. The authors reported that real and attack videos exhibit different optical flow motion patterns, enhancing the performance of face anti-spoofing detection. In Bharadwaj et al. (2013), a novel method for spoofing detection in face videos was introduced by utilizing motion magnification. It enhances facial expressions through Eulerian motion magnification and proposes two feature extraction algorithms: an improved LBP configuration for texture analysis and a motion estimation technique using the histogram of the oriented optical flow (HOOF) descriptor. To enhance the robustness of face recognition systems against spoof attacks, this paper employs dynamic mode decomposition (DMD) to capture liveness cues, such as blinking and facial dynamics (Tirunagari et al., 2015). To address 3D mask presentation attack detection (PAD), remote Photoplethysmography (rPPG) was used as an intrinsic cue, unaffected by mask material or quality. In particular, temporal variations of rPPG signals were extracted using a multi-channel time-frequency analysis scheme to enhance discriminability (Liu et al., 2021). A sample learning based recurrent neural network (SLRNN) was introduced to capture both appearance and temporal cues in Muhammad et al. (2019). In Chang et al. (2023), a Geometry-Aware Interaction Network (GAIN) using dense facial landmarks with a spatio-temporal graph convolutional network (ST-GCN) was introduced to improve the PAD detection. A generalized deep feature representation by incorporating both spatial and temporal information using a tailored 3D convolutional neural network was put forward in Li et al. (2018). The network is initially trained with augmented facial samples using cross-entropy loss and further refined with a custom generalization loss serving as regularization. In Liu et al. (2018), a CNN-RNN model was trained with pixel-wise supervision for estimating face depth and sequence-wise supervision for estimating rPPG signals, which were then fused to distinguish live versus spoof faces. In Muhammad and Oussalah (2023), a video processing scheme using a Gaussian weighting function (GWF) was proposed to model long-range temporal variations, followed by employing a CNN-RNN for PAD detection. A global motion was estimated to compensate for camera motion in Muhammad et al. (2022a), allowing for more detailed analysis of the video content by capturing subtle variations and movements. Following this, a CNN-RNN model was employed to detect PAD. This combination leverages the strengths of both global motion estimation and deep learning models. To further improve the performance of global motion, a dense sampling approach was applied in Muhammad et al. (2022b). Although temporal-focused methods typically demonstrate their effectiveness, their detection performance remains vulnerable to degradation due to real-world variations, such as user demographics, input cameras, and variations in illumination. In addition, each method possesses its own set of advantages and disadvantages, which we summarize in Table 2.

Deep Domain Generalization methods: Deep Domain Generalization (DDG) methods focus on creating models that generalize well to new unseen domains. In particular, the models aim to perform well across a variety of domains without needing to see target domain data during training. For instance, the Domain-invariant Vision Transformer (DiVT) for FAS enhanced the generalizability with respect to two loss functions. First, a concentration loss helps learn a domain-invariant representation by aggregating features of real face data. Second, a separation loss unifies various attack types from different domains (Liao et al., 2023). Huang et al. (2022) proposed adaptive vision transformers (ViT) with ensemble adapters and feature-wise transformation layers for robust cross-domain face anti-spoofing. In Liu et al. (2024), a novel Class Free Prompt Learning (CFPL) paradigm that employs two lightweight transformers, Content Q-Former (CQF) and Style Q-Former (SQF), was introduced to learn semantic prompts based on content and style features using learnable query vectors for face PAD. In Zhang et al. (2019, 2020), a multi-modal fusion approach, aiming to enhance generalization by conducting feature re-weighting, emphasizing the most informative channel features while attenuating the less relevant ones within each modality, was investigated. To extract discriminative fused features, a Partially Shared Multi-modal Network is proposed to learn the fused information from single-modal and multi-modal branches (Liu et al., 2021). For the same purpose, George et al. suggested that incorporating analysis across multiple channels could offer a solution to this problem. Hence, they introduced a CNN approach that operates across multiple channels for PAD (George et al., 2019). In Srivatsan et al. (2023), a new method for making FAS more robust across different situations was introduced by connecting visual representations with natural language. By matching image features with descriptions, based on how we talk about things, FAS can work better even when there’s not much data available. In Wang et al. (2020), it was found that incorporating spatio-temporal information from multiple frames using a Spatio-Temporal Propagation Module (STPM) can help the model generalize better to variations in depth cues across different environments or conditions. An Instance-Aware Domain Generalization framework was introduced in Zhou et al. (2023) by aligning features at the instance level without domain labels. Sun et al. (2023) proposed Dynamic Kernel Generator and Categorical Style Assembly to extract instance-specific features for improving generalization of face PAD. Specifically, the authors formulate their FAS strategy of separability and alignment (SA-FAS) as a problem of invariant risk minimization (IRM), and proposed encouraging domain separability while ensuring uniform live-to-spoof transitions across domains. Inspired by vision-language models, a method known as VL-FAS was introduced in Fang et al. (2024) by leveraging fine-grained natural language descriptions to guide attention towards the face region, resulting in cleaner and more generalized features.

Table 3 Summary of advantages and disadvantages of domain generalization methods

A domain adaptation method that generates pseudo-labeled samples named cyclically disentangled feature translation network (CDFTN) was proposed in Yue et al. (2023). Chuang et al proposed to improve the generalization based on one-side triplet loss (Chuang et al., 2023). In Cai et al. (2022), a two-stream network was utilized to fuse the input RGB image, while meta-pattern learning was proposed to improve generalization. In Huang et al. (2022), a cross-adversarial training scheme was proposed to improve the generalization by minimizing the correlation among two sets of features. The work reported in Zhou et al. (2022) aims to learn a generalized feature space by designing the target data to the source-domain style called generative domain adaptation (GDA). A hypothesis verification framework was proposed in Liu et al. (2022) where two hypothesis verification modules are utilized for improving the generalization. In Wang et al. (2022), a novel shuffled style assembly network (SSAN) was introduced by aligning multiple source domains into a stylized feature space, while domain generalization was improved by a contrastive learning strategy. To select a common feature space, adversarial learning was proposed, and aggregation of live faces was performed to achieve a generalized feature space in Liu et al. (2022). Nonetheless, there is no clear consensus that the pre-defined distributions can be considered the optimal ones for the feature space. We argue that different domains have their strengths and weaknesses. By combining them, their collective knowledge and diversity can lead to a more comprehensive understanding of faces. However, implementing and fine-tuning such approaches, as discussed in Table 3, might be complex and require careful consideration of alignment methods, ensemble learning techniques, and evaluation strategies to ensure the desired performance improvements are achieved.

Fig. 2
figure 2

Schematic diagram of the proposed data augmentation and deep ensemble learning for face anti-spoofing countermeasure

3 Methodology

A key idea underlying our approach is to enhance the PAD model’s generalization ability. This objective is achieved through training multiple sub-models using distinct subsets of synthetic data. Specifically, we embark on data augmentation by blending a static image with spatiotemporal encoded images, effectively replicating the visual effect of motion blur through the manipulation of alpha values applied to the static image. This variation in alpha values governs the extent of each image’s influence on the final composite, thereby yielding an array of synthetic data subsets. As a further stride in our methodology, we introduce a meta-model, which capitalizes on the predictions of the base models. In the following sections, we describe all these steps in detail. An overall view of our proposed methodology is presented in Fig. 2.

3.1 Data Augmentation

The process of data augmentation involves five main steps, as follows: (1) Select a source and a target frame of an input video; (2) Estimate the global motion transformation between the source and target frames; (3) Perform the geometric transformation on the source frame to align it with the target frame; (4) Create a new spatiotemporal encoded image by stabilizing the segments of the video; and (5) Use alpha blending to combine the transformed image with the target (static) frame, considering the transparency of each pixel. Each of these steps is explained in the following sub-sections.

3.1.1 Global Motion Estimation

Suppose a video V is equally divided into P non-overlapping segments, i.e., \(V = \{S_k\}^P_{k=1}\), where \(S_k\) is the k-th segment. The length of each segment is set to be \(N=40\) frames. For each segment, features are extracted from the fixed (first) and moving (second) images of the segment. In particular, the FAST feature detector (Rosten & Drummond, 2005) is utilized to detect interest points, and then the FREAK descriptor (Alahi et al., 2012) extracts the features to collect points of interest from both frames. For matching the interest points, Hamming distance (HD) is utilized in our work. The transformation between frames is calculated from the first frame onward using a rigid (Euclidean space) transformation. The rigid transformation preserves lengths, angles, and shapes, and includes translation, rotation, and reflection. On the contrary, the affine transformation used in Muhammad et al. (2022a) includes translation, rotation, scaling, and shearing, and preserves parallel lines and ratios of distances, but can change shapes through scaling and shearing. Thus, rigid transformation for each subsequent frame in the segment is estimated as:

$$\begin{aligned} \begin{bmatrix} a' \\ b' \\ 1 \end{bmatrix} = \begin{bmatrix} d_{11} &{} d_{12} &{} d_{13} \\ d_{21} &{} d_{22} &{} d_{23} \\ d_{31} &{} d_{32} &{} d_{33} \end{bmatrix}\begin{bmatrix} a \\ b \\ 1 \end{bmatrix} = M\begin{bmatrix} a \\ b \\ 1 \end{bmatrix} \end{aligned}$$
(1)

In this equation, \(\begin{bmatrix} a' \ b' \ 1 \end{bmatrix}\) represents the homogeneous coordinates in the fixed image, and \(\begin{bmatrix} a \ b \ 1 \end{bmatrix}\) represents the homogeneous coordinates in the moving image. The rigid transformation matrix M describes the inter-frame motion. To eliminate false-matching points and robust estimation of the geometric transformation between the frames, we use the M-estimator Sample Consensus (MSAC) algorithm (Torr & Zisserman, 1997) to detect outliers and remove false matching points. To obtain a warped (spatiotemporal) image, accumulation is performed using the following equation (Muhammad et al., 2022b):

$$\begin{aligned} A = \frac{1}{N} \sum _{i=1}^{N} I_i \cdot M, \end{aligned}$$
(2)

where A is a single image that represents the combined motion in the segment after applying the accumulated transformation M, N is the number of frames that are considered for accumulation and \(I_i\) represents the i-th frame in the segment. Each frame \(I_i\) is transformed using the accumulated transformation M before being added to the sum. \(\frac{1}{N}\) is the scaling factor that ensures the final result is an average. In particular, Eq. (2) shows that we take each frame in the segment, apply the accumulated transformation M to it, sum up these transformed frames, and then divide by N to generate a final composite frame A for that particular segment. To show the importance of this approach, we simply remove the accumulated transformation M and use alternatively the following equation:

$$\begin{aligned} B = \frac{1}{N} \sum _{i=1}^{N} I_i. \end{aligned}$$
(3)

This calculates the average frame B by summing all the individual frames in the segment and then dividing that sum by the total number of frames N. The outcome of both equations is illustrated in Fig. 3. The first row represents the noisy camera motion, which can be observed after using the Eq. (3). The second row shows that the proposed approach significantly removes the motion blur using the Eq. (2). Although the camera motion issue is eliminated, black framing near the border of the images in the second row appears, which requires further preprocessing. In the following sub-section, we address this issue and explain the motivation of data augmentation.

Fig. 3
figure 3

a Temporal averaging is performed based on Eq. (3) to visualize the global motion, which exhibits a significant amount of distortion in the encoded image. b The cumulative transformation is calculated by applying the transformation matrix of each frame’s motion to the previous cumulative transformation, according to Eq. (2). c The result of the cumulative transformation with alpha blending is based on Eq. (4). The images in the column are the results of averaging 40 frames

3.1.2 Alpha Transparency

While Eq. (2) effectively removes the disruptive effects of noisy camera motion, the spatiotemporal-encoded images inherently exhibit a certain level of motion blur. This blur arises from the inherent movement between the camera and the scene during each exposure. In particular, if the subjects move their hands or change the direction of their face in the scene while the camera’s shutter is open, their motion appears as a blur in the spatiotemporal-encoded image. This observation has encouraged us to propose a new basic augmentation technique, called alpha transparency, for face PAD. The concept is to assign appropriate alpha values to the first (static) frame of the segment and then blend it with the spatiotemporal-encoded image based on those alpha values. This approach helps preserve more details and finer features of the scene, as the spatiotemporal-encoded image captures different stages of the motion blur. The alpha blending process is achieved through the following two steps: (1) Obtain a source image (i.e., a spatiotemporal-encoded image based on Eq. (2)); and (2) Choose first (still) image of each segment to blend with the source image. Let us assume, without loss of generality, that we blend the source image \(P_1\) over the target image \(P_2\) as follows:

$$\begin{aligned} P_{blended}(a,b) = \alpha P_{1}(a,b) + (1-\alpha )P_{2}(a,b). \end{aligned}$$
(4)

The equation represents how to combine the information from the two source values \(P_{1}(a,b)\) and \(P_{2}(a,b)\) to create a new value at the location (ab) in the synthetic image. The weight factor \(\alpha \) determines the contribution of \(P_{1}(a,b)\) to the blended result. When \(\alpha \) is closer to 1, \(P_{1}(a,b)\) has a higher influence, and when \(\alpha \) is closer to 0, \(P_{2}(a,b)\) has a higher influence. Thus, blending the source image with the target image helps to eliminate the black framing issue, as shown in the third row of Fig. 3, creating a visually seamless transition. Moreover, Fig. 4 displays images that represent the results of blending with different \(\alpha \) values. This approach can be useful in scenarios where someone wants to emphasize the overall motion of the scene while maintaining a recognizable background or main subject. We hypothesize that the combination of alpha blending and spatiotemporal encoding can be beneficial in simulating motion blur and helping the deep learning model become more robust to real-world scenarios where camera motion is present.

Fig. 4
figure 4

2D synthetic samples from CASIA-MFSD are shown. In the left column, we present video segments used in the process of data augmentation. In the right column, we display composite images after blending with spatiotemporal images using alpha values of 0.5 (Synt 1), 1.0 (Synt 2), and 1.5 (Synt 3), respectively. The encoded clip is based on alpha values of 0.1. These synthetic samples can be effectively employed for ensemble stacking, resulting in a significant improvement in face anti-spoofing performance

3.2 Deep Ensemble Learning

Deep learning methods based on 2D Convolutional Neural Networks (CNNs) have shown improved performance compared to classical machine learning approaches (Wang et al., 2020; Liu et al., 2021; Shao et al., 2020). However, mainstream 2D CNN frameworks primarily focus on spatial information, thus lacking the capacity to understand sequential data. Moreover, 2D CNNs do not possess a memory mechanism to capture temporal relations. Motivated by the fact that recurrent neural networks (RNNs) can effectively handle temporal information, we develop a stacking-based deep ensemble learning framework to learn from motion blur by processing sequences of images representing motion over time. Ensemble learning has been supported by multiple approaches such as bagging, boosting, and stacking, resulting in a better generalization of learning models (Fatemifar et al., 2020). Specifically, stacking is an integration technique that involves combining the predictions based on different weak model predictions, where the meta-model is used to integrate the output of base models (Ganaie et al., 2021). One common approach in stacked ensemble learning is to develop a set of Tier-1 classifiers denoted as \(S_{1}, S_{2}, S_{3}, \ldots , S_{N}\). These classifiers are developed through cross-validation of the training dataset. Then, the outputs or predictions from these Tier-1 classifiers are subsequently employed as inputs for a higher-level classifier or meta-learner within the ensemble (Polikar, 2012).

Since our primary goal is to learn from the variations of motion blur sequences, we train diverse recurrent neural networks to leverage the benefits of the proposed data augmentation mechanism. The approach involves fine-tuning a CNN model on the labeled datasets in the first stage. Then, we extract the fine-tuned features from the pooling layer and utilize them as input to train different variants of RNNs. In our work, three base models-Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997), Bidirectional Long Short-Term Memory (BiLSTM) (Schuster & Paliwal, 1997), and Gated Recurrent Unit (GRU) (Cho et al., 2014) are selected. The first base model (LSTM) captures temporal patterns and dynamics across frames using its special memory cells, allowing it to maintain long-range dependencies in sequential data. When presented with a sequence of blurred frames, the LSTM can learn to recognize patterns one by one and retain relationships between consecutive frames. The second base model (BiLSTM) captures information from both past and future contexts in sequential data. It consists of two sets of LSTM cells, where both cells work independently to capture information from both past and future contexts. The hidden states from both directions are then concatenated or combined to obtain the final output. The third base model (GRU) incorporates gating mechanisms similar to LSTM, enabling it to selectively learn and update information over time.

In particular, this approach encourages each submodel to focus on different aspects of the data, such as variations in temporal blur. For example, one model might specialize in certain features based on synthetic data, while another model may perform better in other aspects. We then combine the predictions from these weak experts (base models) and use them as input to a meta-model (another RNN). The meta-model learns to integrate these predictions and make the final decision. We call it a meta-model because it leverages the diversity and complementary strengths of the individual base models, leading to improved generalization for face PAD. Table 4 illustrates the configuration of the proposed meta-model.

Table 4 Base model architectures and their parameters

4 Experimental Analysis of Using Open Datasets

Since spoofing attacks have become more realistic and a bigger security concern for face recognition systems, researchers around the world have paid attention to developing more diverse datasets. These datasets include faces captured in various lighting conditions and utilize different spoofing techniques, such as photos, videos, makeup, masks, etc. These diverse datasets can be single modality, focusing on one type of data (e.g., images), or multimodal, combining multiple data types (e.g., images and depth information). To evaluate our model’s performance in detecting face spoofing attacks, we trained and tested it on several state-of-the-art datasets widely used in the face anti-spoofing domain. Details about these datasets are provided in the following section.

4.1 Datasets

Seven diverse databases, including the Wide Multi-Channel Presentation Attack (WMCA) database (George et al., 2019), the CASIA-SURF database (Zhang et al., 2020), the OULU-NPU database (denoted as O) (Boulkenafet et al., 2017), the CASIA-MFSD database (denoted as C) (Zhang et al., 2012), the Idiap Replay-Attack database (denoted as I) (Chingovska et al., 2012), the MSU-MFSD database (denoted as M) (Wen et al., 2015), and the SiW-Mv2 dataset (Guo et al., 2022) were used to conduct the experiments.

The Wide Multi Channel Presentation Attack (WMCA) database comprises 1941 short video recordings, featuring both genuine and presentation attacks from 72 distinct identities. These recordings utilize various channels, including color, depth, infrared, and thermal.

The Chalearn CASIA-SURF dataset is one of the largest face anti-spoofing datasets, consisting of 1000 subjects and 21,000 video clips across three modalities: RGB, Depth, and IR. Each sample in the dataset comprises one live video clip and six spoof video clips, each representing a different attack method.

The OULU-NPU database contains 4950 videos, encompassing both real and attack videos. Two primary presentation attack types were considered in this database: print and video-replay. The videos recorded from the 55 subjects.

CASIA-MFSD consists of a total of 50 subjects, with each subject having 12 videos captured under varying resolutions and lighting conditions. This dataset is developed to include three distinct types of spoof attacks: replay, warp print, and cut print attacks. In particular, CASIA-MFSD comprises 600 video recordings. Out of these, 240 videos from 20 subjects are allocated for training purposes, while the remaining 360 videos from 30 subjects are designated for testing.

The Idiap Replay-Attack database comprises 1300 video clips that consist of photo and video attacks on 50 clients, all conducted under varying lighting conditions. Data for the attacks was gathered in two distinct lighting conditions: controlled, with office lights on, blinds down, and a uniform background; and adverse, with raised blinds, a complex background, and no office lighting.

The MSU-MFSD dataset was constructed through the participants of 35 individuals, resulting in a combined total of 280 video entries. The recordings were captured using two distinct camera types, each with varying resolutions \((720\times 480\) and \(640\times 480)\). Regarding generating live recordings, every participant contributed two video clips, one recorded with a laptop camera and the other with an Android device. In contrast, for video attack instances, two different camera models, specifically iPhone and Canon cameras, were utilized to capture high-definition videos for each subject. Thus, the presence of diverse lighting conditions and a wide array of attack types introduces significant complexity and difficulty when dealing with these datasets.

The SiW-Mv2 dataset includes 785 videos from 493 subjects and 915 spoof videos from 600 subjects, ranging from common print and replay attacks to various masks, impersonation makeup, and physical material coverings. It is a large-scale face anti-spoofing (FAS) dataset with 14 types of diverse spoofing attacks. For instance, obfuscation makeup and partial coverings are designed to hide a subject’s identity, while impersonation makeup and masks are used to imitate other identities.

4.2 Implementation Details

All the images are adjusted to a size of \(224 \times 224\) to align with the input specifications of the pretrained DenseNet-201 architecture (Huang et al., 2017). Fine-tuning of the CNN model is carried out employing the Stochastic Gradient Descent (SGD) optimizer, with a mini-batch size of 32 and a validation check performed every 30 iterations. The learning rate is set at 0.0001, and fixed-size epochs are not used. Instead, we implement an early stopping mechanism (Prechelt, 1998) to automatically stop the training process to prevent overfitting. During the ensemble learning phase, the CNN model undergoes fine-tuning with spatiotemporal-encoded video clips with alpha values of 0.1 and three distinct synthetic sets individually. These subsets introduce random expansion to the training images through alpha values of 0.5, 1.0, and 1.5 as defined in Eq. (4). Additional data augmentation techniques are utilized, such as rotation within the range of \(-20\) to \(+20\) degrees, as well as random translations along the x and y axes. Subsequently, the features from the fine-tuned model are extracted and used as inputs to train a LSTM, a BiLSTM and a GRU.

4.3 Evaluation Metrics

Various standard metrics are utilized in this study for comparison, such as Equal Error Rate (EER), Attack Presentation Classification Error Rate (APCER), Bona Fide Presentation Classification Error Rate (BPCER), and Average Classification Error Rate (ACER). The EER is often used in biometric systems to represent the point where the false acceptance rate (FAR) and the false rejection rate (FRR) are equal. BPCER measures the rate at which genuine presentations are incorrectly classified as attacks, while APCER measures the rate at which attack presentations (fake attempts) are incorrectly classified as genuine. ACER is the average of APCER and BPCER, providing a balanced view of the system’s performance in detecting both attacks and genuine attempts. Additionally, Half Total Error Rate (HTER) is reported, which is the average of FAR and FRR after setting a specific threshold based on the validation set. The Area Under the Curve (AUC) is also reported, measuring the model’s overall ability to distinguish between classes, with the area under the ROC curve representing performance across all thresholds.

Table 5 Performance comparison on the WMCA dataset during intra-dataset testing (%)
Table 6 Performance comparison on the CASIA-SURF dataset during intra-dataset testing (%)
Table 7 Ablation study using cross-database evaluation

4.4 Comparison Against the State-of-the-Art Methods

To assess the effectiveness of the proposed method, we conducted comparisons with some of the most representative state-of-the-art methods. Tables 5 and 6 present the performance evaluation on WMCA and CASIA-SURF datasets, for intra-dataset testing scenarios. The bold values in the tables indicate the best reported results.

Intra-Testing Results: Table 5 presents a comparison of the performance of our proposed method on the WMCA dataset with other CNN-based approaches (George & Marcel, 2019; Liu et al., 2021; He et al., 2016; Tolstikhin et al., 2021; Wang et al., 2022) and Transformer-based method (Antil & Dhiman, 2024). Similar to the previous approach described in Antil and Dhiman (2024), our proposed meta-model achieves the best performance with the lowest ACER (0.1%)in the intra-testing scenario. Furthermore, the meta-model has a slightly better BPCER (0.0%) compared to the T-Encoder, which indicates perfect recognition of bona fide presentations. Table 6 provides the performance comparison on the CASIA-SURF dataset. The comparison methods (Liu et al., 2021; Zhang et al., 2019, 2020; Yu et al., 2023) employ multimodal approaches. Among these, the proposed meta-model stands out for achieving the lowest ACER of \(0.5\%\). The experimental findings validate that our performance is consistent with other leading benchmarks. Thus, it shows that stacking-based ensemble learning is well-suited for intra-testin scenario.

Table 8 Performance evaluation using MSU-MFSD (M), Idiap (I), CASIA (C), and OULU-NPU (O) databases
Table 9 The evaluation on SiW-M protocol II: Unknown spoof detection

Cross-Dataset Testing: Since our work focuses on improving the generalization of face PAD, we initially conducted cross-dataset testing experiments between two datasets, namely, the WMCA and CASIA-SURF datasets. In particular, we begin by training the model using the CASIA-SURF dataset and then assess its performance on the WMCA dataset, focusing on HTER. Similarly, we invert the process by training the model on the WMCA dataset and evaluating it on the CASIA-SURF dataset. Table 7 showcases the comparison of performances, highlighting the meta-model’s attainment of state-of-the-art results on the WMCA dataset. In particular, when trained on CASIA-SURF and tested on WMCA, the meta-model shows an error rate of \(17.76\%\), indicating its robustness in cross-database evaluation. When trained on WMCA and tested on CASIA-SURF, ViT+AMA (Yu et al., 2023) achieves the lowest error rate of \(8.60\%\), making it the best performer in this scenario.

We extend our experiments to the commonly employed cross-dataset testing scenario, where the model is trained on three source databases and evaluated on a completely unseen database using the leave-one-out (LOO) strategy. Specifically, four datasets - the OULU-NPU database (denoted as O), CASIA-MFSD database (denoted as C), Idiap Replay-Attack database (denoted as I), and MSU-MFSD database (denoted as M)—are employed in various combinations: O &C &I to M, O &M &I to C, O &C &M to I, and I &C &M to O. Table 8 presents a performance comparison with recently proposed state-of-the-art methods in terms of HTER and AUC. It can be observed that the meta-model provides the best results for three protocols: O &C &I to M, O &M &I to C, and I &C &M to O. For O &C &I to M, the meta-model has the lowest HTER (1.20). Similarly, the meta-model has the lowest HTER (1.37) for O &M &I to C. In the case of I &C &M to O, the meta-model has the lowest HTER (2.08). However, DiVT-M performs best with the lowest HTER (3.71), outperforming the meta-model in O &C &M to I. Overall, our proposed method demonstrates itself as the most effective method across the majority of the evaluated scenarios, showcasing its robustness and superior performance in cross-database face presentation attack detection.

We also provide a more comprehensive evaluation of the classifier’s performance using the Area Under the ROC Curve (AUC) shown in Table 8. Although the meta-model provides low performance on one database (i.e., Replay-Attack), one can see that the meta-model achieves more than 99% AUC on all the datasets. Since EER is calculated based on the testing set of source databases, HTER focuses on finding the operating point (threshold) where the False Acceptance Rate (FAR) and False Rejection Rate (FRR) are equal. If the AUC is high and the HTER is low, it indicates that the classifier achieves a good balance between FAR and FRR. Since AUC is not threshold-dependent, it is more useful for comparing different classifiers or evaluating the model’s generalization capability.

To quantitatively assess various types of spoofing attacks, we implement 13 leave-one-out testing protocols on the SiW-M dataset. According to Liu et al. (2019), the model is trained using 12 types of spoof attacks along with 80% of the live videos, and evaluated on the remaining attack type and the other 20% of live videos. The results are analyzed in Table 9 with two of the most recent face anti-spoofing methods (Liu et al., 2019, 2020), using (Liu et al., 2018) as the baseline, as it has demonstrated state-of-the-art performance across various benchmarks. In comparison with other methods, the meta-model performs best on several attacks, including Replay Attack, Print Attack, Paper 3D Mask, Paper Glasses Partial Attack, and Partial Fun Eye Glasses Attack. Specifically, it achieves the lowest APCER (0.1), BPCER (7.8), ACER (3.9), and EER (1.1) for Replay Attack. For Print Attack, it consistently has the lowest BPCER (2.5), ACER (1.2), and EER (0.6). In the case of Paper 3D Mask, it has the lowest APCER (0.0) and ACER (3.2), along with the second-lowest BPCER (6.4). Similarly, it records the lowest BPCER (0.0), ACER (0.0), and EER (0.0) for Paper Glasses Partial Attack, and the lowest BPCER (0.0), ACER (14.8), and EER (10.9) for Fun Partial Attack. Thus, the meta-model demonstrates superior performance, particularly excelling in Print and Paper Glasses attacks. It achieves the lowest average BPCER (6.6 ± 5.9), ACER (13.7 ± 10.8), and EER (9.3 ± 7.4), indicating its robustness across various attack types.

Table 10 The results of cross-dataset testing on limited source domains

4.5 Experiment on Limited Source Domains

Our study also investigates the scenario of a limited source domain by training the model on only two source domains, as opposed to the three domains used in Table 8. It can be seen from the results of Table 10 that the model continues to exhibit superior performance across the target domains. In particular, the model achieves the lowest Half Total Error Rate (HTER) in four protocols and the highest Area Under the Curve (AUC) score on all target domains. This outcome emphasizes that even with limited source data, the stacking ensemble approach maintains its robust generalization capability. Overall, this finding is significant as it demonstrates the effectiveness of the stacking ensemble approach in adapting to scenarios where a limited amount of source data is available.

4.6 Ablation Studies

In order to assess the effectiveness of our proposed synthetic data generation and ensemble learning, we conducted separate experiments on four datasets in cross-testing scenarios. The results of these experiments are detailed in Table 11. The results without augmentation represent the performance of the deep DenseNet-201 model (Huang et al., 2017) on stabilized encoded training samples with alpha values set to 0.1. Subsequently, we gradually introduce synthetic training samples and assess the performance at different alpha values. Initially, with Synthetic set1 (alpha values = 0.5), the model demonstrates a slight improvement in performance, evident in a higher AUC and reduced HTER compared to training without augmentation. For instance, in the O &C &I to M scenario, the HTER decreased from 19.02 to 18.11, and the AUC increased from 86.12 to 87.63. Similarly, we assess the performance using Synthetic set2. It can be observed that the model continues to show enhanced performance.

Next, we assess the performance using Synthetic set2 (alpha values = 1.0). One can see that the model continues to show enhanced performance. The HTER improved by approximately 1.3% on the O &M &I to C scenario, reducing from 19.52 to 17.20. We then evaluate the third synthetic subset, Synthetic set3 (alpha values = 1.5). This set shows consistent improvements across most scenarios. For example, in the O &C &I to M scenario, the HTER decreased to 17.21, and the AUC improved to 90.87. When combining all synthetic sets, the model achieves significant performance gains. The combined sets result in an overall reduction in HTER and an increase in AUC across all scenarios. For instance, the HTER in the O &M &I to C scenario drops to 15.20, and the AUC rises to 93.90. We also investigate the alpha values to 2.0, but this leads to a decline in the model’s performance. Hence, we choose to report the performance for only three alpha values.

Finally, we assess the performance after introducing LSTM, BiLSTM, and GRU layers without using the meta-model. Each of these architectures further enhances the model’s performance, for instance, CNN-LSTM achieves an HTER of 8.80 and AUC of 99.38 in the O &C &I to M scenario. CNN-BiLSTM represents an HTER of 2.33 and AUC of 99.98 in the O &M &I to C scenario. Similarly, CNN-GRU achieves an HTER of 8.04 and AUC of 94.02 in the O &M &I to I scenario. For the I &C &M to O scenario, CNN-BiLSTM shows an improved HTER of 12.28 and an AUC of 99.52, outperforming other RNNs.

Table 11 Ablation study using cross-database evaluation

4.7 Discussion

The quantitative results in Table 11 highlight the critical role of synthetic images in training CNN models effectively. When the CNN is trained on all synthetic subsets, there is a notable improvement in performance across various datasets: up to 3% for M, 4% for C, 4% for I, and 6% for O. These gains are particularly significant given the challenging nature of cross-dataset scenarios. Motivated by these initial improvements, the study explores ensemble learning to fully leverage the variations present in synthetic data. Specifically, LSTM is integrated with CNN to predict temporal blur, which further enhances the model’s performance across all datasets. This approach capitalizes on temporal information, allowing the model to better differentiate between genuine and spoofed images. Similarly, employing BiLSTM and GRU models results in substantial performance improvements, particularly notable on the C and O datasets.

However, the generalization ability of individual RNN models (LSTM, BiLSTM, and GRU) remains somewhat limited. To address this, the study introduces a meta-model designed to combine the strengths and mitigate the weaknesses of these base models. This meta-model is trained using the predictions of the base models, creating a new training set from these predictions. The meta-model, another RNN, learns to optimally weigh and combine the base model predictions, leading to enhanced overall performance. We argue that the ensemble learning guided by a video distillation scheme proves to be highly beneficial in improving cross-domain face PAD performance. In particular, the proposed data augmentation technique, which involves using synthetic images, facilitates the base models in learning from the diversity within the data. Temporal inconsistencies or blurriness, which might otherwise be considered noise, are leveraged as valuable features for distinguishing genuine images from spoofed ones. Thus, this approach ensures that the final test predictions made by the meta-model are robust and reliable.

Table 12 Average execution time in seconds

4.8 Comparisons of Execution Times

We analyze the execution times of the proposed video distillation technique with the previous global motion estimation methods (Muhammad et al., 2022a, b) and optical flow(Horn & Schunck, 1981). Table 12 reports the numerical results in the total number of seconds used to generate the training samples based on Eq. (4) using alpha values of 0.1 on two datasets. All these comparison results were reported by using a MATLAB environment based on a workstation with 3.5 GHz Intel Core i7-5930k and 64 GB RAM. One can see that the proposed global motion estimation technique is computationally less expensive than the previous motion estimated methods reported recently in the literature. This is due to the fact that the FAST (Rosten & Drummond, 2005) feature detector is designed with a focus on computational efficiency and speed, while FREAK (Alahi et al., 2012) is intended to work in combination with fast feature detectors like FAST, providing a matching mechanism that is both fast and robust.

Fig. 5
figure 5

Displaying feature maps through visualization. The types of images are labelled in the first column. The second column shows the original encoded and synthetic images. The third column illustrates the feature maps from Grad-CAM (Selvaraju et al., 2017) while the fourth column shows the feature maps from occlusion sensitivity maps (Zeiler & Fergus, 2014). Similarly, the fifth and sixth column visualize the features maps from Gradient Attribution map using Guided Backpropagation (Springenberg et al., 2014), and locally interpretable model-agnostic explanations (LIME) (Ribeiro et al., 2016), respectively. The last column shows the masked images obtained from LIME predictions

4.9 Interpretation of the Deep Neural Network

Interpretation is essential to observe the learning patterns in data that are important, but there is no clear consensus on how interpretability should be best defined in the context of machine learning (Molnar et al., 2020). Although explanation methods intend to make neural networks more trustworthy and interpretable, the “black-box” nature of deep neural networks can make it challenging to determine precisely why a particular decision was made. For instance, synthetic samples provide additional variations of the data and lead to better interpretability compared to the same model trained without synthetic samples. This improvement is attributed to the fact that the motion cues, which are naturally available in the frame sequences, are “easy to learn” for the model and play an important role in model optimization. Consequently, the importance of interpretation is becoming increasingly popular and has led to useful and promising findings.

Fig. 6
figure 6

The t-SNE visualization of feature distributions on cross-testing scenarios. a shows the feature distribution of the original encoded video clips, b reflects the feature distribution of encoded video clips with a subset of synthetic samples, (c) shows the feature distribution of our meta-model

Fig. 7
figure 7

The Receiver Operating Characteristics (ROC) curves. a O &C &I to M, b O &M &I to C, c O &C &M to I, and d I &C &M to O are developed by plotting the true positive rate (TPR) against the false positive rate (FPR)

In our work, we utilize Gradient-weighted Class Activation Mapping (denoted as Grad-CAM) (Selvaraju et al., 2017), Occlusion Sensitivity Maps (denoted as OCC-SEN) (Zeiler & Fergus, 2014), Gradient Attribution Map using Guided Backpropagation (denoted as Grad-ATT) (Springenberg et al., 2014), and Locally Interpretable Model-Agnostic Explanations (denoted as LIME) (Ribeiro et al., 2016) to understand which patterns in the data are deemed important and contribute to the final decision. These methods enable us to trust the behavior of the developed deep learning model and/or further tune the model by observing its interpretations. Specifically, we extract visualization maps from the pretrained DenseNet-201 (Huang et al., 2017) convolutional neural network for all of the above methods in our experiments. In Fig. 5, we visualize diverse synthetic images from the CASIA-FASD dataset. The first four rows show print attack images, while the next four rows show replay attack images. Each visualization method captures the class discriminative region, thanks to the proposed synthetic data generation scheme that allows the network to use more subtle cues for its correct classification. In particular, the first row shows that the neurons in the deep convolutional layers focus on the paper’s texture and hand movement cues. However, Grad-ATT (Springenberg et al., 2014) interpretation shows that the model also takes the background as context to make the prediction. Surprisingly, this issue is eliminated by the proposed synthetic data generation scheme, where the second, third, and fourth rows show that the model only considers motion cues, surface edges, and barely observes the background context.

In the case of a replay attack, the remaining rows show that the tablet screen and hand movement provide discriminative information for the model’s prediction. While we cannot present this for every image of the dataset, we observed that the mouth region, eye blinking, and head rotation contribute positively to distinguishing live and spoofed images. Thus, interpretations from the above methods demonstrate that the proposed learning model is focusing on the correct features of the input data, and the model’s decision can be viewed in a human-understandable way. Moreover, the proposed synthetic data generation method provides informative RGB images and helps the model to make the features of spoofed faces more dispersed, allowing a better class discrimination to generalize well to the target domain.

4.10 Visualization and Analysis

In order to visually illustrate the individual contributions of each model, we employ t-SNE (t-Distributed Stochastic Neighbor Embedding) to analyze the distribution of different features, as depicted in Fig. 6. Initially, the model is trained on the 0+C+I source domains without incorporating synthetic samples, resulting in a trivial distribution shown in Fig. 6a. In this representation, the boundary between live and spoofed samples is indistinct, and areas of overlap can lead to potential misclassifications, thereby degrading overall performance.

However, when synthetic samples are included in the model, as depicted in Fig. 6b, the feature distribution demonstrates improvement, offering a comparatively clearer separation compared to the model that does not include synthetic samples. This enhancement is attributed to the synthetic samples aiding the model in recognizing spatiotemporal artifacts. Nonetheless, with the introduction of the meta-model, we observe a well-structured and compact distribution with a clearly defined boundary in Fig. 6c. As a result, our proposed ensemble learning approach exhibits strong generalizability when applied to unseen target data.

In Fig. 7, we employ ROC curves to visually represent the model’s ability to differentiate between real and attack classes. As demonstrated in Fig. 7, the meta-model consistently achieves an AUC (Area Under the Curve) of over 90% across all datasets, showcasing an impressive level of performance on previously unseen testing sets. The ROC curve is constructed with True Positive Rate (TPR) plotted against False Positive Rate (FPR), where FPR is on the x-axis and TPR is on the y-axis. Specifically, when the meta-model (ensemble) shifts the curves closer to the top-left corner, it indicates superior performance in distinguishing between the classes.

5 Conclusions

In this paper, we addressed the domain generalization issue in face presentation attack detection (PAD) by proposing a novel approach that combines data augmentation and deep ensemble learning. By observing multiple blurred sequences, the base models were able to learn sequential patterns of motion blur and infer how objects move over time. This data augmentation technique was found to be helpful in improving the robustness of the models and their ability to handle variations in motion blur present in real-world scenarios. This technique not only enhanced the training data, but also addressed the issue of black framing that might arise during feature-based global motion estimation. Based on the experimental results, the performance of LSTM, BiLSTM, and GRU still face limitations in certain scenarios. To improve the overall generalization, we introduced a meta-model that leverages the strengths of different base models. This ensemble approach allowed the model to benefit from the diverse representations learned by individual base models. Based on the experimental results on seven benchmark datasets, the meta-model achieves competitive performance on all datasets.

Finally, the interpretation of the model shows that motion cues (e.g., temporal information or motion patterns) are helpful in improving the model’s generalization ability. We conclude that the effectiveness of a meta-model depends on the diversity and quality of the base models used. Especially, if the base models suffer from similar limitations or biases, the meta-model may not provide significant improvements. Although the proposed method improves generalization, we observe two disadvantages. First, since the meta-model requires an additional training phase using the outputs from the base models, it introduces an extra layer of computational complexity and cost. Second, the proposed data augmentation may not encode all fine details, especially when the background is non-static. Therefore, future research should explore new approaches and enhancements to address these limitations and advance video summarization methods for non-static background videos.