Keywords

1 Introduction

With the increasing influence of smart devices in our daily lives, people are seeking for secure and convenient ways to access their personal information. Biometrics, such as face, fingerprint, and iris, are widely utilized for person authentication due to their intrinsic distinctiveness and convenience to use. Face, as one of the most popular modalities, has received increasing attention in the academia and industry in the recent years (e.g., iPhone X). However, the attention also brings a growing incentive for hackers to design biometric presentation attacks (PA), or spoofs, to be authenticated as the genuine user. Due to the almost no-cost access to the human face, the spoof face can be as simple as a printed photo paper (i.e., print attack) and a digital image/video (i.e., replay attack), or as complicated as a 3D Mask and facial cosmetic makeup. With proper handling, those spoofs can be visually very close to the genuine user’s live face. As a result, these call for the need of developing robust face anti-spoofing algorithms.

As the most common spoofs, print attack and replay attack have been well studied previously, from different perspectives. The cue-based methods aim to detect liveness cues [1, 2] (e.g., eye blinking, head motion) to classify live videos. But these methods can be fooled by video replay attacks. The texture-based methods attempt to compare texture difference between live and spoof faces, using pre-defined features such as LBP [3, 4], HOG [5, 6]. Similar to texture-based methods, CNN-based methods [2, 7, 8] design a unified process of feature extraction and classification. With a softmax loss based binary supervision, they have the risk of overfitting on the training data. Regardless of the perspectives, almost all the prior works treat face anti-spoofing as a black box binary classification problem. In contrast, we propose to open the black box by modeling the process of how a spoof image is generated from its original live image.

Fig. 1.
figure 1

The illustration of face spoofing and anti-spoofing processes. De-spoofing process aims to estimate a spoof noise from a spoof face and reconstruct the live face. The estimated spoof noise should be discriminative for face anti-spoofing.

Our approach is motivated by the classic image de-X problems, such as image de-noising and de-blurring [9,10,11,12]. In image de-noising, the corrupted image is regarded as a degradation from the additive noise, e.g., salt-and-pepper noise and white Gaussian noise. In image de-blurring, the uncorrupted image is degraded by motion, which can be described as a process of convolution. Similarly, in face anti-spoofing, the spoof image can be viewed as a re-rendering of the live image but with some “special” noise from the spoof medium and the environment. Hence, the natural question is, can we recover the underlying live image when given a spoof image, similar to image de-noising?

Yes. This paper shows “how” to do this. We call the process of decomposing a spoof face to the spoof noise pattern and a live face as Face De-spoofing, shown in Fig. 1. Similar to the previous de-X works, the degraded image \(\mathbf {x}\in \mathbb {R}^{m}\) can be formulated as a function of the original image \(\mathbf {\hat{x}}\), the degradation matrix \(\mathbf {A}\in \mathbb {R}^{m\times m}\) and an additive noise \(\mathbf {n}\in \mathbb {R}^{m}\).

$$\begin{aligned} \mathbf {x}=\mathbf {A}\mathbf {\hat{x}} + \mathbf {n} = \mathbf {\hat{x}} + (\mathbf {A}-\mathbb {I}) \mathbf {\hat{x}} + \mathbf {n} = \mathbf {\hat{x}} +N(\mathbf {\hat{x}}), \end{aligned}$$
(1)

where \(N(\mathbf {\hat{x}})=(\mathbf {A}-\mathbb {I}) \mathbf {\hat{x}} + \mathbf {n}\) is the image-dependent noise function. Instead of solving \(\mathbf {A}\) and \(\mathbf {n}\), we decide to estimate \(N(\mathbf {\hat{x}})\) directly since it is more solvable under the deep learning framework [13,14,15,16,17]. Essentially, by estimating \(N(\mathbf {\hat{x}})\) and \(\mathbf {\hat{x}}\), we aim to peel off the spoof noise and reconstruct the original live face. Likewise, if given a live face, face de-spoofing model should return itself plus zero noise. Note that our face de-spoofing is designed to handle paper attack, replay attack and possibly make-up attack, but our experiments are limited to the first two PAs. The benefits of face de-spoofing are twofold: (1) it reverses, or undoes, the spoofing generation process, which helps us to model and visualize the spoof noise pattern of different spoof mediums. (2) the spoof noise itself is discriminative between live and spoof images and hence is useful for face anti-spoofing.

While face de-spoofing shares the same challenges as other image de-X problems, it has a few distinct difficulties to conquer:

No Ground Truth: Image de-X works often use synthetic data where the original undegraded image could be used as ground truth for supervised learning. In contrast, we have no access to \(\mathbf {\hat{x}}\), which is the corresponding live face of a spoof face image.

No Noise Model: There is no comprehensive study and understanding about the spoof noise. Hence it is not clear how we can constrain the solution space to faithfully estimate the spoof noise pattern.

Diverse Spoof Mediums: Each type of spoofs utilizes different spoof mediums for generating spoof images. Each spoof medium represents a specific type of noise pattern.

To address these challenges, we propose several constraints and supervisions based on our prior knowledge and the conclusions from a case study (in Sect. 3.1). Given that a live face has no spoof noise, we impose the constraint that \(N(\mathbf {\hat{x}})\) of a live image is zero. Based on our study, we assume that the spoof noise of a spoof image is ubiquitous, i.e., it exists everywhere in the spatial domain of the image; and is repetitive, i.e., it is the spatial repetition of certain noise in the image. The repetitiveness can be encouraged by maximizing the high-frequency magnitude of the estimated noise in the Fourier domain.

With such constraints and auxiliary supervisions proposed in [18], a novel CNN architecture is presented in this paper. Given an image, one CNN is designed to synthesize the spoof noise pattern and reconstruct the corresponding live image. In order to examine the reconstructed live image, we train another CNN with auxiliary supervision and a GAN-like discriminator in an end-to-end fashion. These two networks are designed to ensure the quality of the reconstructed image regarding its discriminativeness between live and spoof, and the visual plausibility of the synthesized live image.

To summarize, the main contributions of this work include:

  • We offer a new perspective for detecting the spoofing face from print attack and replay attack by inversely decomposing a spoof face image into the live face and the spoofing noise, without having the ground truth of either.

  • A novel CNN architecture is proposed for face de-spoofing, where appropriate constraints and auxiliary supervisions are imposed.

  • We demonstrate the value of face de-spoofing by its contribution to face anti-spoofing and the visualization of the spoof noise patterns.

2 Prior Work

We review the most relevant prior works to ours from two perspectives: texture-based face anti-spoofing and de-X problems.

Texture-Based Face Anti-spoofing. Texture analysis is widely adopted in face anti-spoofing as well as other computer vision tasks [19, 20], where defining an effective feature representation is the key endeavor. Early works apply the hand-crafted feature descriptors, such as LBP [3, 4, 21], HoG [5, 6], SIFT [22] and SURF [23], to project the faces to a low-dimension embedding. However, those hand-crafted features are not specifically designed to capture the subtle differences in the spoofing faces, and thus the embedding may not be discriminative. In addition, those features may not be robust to variations such as illumination, pose, and etc. To overcome some of these difficulties, researchers tackle the problem in different domains, such as HSV and YCbCr color space [24, 25], temporal domain [26,27,28,29] and Fourier spectrum [30].

Heading into the deep learning era, researchers aim to build deep models for a higher accuracy. Most of the CNN works treat face anti-spoofing as a binary classification problem and apply the softmax loss function. Compared to hand-crafted features, such models [29] achieve remarkable improvements in the intra-testing (i.e., train and test within the same dataset). However, during the cross-testing (i.e., train and test in different datasets), these CNN models exhibit a poor generalization ability due to the overfitting to training data. Atoum et al. [31] and Liu et al. [18] observe the overfitting issue of the softmax loss, and both propose novel auxiliary-driven loss functions instead of softmax to supervise the CNN. These works bring us the insight that we need to involve the domain knowledge to solve face anti-spoofing.

To the best of our knowledge, all the previous methods are discriminative models. There are only a few papers [2, 22] trying to categorize the types and properties of the spoof noise pattern, such as color distortion and moiré pattern. In this work, we analyze the properties of spoof noise and design a GAN-fashion generative model [32] to estimate the spoof noise pattern and peel it off the spoof image. We believe by decomposing the spoof image, CNN can analyze the spoof noise more directly and effectively, and gain more knowledge in tackling face anti-spoofing.

De-X Problems. De-X problems, such as de-noising, de-blurring, de-mosaicing, super-resolution and inpainting [13,14,15,16,17, 33,34,35,36,37,38], are classic low-level vision problems that remove the degradation effect or artifacts from the image. General de-noising works assume additive Gaussian noise and researchers propose non-local filters [33] or CNNs [13, 34] to exploit the inherent similarity within the images. For de-mosaicing and super-resolution, many models, such as ResNet in [14, 15] and joint models in [16, 17, 35], are learnt from the given pairs of low-quality input and high-quality ground truth. In image inpainting, users mark the area to inpaint in a mask map and apply the filling based on the existing patch texture and the overall view structure in the unmasked region [36, 37, 39].

One advantage of existing de-X problems is that most of the image degradation can be easily synthesized. This brings two benefits: (1) it provides the model training with the input degraded samples and golden ground-truth original images for supervision. (2) it is easy to synthesize a large amount of data for training and evaluation. On the contrary, degradation due to spoofing is versatile, complex, and subtle. It consists of 2-stage degradation: one from the spoof medium (e.g., paper and digital screen), and the other from the interaction of the spoof medium with the imaging environment. Each stage includes a large number of variations, such as medium type, illumination, non-rigid deformation and sensor types. Combination of these variations makes the overall degradation varies greatly. As a result, it is almost impossible to mimic realistic spoofing by synthesizing a degradation, which is a distinct challenge of face de-spoofing compared to the conventional de-X problems.

Without the ground truth of the degraded image, face de-spoofing becomes a very challenging problem. In this work, we propose an encoder-decoder architecture with novel loss functions and supervisions to solve the de-spoofing problem.

Fig. 2.
figure 2

The illustration of the spoof noise pattern. \(\mathbf{Left: }\) live face and its local regions. \(\mathbf{Right: }\) Two registered spoofing faces from print attack and replay attack. For each sample, we show the local region of the face, intensity difference to the live image, magnitude of 2D FFT, and the local peaks in the frequency domain that indicates the spoof noise pattern. Best viewed electronically.

3 Face De-spoofing

In this section, we start with a case study of spoof noise pattern, which demonstrates a few important characteristics of the noise. This study motivates us to design the novel CNN architecture that will be presented in Sect. 3.2.

3.1 A Case Study of Spoof Noise Pattern

The core task of face de-spoofing is to estimate the spoofing-relevant noise pattern in the given face image. Despite the strength of using a CNN model, we are still facing the challenge of learning without the ground truth of the noise pattern. To address this challenge, we would like to first carry out a case study on the noise pattern with the objectives of answering the following questions: (1) is Eq. 1 a good modeling of the spoof noise? (2) what characteristics does the spoof noise hold?

Let us denote a genuine face as \(\mathbf {\hat{I}}\). By using printed paper or video replay on digital devices, the attacker can manufacture a spoof image \(\mathbf {I}\) from \(\mathbf {\hat{I}}\). Considering no non-rigid deformation between \(\mathbf {I}\) and \(\mathbf {\hat{I}}\), we summarize the degradation from \(\mathbf {\hat{I}}\) to \(\mathbf {I}\) as the following steps:

Fig. 3.
figure 3

The proposed network architecture.

  1. 1.

    Color distortion: Color distortion is due to a narrower color gamut of the spoof medium (e.g. LCD screen or Toner Cartridge). It is a projection from the original color space to a tinier color subspace. This noise is dependent on the color intensity of the subject, and hence it may apply as a degradation matrix to the genuine face \(\mathbf {I}\) during the degradation.

  2. 2.

    Display artifacts: Spoof mediums often use several nearby dots/sensors to approximate one pixel’s color, and they may also display the face differently than the original size. Approximation and down-sampling procedure would cause a certain degree of high-frequency information loss, blurring, and pixel perturbation. This noise may also apply as a degradation matrix due to its subject dependence.

  3. 3.

    Presenting artifacts: When presenting the spoof medium to the camera, the medium interacts with the environment and brings several artifacts, including reflection and transparency of the surface. This noise may apply as an additive noise.

  4. 4.

    Imaging artifacts: Imaging lattice patterns such as screen pixels on the camera’s sensor array (e.g. CMOS and CCD) would cause interference of light. This effect leads to aliasing and creates moiré pattern, which appears in replay attack and some print attack with strong lattice artifacts. This noise may apply as an additive noise.

These four steps show that the spoof image \(\mathbf {I}\) can be generated via applying degradation matrices and additive noises to \(\mathbf {\hat{I}}\), which is basically conveyed by Eq. 1. As expressed by Eq. 1, the spoof image is the summation of the live image and image-dependent noise. To further validate this model, we show an example in Fig. 2. Given a high-quality live image, we carefully produce two spoof images via print and replay attack, with minimal non-rigid deformation. After each spoof image is registered with the live image, the live image becomes the ground truth live image if we would perform de-spoofing on the spoof image. This allows us to compute the difference between the live and spoof images, which is the noise pattern \(N(\mathbf {\hat{I}})\). To analyze its frequency properties, we perform FFT on the spoof noise and show the 2D shifted magnitude response.

In both spoof cases, we observe a high response in the low-frequency domain, which is related to color distortion and display artifacts. In print attack, repetitive noise in Step 3 leads to a few “peak” responses in the high-frequency domain. Similarly, in the replay attack, visible moiré pattern reflects as several spurs in the low-frequency domain, and the lattice pattern that causes the moiré pattern is represented as peaks in the high-frequency domain. Moreover, spoof patterns are uniformly distributed in the image domain due to the uniform texture of the spoof mediums. And the high response of the repetitive pattern in the frequency domain exactly demonstrates that it appears widely in the image and thus can be viewed as ubiquitous.

Under this ideal registration, the comparison between live and spoof images provides us a basic understanding of the spoof noise pattern. It is a type of texture that has the characteristics of \(\mathbf{repetitive }\) and \(\mathbf{ubiquitous }\). Based on this modeling and noise characteristics, we design a network to estimate the noise without the access to the precisely registered ground truth live image, as this case study has.

Table 1. The network structure of DS Net, DQ Net and VQ Net. Each convolutional layer is followed by an exponential linear unit (ELU) and batch normalization layer. The input image size for DS Net is \(256\times 256\times 6\). All the convolutional filters are \(3\times 3\). 0\1 Map Net is the bottom-left part, i.e., conv1-10, conv1-11, and conv1-12.

3.2 De-Spoof Network

Network Overview: Figure 3 shows the overall network architecture of our proposed method. It consists of three parts: De-Spoof Net (DS Net), Discriminative Quality Net (DQ Net), and Visual Quality Net (VQ Net). DS Net is designed to estimate the spoof noise pattern \(\mathbf {N}\) (i.e. the output of \(N(\mathbf {\hat{I}})\)) from the input image \(\mathbf {I}\). The live face \(\mathbf {\hat{I}}\) then can be reconstructed by subtracting the estimated noise \(\mathbf {N}\) from the input image \(\mathbf {I}\). This reconstructed image \(\mathbf {\hat{I}}\) should be both visually appealing and indeed live, which will be safeguarded by the DQ Net and VQ Net respectively. All networks can be trained in an end-to-end fashion. The details of the network structure are shown in Table 1.

As the core part, DS Net is designed as an encoder-decoder structure with the input \(\mathbf {I} \in \mathbb {R}^{256 \times 256 \times 6}\). Here the 6 channels are RGB \(+\) HSV color space, following the suggestion in [31]. In the encoder part, we first stack 10 convolutional layers with 3 pooling layers. Inspired by the residual network [40], we follow by a short-cut connection: concatenating the responses from pool1-1, pool1-2 with pool1-3, and then sending them to conv1-10. This operation helps us to pass the feature responses from different scales to the later stages and ease the training procedure. Going through 3 more convolution layers, the responses \(\mathbf {F} \in \mathbb {R}^{32 \times 32 \times 32}\) from conv1-12 are the feature representation of the spoof noise patterns. The higher magnitudes the responses have, the more spoofing-perceptible the input is.

Out from the encoder, the feature representation \(\mathbf {F}\) is fed into the decoder to reconstruct the spoof noise pattern. \(\mathbf {F}\) is directly resized to the input spatial size \(256\times 256\). It introduces no extra grid artifacts, which exist in the alternative approach of using a deconvolutional layer. Then, we pass the resized \(\mathbf {F}\) to several convolutional layers to reconstruct the noise pattern \(\mathbf {N}\). According to Eq. 1, the reconstructed live image can be retrieved by: \(\mathbf {\hat{x}} = \mathbf {x} - N(\mathbf {\hat{x}}) = \mathbf {I} - \mathbf {N}\).

Each convolutional layer in the DS Net is equipped with exponential linear unit (ELU) and batch normalization layers. To supervise the training of DS Net, we design multiple loss functions: losses from DQ Net and VQ Net for the image quality, 0\1 map loss, and noise property losses. We introduce these loss functions in Sects. 3.3 and 3.4.

3.3 DQ Net and VQ Net

While we do not have the ground truth to supervise the estimated spoof noise pattern, it is possible to supervise the reconstructed live image, which implicitly guides the noise estimation. To estimate a good-quality spoof noise, the reconstructed live image should be quantitatively and visually recognized as live. For this purpose, we propose two networks in our architecture: Discriminative Quality Net (DQ Net) and Visual Quality Net (VQ Net). The VQ Net aims to guarantee the reconstructed live face is photorealistic. The DQ Net is proposed to guarantee the reconstructed face would indeed be considered as live, based on the judgment of a pre-trained face anti-spoofing network. The details of our proposed architecture are shown in Table 1.

Discriminative Quality Net: We follow the state-of-the-art network architecture of face anti-spoofing [18] to build our DQ Net. It is a fully convolutional network with three filter blocks and three additional convolutional layers. Each block consists of three convolutional layers and one pooling layer. The feature maps after each pooling layer are resized and stacked to feed into the following convolutional layers. Finally, DQ Net is supervised to estimate the pseudo-depth \(\mathbf {D}\) of an input face, where \(\mathbf {D}\) for the live face is the depth of the face shape and \(\mathbf {D}\) for the spoof face is a zero map as a flat surface. We adopt the 3D face alignment algorithm in [41] to estimate the face shape and render the depth via Z-Buffering.

Similar to the previous work [42], DQ Net is pre-trained to obtain the semantic knowledge of live faces and spoofing faces. And during the training of DS Net, the parameters of DQ Net are fixed. Since the reconstructed images \(\mathbf {\hat{I}}\) are live images, the corresponding pseudo-depth \(\mathbf {D}\) should be the depth of the face shape. The backpropagation of the error from DQ Net guides the DS Net to estimate the spoof noise pattern which should be subtracted from the input image,

$$\begin{aligned} J_{DQ} = \left\| \text {CNN}_{DQ}(\mathbf {\hat{I}} ) - \mathbf D \right\| _1, \end{aligned}$$
(2)

where \(\text {CNN}_{DQ}\) is a fixed network and \(\mathbf{D }\) is the depth of the face shape.

Visual Quality Net: We deploy a GAN to verify the visual quality of the estimated live image \(\mathbf {\hat{I}}\). Given both the real live image \(\mathbf {I_{live}}\) and the synthesized live image \(\mathbf {\hat{I}}\), VQ Net is trained to distinguish between \(\mathbf {I_{live}}\) and \(\mathbf {\hat{I}}\). Meanwhile, DS Net tries to reconstruct photorealistic live images where the VQ Net would classify them as non-synthetic (or real) images. The VQ Net consists of 6 convolutional layers and a fully connected layer with a 2D vector as the output, which represents the probability of the input image to be real or synthetic. In each iteration during the training, the VQ Net is evaluated with two batches, in the first one, the DS Net is fixed and we update the VQ Net,

$$\begin{aligned} J_{VQ_{train}} = -\mathbb {E}_{\mathbf {I} \in \mathcal {R}} \ \text {log}(\text {CNN}_{VQ}(\mathbf {I}))-\mathbb {E}_{\mathbf {I} \in \mathcal {S}} \ \text {log}(1-\text {CNN}_{VQ}(\text {CNN}_{DS}(\mathbf {I}))), \end{aligned}$$
(3)

where \(\mathcal {R}\) and \(\mathcal {S}\) are the sets of real and synthetic images respectively. In the second batch, the VQ Net is fixed and the DS Net is updated,

$$\begin{aligned} J_{VQ_{test}} = -\mathbb {E}_{\mathbf {I} \in \mathcal {S}} \ \text {log}(\text {CNN}_{VQ}(\text {CNN}_{DS}(\mathbf {I}))). \end{aligned}$$
(4)

3.4 Loss Functions

The main challenge for spoof modeling is the lack of the ground truth for the spoof noise pattern. Since we have concluded some properties about the spoof noise in Sect. 3.1, we can leverage them to design several novel loss functions to constrain the convergence space. First, we introduce magnitude loss to enforce the spoof noise of the live image to be zero. Second, zero\one map loss is used to demonstrate the ubiquitousness of the spoof noise. Third, we encourage the repetitiveness property of spoof noise via repetitive loss. We describe three loss functions as the following:

Magnitude Loss: The spoof noise pattern for the live images is zero. The magnitude loss can be utilized to impose the constraint for the estimated noise. Given the estimated noise \(\mathbf {N}\) and reconstructed live image \(\mathbf {\hat{I}}=\mathbf {I}-\mathbf {N}\) of an original live image \(\mathbf {I}\), we have,

$$\begin{aligned} J_m =\left\| \mathbf {N}\right\| _1. \end{aligned}$$
(5)

Zero\One Map Loss: To learn discriminative features in the encoder layers, we define a sub-task in the DS Net to estimate a zero-map for the live faces and an one-map for the spoof. Since this is a per pixel supervision, it is also a constraint of ubiquitousness on the noise. Moreover, 0\1 map enables the receptive field of each pixel to cover a local area, which helps to learn generalizable features for this problem. Formally, given the extracted features \(\mathbf {F}\) from an input face image \(\mathbf {I}\) in the encoder, we have,

$$\begin{aligned} J_z = \left\| \text {CNN}_{01map}(\mathbf {F}; \varTheta ) - \mathbf M \right\| _1, \end{aligned}$$
(6)

where \(\mathbf {M} \in \mathbf {0}^{32\times 32}\) or \(\mathbf {M} \in \mathbf {1}^{32\times 32}\) is the zero\one map label.

Repetitive Loss: Based on the previous discussion, we assume the spoof noise pattern to be repetitive, because it is generated from the repetitive spoof medium. To encourage the repetitiveness, we convert the estimated noise \(\mathbf {N}\) to the Fourier domain and compute the maximum value in the high-frequency band. The existence of high peak is indicative of the repetitive pattern. We would like to maximize this peak for spoof images, but minimize it for live images, as the following loss function:

where \(\mathcal {F}\) is the Fourier transform operator, H is an operator for masking the low-frequency domain of an image, i.e., setting a \(k\times k\) region in the center of the shifted 2D Fourier response to zero.

Finally, the total loss function in our training is the weighted summation of the aforementioned loss functions and the supervisions for the image qualities,

$$\begin{aligned} J_T = J_z + \lambda _1 J_m + \lambda _2 J_r + \lambda _3 J_{DQ} + \lambda _4 J_{VQ_{test}} , \end{aligned}$$
(7)

where \(\lambda _1, \lambda _2, \lambda _3, \lambda _4\) are the weights. During the training, we alternate between optimizing Eqs. 7 and 3.

4 Experimental Results

4.1 Experimental Setup

Databases. We evaluate our work on three face anti-spoofing databases, with print and replay attacks: Oulu-NPU [43], CASIA-MFSD [44] and Replay-Attack [45]. Oulu-NPU [43] is a high-resolution database, considering many real-world variations. Oulu-NPU also includes 4 testing protocols: Protocol 1 evaluates on the illumination variation, Protocol 2 examines the influence of different spoof medium, Protocol 3 inspects the effect of different camera devices and Protocol 4 contains all the challenges above, which is close to the scenario of cross testing. CASIA-MFSD [44] contains videos with resolution \(640\times 480\) and \(1280\times 720\). Replay-Attack [45] includes videos of \(320\times 240\). These two databases are often used for cross testing [2].

Parameter Setting. We implement our method in Tensorflow [46]. Models are trained with the batch size of 6 and the learning rate of \(3\mathrm {e}{-5}\). We set the \(k=64\) in the repetitive loss and set \(\lambda _1\) to \(\lambda _4\) in Eq. 7 as 3, 0.005, 0.1 and 0.016, respectively. DQ Net is trained separately and remains fixed during the update of DS Net and VQ Net, but all sub-networks are trained with the same and respective data in each protocol.

Evaluation Metrics. To compare with previous methods, we use Attack Presentation Classification Error Rate (APCER) [47], Bona Fide Presentation Classification Error Rate (BPCER) [47] and, \(\textit{ACER}=(\textit{APCER}+\textit{BPCER})/2\) [47] for the intra testing on Oulu-NPU, and Half Total Error Rate (HTER) [48], half of the summation of FAR and FRR, for the cross testing between CASIA-MFSD and Replay-Attack.

4.2 Ablation Study

Using Oulu-NPU Protocol 1, we perform three studies on the effect of score fusing, the importance of each loss function, and the influence of image resolution and blurriness.

Table 2. The accuracy of different outputs of the proposed architecture and their fusions.

Different Fusion Methods. In the proposed architecture, three outputs can be utilized for classification: the norms of either the 0\1 map, the spoof noise pattern or the depth map. Because of the discriminativeness enabled by our learning, we can simply use a rudimentary classifier like L-1 norm. Note that a more advance classifier is applicable and would likely lead to higher performance. Table 2 shows the performance of each output and their fusion with maximum and average. It shows that the fusion of spoof noise and depth map achieves the best performance. However, adding the 0\1 map scores do not improve the accuracy since it contains the same information as the spoof noise. Hence, for the rest of experiments, we report performance from the average fusion of the spoof noise \(\mathbf {N}\) and the depth map \(\mathbf {\hat{D}}\), i.e., \(score =(\left\| \mathbf {N}\right\| _1+\left\| \mathbf {\hat{D}}\right\| _1)/2\).

Advantage of Each Loss Function. We have three main loss functions in our proposed architecture. To shows the effect of each loss function, we train a network with each loss excluded one by one. By disabling the magnitude loss, the 0\1 map loss and the repetitive loss, we obtain the ACERs 5.24, 2.34 and 1.50, respectively. To further validate the repetitive loss, we perform an experiment on high-resolution images by changing the network input to the cheek region of the original 1080P resolution. The ACER of the network with the repetitive loss is 2.92 but the network without cannot converge.

Table 3. ACER of the proposed method with different image resolutions and blurriness. To create blurry images, we apply Gaussian filters with different kernel sizes to the input images.

Resolution and Blurriness. As shown in the ablation study of repetitive loss, the image quality is critical for achieving a high accuracy. The spoof noise pattern may not be detected in the low-resolution or motion-blurred images. The testing results on different image resolutions and blurriness are shown in Table 3. These results validate that the spoof noise pattern is less discriminative for the lower-resolution or blurry images, as the high-frequency part of the input images contains most of the spoof noise pattern.

4.3 Experimental Comparison

To show the performance of our proposed method, we present our accuracy in the intra testing of Oulu-NPU and the cross testing on CASIA and Replay-Attack.

Table 4. The intra testing results on 4 protocols of Oulu-NPU.

Intra Testing. We compare our intra testing performance on all 4 protocols of Oulu-NPU. Table 4 shows the comparison of our method and the best 3 out of 18 previous methods [18, 49]. Our proposed method achieves promising results on all protocols. Specifically, we outperform the previous state of the art by a large margin in Protocol 4, which is the most challenging protocol, and similar to cross testing.

Cross Testing. We perform cross testing between CASIA-MFSD [44] and Replay-Attack [45]. As shown in Table 5, our method achieves the competitive performance on the cross testing from CASIA-MFSD to Replay-Attack. However, we achieve a worse HTER compared to the best performing methods from Replay Attack to CASIA-MFSD. We hypothesize the reason is that images of CASIA-MFSD are of much higher resolution than those of Replay Attack. This shows that the model trained with higher-resolution data can generalize well on lower-resolution testing data, but not the other way around. This is one limitation of the proposed method, and worthy further research.

Table 5. The HTER of different methods for the cross testing between the CASIA-MFSD and the Replay-Attack databases. We mark the top-2 performances in bold.

4.4 Qualitative Experiments

Spoof Medium Classification. The estimated spoof noise pattern of the test images can be used for clustering them into different groups and each group represents one spoof medium. To visualize the results, we use t-SNE [52] for dimension reduction. The t-SNE projects the noise \(\mathbf {N} \in \mathbb {R}^{256 \times 256 \times 6}\) to 2 dimensions by best preserving the KL divergence distance. Figure 4 shows the distributions of the testing videos on Oulu-NPU Protocol 1. The left image shows that the noise of live is well-clustered, and the noise of spoof is subject dependent, which is consistent with our noise assumption. To obtain a better visualization, we utilize the high pass filter to extract the high-frequency information of noise pattern for dimension reduction. The right image shows that the high frequency part has more subject independent information about the spoof type and can be utilized for classification of the spoof medium.

To further show the discriminative power of the estimated spoof noise, we divide the testing set of Protocol 1 to training and testing parts and train an SVM classifier for spoof medium classification. We train two models, a three-class classifier (live, print and display) and a five-class classifier (live, print1, print2, display1 and display2), and they achieve the classification accuracy of \(82.0\%\) and \(54.3\%\) respectively, shown in Table 6. Most classification errors of the five-class model are within the same spoof medium. This result is noteworthy given that no label of spoof medium type is provided during the learning of the spoof noise model. Yet the estimated noise actually carries appreciable information regarding the medium type; hence we can observe reasonable results of spoof medium classification. This demonstrates that the estimated noise contains spoof medium information and indeed we are moving toward estimating the faithful spoof noise residing in each spoof image. In the future, if the performance of spoof medium classification improves, this could bring new impact to applications such as forensic.

Fig. 4.
figure 4

The 2D visualization of the estimated spoof noise for test videos on Oulu-NPU Protocol 1. Left: the estimated noise, Right: the high-frequency band of the estimated noise, Color code used: black=live, green=printer1, blue=printer2, magenta=display1, red=display2. (Color figure online)

Table 6. The confusion matrices of spoof mediums classification based on spoof noise pattern.
Fig. 5.
figure 5

The visualization of input images, estimated spoof noises and estimated live images for test videos of Protocol 1 of Oulu-NPU database. The first four columns in the first row are paper attacks and the second four are the replay attacks. For a better visualization, we magnify the noise by 5 times and add the value with 128, to show both positive and negative noise.

Fig. 6.
figure 6

The failure cases for converting the spoof images to the live ones.

Successful and Failure Cases. We show several success and failure cases in Figs. 5 and 6. Figure 5 shows that the estimated spoof noises are similar within each medium but different from the other mediums. We suspect that the yellowish color in the first four columns is due to the stronger color distortion in the paper attack. The fifth row shows that the estimated noise for the live images is nearly zero. For the failure cases, we only have a few false positive cases. The failures are due to undesired noise estimation which will motivate us for further research.

5 Conclusions

This paper introduces a new perspective for solving the face anti-spoofing by inversely decomposing a spoof face into the live face and the spoof noise pattern. A novel CNN architecture with multiple appropriate supervisions is proposed. We design loss functions to encourage the pattern of the spoof images to be ubiquitous and repetitive, while the noise of the live images should be zero. We visualize the spoof noise pattern which can help to have a deeper understanding of the added noise by each spoof medium. We evaluate the proposed method on multiple widely-used face anti-spoofing databases.