Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to explore the fundamental natures (\textit{e.g.}, modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors benefits the ViT on IR modality but not RGB or Depth modalities. Second, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. Finally, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M$^{2}$A$^{2}$E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M$^{2}$A$^{2}$E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.


Introduction
Face recognition technology has widely used in many intelligent systems due to their convenience and remarkable accuracy.However, face recognition systems are still vul-nerable to presentation attacks (PAs) ranging from print, replay and 3D-mask attacks.Therefore, both the academia and industry have recognized the critical role of face antispoofing (FAS) for securing the face recognition system.
In the past decade, plenty of hand-crafted features based [6,7,23,37] and deep learning based [2,12,13,31,38,45] methods have been proposed for unimodal FAS.Despite satisfactory performance in seen attacks and environments, unimodal methods generalize poorly on emerging novel attacks and unseen deployment conditions.Thanks to the advanced sensors with various modalities (e.g., RGB, Infrared (IR), Depth, Thermal) [17], multimodal methods facilitate the FAS applications under some high-security scenarios with low false acceptance errors (e.g., face payment and vault entrance guard).
Recently, due to the strong long-range and cross-modal representation capacity, vision transformer (ViT) [11] based methods [15,26] have been proposed to improve the robustness of FAS systems.However, these methods focus on direct finetuning ViTs [15] or modifying ViTs with complex and powerful modules [26], which cannot provide enough insights on bridging the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) of ViT in multimodal FAS.Despite mature exploration and finds [3,18,44] of ViT on other computer vision communities (e.g., generic object classification [8]), these knowledge might not be fully aligned for the multimodal FAS due to the task and modality gaps.
Compared with CNN, ViT usually aggregates the coarse intra-patch info at the very early stage and then propagates the inter-patch global attentional features.On other words, it neglects the local detailed clues for each modality.According to the prior evidence from MM-CDCN [48], local fine-grained features from multiples levels benefits the live/spoof clue representation in convolutional neural networks (CNN) from different modalities.Whether local descriptors/features can improve the ViT-based multimodal FAS systems is worth exploring.
Compared with CNNs, ViTs usually have huger parameters to train, which easily overfit on the FAS task with lim-arXiv:2302.05744v1[cs.CV] 11 Feb 2023 ited data amount and diversity.Existing works show that direct finetuning the last classification head [15] or training extra lightweight adapters [20] can achieve better performance than fully finetuning.However, all these observations are based on the unimodal RGB inputs, it is unclear how different ViT-based transfer learning techniques perform on 1) other unimodal scenario (IR or Depth modality); and 2) multimodal scenario (e.g., RGB+IR+Depth).Moreover, to design more efficient transfer learning modules for ViT-based multimodal FAS should be considered.
Existing multimodal FAS works usually finetune the Im-ageNet pre-trained models, which might be sub-optimal due to the huge task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps.Meanwhile, in consideration of costly collection of large-scale annotated live/spoof data, self-supervised pre-training without labels [34] is potential for model initialization in multimodal FAS.Although a few self-supervised pre-training methods (e.g., masked image modeling (MIM) [3,9] and contrastive learning [1]) are developed for multimodal (e.g., visionlanguage) applications, there are still no self-supervised pre-trained models specially for multimodal FAS.To investigate the discrimination and generalization capacity of pre-trained models and design advanced self-supervision strategies are crucial for ViT-based multimodal FAS.
Motivated by the discussions above, in this paper we rethink the ViT-based multimodal FAS into three aspects, i.e., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning.Besides the elaborate investigations, we also provide corresponding elegant solutions to 1) establish powerful inputs with local descriptors [5,10] for IR modality; 2) efficiently finetune multimodal ViTs via adaptive multimodal adapters; and 3) pre-train generalized multimodal model via modality-asymmetric masked autoencoder.Our contributions include: • We are the first to investigate three key factors (i.e., inputs, pretraining, and finetuning) for ViT-based multimodal FAS.We find that 1) leveraging local feature descriptors benefits the ViT on IR modality; 2) partially finetuning or using adapters can achieve reasonable performance for ViT-based multimodal FAS but still far from satisfaction; and 3) mask autoencoder [3,18] pre-training cannot provide better finetuning performance compared with ImageNet pre-trained models.
• We design the adaptive multimodal adapter (AMA) for ViT-based multimodal FAS, which can efficiently aggregate local multimodal features while freezing majority of ViT parameters.
• We propose the modality-asymmetric masked autoencoder (M 2 A 2 E) for multimodal FAS self-supervised pre-training.Compared with modality-symmetric autoencoders [3,18], the proposed M 2 A 2 E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic downstream settings.To our best knowledge, this is the first attempt to design the MIM framework for generalized multimodal FAS.
• Our proposed methods achieve state-of-the-art performance with most of the modality settings on both intraas well as cross-dataset testings.

Related Work
Multimodal face anti-spoofing.With multimodal inputs (e.g., RGB, IR, Depth, and Thermal), there are a few multimodal FAS works that consider input-level [14,30,35,35] and decision-level [54] fusions.Besides, mainstream FAS methods extract complementary multi-modal features using feature-level fusion [25,26,28,41,48,56] strategies.As there are redundancy across multi-modal features, direct feature concatenation [48] easily results in high-dimensional features and overiftting.To alleviate this issue, Zhang et al. [55,56] propose a feature re-weighting mechanism to select the informative and discard the redundant channel features among RGB, IR, and Depth modalities.Shen et al. [39] design a Modal Feature Erasing operation to randomly dropout partial-modal features to prevent modality-aware overftting.George and Marcel [16] present a cross-modal focal loss to modulate the loss contribution of each modality, which benefts the model to learn complementary information among modalities.
Transformer for vision tasks.Transformer is proposed in [40] to model sequential data in the field of NLP.Then ViT [11] is proposed recently by feeding transformer with sequences of image patches for image classification.In consideration of the data hungry characteristic of ViT, direct training ViTs from scratch would result in severe overfitting.On the one hand, fast transferring (e.g., adapter [8,19,22] and prompt [57] tuning) while fixed most of pre-trained models' parameters is usually efficient for downstream tasks.On the other hand, self-supervised masked image modeling (MIM) methods (e.g., BEiT [4] and MAE [3,18]) benefit the excellent representation learning, which improve the finetuning performance in downstream tasks.Meanwhile, a few works introduce vision transformer for FAS [15,26,33,42,43,47].On the one hand, ViT is adopted in the spatial domain [15,33,42] to explore live/spoof relations among local patches.On the other hand, global features temporal abnormity [43] or physiological periodicity [47] are extracted applying ViT in the temporal domain.Recently, Liu and Liang [26] develop the modalityagnostic transformer blocks to supplement liveness features for multimodal FAS.Despite convincing performance via modified ViT with complex customized modal-disentangled and cross-modal attention modules [26], there are still no works to explore the fundamental natures (e.g., modality- , LBP [36], HOG [10], and PLGF [5]) and their compositions.
aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS.

Methodology
To benefit the exploration of fundamental natures of ViT for multimodal FAS, here we adopt the simple, elegant, and unified ViT framework as baselines.As illustrated in the left part (without 'AMA') of Fig. 1, the vanilla ViT consists of a patch tokenizer E patch via linear projection, N transformer blocks E i trans (i = 1, ..., N ) and a classification head E head .The unimodal (X RGB , X IR , X Depth ) or multimodal (X RGB+IR , X RGB+Depth , X IR+Depth , X RGB+IR+Depth ) inputs are passed over E patch to generate the visual tokens T Vis , which is concatenated with learnable class token T Cls , and added with position embeddings.Then all patch tokens T All = [T Vis , T Cls ] will be forwarded with E trans .Finally, T Cls is sent to E head for binary live/spoof classification.
We will first briefly introduce different local descriptor based inputs in Sec.

Local Descriptors for Multimodal ViT
Besides the raw multimodal inputs, we consider three local features and their compositions for multimodal ViT.The motivations behind are that the vanilla ViT with raw inputs is able to model rich cross-patch semantic contexts but sensitive to illumination and neglecting the local finegrained spoof clues.Explicitly leveraging local descriptors as inputs might benefit multimodal ViT mining more discriminative fine-grained spoof clues [46,48,49,51] as well as illumination-robust live/spoof features [25].
Local binary pattern (LBP).LBP [36] computes a binary pattern via thresholding central difference among neighborhood pixels.Fine-grained textures and illumination invariance make LBP robust for generalized FAS [24].For a center pixel I c and a neighboring pixel I i (i = 1, 2, ..., p), LBP can be formalized as follows: Typical LBP maps are shown in second column of Fig. 2.

Histograms of oriented gradients (HOG)
. HOG [10] describes the distribution of gradient orientations or edge directions within a local subregion.It is implemented via firstly computing magnitudes and orientations of gradients at each pixel, and then the gradients within each small local subregion are accumulated into orientation histogram vectors of several bins, voted by gradient magnitudes.Due to the partial invariance to geometric and photometric changes, HOG features might be robust for the illumination-sensitive modalities like RGB and IR.The visualization results are shown in third column of Fig. 2.

Pattern of local gravitational force (PLGF). Inspired by
Law of Universal Gravitation, PLGF [5] describes the image interest regions via local gravitational force magnitude, which is useful to reduce the impact of illumination/noise variation while preserving edge-based low-level clues.It can be formulated as: where I is the raw image.M x and M y are two filter masks for gravitational force calculation.m and n are indexes denoting the relative position to the center.* is convolution operation sliding along all pixels.The visualization of PLGF maps are shown in fourth column of Fig. 2.
Composition.In consideration of the complementary characteristics from raw image and local descriptors, we also study the compositions among these features via inputlevel concatenation.For example, 'GRAY HOG PLFG' denotes three-channel inputs (raw gray-scale channel + HOG + PLFG), which is visualized in last column of Fig. 2.

Adaptive Multimodal Adapter
Recent studies have verified that introducing adapters [20] with fully connected (FC) layers can improve the FAS performance when training data is not adequate.However, FC-based adapter focuses on the intra-token feature refinement but neglects 1) contextual features from local neighbor tokens; and 2) multimodal features from cross-modal tokens.To tackle these issues, we extend the convolutional adapter (ConvAdapter) [22] into a multimodal version for multimodal FAS.
As illustrated in Fig. 1, instead of directly finetuning the transformer blocks E trans , we fix all the pre-trained parameters from E patch and E trans while training only adaptive multimodal adapters (AMA) and E head .An AMA module consists of four parts: 1) an 1×1 convolution with GELU Θ ↓ for dimension reduction from the original channels D to a hidden dimension D ; 2) a 3×3 2D convolution Θ 2D mapping channels D ×K to D for multimodal local feature aggregation, where K means the modality numbers; 3) an adaptive modality weight (w 1 , ..., w K ) generator via cascading global averaging pooling (GAP), 1×1 convolution Θ Ada to project channels from D ×K to K, and the Sigmoid function σ; and 4) an 1×1 convolution with GELU Θ ↑ for dimension expansion to D. As features from different modalities are already spatially aligned, we restore the 2D structure for each modality after the channel squeezing.Similarly, the 2D structure will be flatten into 1D tokens before the channel expanding.The AMA can be formulated as Here we show an example when K=3 (i.e., RGB+ IR+Depth) in Eq.( 3), and AMA is flexible for arbitrary modalities (e.g., RGB+IR).Note that AMA is equivalent to vanilla ConvAdapter [22] in unimodal setting when K=1.

Modality-Asymmetric Masked Autoencoder
Existing multimodal FAS works usually finetune the Im-ageNet pre-trained models, which might be sub-optimal due to the huge task and modality gaps.Meanwhile, in consideration of costly collection of large-scale annotated live/spoof data, self-supervised pre-training without labels [34] is potential for model initialization in multimodal FAS.Here we propose the modality-asymmetric masked autoencoder (M 2 A 2 E) for multimodal FAS self-supervised pre-training.
As shown in Fig. 8, given a multimodal face sample (X RGB , X IR , X Depth ), M 2 A 2 E randomly selects unimodal input X i (i ∈ RGB, IR, Depth) among all modalities.Then random sampling strategy [18] is used to mask out p percentage of the visual tokens in X i .Only the unmasked visible tokens are forwarded the ViT encoder and both visible and masked tokens are fed in unshared ViT decoders.In terms of the reconstruction target, given a masked input X i with the i-th modality, M 2 A 2 E aims to predict the pixel values with mean squared error (MSE) loss for 1) each masked patch of X i , and 2) the whole input images of other modalities X j (j = i; j ∈ RGB, IR, Depth).The motivation behind M 2 A 2 E is that with the multimodal reconstruction target, the self-supervised pre-trained ViTs are able to model Relation to modality-symmetric autoencoders [3,18].
Compared with the vanilla MAE [18], M 2 A 2 E adopts the same masked strategy in unimodal ViT encoder but targeting at multimodal reconstruction with multiple unshared ViT decoders.Besides, M 2 A 2 E is similar to the multimodal MAE [3] only when partial tokens from a single modality are visible while masking all tokens from other modalities.

Datasets and Performance Metrics
Three commonly used multimodal FAS datasets are used for experiments, including WMCA [17], CASIA-SURF (MmFA) [56] and CASIA-SURF CeFA (CeFA) [27].WMCA contains a wide variety of 2D and 3D PAs with four modalities, which introduces 2 protocols: 'seen' protocol which emulates the seen attack scenario and the 'unseen' attack protocol that evaluates the generalization on an unseen attack.MmFA consists of 1000 subjects with 21000 videos, and each sample has 3 modalities, which has an official intra-testing protocol.CeFA is the largest multimodal FAS dataset, covering 3 ethnicities, 3 modalities, 1607 subjects, and 34200 videos.We conduct intra-and cross-dataset testings on WMCA and MmFA datasets, and leave large-scale CeFA for self-supervised pre-training.
In terms of evaluation metrics, Attack Presentation Classification Error Rate (APCER), Bonafide Presentation Classification Error Rate (BPCER), and ACER [21] are used for the metrics.The ACER on testing set is determined by the Equal Error Rate (EER) threshold on dev sets for MmFA, and the BPCER=1% threshold for WMCA.True Positive Rate (TPR)@False Positive Rate (FPR)=10 −4 [56] is also provided for MmFA.For cross-testing experiments, Half Total Error Rate (HTER) is adopted.

Implementation Details
We crop the face frames using MTCNN [53] face detector.The local descriptors are extracted from gray-scale images with: 1) 3x3 neighbors for LBP [36]; 2) 9 orientations, 8x8 pixels per cell, and 2x2 cells per block for HOG [10]; and 3) the size of masks are set to 5 for PLGF [5].Composition inputs 'GRAY HOG PLGF' are adopted on unimodal and multimodal experiments for IR modality, while the raw inputs are utilized for RGB and Depth modalities.ViT-Base [11] supervised by binary cross-entropy loss is used as the defaulted architecture.For the direct finetuning, only the last transformer block and classification head are trainable.For AMA and ConvAdapter [22] finetuning, the original and hidden channels are D=768 and D =64, respectively.For M 2 A 2 E, the mask ratio p=40% is used while decoder depth and width is 4 and 512, respectively.
The experiments are implemented with Pytorch on one NVIDIA A100 GPU.For the self-supervised pre-training on CeFA with RGB+IR+Depth modalities, we use the AdamW [32] optimizer with learning rate (lr) 1.5e-4, weight decay (wd) 0.05 and batch size 64 at the training stage.ImageNet pre-trained weights are used for our encoder.We train the M 2 A 2 E for 400 epochs while warming up the first 40 epochs, and then performing cosine decay.For supervised unimodal and multimodal experiments on WMCA and MmFA, we use the Adam optimizer with the fixed lr=2e-4, wd=5e-3 and batch size 16 at the training stage.We finetune models with maximum 30 epochs based on the ImageNet or M 2 A 2 E pre-trained weights.

Intra-dataset Testing
Intra testing on WMCA.
The unimodal and multimodal results of protocols 'seen' and 'unseen' on WMCA [17] are shown in Table 1.On the one hand, compared with the direct finetuning results from 'ViT', the ViT+AMA/ConvAdapter can achieve significantly lower ACER in all modalities settings and both 'seen' and 'unseen' protocols.This indicates the proposed AMA efficiently leverages the unimodal/multimodal local inductive cues to boost original ViT's global contextual features.On the other hand, when replaced the ImageNet pre-trained ViT with self-supervised M 2 A 2 E from CeFA, the generalization for unseen attack detection improves obviously with modalities 'IR', 'Depth', 'RGB+IR', 'IR+Depth', and 'RGB+IR+Depth', indicating its excellent transferability for downstream modality-agnostic tasks.It is surprising to find in the last block that the proposed methods with RGB+IR+Depth modalities perform even better than 'MC-CNN' [17] with four modalities in both 'seen' and 'unseen' protocols.With complex and specialized modules, although 'MA-ViT' [26] outperforms the proposed methods with RGB+Depth modalities in 'unseen' protocol by -2.01%ACER, the proposed AMA and M 2 A 2 E might be    WMCA, respectively.

Ablation Study
We also provide the results of ablation studies for inputs with local descriptors and AMA on 'seen' protocol of WMCA while studies for M 2 A 2 E on 'unseen' protocol of WMCA and cross testing from WMCA to MmFA.Impact of inputs with local descriptors.In the default setting of ViT inputs, composition input 'GRAY HOG PLGF' is adopted on for IR modality, while the raw inputs are utilized for RGB and Depth modalities.In this ablation, we consider three local descriptors ('LBP' [36], 'HOG' [10], 'PLGF' [5]) and their compositions ('HOG PLGF', 'LBP HOG PLGF', 'GRAY HOG PLGF').It can be seen from Fig. 4 that the 'LBP' input usually performs worse than other features for all three modalities.In contrast, the 'PLGF' input achieves reasonable performance (even better performance than raw input for IR modality via direct finetuning).It is clear that raw inputs are good enough for all modalities via ConvAdapter.One highlight is that composition input 'GRAY HOG PLGF' performs the best for IR modality via both direct finetuning and ConvAdapter, indicating the importance of local detailed and illumination invariant cues in IR feature representation.Impact of adapter types.Here we discuss five possible adapter types for efficient multimodal learning, including FC-based 'vanilla adapter' [20], independent-modal 'ConvAdapter' [22] modal ConvAdapter' reduces more than 0.5% ACER in all multimodal settings via aggregating multimodal local features.In contrast, we cannot see any performance improvement from 'multimodal ConvAdapter (huge)'.In other words, directly learning high-dimensional (D ×K) convolutional features for all K modalities results in serious overfitting.Compared with 'multimodal ConvAdapter', AMA enhances the diversity of features for different modalities via adaptively weighting the shared lowdimensional (D ) convolutional features, which decreases 0.52%, 0.31%, 1.07%, 0.07% ACER for 'RGB+Depth', 'RGB+IR', 'IR+Depth', 'RGB+IR+Depth', respectively.
Impact of dimension and position of AMA.Here we study the hidden dimensions D in AMA and the impact of AMA positions in transformer blocks.It can be seen from Fig. 6 that despite more lightweight, lower dimensions (16 and 32) cannot achieve satisfactory performance due to weak representation capacity.The best performance can be achieved when D =64 in all multimodal settings.In terms of AMA positions, it is interesting to find from Fig. 7 that plugging AMA along FFN performs better than along MHSA in multimodal settings.This might be because the multimodal local features complement the limitation of point-wise receptive field in FFN.Besides, it is reasonable that applying AMA on MHSA+FFN performs the best.
Impact of mask ratio in M 2 A 2 E. Fig. 8(a) illustrates the generalization of the M 2 A 2 E pre-trained ViT when finetuning on 'unseen' protocol of WMCA and cross testing from MmFA to WMCA.Different from the conclusions from [3,18] using very large mask ratio (e.g., 75% and 83%), we find that mask ratio ranges from 30% to 50% are suitable for multimodal FAS, and the best generalization performance on two testing protocols are achieved when mask ratio equals to 40%.In other words, extreme high mask ratios (e.g., 70% to 90%) might force the model to learn too semantic features but ignoring some useful low-/mid-level live/spoof cues.Comparison between multimodal MAE [3] and M 2 A 2 E.
We also compare M 2 A 2 E with the symmetric multimodal MAE [3] when finetuning on all downstream modality settings.It can be seen from Fig. 9 that with more challenging reconstruction target (from masked unimodal inputs to multimodal prediction), M 2 A 2 E is outperforms the best settings of multimodal MAE [3] on most modalities ('RGB', 'IR', 'RGB+IR', 'RGB+Depth', 'RGB+IR+Depth'), indicating its excellent downstream modality-agnostic capacity.

Conclusions and Future Work
In this paper, we investigate three key factors (i.e., inputs, pretraining, and finetuning) for ViT-based multimodal FAS.We propose to combine local feature descriptors for IR inputs, and design the modality-asymmetric masked autoencoder and adaptive multimodal adapter for efficient selfsupervised pre-training and supervised finetuning for multimodal FAS.We note that the study of ViT-based multimodal FAS is still at an early stage.Future directions include: 1) Besides inputs, integrating local descriptors into transformer blocks [50] or adapters is potential for ViT-based multimodal FAS; 2) Besides generalization, the discriminative capacity of M 2 A 2 E pre-trained models should be improved.Some regularization strategies like distillation [52] might be explored.

Figure 1 .Figure 2 .
Figure 1.Framework of the ViT finetuning with adaptive multimodal adapters (AMA).The AMA and classification head are trainable while the linear projection and vanilla transformer blocks are fixed with the pre-trained parameters.'MHSA', 'FFN', and 'GAP' are short for the multihead self-attention, feed-forword network, and global average pooling, respectively.
3.1, then introduce the efficient ViT finetuning with AMA in Sec.3.2, and at last present the generalized multimodal pre-training via M 2 A 2 E in Sec.3.3.

Figure 3 .
Figure3.The framework of the modality-asymmetric masked autoencoder (M 2 A 2 E).Different from previous multimodal MAE[3] masking all modalities as inputs, our M 2 A 2 E randomly selects unimodal masked input for multimodal reconstruction.

FFigure 4 .
Figure 4. Impacts of inputs with local feature descriptors (e.g., LBP, HOG, PLGF) for ViT using direct finetuning and ConvAdapter strategies on (a) RGB, (b) IR, and (c) Depth modalities.More results on multimodal settings can be found in Appendix A.

Figure 5 .
Figure 5. Ablation of the adapter types in transformer blocks.

Figure 6 .Figure 7 .
Figure 6.Ablation of the hidden dimensions in AMA.

Table 2 .
The results on MmFA.Larger TPR and lower ACER values indicate better performance.Best results are marked in bold.

Table 3 .
The HTER (%) values from the cross-testing between WMCA and MmFA datasets.Best results are marked in bold.