A study of sparse representation-based classification for biometric verification based on both handcrafted and deep learning features

Biometric verification is generally considered a one-to-one matching task. In contrast, in this paper, we argue that the one-to-many competitive matching via sparse representation-based classification (SRC) can bring enhanced verification security and accuracy. SRC-based verification introduces non-target subjects to construct dynamic dictionary together with the client claimed and encodes the submitted feature. Owing to the sparsity constraint, a client can only be accepted when it defeats almost all non-target classes and wins a convincing sparsity-based matching score. This will make the verification more secure than those using one-to-one matching. However, intense competition may also lead to extremely inferior genuine scores when data degeneration occurs. Motivated by the latent benefits and concerns, we study SRC-based verification using two sparsity-based matching measures, three biometric modalities (i.e., face, palmprint, and ear) and their multimodal combinations based on both handcrafted and deep learning features. We finally approach a comprehensive study of SRC-based verification, including its methodology, characteristics, merits, challenges and the directions to resolve. Extensive experimental results demonstrate the superiority of SRC-based verification, especially when using multimodal fusion and advanced deep learning features. The concerns about its efficiency in large-scale user applications can be readily solved using a simple dictionary shrinkage strategy based on cluster analysis and random selection of non-target subjects.


Introduction
Biometric verification is generally considered a one-to-one matching problem, which is solved by comparing the captured biometric data with the gallery template(s) associated with the identity claimed to produce a matching score [1,2]. The matching score is then used to compare with the system's operating threshold to decide whether the user can be authenticated or not. The operating threshold is chosen in training phase to minimize some posteriori performance criterion, e.g., equal error rate (EER), based on the genuine and impostor score distributions. However, it is unlikely to collect and/or generate a sufficiently rich set of representative templates for each client to cover all possible changes, for example, expression, pose, illumination, aging, and occlusion on the face, so as to accurately model the score distributions [1][2][3]. Data imbalance between the genuine and impostor samples is also a challenge [4]. Moreover, for example, human faces are all somewhat similar, and some subjects may have very similar face images [5]. In realworld applications, it is also unlikely that the distributions of genuine and impostor scores will always be completely non-overlapping. As a result, there is rarely an ideal operating threshold at which both the false accept rate (FAR) and the false reject rate (FRR) are zero. Furthermore, in test phase, one-to-one matching verification ignores the correlation of the probe sample with other people. Therefore, it is insufficient and insecure to authenticate a user using only one-to-one matching and imperfect predetermined operating threshold, especially in the safety-critical applications in military, security, and finance.
Recently, deep learning (DL)-based approaches have made substantial progress in the computer vision and pattern recognition community [6,7]. Many deep convolutional neural networks (CNN)-based face verification systems have achieved near-perfect performance on large-scale unconstrained benchmarks, such as LFW [8] and MegaFace [9]. However, there are more and more studies reported that the state-of-the-art (SOTA) DL-based face recognition systems, including VGGFace, SphereFace, and ArcFace, are highly vulnerable to presentation attacks [10,11], morphing attacks [12], and adversarial perturbations [13]. They validated that the deep models with higher FR accuracy show higher levels of vulnerability, and they are more vulnerable than some approaches using handcrafted features [10][11][12]. Note that, these SOTA face verification approaches use deep CNN networks to extract feature from the test image, and then operate in the conventional one-to-one matching verification framework. Apparently, recent efforts in deep learning-based feature extraction have yet to bring about desirable security performance of verification systems. It is time to rethink the simple one-to-one matching and shift some of research focus to new classification mechanisms and security measures.
The sparse representation-based classification (SRC) has been studied extensively and proven powerful in biometric identification [14][15][16][17]. SRC techniques conduct a one-tomany comparison in a single sparse coding process and is naturally used for identification task. The idea and implementation of the original SRC model are very simple and straightforward. It is to represent the test sample as a sparse linear combination of the training samples in an overcomplete dictionary and then classify the test sample to the class which yields a minimum class-specific reconstruction residue [14]. The dictionary is constructed with the training samples of all the classes. Some variants of SRC also expand the dictionary by adding linear and non-linear variation sub-dictionaries to alleviate the insufficient training samples problem [18][19][20], or directly using a learning dictionary [21]. Recently, several studies reported that when using deep CNN features, many SRC extensions can get significant improvements in accuracy and robustness [16,[22][23][24][25][26]. Inspired by the great success of SRC in identification, some studies have introduced SRC in the unimodal verification of speaker, face, and finger vein [27][28][29][30][31][32]. Huang et al. [3] reported that the multimodal verification using SRC shows very promising verification accuracy and strong resistance to the unimodal presentation attacks. SRC-based verification follows the similar pre-classification procedures, but it only compares the class-specific sparsity-based matching score associated with the identity claimed with system's operating threshold, regardless of the remaining scores. From a process point of view, the difference between SRC-based verification and one-to-one matching verification lies in how matching scores are generated and whether non-target subjects participate in the comparison. A brief overview of the existing literature about SRC and SRC-based verification will be presented in the section "Related work".
Although the current research has experimentally verified the effectiveness of SRC-based verification in some biometric fields, it has not been studied in-depth, such as the verification characteristics, merits, shortcomings, and challenges. By incorporating non-target subjects, SRC-based verification provides a competing mechanism to allocate class-specific sparsity-based matching scores to all classes in the dictionary using sparse optimization. Owing to the sparsity constraint, SRC allows only one or very few classes to get a good matching score, while the remaining classes get inferior scores. To be accepted, the genuine class needs to defeat almost all the non-target subjects and get an eligible score that is superior to a certain predetermined operating threshold in sparse coding competition. Therefore, an acceptance response made by SRC-based verification should be more convincing than those based on the one-to-one matching. Essentially, the SRC-based verification not only examines the matching score obtained by the client claimed but also compares implicitly the correlations of the query data to the client and many non-target subjects, and thereby offers enhanced protection for identity security.
On the other hand, biometric sample quality often fluctuates with the illumination, pose, and appearance variants [33][34][35]. Moderate data degeneration is inevitable in practical applications. Under these circumstances, once a genuine client fails to get a top rank in the competition, it is more likely to get an extremely inferior score and thus be rejected falsely. It is also impractical to authenticate a user with overly relaxed operating threshold, which will lead to an excessively high false accept rate. Therefore, there is an urgent need to study how to mitigate this problem or under what circumstances SRC-based verification is preferable. Moreover, the heavy computation burden and accuracy degradation in large-scale user applications still haunt the SRC-based identification [36,37]. With the similar classification mechanism, how to make SRC-based verification free from this problem is also critical.
In this work, we focus on the theoretical analysis and experimental validation about the characteristics, merits, and challenges of the verification with the general SRC model [14], rather than designing new sparse coding model. We also try to explore the directions to resolve its challenges. To these aims, we study SRC-based verification using two sparsitybased matching measures on three biometric traits, i.e., face, palmprint, ear, and their multimodal combinations, based on DCT and ArcFace-based CNN [7] features. The two sparsitybased matching measures are sparse coding error (SCE) and sparse contribution rate (SCR) [14]. The SCE signals the representation ability of SRC, while SCR reflects the sparseness of coding coefficients. In the SRC-based multimodal verification, we apply Sum-rule to combine the matching scores of all modalities. The well-known multimodal methods based on one-to-one matching and cosine similarity are used as competitors, e.g., support vector machine (SVM) and likelihood ratio (LLR) [38].
Our contributions in this paper are summarized as follows.
1. Overall, we approach a comprehensive study of SRCbased verification that has never been done in the existing literature, including its methodology, characteristics, merits, challenges, and the directions to resolve.

Extensive experiments involving three biometric traits
and their multimodal combinations, and both handcrafted and deep learning features, demonstrate that SRC-based verification significantly outperforms many well-known methods that are based on one-to-one matching in both unimodal and multimodal scenarios. 3. We empirically confirm a strong correlation between verification accuracy and the inter-class separability among classes in coding dictionary. SRC-based verification is rather suitable for the scenarios using advanced deep learning features and multiple biometrics, avoiding the long tail effect of receiver-operating characteristic (ROC) curve. 4. We propose to shrink the coding dictionary to a certain small scale using cluster analysis and random selection strategy. Dictionary shrinkage can avoid massive computational cost and accuracy degradation in large-scale user applications.
The rest of the paper is organized as follows. In the section "Related work", we briefly review the SRC techniques and the verification using SRC. In the section "Datasets", we introduce the face, palmprint, and ear datasets, and three chimeric multimodal datasets. In the section "SRC-based biometric verification", we first present the methodology of SRC-based verification and two sparsity-based matching measures. Second, we discuss its features and challenges. Third, we explore the solutions using dictionary shrinkage and multimodal extension. In the section "Experiments", we report our experimental results and analysis. Finally, we draw conclusions and provide some research directions for future work in the section "Conclusions and future directions".

Related work
Sparse representation-based classification Wright et al. [14], for the first time, put forward the SRC model and showed its significant improvement in face identification. Its success has largely boosted the research of biometric recognition based on sparse representation and collaborative representation [5]. A number of variants and extensions of SRC have been proposed in the last decade. Meanwhile, there are also many works that pay attention on the source of their discriminative ability [5,19,20,25,39].
A major direction is to explore its capacity to handle complex variations like illumination, pose, corruption, and occlusion. Yang et al. [40] proposed a Gabor-based SRC (GSRC) using Gabor features, which was shown to be more robust against illumination changes and pose mismatches. They also proposed a robust sparse coding (RSC) model by regarding the sparse coding as a sparsity-constrained regression problem [15]. RSC can effectively estimate the corrupted pixels and occluded regions and then exclude them from sparse representation in an iterative process. Zhou et al. [41] proposed to detect the contiguous occluded regions in test image using Markov random fields. Illiadis et al. [42] proposed a fast low-rank and iterative reweighted nonnegative least-squares algorithm, namely F-LR-IRNNLS, to address the problem of contiguous occlusions. F-LR-IRNNLS considers the error image is low-rank in comparison of image size, and it follows a distribution that can be described by a tailored potential loss function. Lai et al. [43] proposed a modular weighted global sparse representation (WGSR) method that divides an image into modules and sparsely encodes each module separately, and then dynamically combines their reconstruction errors based on their reliability for final classification. Lai et al. [44] proposed a collaborative patch framework using class-wise sparse representation (CSR-CP) to tackle the problem of uncontrolled training data. CSR-CP optimizes all patches together to seek a groupwise sparse representation by putting all patches of an image into a group.
Although SRC and its extensions have significantly improved the robustness of biometric identification, they are often criticized by the harsh requirements on the quality and the number of training samples per subject, and the poor efficiency in solving sparse optimization problem in largescale scenarios. SRC requires the training samples per user which are sufficient and well-controlled to maintain its superior performance [14]. However, in real-world applications, the training data often contain a large number of identities but sufficient representative images for every identity cannot be guaranteed. To solve such an insufficient training samples problem, or saying under-sampled problem, a lot of effort has been made in the community. Deng et al. proposed several dictionary augmentation methods to enhance the representation ability of the gallery dictionary, including extended SRC (ESRC) [18], superposed SRC (SSRC) [19], and superposed linear representation classifier (SLRC) [20]. They take advantage of intra-class variation, class centroids, and the sample-to-centroid difference to construct the coding dictionary. Gao et al. [45] proposed a semi-supervised SRC (S 3 RC) using a variation dictionary to represent the linear variation of test sample and using a learned gallery dictionary based on Gaussian mixture model (GMM) to represent the non-linear variation. Jiang et al. [46] proposed a sparse-and densehybrid representation model based on a supervised low-rank dictionary decomposition/learning, aiming to alleviate the under-sampled problem and the uncontrolled training data problem simultaneously. Yang and Wang et al. paid attention on more fine-grained part-based methods [22,23]. The face image is divided into multiple overlapping patches, centered around 5 facial keypoints and 16 regularly sampled facial points. A joint and collaborative representation is performed on the local dictionaries, each with an intra-class variation sub-dictionary, based on the local convolution or Gabor features for the final classification. The aforementioned methods and strategies have alleviated the problems of under-sampled and uncontrolled training data in small datasets to a certain extent, but their effectiveness in large-scale datasets remains to be tested.
Recall that the SRC methods use the training templates of all classes to construct coding dictionary, and thus, the computational cost of sparse optimization increases with the growth of dictionary scale [14]. Therefore, how to improve the efficiency is crucial in large-scale user applications. To alleviate this issue, Xu et al. [47] proposed a two-phase test sample representation method for face recognition. The first phase uses all of the training samples to represent the test sample using the more efficient L 2 -norm based collaborative representation and select a limited number of "nearest neighbors" according to the representation ability of each training sample. Xu et al. [48] further improved the method using both the original and new generated 'symmetrical face' samples of a small number of classes that are 'near' to the test sample to represent and classify it. He et al. [49] proposed to filter the database into a small subset based on the nearest-neighbor criterion in a learned robust metric, and then perform nonnegative sparse representation-based classification with a small dictionary. All the above methods use a two-stage strategy that selects a small subset from the entire database in some efficient way and then performs SRC using the dictionary built with the selected data. Although compared with the one-stage SRC methods, they can substantially reduce the computational cost, the data filtering in the whole dataset are still very time-consuming in large-scale user applications.

Verification using SRC
More than 2 decades ago, Verlinde et al. [50] have proposed a one-to-many matching biometric verification method using a k-NN classifier. This is one of the pioneering attempts to consider non-target subjects in the test phase for verification. Cohort-based score normalization also takes advantage of non-target subjects, but serves the conventional one-toone matching verification [51]. Nevertheless, the verification using non-target subjects and one-to-many matching did not receive much attention. Recently, inspired by the great success of SRC-based identification, SRC has also been introduced in the fields of the unimodal verification of speaker, face, and finger vein [27][28][29][30][31][32], and the multimodal verification using face and ear [3].
In Ref. [27], GMM mean supervectors are used as features of an utterance to construct coding dictionary. The L 1 -norm value of the coding coefficients associated with the claimed identity is used as genuine score, while such L 1 -norm values for the other classes are imposter scores. Although, their experiments did not show improved performance of the proposed SRC-based verification, whose complementary information to the standard UBM-GMM classifier was clearly validated. Li et al. [28] built the coding dictionary using the i-vectors from total variability as atoms and evaluated three sparsity-based measures for speaker verification, including L 1 -norm ratio (i.e., SCR), L 2 residual ratio, and a Bnorm L 2 residual (a regularized SCE measure). The Bnorm L 2 residual measure outperforms the other two measures in their experiments. The SRC-based verification approaches get inferior performance compared to the SVM classifier based on cosine similarity. However, improved verification results are achieved when combing sparsity-based scores and the SVM results. In Ref. [29], Kua et al. also investigated the i-vector-based SRC verification (iSRC) using L 1 constraint, L 2 constraint, and both constraints (Elastic net) in the coding optimization. The L 1 -norm ratio is used as verification criterion, which was claimed to be superior to the other two measures proposed in Ref. [28]. They also validated that a small-size dictionary chosen based on column vector frequency can improve the verification accuracy and efficiency. Hasheminejad and Farsi [30] proposed to learn target, background, noise dictionaries with orthogonal atoms, and concatenated them together to construct overcomplete dictionary for speaker verification. The derived Bnorm L 2 residual scores are transformed to log-likelihood-ratio scores before decision. They reported better verification performance than iSRC.
Xin et al. [31] also utilized SCE as matching measure in finger vein verification and got a very low EER of 0.017% on a dataset with 600 fingers, which is also better than many competitive methods in their experiments. Shin et al. [32] performed sparse representation of the test color face image on each color channel of the color configurations, for example, and combined their class-specific reconstruction residuals (i.e., SCE) with Sum fusion rule for verification. The approach surpasses the one-to-one matching verification using LBP and Gabor features by a large margin of 12-22% EER on CMU Multi-PIE and Color FERET face datasets. However, such a system heavily relying on color channels would be sensitive to facial appearance variations, illumination and sensors in applications. And, the EER results of 1.89% and 2.79% they achieved are still very high for realworld applications.
In Ref. [3], Huang et al. performed sparse representation on the face and ear modalities, respectively. The multimodal verification based on the summation of SCE scores of the two modalities achieves about 0.2% EER on a multimodal dataset built with AR face dataset and USTB III ear dataset. However, the method is sensitive to the worst-case partial spoof attacks. Aiming to improve the anti-spoofing performance, they also proposed to use collaborative representation fidelity with non-target subjects to measure the affinity of the query sample to the claimed client. The resulting SCE scores and affinity scores of the two modalities are then combined in a stacked way to train an SVM classifier. The method was reported with promising anti-spoofing performance, and meanwhile, it achieves a good trade-off between verification accuracy and anti-spoofing.
Most of the studies of SRC-based verification are in the speaker verification community. Overall, the verification improvement brought by SRC-based verification is limited, but consumes expensive computation cost. This could be attributed to the large intra-class variations and the small inter-class variations of speech signals, which somewhat violates the two preconditions of SRC application. According to the existing literature, it seems that the SRC-based verification did not get much attention in the mainstream biometric community like face, iris, and fingerprint. In this paper, we are trying to uncover the limitations of SRC-based verification and the concerns about it in the community. We hope that our experimental results and findings will rekindle interest in SRC-based verification. In the end, we would also like to emphasize that the SRC-based verification is different from the approaches in Refs. [52][53][54]. In these studies, sparse coding is used for feature extraction based on a learned dictionary and the verification still operate in the conventional one-toone matching way. This will lose some of the benefits of competitive matching between the client and non-target subjects.

Datasets
In this paper, we study the SRC-based verification using three modalities, i.e., face, palmprint, ear, and their combinations. Note that SRC classifiers generally require multiple gallery samples per subject to construct an overcomplete dictionary for sparse coding [14,55], if without using any dictionary augmentation or optimization skills like in Refs. [18][19][20][21][22][23]. We thus select the publicly available Georgia Tech (GT) [56] and AR face datasets [57], PolyU 2D&3D palmprint dataset [58,59], and USTB III ear dataset [60]. Their constitutions are shown in Table 1. Figure 1 shows sample images of one user in each dataset. In USTB III, the samples in red box are used as gallery, and the remainders are used as probe except the two images in blue box with large pose variation.
To evaluate SRC-based multimodal verification, we create three chimeric multimodal datasets by pairing subjects from datasets of different modalities. This is a widely used method in the community for creating multimodal datasets [1,4]. The underlying assumption is that generally different modalities can be considered physiologically independent. The adjacency of the face and the ear in physical location may lead to pose correlation between their samples. It depends on the data collection setup and user cooperation. We take this issue into account using a universal pairing protocol to produce more virtual multimodal samples. That is, one sample of a modality is paired with all the probe samples of another modality to form multiple multimodal probe samples. For example, for a virtual subject in MD III, the 7 face images from GT Probe are paired with the 10 palmprint images from PolyU Probe. Then, we can get 100 × 7 × 10 7000 multimodal probe samples for all the 100 subjects, as shown in Table  2. This universal pairing protocol can bring more instances for testing, covering the possible multimodal combinations as much as possible, and meanwhile, the evaluation is more challenging. Note that we use a unique pairing protocol to create multimodal gallery samples. Table 2 summarizes the composition of the three chimeric multimodal datasets we created, namely, MD I, MD II, and MD III. MD I and MD II use face and ear traits, and MD III uses face and palmprint traits. Note that we use the minimum number of users to construct a multimodal dataset when the unimodal datasets used have different user scales. MD I uses the first 50 subjects of USTB III, and MD II uses the first 79 subjects of AR face dataset. To differentiate these unimodal subsets with their whole datasets, they are hereinafter represented by "USTB III (50)" and "AR (79)". In our experiments using DCT features, all images of the three modalities are resized to 50 × 40 pixels for extracting 200-D DCT feature. In the CNN experiments, all images are resized to 112 × 112 pixels before being fed into ArcFacebased CNN networks. If not specified, the experimental results for illustrations in the section "SRC-based biometric verification" are obtained when using DCT features.

SRC model
Assume that there are sufficiently enough well-controlled training samples for each class in the dataset with K classes. For simplicity, suppose all the classes have n training is the dictionary composed of all training samples. Given a query sample y, it can be represented by y Aα, where α ∈ R N is the coding coefficient vector. If M << N , generally a sparsest solution can be sought by solving a L 0 -norm optimization problem where · 0 denotes the L 0 -norm, and ε > 0 is a constant. Solving L 0 optimization problem in (1) is NP hard and extremely time-consuming. However, if the solution of L 0 optimization problem in (1) spare enough, it is equal to the solution to the following L 1 -norm optimization problem [14]: where · 1 denotes the L 1 -norm. This problem can be solved in polynomial time by standard linear programming algorithms [61].
In our experiments, we use l 1 − 1s optimization method [30] to solve sparse coding problem. Onceα is achieved, the class-specific sparse coding error (SCE) of ith class can be calculated using the coefficients associated with ith class as follows: where δ i : R N → R N is the characteristic function that selects the coefficients associated with the ith class. The SRC and most of its extensions identify a query sample by sorting all the resulting SCE scores and then assign the class with the least SCE score.

SRC-based verification
Rather than examining all class-specific matching scores of all classes in dictionary, SRC-based verification can only calculate the matching score of the class associated with the identity claimed and then compare it with a predetermined operating threshold for an acceptance or rejection output. Figure 2 shows the flowchart of SRC-based unimodal verification. A client claims an identity and the corresponding sub-dictionary is used to build an overcomplete dictionary together with the sub-dictionary of non-target subjects.
Suppose A c a c, 1 , a c, 2 , · · · , a c, n is the subdictionary of the client claimed, A b A 1 , A 2 , · · · , A k−1 is the sub-dictionary composed of the gallery samples of the non-target subjects involved, or saying background dictionary, and then, the overall dictionary can be rewritten as In our experiments, if no specific instructions, the coding dictionary A is composed of gallery samples of all subjects in each dataset. The SCE score associated with the claimed identity can be calculated as follows: The superior identification performance of SRC and its extensions have validated that the SCE, as a distance measure, is a good candidate to measure the correlation between a query sample and a specific class. Thus, it is reasonable to use SCE for verification. Considering the binary classification in verification decision, the output is either acceptance or rejection, which can be denoted with 1 and 0, respectively. Given an operating threshold θ sce , the verification rule with SCE can be written as The sparsity concentration index (SCI) is a measure proposed along with SRC and SCE measure [14]. SCI measures how good the coding coefficient vector itself is in terms of localization. SCI is close to 1 when the query sample is encoded using only the dictionary atoms from a single class, while it is close to 0 if the coefficients spread evenly over all the classes. We refer the readers to Ref. [14] for its detailed formulation. SCI is often used to validate whether a query sample is a valid sample from the subjects in the coding dictionary, as a complementary measure with SCE.
Essentially, SCI depends on the class that contributes the most in sparse coding, whose value is the largest Sparsity Contribution Rate (SCR) among those of all classes in the dictionary. SCR reflects the participation degree of a specific class in representing the query sample. A larger SCR value indicates a higher possibility of the query sample belonging to a specific class. Therefore, SCR can also be used as a similarity measure for verification. The SCR score associated with the class claimed can be calculated as follows: Apparently, ρ c α ∈ [0, 1]. The verification rule with a given threshold, θ scr , can be written as Figure 3 demonstrates the distributions of SCE and SCR scores obtained on AR face dataset. As for SCE, most genuine scores distribute in [0, 0.5], while the impostor scores concentrate around 1.0. On the contrary, almost all impostor scores of SCR are close to 0, while its genuine scores spread in a wide range. Compared with SCE, the overlap between genuine and impostor distributions is rather evident. This implies that the verification based on SCE should be better, which will be demonstrated in the section "Experiments". The disadvantage of SCR could possibly originate from that (2) is solved by choosing α to minimize the overall SCE but not the SCR [14]. Note that some variants of SCE and SCR are used for speaker verification in Refs. [28,29]. However, Kua et al. [29] found that SCR is best in their speaker verification. SCE are also selected in Refs. [28,31] for face and finger vein verification. Moreover, SCE and SCR have already been investigated in a variety of biometric identification applications [40,55,62]. SCE signals the representation ability of SRC, while SCR reflects the sparseness of coding vector. They are more general and representative, and thus more suitable for exploring SRC-based verification features in this paper.

Characteristics and merits
In contrast to the conventional one-to-one matching verification, SRC-based verification conducts one-to-many matching between the query sample and the templates of the Fig. 4 Two instances of SCE scores obtained in the face subset of MD II client and non-target subjects. SRC-based verification provides a competing mechanism, i.e., sparse coding optimization, which allocates class-specific sparsity-based matching scores to all classes in dictionary according to their correlations with the submitted data. Moreover, with the sparsity constraint, SRC generally allows only one or very few classes to get a convincing sparsity-based matching score, while the remaining classes get very inferior scores, as shown in the upper plot of Fig. 4. In other words, to be accepted, the genuine class needs to defeat almost all the non-target classes in the competing coding process. Therefore, an acceptance response made by SRC-based verification should be more convincing than those based on one-to-one matching. Overall, SRC-based verification not only examines the matching score obtained by the client, but also implicitly compares the correlations of the query data to the client and many nontarget subjects, and thereby offers enhanced protection for identity security. Figure 5 shows the SCE and SCR score distributions obtained on GT face dataset. We divide the genuine scores It is quite clear that in both SCE and SCR cases, the top 5 genuine score distribution has a very trivial overlap with impostor score distribution. This means that a top rank usually comes along with a favorable genuine score, and vice versa. On the contrary, the genuine scores out of top 5 are so inferior that they are all close to the impostor distribution center. These results show that if the genuine class can defeat most of the non-target classes, SRC-based verification system will generally accept the verification request. Hence, the rank information among the client and non-target subjects is implicitly employed by SRC-based verification, embedding in the sparsity-based matching score. This implies that the sparsity-based matching scores obtained from competitive matching are more discriminative than the scores that come from one-to-one comparison, e.g., Euclidean distance and cosine similarity.
Generally, compared with non-target subjects, a genuine query sample (or feature) with good biometric quality is more similar with the client, who can hence win an eligible score in the competition, superior to a predetermined operating threshold. If the submitted biometric sample is unreliable, for example, it is captured from an intruder or has poor biometric quality, it is likely that none of the classes can dominate the sparse coding competition. Consequently, the coding coefficients will be spread evenly over all classes [14]. The client claimed can only get an inferior score and is rejected. This is often the case for the fake biometric traits used in spoof attacks if without sophisticated biometric fabrication. Besides the security improvement, SRC-based verification can also achieve significant advantage in verification accuracy in both unimodal and multimodal scenarios, compared with the conventional one-to-one matching verification. The experimental results will be presented in the section "Experiments".

Challenges
However, biometric sample quality often fluctuates with the illumination, pose, and appearance variants [33][34][35]. Moderate data degeneration is inevitable in real-world applications. The submitted sample could probably be much different from the gallery samples of the claimed client, or even be somewhat similar to those of non-target subjects. Under these circumstances, the genuine client may fail to achieve a top rank in the competing sparse coding. As a result, the client will get an extremely inferior sparsity-based genuine score, as shown in the lower plot of Fig. 4. We can see in Fig. 5 that although distribution overlap between the genuine and impostor is not evident, there are a certain number of genuine scores spread near the impostor score distribution center.
Moreover, these genuine scores are so inferior that in realworld applications, it is impossible to accept the verification requests by tuning operating threshold, for the sake of avoiding high false accept rate. Figure 6 shows the ROC curves of SRC-based unimodal verification using SCE and SCR on all unimodal datasets. In the SCE case, they either get very high false reject rates in a wide range of FAR variation or their ROC curves almost do not drop at all after 10% FAR. In the SCR case, although the ROC curve on AR face dataset finally converges to 0% FRR, it is unacceptable for a 44% FAR in real applications. We call the phenomenon that the ROC curve cannot converge to 0% FRR or has a very long and flat tail along the FAR axis as FRR bottleneck problem. An evident FRR bottleneck problem will degrade user experience, and even makes the SRC-based verification unacceptable.
Due to the involvement of non-target subjects, SRC-based verification meets another challenge, that is, computational cost. Suppose that the atom dimensionality M in A ∈ R M×N is fixed, the complexity of sparse coding optimization Fig. 6 The ROC curves of SRC-based unimodal verification with SCE and SCR when using DCT features depends on the number of atoms. For example, the empirical complexity of commonly used l1-ls optimization method to solve (2) is O(N v ) with v 1.5 [15,61]. In the applications with large-scale users, if using the gallery samples of all enrolled clients to build a coding dictionary, the computational cost would be prohibitively expensive.
Furthermore, it is well acknowledged that the increase of dictionary scale usually leads to accuracy degradation for SRC-based identification [36,37,39]. Likewise, SRC-based verification may confront the same challenge in the applications with large user scale. The more non-target subjects involved in sparse coding, the higher possibility to increase the distribution overlap in feature subspace between the subdictionaries of the genuine class and the non-target classes, i.e., A c and A b .
As illustrated in Fig. 7a, the convex hull spanned by biometric samples is only an extremely tiny portion of the unit  [63]. In the overall convex hull, the distribution interval among classes is very small, and many classes may overlap their neighbors to some degree. In reality, the appearance variations, pose, and alignment error will also aggravate the distribution overlaps. Accordingly, the more classes in dictionary may introduce more distribution overlaps. The inter-class separability between the genuine class and the overall non-target classes plays a critical role in recognition [20]. In the section "Experiments", we empirically confirm the evident correlation between the inter-class separability and SRC-based verification accuracy.
Overall, SRC-based verification has to contend with the following challenges: (1) Once the genuine class fails to get a top rank in encoding the query data, it is very likely to result in extremely inferior genuine score. Consequently, the SRCbased verification may suffer from evident FRR bottleneck problem. If this problem cannot be resolved or avoided properly, SRC-based verification may not be suitable for some biometrics and the application scenarios that require both very low FRR and FAR. (2) Owing to the involvement of non-target subjects, the larger scale of the coding dictionary used, the more likely it is to degrade the verification efficiency and accuracy.
One may also notice that SRC techniques generally require multiple training samples per class for dictionary construction [14,28]. This requirement is rather harsh and even impractical for some identification tasks. However, SRCbased verification is designed for the positive recognition scenarios where user cooperation is generally available [1]. A representative set of biometric samples per user can be captured in registration phase. If the gallery samples collected are not sufficient, there are still many ways to generate simulated biometric samples for users based on their enrolled data [64][65][66]. The 3D face models [64] or generative adversarial networks (GAN) models [65,66] learned from gallery samples can be used to generate samples with variants like pose and illumination. The dictionary augmentation skills like supplementing an intra-class variation dictionary can also be helpful to alleviate this problem [18,20]. Besides, for the non-target subjects, SRC-based verification does not have to require their gallery samples with equivalent number and representative capability.

Small random dictionary
In the application scenarios with large user scale, if using gallery samples of all the enrolled subjects to build coding dictionary, it seems inevitable to degrade verification accuracy, and meanwhile significantly increase computational cost. In this subsection, we will first clarify the unnecessary use of a large number of non-target subjects to construct dictionary. Then, we propose a straightforward but effective strategy to shrink the dictionary via cluster analysis and random selection.
The non-target subjects used in coding dictionary play a critical role in SRC-based verification. Security improvement can be achieved through the one-to-many competitive matching among the client and non-target subjects. The more non-target subjects engage in the competition, the higher reliability of an "acceptance" decision can be achieved. However, the more non-target subjects involved also increase the intensity of competition, thereby increasing the likelihood of false "rejection" to genuine clients, while inevitably leading to higher computation burden. From these viewpoints, an excess number of non-target subjects are not only unnecessary, but can also cause negative effects.
To explicitly illustrate the risk of using excess non-target subjects for dictionary construction, we plot two toy examples in Fig. 7. Suppose there are 6 subjects, 4 training samples per class, and the feature dimension is 2, the convex hull is depicted in Fig. 7b. Considering a K 6 sparse L 0norm sparse coding problem, the percentage of the K nonzero entries obtained by a class can reflect the probability of the query sample belonging to this class, similar with the SCR measure in L 1 -norm optimization. Given a query sample of Class 1 on its distribution boundary, if randomly selecting half of the classes and using all their training samples as dictionary atoms, 3 training samples of Class 1 could possibly be used to represent the query sample, as illustrated in Fig. 7c. In this case, the genuine score is 3 K 0.5. On the other hand, if using the training samples of all the classes for coding, the genuine score may become smaller like 2 K 0.33, as shown in Fig. 7b. This score is much lower than the 0.5 achieved in Fig. 7c. It might not be a problem in close-set identification, since Class 1 still gets the top one rank. However, in verification, such a low score inclines to cause a false rejection by the comparison with a predefined operating threshold.
Considering the above observations and analyses, it is inadvisable to use a large number of biometric samples of the enrolled subjects to build a coding dictionary for verification. Here, we consider a simple dictionary shrinkage strategy via cluster analysis and random selection. The basic idea is that we first conduct cluster analysis on the training samples of all enrolled subjects, and then randomly select a few subjects from each cluster and use their training samples to construct an overcomplete dictionary with a limited number of classes. Cluster analysis is applied to avoid the worst-case scenario where all the selected subjects are concentrated in a tiny space and their distributions overlap heavily. To differentiate from the full dictionary with all the enrolled subjects, we call such a dictionary as small random dictionary.
In our experiments, all the query samples are used for testing, including those of the subjects selected for dictionary construction. The coding dictionaries for the selected and unselected classes are not the same but equivalent, that is, they share the same A b . For example, 50 subjects are selected in PolyU dataset. In testing phase, the query samples of these classes are encoded based on the dictionary built with them. When the query samples are of the remaining classes, one of the 50 subject is replaced by the class claimed to construct dynamic coding dictionary. Hence, the resulting sparsitybased matching scores should be compatible. By conducting experiments on different small random dictionaries with the same scale on unimodal and multimodal datasets, we find that they consume much less time and can bring about better or comparable verification accuracy, compared with the full dictionary. This evidence supports that shrinking the dictionary by selecting non-target classes and training samples is feasible to avoid the heavy computational burden and recognition accuracy degradation on large-scale dataset.
The benefits of small random dictionary are summarized as follows. First and foremost, the smaller scale of dictionary means the lower sparse coding complexity and thus makes the SRC-based verification more efficient. Second, compared with large-scale dictionary, smaller scale dictionary is less likely to bring in more distribution overlaps among classes. Besides, the strategy of using small random dictionary is simple and requires no complicated training and preprocessing.

Multimodal verification
According to the analyses above, the SRC-based verification is rather suitable for the application scenarios where well inter-class separability is available. It is well acknowledged that in the multimodal feature space classes are better separated than in the unimodal feature space [1,3,4,33,34,67]. Besides, it is much more difficult to spoof multiple biometric modalities than to spoof only one modality [4]. In this paper, we study the SRC-based multimodal verification with the combinations of face and ear, face, and palmprint, using Sum fusion rule.
The ear is located near the face, and can be captured along with the face using the same type of sensor or a single sensor at two times. Face detection can also help speed up ear detection by offering an ear region of interest. Most popular face feature extraction and classification techniques are applicable to the ear and palmprint. The recognition systems using ear and palmprint are also contactless. Besides, the ear has several appealing merits over the face: the ear has a stable structure with rich information, nearly unaffected by aging and expressions [68,69]. Although there is a common impression that human ears are usually occluded by hair, it can be avoided via user cooperation in verification scenario. The studies in Refs. [33,34] have already validated that the multimodal identification with face and ear can significantly improve the recognition accuracy and robustness. Compared with the face and ear, it is much harder to steal a person's palmprint. Suppose and A e A e c , A e b are, respectively, the face and ear coding dictionaries. The SCE and SCR matching scores of the face and the ear can be calculated using (4) and (6), respectively. As shown in Fig. 8, the proposed SRC-based multimodal verification system first performs two independent sparse coding procedures, and then integrates the derived sparsity-based matching scores. Since the SCE or SCR of the face and the ear have similar distributions, we hence directly combine them without score normalization, empirically which brings no improvement in our experiments.
For convenience, let s f and s e be the sparsity-based matching scores of two modalities. The multimodal matching score with Sum fusion is calculated as s s f +s e . The multimodal verification system makes decision with the similar rule in (5) or (7), according to the measure used.  Figure 9 plots the distributions of the multimodal SCE and SCR scores obtained on the MD I dataset. The overlap between the genuine and impostor distributions is trivial in both cases. Especially, both categories of the multimodal SCE show evident distribution centers that are far apart from one another. This implies a good robustness of the multimodal verification system. The detailed experimental results will be given in the section "Experiments".

Settings
For convenience, we denote the SRC-based verification methods with SCE and SCR as SRC_sce and SRC_scr, respectively. The l 1 _ls optimization [61] is used to solve the sparse coding problem. The unimodal verification method and multimodal verification method with Sum fusion [70] based on one-to-one matching and cosine similarity are used as unimodal and multimodal baselines. The multimodal SRC_sce and SRC_scr are also compared with the wellknown multimodal methods, i.e., LLR [38] and SVM [71]. The multimodal SVM method fuses the matching scores of all the modalities in a stacked way and uses the RBF kernel with sigma of 0.25. They also use cosine similarity scores and are evaluated with tenfold cross-validation method.
In the experiments using ArcFace-based CNN features, the publicly available pretrained ResNet 50 model 1 is used to finetune for each modality. This model was trained on the MS1M-Arcface 2 dataset with the Arcface loss. Note that 1 https://github.com/luckycallor/InsightFace-tensorflow. 2 https://github.com/deepinsight/insightface/wiki/Dataset-Zoo. Fig. 9 Distributions of multimodal scores of SCE and SCR when using DCT features on MD I we revise the networks output to be a 200-D feature embedding, hence, we use the third-party dataset and some gallery samples of our datasets to separately finetune the networks. We use 2 gallery samples/subject of AR, 3 gallery samples/subject of GT, and a small subset of CASIA-WebFace to finetune the face model. The ear network is finetuned with 4 gallery samples/subject of USTB III, and 3352 ear samples of 300 subjects collected from college students. The palmprint network is finetuned with 3000 gallery samples of the remaining 300 subjects of the PolyU 2D&3D palmprint dataset. In the finetuning, the batch size is 32, initial learning rate is 0.001, weight decay is 0.0005, and momentum is 0.9. We train each model on one NVIDIA RTX 2080ti GPU card.
For each type of the sparsity-based matching measure, according to (4) and (6), given K classes in the coding dictionary, we can obtain one genuine score and K − 1 impostor scores for a probe sample. For the SRC-based multimodal verification, we get 4400 genuine scores and 4400 × (50-1) 215,600 impostor scores on MD I, 6083 genuine scores and 6083 × (79-1) 474,474 impostor scores on MD II, 7000 genuine scores and 7000 × (100-1) 693,000 impostor scores on MD III. As for the competing methods, we empirically select the best matching score from the comparisons of a probe sample and all training samples of a class, hence, the same numbers of genuine and impostor scores are available. All the experiments except CNN feature extraction are conducted on Matlab platform on a desktop with 3.3 GHz CPU, 64 GB RAM.

Unimodal verification
Tables 3 and 4 report all the unimodal and multimodal verification results in terms of EER. In the unimodal experiments using DCT features, compared with the baseline, both SRC_sce and SRC_scr get about 14% EER decrease on all the three face datasets, while they reduce the EER by roughly 10% and 5% in ear and palmprint cases. The unimodal SRC_sce performs much better than SRC_scr on all the datasets except the PolyU. Impressively, both the unimodal SRC_sce and SRC_scr evidently outperform all the multimodal methods using cosine similarity.
When using CNN features for verification, as shown in Table 4, all the methods get significant improvements. On AR dataset, the unimodal baseline gets an EER less than 1/11 of that it gets when using DCT features. The advantage of SRCbased verification over baseline method is also significant. The EER rates of baseline method are about 4.4 to 8.9 times of those SRC_sce gets on the same datasets. And, the unimodal SRC_sce consistently outperforms the unimodal SRC_scr.
Recall that all the unimodal SRC-based verification methods confront evident FRR bottleneck problem when using DCT features, as shown in Fig. 6. On the contrary, as the ROC curves shown in Fig. 10, the unimodal SRC_sce using CNN features does not meet this problem on any datasets. Although the FRR bottleneck problem still haunts the unimodal SRC_scr on GT dataset, it becomes rather trivial compared with that appears in DCT experiments. This result implies that more discriminative feature used for SRC-based verification can alleviate and even avoid the FRR bottleneck problem.

Multimodal verification
In the right columns of Tables 3 and 4, the multimodal SRC_sce and SRC_scr are compared with their multimodal competitors. In the experiments with DCT features, the multimodal SRC_sce gets the best EER results of 0.545%, 0.195%, and 0.125% on MD I, MD II, and MD II, respectively, while the best results obtained by the conventional multimodal methods are only 6.034%, 6.44%, and 3.12%. The multimodal SRC_scr also performs significantly better than the conventional methods on all datasets, though it cannot be compared with the multimodal SRC_sce. The comparison of ROC curves in Fig. 11 visually demonstrates their significant superiority to the LLR and SVM. Note that we outline the ROC curves of SRC-based methods and the methods using cosine similarity in different plots for showing more details. We would also like to mention that the inferior performance of LLR and SVM reflects the challenges due to expression, illumination, and pose variations in face, ear, and palmprint samples.
When using CNN features, we can see from Table 4 that the SRC-based multimodal verification achieves extraordinary improvements. Both SRC_sce and SRC_scr obtain promising EER results that range from 4.33e-4% to 1.48e-3% on MD II and MD III. Their ROC curves almost overlap completely, as shown in Fig. 12. Note that the best EER results of LLR and SVM are as high as 0.136% and 0.063% on these two datasets. We do not see the FRR bottleneck phenomenon on the ROC curves of SRC_sce and SRC_scr, as shown in Fig. 12.
Overall, compared with the unimodal methods, all multimodal methods get significant improvements in the experiments with both DCT and ArcFace-based CNN features. Therefore, it is validated that the proposed SRC-based multimodal methods significantly outperform their unimodal counterparts and the well-known conventional methods.

Correlation with inter-class separability
Wright et al. attributed the success of SRC to that it can better exploit the actual (possibly multimodal and non-linear) distributions of the training samples of each class and is therefore likely to be more discriminative among multiple classes [14]. Note that the biometric quality of query samples and inter-class separability are the two major factors that affect biometric recognition performance. As shown in Fig. 1, the biometric quality of face, ear, and palmprint probe samples used is roughly comparable. The performance of SRC-based identification on each dataset can reflect the inter-class separability of samples in coding dictionary to some extent [20]. Recall that SRC-based verification and identification share the same comparison mechanism. Therefore, we use the commonly used rank-1 recognition rate and the overall Cumulative Match Characteristic (CMC) curve of SRC-based identification as inter-class separability indicators. Note that the SRC-based multimodal identification method evaluated here uses SCE measure and Sum fusion as the proposed SRC_sce multimodal verification.   Tables 5 and 6 report the EER results and the corresponding rank-1 recognition rates when using DCT or CNN features. We sort the datasets according to the identification accuracy in an ascending order. Thus, the datasets on the right side should have better inter-class separability in terms of rank-1 accuracy. In DCT experiments, SRC_sce always obtains lower EER results on the right datasets except on USTB III (50), as shown in Table 5. However, when look on the ROC curves in Fig. 6, we can see that the overall verification performance of SRC_sce on USTB III (50) is better than that on USTB III. Therefore, when using DCT features, SRC_sce achieves better verification performance on the datasets where SRC gets better identification results.
We get the same result when using CNN features, as shown in Table 6. One may notice that SRC gets better rank-1 accuracy on GT dataset, but SRC_sce obtains slightly worse EER result than that obtained on AR dataset. However, looks on the CMC curves in Fig. 13, one can see that the curve on AR dataset converges to 100% accuracy faster than the curve on GT dataset. In other words, the overall identification performance of SRC on AR dataset is better than that obtained on GT dataset. It should also be noted that although SRC_sce achieves slightly inferior EER (0.364%) on USTB III (50) than the 0.361% EER on USTB III, the ROC curves in Fig. 10 show that it gets better verification performance on the former subset.
As for the SRC-based multimodal identification, when using DCT features all the CMC curves ascend rapidly to 100% in the first few ranks, while in the experiments with CNN features, SRC achieves 100% rank-1 accuracy on all multimodal datasets. This implies a favorable inter-class separability on each multimodal dataset. Correspondingly, we can see the ROC curves of multimodal SRC_sce converge to As for the comparison between DCT and CNN experiments, we can see from Fig. 13 that when using CNN features, all the unimodal and multimodal samples are identified correctly at rank 6 on any datasets, while there are still many unimodal probe samples which cannot be identified after rank 6 in the DCT features. This indicates again that CNN feature space shows much better inter-class separability, which is in line with the existing literature. As a result, the ROC curves got by SRC-based unimodal verification decline slowly and cannot converge to 0% FAR, showing evident FRR bottleneck problem. On the contrary, the SRC-based verification using CNN features achieves significant superiority in both unimodal and multimodal scenarios and on all the datasets.
The above experimental results demonstrate that SRCbased verification can achieve better performance in the scenarios where better identification performance is achieved by SRC. Apparently, there is a positive correlation between the performance of SRC-based verification and the interclass separability of samples in the coding dictionary. This characteristic could be used as a guideline for SRC-based verification applications. The SRC-based verification may not be suitable for the biometrics or scenarios where inferior inter-class separability is available. In our study, compared with the face, palmprint, and ear unimodal verification, their combinations are more recommendable to use SRC-based verification.

Small random dictionary
We evaluate the effectiveness and efficiency of small random dictionary on MD II and MD III using the unimodal and multimodal SRC_sce with both DCT and CNN features. On each dataset, we test 10 small random dictionaries with 50 subjects. Figure 14 uses boxplot to illustrate the EER distributions of the SRC_sce with small random dictionaries. A red diamond marker denotes the EER result of SRC_sce using the full dictionary with all subjects on the same dataset.  We can see from Fig. 14 that in each case, all the small random dictionaries achieve similar EER results with a small variance. In both the unimodal and multimodal experiments with DCT features, most small random dictionaries are better than or comparable to the full dictionary, and the remainders are slightly worse. When using CNN features, the superiority of small random dictionaries to full dictionary is much more evident. Table 7 reports the average running time per sample of SRC_sce unimodal verification on AR and PolyU datasets. We can see that the verification using small random dictionary is much more efficient than when using a full dictionary. Note that the time consumption of our multimodal verification using Sum fusion is about 2 times of that used by unimodal verification.
In this series of experiments, we can see that compared with the full dictionary, the SRC-based verification using small random dictionaries generally achieve better or comparable verification results. Considering the large user scale in real-world applications, the SRC-based verification using the dictionary with all the enrolled subjects will inevitably subject to accuracy degradation and a heavy or even unaffordable computational cost. Therefore, shrinking the big dictionary to a certain scale is indispensable in large-scale user applications.

Conclusions and future directions
In this paper, we have first given an insight into SRCbased biometric verification by studying two sparsity-based matching measures on three biometric traits and their multimodal combinations, using both handcrafted and deep CNN features. The sparse coding in SRC-based verification can be seen as a one-to-many competitive matching process where the client has to compete with non-target subjects for a convincing sparsity-based matching score. Essentially, SRC-based verification not only examines the matching score obtained by the client but also implicitly compares the correlations of the query data to a limited number of the non-target subjects, and thereby offers enhanced protection for identity security. Extensive experimental results demonstrate that in both unimodal and multimodal scenarios, SRC-based verification achieves overwhelming superiority to many well-known methods based on the one-to-one matching and cosine similarity, especially when using multimodal fusion and CNN features.
The foremost concern about SRC-based verification is that if the genuine class fails to get a top rank in encoding the query data when data degeneration occurs, it is very likely to result in extremely inferior genuine score, and consequently lead to long tail effect of ROC curve. We call this effect as FRR bottleneck problem. If this problem cannot be resolved or avoided properly, SRC-based verification may not be suitable for some biometrics and the application scenarios requiring very low FRR rate and high user acceptability. This problem is particularly prominent in the unimodal verification using DCT features. On the contrary, we did not see this effect when using advanced deep CNN features and multimodal combinations on all the three multimodal datasets.
We also found that there is a strong correlation between the performance of SRC-based verification and the interclass separability among the classes in the coding dictionary. Hence, whether the coding dictionary has well-separated feature distribution is critical. The SRC-based verification is rather suitable for the application scenarios where favorable inter-class separability is available, such as multimodal biometrics, and when using discriminative deep learning features. On the other side, SRC-based verification may not be suitable for all biometric applications. One can reckon its feasibility according to the existing relevant SRC-based identification studies.
Another major challenge lies in that, owing to the utilization of non-target subjects, a large-scale of the coding dictionary will definitely bring huge computational burden, and it is also likely to degrade the verification accuracy. In the large-scale user applications, we suggest one select a properly small subset of non-target subjects with well-separated feature distribution. In our experiments, a simple dictionary shrinkage strategy based on cluster analysis and random selection of non-target subjects can generally improve verification accuracy while maintaining high efficiency.
The introduction of non-target subjects may also raise a concern about increasing the number of vulnerabilities that can be explored by intruders. One may worry that intruders may deceive an SRC-based verification system using the biometric traits stolen from non-target subjects. We would like to emphasize again that the class-specific sparsity-based matching score used for verification comparison is the one associated with the identity claimed. According to the characteristics of SRC, the corresponding class is very likely to be assigned an inferior score and thus be correctly rejected.
On the other hand, if the non-target subjects never exist, for example, using the virtual individuals created by GAN models, there will not be any chances of stealing except for breaking into the system database.
In the future, more efforts can be made on the following aspects: • Sparsity-based matching measures. The existing sparsitybased matching measures use either coding coefficients or reconstruction residues. However, in practice, the L 1 -norm optimization often need to make a compromise between the reconstruction fidelity and sparseness, especially when the input data has a low biometric quality. It may be a direction to integrate the discriminative cues in both coding coefficients and reconstruction residues to design more robust sparsity-based matching measures. • Multimodal verification. Recent studies shown that even the SOTA deep learning-based face verification approaches are highly vulnerable to some low-level print and replay presentation attacks [10,11]. Many more unpredictable advanced attacks will emerge in the near future.
In view of the fact that people's face images can be easily obtained or stolen, it seems very difficult to eliminate the vulnerability of the unimodal face recognition to presentation attacks. On the other hand, it will be much more difficult to fool a multimodal system using the face and other biometric traits that are harder to steal and can also be recognized contactless. Our study in this paper suggests that the SRC-based multimodal verification using deep learning features can achieve high accuracy and meanwhile avoid some shortcomings. We believe that the combination of deep learning features, multiple biometrics, and SRC classification techniques can bring about a good trade-off between verification accuracy and security. • Large-scale datasets. One major challenge at the moment is that there are no suitable large-scale datasets available for SRC-based verification research. Note that SRC requires sufficient well-controlled training samples per user if directly using them or their features to build the overcomplete dictionary. However, most of the publicly available large-scale datasets are collected in unconstrained environment. It is also cumbersome and expensive to collect a large-scale dataset with sufficiently well-controlled samples per user if without industry support. A more efficient way is to select suitable data from many existing largescale datasets. Another alternative is to use data generation techniques like GAN and 3D face models to produce multiple simulated samples with variants like pose and illumination based on users' enrolled data. The dictionary augmentation skills like supplementing an intra-class variation dictionary in Refs. [18][19][20][21][22][23] can also helpful to alleviate the under-sampled problem, and thus reduce the requirements for the use of SRC-based verification in large-scale datasets.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare that there are no conflicts of interest regarding the publication of this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.