Abstract
Deep feature learning has become crucial in large-scale face recognition, and margin-based loss functions have demonstrated impressive success in this field. These methods aim to enhance the discriminative power of the softmax loss by increasing the feature margin between different classes. These methods assume class balance, where a fixed margin is sufficient to squeeze intra-class variation equally. However, real-face datasets often exhibit imbalanced classes, where the fixed margin is suboptimal, limiting the discriminative power and generalizability of the face recognition model. Furthermore, margin-based approaches typically focus on enhancing discrimination either in the angle or cosine space, emphasizing one boundary while disregarding the other. To overcome these limitations, we propose a joint adaptive margins loss function (JAMsFace) that learns class-related margins for both angular and cosine spaces. This approach allows adaptive margin penalties to adjust adaptively for different classes. We explain and analyze the proposed JAMsFace geometrically and present comprehensive experiments on multiple face recognition benchmarks. The results show that JAMsFace outperforms existing face recognition losses in mainstream face recognition tasks. Specifically, JAMsFace advances the state-of-the-art face recognition performance on LFW, CPLFW, and CFP-FP and achieves comparable results on CALFW and AgeDB-30. Furthermore, for the challenging IJB-B and IJB-C benchmarks, JAMsFace achieves impressive true acceptance rates (TARs) of 89.09% and 91.81% at a false acceptance rate (FAR) of 1e-4, respectively.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Deep learning techniques, specifically deep convolutional neural networks (DCNNs), have greatly improved performance in various fields, such as natural language processing, voice recognition, and computer vision. In particular, these techniques have greatly improved the performance of face-related tasks, including face recognition [28]. Recent advancements in face recognition have focused on leveraging deep learning methods to extract high-dimensional feature vectors for improved performance. These methods enable automatic learning of discriminative features for face recognition tasks, resulting in the use of facial recognition systems in various applications such as security, surveillance, human–robot interaction, and intelligent productive teaming [12, 14, 15, 17]. Additionally, thermal face recognition is a crucial application of face recognition [29, 31], operating effectively in low-light conditions and capable of identifying individuals even if they are wearing masks or other facial coverings [30].
Loss functions play a pivotal role in the learning optimization of DCNNs. One of the most widely used loss functions is the traditional softmax loss, which fails to produce highly discriminative feature vectors [5, 22, 35]. To overcome this issue, various loss functions [1, 5, 16, 20, 22, 35, 37] have been proposed to promote the feature vectors’ intra-class compactness and inter-class separability to enhance face recognition accuracy and generalization.
Current state-of-the-art (SOTA) face recognition methods mostly rely on margin-based softmax loss, which improves feature discrimination by adding a margin to each identity. These methods, such as SphereFace [22], CosFace [35], and ArcFace [5, 6], add the multiplicative angular margin, additive cosine margin, and additive angular margin, respectively, to increase class separation. These methods assume that all classes have enough samples to describe their distributions accurately, and thus using a constant margin for all classes would be sufficient. However, this assumption does not hold for public face datasets, which are often highly unbalanced. As shown in Fig. 1, the space spanned by existing training samples can accurately represent the real distribution for classes with many samples (rich classes). However, for classes with few samples (poor classes), the space spanned by existing training samples may only be a small part of the real distribution. Therefore, it can be concluded that a fixed margin may not be appropriate for classes with varying sample distributions, and an adaptive margin would be more suitable. This is because the adaptive margin can adjust the margin value according to the sample distribution of each class, and better constrain the intra-class variation of the class with few samples [20]. Furthermore, these margin-based approaches only enhance discrimination in the angle or cosine space, emphasizing one boundary while disregarding the other, which can lead to suboptimal performance. In this paper, we introduce a new loss function called joint adaptive margins loss (JAMsFace), which dynamically sets additive margins based on the class distribution and improves discrimination in both angle and cosine spaces.
Our major contributions are as follows:
-
We present a joined adaptive angular and cosine margins loss (JAMsFace) that dynamically adjusts the decision boundary to learn a more compact and precise face feature representation.
-
JAMsFace allows for more effective differentiation between features by considering both angular and cosine spaces, which improves the overall performance of face recognition.
-
We conduct comprehensive experiments on multiple face recognition benchmark datasets. The experimental results show that the proposed JAMsFace advances the state-of-the-art face recognition performance on five out of seven mainstream benchmarks.
2 Related work
The importance of loss functions in determining unique features during the training of face recognition methods cannot be overstated. Utilizing the appropriate loss functions can significantly enhance the performance of face recognition. The loss function methods can be divided into two categories: metric-based and margin-based [36]. The metric-based techniques measure the distance between two faces, while margin-based techniques aim to maximize the space between two faces.
2.1 Metric-based methods
In early works, using metric-based approaches was common practice. These methods aim to learn a similarity measure for a set of images using a deep metric learning network [7]. They strive to make visually similar images closer together in an embedding manifold, while pushing visually dissimilar images further apart. Two types of metric-based losses have been developed for different setups: the contrastive loss and the triplet loss. The contrastive loss [3] trains a network to predict whether pairs of samples belong to the same class or not by minimizing the embedding distance for positive pairs and maximizing the distance for negative pairs. This approach is also known as distance metric learning.
In contrast to the contrastive loss, the triplet loss uses triplet samples (anchor, positive, and negative). It was first used for training in FaceNet [26]. The goal is to minimize the distance between the anchor and positive pairs and to maximize the distance between the anchor and negative pairs.
However, using contrastive and triplet losses for deep representation learning can result in slow convergence and instability, as the embeddings are optimized only against one negative class during each update [4, 33, 37]. To address these issues, Sohn et al. [33] proposed \((N+1)\)-tuplet loss, which improves training convergence by selecting a positive sample from among \(N-1\) negative samples and considering the distance between the anchor and negative samples. However, the number of samples in each batch increases in a quadratic fashion, leading to an explosion in the sample space. Wen et al. [37] attempted to solve this problem by introducing center loss, which simultaneously learns the center for each class and penalizes the distances between the features and their corresponding class centers. Despite these efforts, center loss and similar methods are ineffective in tackling the open set problem in face recognition [4].
2.2 Margin-based methods
Several margin-based softmax loss functions have recently been proposed [1, 5, 11, 13, 20,21,22, 25, 35]. These methods create a decision boundary in the angular space by incorporating additional constraints to the target logit, making the learned features more discriminative than those generated by deep metric learning methods.
Liu et al. [21] proposed a large-margin softmax loss (L-softmax) that uses a piecewise function to ensure the monotonicity of the cosine function and explicitly promotes intra-class compactness and inter-class separability between the learned features. Liu et al. [22] extended L-softmax with SphereFace, which normalizes the weight matrix of the last fully connected layer and discriminatively spans the learned features on a hypersphere manifold. However, SpherFace is challenging to train as it employs a multiplicative angular penalty margin between the deep features and their corresponding weights.
To tackle this issue, CosFace [35] converts face classification to a measure based on the cosine distance by introducing an additive cosine margin on the cosine angle between the deep features and their corresponding weights. However, finding a suitable additive cosine margin that achieves both intra-class compactness and inter-class separability can be challenging. Deng et al. [5] proposed ArcFace, which improves the geometric interpretation of the margin and achieves better performance by introducing an additive angular margin, while He et al. [9] point out that the intra- and inter-class objectives in softmax are entangled; therefore, a well-optimized inter-class objective leads to relaxation of the intra-class objective and vice versa.
AdaptiveFace [20] and Dyn-arcFace [13] address the issue of unbalanced data by learning an adaptive margin for each class, while CurricularFace [11] adaptively adjusts the relative importance of easy and hard samples during different training stages by using curriculum learning with the loss function. On the other hand, KappaFace [25] adaptively modulates the positive margins based on class imbalance and difficulty. Among the various methods, AdaFace [18] considers the image quality during training and introduces a quality adaptive margin by estimating the image quality through feature norms.
In conclusion, margin-based softmax loss functions have significantly improved the performance of deep face recognition. As a result, they have become more commonly used in face recognition tasks because of their ability to effectively reduce intra-class variability and inter-class similarity, leading to better accuracy and robustness of face recognition models.
In this paper, a new loss function called JAMsFace for face recognition is proposed and validated on several current face recognition datasets [10, 23, 24, 32, 39, 44, 45]. The experimental results show that the proposed loss function improves the state-of-the-art performance of face recognition on five out of seven mainstream benchmarks.
3 Proposed approach
3.1 Preliminary
As shown in Fig. 2, a face recognition training system typically has three main components: a dataset for training, validation, and testing, a feature extraction network (such as a backbone), and a loss function. The loss function measures the similarity between samples, with similar samples being closer and different samples being farther apart. The system uses prototypes, represented by the weight vectors of the final fully connected layer, to represent each identity in the training set. The predicted scores, calculated through the final fully connected layer, represent the similarity between the feature vector and each prototype. The loss function optimizes the feature extraction network and final fully connected layer through backpropagation. The most commonly used loss function for classification is the softmax loss, which separates features from different classes by maximizing the probability of the correct class. The softmax loss is represented as:
where \(P_i\) denotes the predicted probability of embedded feature \(x_i\) belonging to the correct classification. \(x_i \in \mathbb {R}^{d}\) is the embedded feature of ith training sample, which corresponds to a class \(y_i\). \(W_{j} \in \mathbb {R}^{d}\) denotes the jth column of the weight \(W\in \mathbb {R}^{d\times n}\). \(b_j\) is the bias, while N is the batch size. The number of classes in the training dataset and the embedding feature size are n and d, respectively. In practice, the bias is usually set to \(b_j= 0\) as in [5], and then \(W_{j}^{T}x_{i}+b_{j}\) is transformed as \(W_{j}^{T}x_{i}= \Vert W_j \Vert \Vert x_i \Vert cos\,\theta _j\), where \(\theta _j\) is the angle between the weight vector \(W_j\) and the feature vector \(x_i\). To optimize feature learning, the individual weight is set to \(\Vert W_j \Vert = 1\) by \(l_2\) normalization [5, 22, 34, 35]. In order to better optimize the classification result, the deep feature \(\Vert x_i \Vert \) is also normalized by \(l_2\) and re-scaled to a fixed value s. Thus, the original softmax can be modified as in Eq. 2 and is called normalized softmax loss (NSL).
However, the normalized softmax loss has a limited ability to differentiate features for practical face recognition tasks. To overcome this limitation, various margin-based variants have been proposed [5, 21, 22, 35]. These variants introduce a margin between the target score and non-target scores, which allows for more accurate differentiation between features and improves the overall performance of the face recognition system. They can be formulated in a general form as follows:
where \(g(m,\,\theta _{y_i})\) is the introduced margin function. For instance, SphereFace [22] introduces the function \(g(m_{1} ,{\mkern 1mu} \theta _{y} ) = \cos (m_{1} \theta _{y} ) \), where \(m_1\) is a multiplicative angular margin, \(m_1 \ge 1\) and is an integer. \(g(m_2,\,\theta _{y}) = \cos(\theta _y) - m_2\) with \(m_2 \ge 0\) is the motivation of CosFace [35], where \(m_2\) is an additive cosine margin, whereas \(g(m_3,\,\theta _{y}) = \cos(\theta _y + m_3)\) with \(m_3 \ge 0\) is introduced in ArcFace [5], where \(m_3\) is an additive angular margin. Thus, utilizing the margin penalties within the softmax loss achieved better discriminative features than the original softmax loss. Finally, these margin-based variants can be integrated in a combined form as \(g(m,\,\theta _{y}) = \cos(m_1\theta _y + m_3) - m_2\).
However, the previous methods introduce a fixed margin for all classes, which can be problematic when the training datasets have unbalanced class distributions. The fixed margin approach neglects the variations in class distribution among the training data, which can lead to mediocre performance [19, 20]. Moreover, these margin approaches enhance discrimination either in the angle or cosine space, emphasizing one boundary while ignoring the other. To tackle these issues, we propose an approach that considers variations in class distribution among the training data by introducing an additive dynamic penalty for both angular and cosine margins. This approach allows for more effective differentiation between features by considering both angular and cosine spaces and improves the overall performance of the face recognition system, particularly when the dataset has unbalanced class distribution.
3.2 Proposed loss function (JAMsFace)
The proposed loss function aims to effectively optimize the discrimination objective in both angle and cosine spaces while addressing the class imbalance problem. To achieve this, we introduce an additive dynamic penalty for both angular and cosine margins. Rather than introducing a completely new loss function, the JAMsFace loss builds upon existing angular and cosine margin-based losses by incorporating two dynamic margin penalties. This approach enhances the model’s adaptability to the data distribution and further improves its discriminative power. By combining these dynamic penalties, our approach effectively balances the trade-off between intra-class compactness and inter-class separability, resulting in improved face recognition performance.
In a binary classification scenario, the angle between the learned feature vector \(x\) and the ground truth weight vector \(W_i\) of class \(C_i\) (\(i=1, 2\)) is represented by \(\theta _i\). A normalized softmax loss requires that \(\cos(\theta _1) > \cos(\theta _2)\) for correct classification of \(x\) as \(C_1\), and similarly for \(C_2\) requires \(\cos(\theta _2) > \cos(\theta _1)\) for correct classification of \(x\) as \(C_2\). However, this decision boundary has limited discrimination power for practical face recognition tasks (Fig. 4a). To address this, CosFace [35] proposes a large-margin classifier that requires \(\cos(\theta _1)-m > \cos(\theta _2)\) for correct classification of \(x\) as \(C_1\), and similarly, for correct classification of \(x\) as \(C_2\), it requires \(\cos(\theta _2)-m > \cos(\theta _1)\). Alternatively, ArcFace [5] proposes to enhance the discriminative power by adding an additive angular margin. Therefore, the target logit is \(\cos(\theta _i + m)\), and the classification boundary for class C1 is defined as \(s\left( \cos\left( \theta _1+m\right) - \cos\,\theta _2\right) = 0\), and similarly for C2, \(s\left( \cos\left( \theta _2+m\right) - \cos\,\theta _1\right) = 0\). The angular margin restrictions improve separability and classification performance by making feature vectors of the same class more compact and those of different classes farther apart.
These margin approaches typically focus on enhancing discrimination either in the angle or cosine space, emphasizing one boundary while ignoring the other. For instance, approaches such as CosFace and ArcFace employ a manually defined margin m that remains constant throughout the training process. This limitation of fixed and single-margin approaches restricts their ability to effectively capture the inherent complexity of face data and adapt to varying intra-class variations. By relying on a fixed margin, these methods may not fully exploit the discriminative potential of the feature space.
To address this, our proposed approach introduces joint adaptive margins that dynamically adjust for both the angle and cosine spaces. Instead of relying on a fixed margin, our method allows the margins to adapt to the data distribution, leading to improved discriminative ability and enhanced face recognition performance. Formally, joint adaptive margins softmax loss (JAMsFace) is defined as:
where \(m_{a_{y_i}}\) is the angular margin corresponding to the target class \(y_i\), which denotes the extent of the angle increase, while \(m_{c_{y_i}}\) is the cosine margin corresponding to the target class \(y_i\) and denotes the degree of increment of the cosine. The intuition behind joint adaptive margins can be illustrated using the previous simple binary classification example (with two classes C1 and C2). In order to correctly classify a sample \(x\), we need to ensure that \(\cos(\theta _1) > \cos(\theta _2)\). However, using joint adaptive margins, we instead require that \((\cos(\theta _1+m_a)-m_c) > \cos(\theta _2)\) where \(m_a, m_c > 0\). This results in a more stringent decision, since \(\cos(\theta +m_a)\) is lower than \(\cos(\theta )\) and \(\cos(\theta )-m_c\) is also lower than \(\cos(\theta )\). This approach essentially makes the decision boundary between the two classes more flexible and class-specific, leading to more effective discrimination power in the trained model.
In Fig. 3, the proposed JAMsFace and other margin-based softmax losses are illustrated geometrically. It explains that CosFace, ArcFace, and JAMsFace can be understood geometrically as projecting face features onto a hypersphere. This interpretation involves representing each face feature as a point in a high-dimensional space. The face recognition task is to find a decision boundary that separates these points into their respective classes. The decision boundary is a hyperplane that partitions the hypersphere into regions, where each region corresponds to a different face class. These methods can enhance face recognition discrimination power by finding an optimal hyperplane that maximizes inter-class distance and minimizes intra-class distance. However, JAMsFace stands out among the other margin-based softmax losses by allocating a larger joined margin to compact further the poor class (orange arc), which implicitly optimizes the underlying real space.
3.3 Comparison with other loss functions
To provide a deeper insight into the differences between our method and other margin-based softmax losses, we analyze the distinctions in the decision boundaries of various margin-based softmax loss functions. These methods mainly differ in terms of margin placement and whether the margin is uniform or class-related (adaptive margin).
Decision Boundaries. The decision boundaries under binary classification are provided in Table 1 and Fig. 4. Figure 4 intuitively illustrates that the decision boundary of the original softmax loss is a dividing line. However, a sample located close to the boundary can cause misclassification due to insufficient discrimination. In contrast, both CosFace and ArcFace utilize a uniform margin between the classes, which does not consider the sample distribution of each class and leads to poor generalization. To address these limitations, JAMsFace employs class-specific margins in both cosine and angular spaces. For poor classes, such as class 1, a larger margin is used, resulting in more compact feature extraction and pushing the actual boundary of class 1 away from class 2. This approach effectively enhances the discrimination power of the trained model.
Toy Example. In order to better illustrate the robustness and the impact of our proposed JAMsFace, we conducted a toy experiment to compare the feature distributions obtained by different loss functions. Specifically, we created a toy dataset by selecting face images from eight identities in MS1MV2 dataset [5] and trained several 10-layer ResNet models that output three-dimensional features. The dataset consists of eight classes, where class 0 (represented in red) has the most samples (over 500), and classes 1 and 2 (represented in yellow and blue) have a rich number of samples (about 250 each). The remaining classes (3 to 7) have fewer (poor) samples (about 60 each). This distribution roughly simulates the sample number distribution of MS-Celeb-1M. To visualize the features, we normalized the three-dimensional features and plotted them on a sphere.
Figure 5 depicts that the softmax loss function prefers the rich classes (classes 0, 1, and 2) and allocates a larger space for them, which leads to noticeable ambiguity in decision boundaries. In contrast, CosFace and ArcFace reduce the intra-class variations and assign equal space to each class, regardless of the sample distribution. For instance, the pink and yellow points occupy almost the same space. Our proposed JAMsFace, however, focuses on optimizing both the poor and rich classes to be more compact. Comparing CosFace, ArcFace, and JAMsFace, it is evident that the rich class 0 (red points) occupies almost the same area, while for the poor classes (pink, green, orange, and purple), our approach produces more compact and separable features.
4 Experiments and analysis
To analyze the performance of our proposed JAMsFace model, we first introduce our implementation details, then perform extensive experiments on the most widely used benchmarks for face recognition, and finally compare our method with recent state-of-the-art deep face recognition models.
4.1 Implementation details
For data preprocessing, we follow the commonly used approach as in recent works [1, 5, 46]: Each face image is cropped to a size of 112 \(\times \) 112, using a similarity transformation based on the five face landmarks (two eyes, a nose, and two mouth corners) detected by MTCNN [43]. Finally, the RGB pixel values are normalized from [0, 255] to \([-1, 1]\).
Learning Strategy. We use slightly modified versions of well-known CNN architectures such as ResNet50 and ResNet100 for the embedding network, following [1, 5, 46]. The models are trained on an NVIDIA Quadro RTX 8000 with a batch size of 512 using the stochastic gradient descent (SGD) algorithm with a momentum of 0.9 and a weight decay of \(5\textrm{e}{-4}\). The learning rate for CASIA-WebFace [42] and VGGFace2 [2] starts at 0.001 and is reduced by a factor of ten at the 20th, 28th, and 32nd epoch, for a total of 34 epochs. The learning rate for the larger datasets is divided at the 10th, 18th, and 22nd epochs, stopping after 24 epochs. Regarding the memory buffer, we set the momentum \(\alpha = 0.3\), and \(\gamma \) was set to 0.5 and 0.7 for the smaller and larger datasets, respectively. All experiments were implemented using PyTorch [27].
Train data. We follow the trend in recent works and separately utilize CASIA-WebFace [42], VGGFace2 [2] and the refined MS1MV2 [5] as our training sets. This enables a direct and fair comparison with state-of-the-art methods. The MS1MV2 is a refined version of the MS-Celeb-1M [8] containing 5.8 M images of 85K identities. These datasets are publicly available.
Test Settings. In the test phase, we input the cropped and aligned 112 \(\times \) 112 images into the trained model to generate a 512-dimensional feature vector and normalize it into a unit-length vector. In the classification stage, cosine similarity is used to calculate the distance between the output features and each category, thereby realizing face recognition.
4.2 Evaluation metrics and benchmarks
Metrics. Face recognition performance is typically assessed through two primary tasks: verification and identification, and each has its respective evaluation metrics. Two sets of samples, a gallery and a probe, are required to evaluate these tasks. The gallery is a set of known identities registered in the face recognition system, while the probe set consists of faces that must be recognized for verification or identification. The face recognition system decides whether to accept the matching of a probe face and a gallery face by comparing their similarity, calculated through some measurement between their features, with a given threshold. Specifically, when a probe face and a gallery face have the same identity, a true acceptance (TA) occurs if their similarity is above the threshold, and a false rejection (FR) occurs if their similarity is below the threshold. Conversely, when they are different identities, a true rejection (TR) occurs if their similarity is below the threshold, and a false acceptance (FA) occurs if their similarity is above the threshold. These basic concepts form the foundation of the evaluation metrics.
Benchmarks. Face recognition benchmarks are critical for evaluating the performance of different face recognition models. Labelled Faces in the Wild (LFW) [10] is the commonly used benchmark for face recognition in unconstrained environments. The original LFW protocol includes 3,000 genuine and 3,000 impostor face pairs and evaluates the mean verification accuracy on these 6,000 pairs. In addition to LFW, several other benchmarks are also used for face recognition evaluations. These include CFP-FP [32], CPLFW [44], CALFW [45], and AgeDB [24]. The IJB-B [39] and IJB-C [23] are also used for even more rigorous evaluations. These benchmarks are incredibly challenging, as they include significant pose, illumination, expression, and occlusion variations. These benchmarks are publicly available.
4.3 Evaluation results
Results on LFW, CALFW, CPLFW, CFP-FP and AgeDB-30.
We train our JAMsFace on MS1MV2 using ResNet100 as a backbone and provide a comprehensive comparison with other SOTA methods. To maintain consistency with previous approaches, we follow the unrestricted with labeled outside data protocol to report the results, as outlined in [5, 11]. Additionally, we utilized 7000 pair testing images for CFP-FP and 6000 pairs for the rest of the datasets to ensure an objective and reliable evaluation of our approach, like in [5]. The results illustrated in Table 2 demonstrate that our proposed method outperforms recent methods. Notably, our JAMsFace surpasses existing methods in terms of verification accuracy on LFW, CPLFW, and CFP-FP and achieves comparable results on the other two datasets. Compared to AdaptiveFace [20], which employs an adaptive margin for either the angular or cosine space, our JAMsFace delivers an impressive enhancement in verification accuracy, with gains of 0.24%, 1.13%, 4.6%, 3.77%, and 0.58% for LFW, CALFW, CPLFW, CFP-FP, and AgeDB-30 datasets, respectively.
Results on IJB-B and IJB-C. To exhaustively evaluate the performance of our proposed loss function JAMsFace, we use two of the most challenging face recognition benchmarks, namely IJB-B and IJB-C. To ensure a fair comparison, we reimplement other state-of-the-art methods, including SphereFace, CosFace, ArcFace, and Circle loss. To further ensure consistency, we train all the implemented models on the widely used VGGFace2 dataset using the same CNN architectures as in [22, 35, 38]. Hence, VGGFace2 comprises 3.1 M images from 8.6K different identities. By employing this testing methodology, we can unbiasedly compare the efficacy of JAMsFace against other recent approaches.
IJB-B [39] comprises 1845 subjects for 21.8K still images (11.8K faces and 10k non-faces) and 55K frames from 7K videos. The standard 1:1 verification and 1:N identification protocols are used for the experiments. The protocol defines 12,115 templates and a list of 8,010,270 comparisons. Specifically, in the 1:1 verification protocol, 10,270 genuine matches and 8 M impostor matches are constructed, and in the 1:N identification protocol, 10,270 probes and 1875 galleries are constructed. IJB-C [23] is an extension of IJB-B and uses similar evaluation protocols. It adds 1661 new subjects to IJB-B with a total of 31.3K still images (21.3K faces and 10k non-faces) and 117.5K frames from 11.8K videos.
The evaluations are presented in Tables 3 and 4, with results expressed in terms of true accept rates (TARs) at varying false accept rates (FAR) for verification and true positive identification rates (TPIR) at varying false positive identification rates (FPIR) for identification. Remarkably, the proposed JAMsFace loss achieves the best performance on both identification and verification tasks compared to the baseline methods SphereFace, CosFace, ArcFace, and Circle loss. On the IJB-B benchmark (Table 3), JAMsFace scored a TAR at FAR1e-4 of 89.09%, significantly outperforming CosFace [35] and ArcFace [5] with 86.75% and 88.79%, respectively. Similarly, on the IJB-C benchmark (Table 4), JAMsFace achieved a TAR at FAR1e-4 of 91.81%, surpassing the performance of CosFace [35] and ArcFace [5] with 89.55% and 91.47%, respectively. Furthermore, the full ROC curves of JAMsFace on IJB-B and IJB-C are shown in Figs. 6 and 7, respectively. The results demonstrate that JAMsFace can achieve remarkable performance even at the challenging FAR=1e-5 setting.
4.4 Ablation study
We report the performances on LFW, AgeDB-30, CALFW, and CPLFW, as well as on the combined dataset from [38] in our ablation study. The combined dataset is created by combining these four validation datasets.
We demonstrate the effectiveness of the joint adaptive margins in our proposed loss function, JAMsFace, by comparing it with other alternatives that employ different fixed/adaptive margin settings. Equation 3 highlights three types of margins: multiplicative angular margin (MA), additive angular margin (AA), and additive cosine margin (AC). Our experimental analysis primarily focuses on assessing the impact of joint adaptive margins.
In Table 5, we initially evaluate the performance of fixed joint alternatives to investigate the impact of utilizing a joint margin. Subsequently, we introduce adaptive margins alongside fixed margins. Finally, we replace the fixed margins with adaptive margins to dynamically adapt to the data distribution.
Based on the analysis of Table 5, it is clear that the introduction of a mixed approach, which combines adaptive and fixed margins, results in improved performance across all datasets compared to the fixed margin alternatives. Notably, the mixed alternatives with adaptive additive angular or cosine margins demonstrate superior performance compared to those with an adaptive multiplicative margin. Additionally, the utilization of adaptive versions for both angular and cosine margins significantly enhances the performance and outperforms all other alternative methods.
These results provide direct evidence that the adaptive versions outperform the fixed ones, highlighting the effectiveness of adaptive margins in enhancing the discriminative ability of our approach.
To conclude, our approach achieves higher verification performance across multiple datasets and outperforms other methods employing either fixed joint margins or adaptive margins. These results demonstrate that our joint adaptive margin approach effectively balances the intra-class compactness and inter-class separability, leading to superior performance on face recognition tasks.
5 Discussion
In this paper, we introduced a new loss function that offers significant flexibility in setting margins based on class distribution. Unlike existing methods such as CosFace [35], ArcFace [5], and AdaptiveFace [20], JAMsFace reduces intra-class variance and increases inter-class variance by utilizing joint adaptive margins in both the angle and cosine spaces. This approach encourages the model to learn more discriminative features and improves face recognition accuracy. Additionally, the utilization of joint adaptive margins not only enhances face representation but also improves overall performance.
To further validate the effectiveness of the proposed loss function, we implemented a fixed margin version of our JAMsFace, which we called ArcPlusCos loss, along with re-implementations of three other state-of-the-art loss functions: CosFace, ArcFace, and AdaptiveFace. To ensure fair comparisons, all the implemented losses and JAMsFace were trained on VGGFace2 [2] using 64-CNN architecture from [22, 35]. The angular and cosine margins were set to the best values reported in previous works [5, 35], respectively. Table 6 depicts the verification performance results, measured in terms of accuracy score, on several popular benchmark datasets. The results demonstrate that our proposed method outperforms the fixed margin variants and the single adaptive margin variant in all evaluation datasets, highlighting the significant improvement achieved by introducing joint adaptive margins.
In addition to its performance benefits, JAMsFace offers practical advantages by effectively addressing the class imbalance problem in face recognition and by leveraging both cosine and angular margins to tackle the challenges posed by unconstrained environments. It provides robustness to class imbalance, unconstrained environments, generalization to unseen classes, and can be integrated into existing face recognition frameworks. However, it is important to carefully evaluate the computational complexity and consider the specific requirements of the application when assessing the practicability of JAMsFace.
Nevertheless, it should be noted that our proposed method still requires a substantial amount of training data to achieve optimal performance. The joint penalty imposed by the cosine and angular margins in JAMsFace also adds computational complexity to the training process. To address these limitations, future research can explore techniques to improve the computational efficiency of margin-based softmax losses, especially for deployment in real-world scenarios involving mobile devices or cloud-based systems. Moreover, while JAMsFace has shown the best results on the cross-pose CPLFW dataset, there is still a gap for improvement in this area. Further research can explore alternative loss functions that can better handle cross-pose face recognition and enhance the model’s ability to handle pose variations.
6 Conclusions
In this paper, we propose a novel loss function named JAMsFace, which utilizes an adaptive cosine margin with angular margin penalties to avoid using a single constant penalty margin. Our motivation is based on the fact that real training data has a lot of variation within and between classes, which means that the fixed margin used in many margin-based softmax losses may not be the optimal way to learn the distance between and within different classes. Additionally, margin-based approaches only improve angle or cosine space discrimination, focusing on one boundary while ignoring the other. Therefore, we replace this fixed margin with a class-related margin in the cosine and angular spaces. This allows the model to learn a specific margin for each class, adapting to its intra-class variations while maintaining discriminative inter-class. The experiment results on several widely used benchmarks demonstrate that JAMsFace outperforms some current state-of-the-art face recognition methods. Specifically, JAMsFace advances the state-of-the-art face recognition performance on LFW, CPLFW, and CFP-FP and achieves comparable results on CALFW and AgeDB-30. Moreover, JAMsFace achieves a TAR at FAR1e-4 of 89.09% and 91.81% for IJB-B and IJB-C benchmarks, respectively.
References
Boutros F, Damer N, Kirchbuchner F, et al (2022) Elasticface: elastic margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p 1578–1587
Cao Q, Shen L, Xie W, et al (2018) Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), IEEE, p 67–74
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), IEEE, p 539–546
Deng J, Zhou Y, Zafeiriou S (2017) Marginal loss for deep face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, p 60–68
Deng J, Guo J, Xue N, et al (2019) Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p 4690–4699
Deng J, Guo J, Liu T, et al (2020) Sub-center arcface: boosting face recognition by large-scale noisy web faces. In: European conference on computer vision, Springer, p 741–757
Guillaumin M, Verbeek J, Schmid C (2009) Is that you? metric learning approaches for face identification. In: IEEE 12th international conference on computer vision, IEEE, p 498–505
Guo Y, Zhang L, Hu Y, et al (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In: European conference on computer vision, Springer, p 87–102
He L, Wang Z, Li Y, et al (2020) Softmax dissection: towards understanding intra-and inter-class objective for embedding learning. In: Proceedings of the AAAI conference on artificial intelligence, p 10957–10964
Huang GB, Mattar M, Berg T, et al (2008) Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. In: Workshop on faces in’Real-Life’Images: detection, alignment, and recognition
Huang Y, Wang Y, Tai Y, et al (2020) Curricularface: adaptive curriculum learning loss for deep face recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p 5901–5910
Irjanto NS, Surantha N (2020) Home security system with face recognition based on convolutional neural network. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2020.0111152
Jiao J, Liu W, Mo Y et al (2021) Dyn-arcface: dynamic additive angular margin loss for deep face recognition. Multimed Tools Appl 80(17):25741–25756
Johnson M, Bradshaw JM (2021) How interdependence explains the world of teamwork. A systems engineering approach to realizing synergistic capabilities, engineering artificially intelligent systems. Springer, Cham, pp 122–146
Kavalionak H, Gennaro C, Amato G et al (2019) Distributed video surveillance using smart cameras. J Grid Comput 17:59–77
Khalifa A, Al-Hamadi A (2021) A survey on loss functions for deep face recognition network. In: 2021 IEEE 2nd International conference on human-machine systems (ICHMS), IEEE, p 1–7
Khalifa A, Abdelrahman AA, Strazdas D et al (2022) Face recognition and tracking framework for human-robot interaction. Appl Sci 12(11):5568
Kim M, Jain AK, Liu X (2022) Adaface: quality adaptive margin for face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p 18750–18759
Liu B, Deng W, Zhong Y, et al (2019a) Fair loss: margin-aware reinforcement learning for deep face recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, p 10052–10061
Liu H, Zhu X, Lei Z, et al (2019b) Adaptiveface: adaptive margin and sampling for face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p 11947–11956
Liu W, Wen Y, Yu Z, et al (2016) Large-margin softmax loss for convolutional neural networks. In: Proceedings of the 33rd international conference on international conference on machine learning-vol 48, p 507–516
Liu W, Wen Y, Yu Z, et al (2017) Sphereface: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, p 212–220
Maze B, Adams J, Duncan JA, et al (2018) Iarpa janus benchmark-c: face dataset and protocol. In: 2018 International conference on biometrics (ICB), IEEE, p 158–165
Moschoglou S, Papaioannou A, Sagonas C, et al (2017) Agedb: the first manually collected, in-the-wild age database. In: proceedings of the IEEE conference on computer vision and pattern recognition workshops, p 51–59
Oinar C, Le BM, Woo SS (2022) Kappaface: adaptive additive angular margin loss for deep face recognition. arXiv preprint arXiv:2201.07394
Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition
Paszke A, Gross S, Massa F, et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, vol 32
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, p 815–823
Seal A, Bhattacharjee D, Nasipuri M, et al (2012) Minutiae from bit-plane sliced thermal images for human face recognition. In: Proceedings of the international conference on soft computing for problem solving (SocProS 2011), Springer, p 113–124
Seal A, Bhattacharjee D, Nasipuri M, et al (2013a) Thermal human face recognition based on gappypca. In: 2013 IEEE 2nd international conference on image information processing (ICIIP-2013), IEEE, p 597–600
Seal A, Ganguly S, Bhattacharjee D, et al (2013b) Thermal human face recognition based on haar wavelet transform and series matching technique. In: Multimedia processing, communication and computing applications: proceedings of the 1st international conference, ICMCCA, Springer, p 155–167
Sengupta S, Chen JC, Castillo C, et al (2016) Frontal to profile face verification in the wild. In: 2016 IEEE winter conference on applications of computer vision (WACV), IEEE, p 1–9
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Advances in neural information processing systems, vol 29
Wang F, Xiang X, Cheng J, et al (2017) Normface: L2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM international conference on multimedia, p 1041–1049
Wang H, Wang Y, Zhou Z, et al (2018) Cosface: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, p 5265–5274
Wang M, Deng W (2021) Deep face recognition: a survey. Neurocomputing 429:215–244
Wen Y, Zhang K, Li Z, et al (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision, Springer, p 499–515
Wen Y, Liu W, Weller A, et al (2022) Sphereface2: binary classification is all you need for deep face recognition. In: International conference on learning representations
Whitelam C, Taborsky E, Blanton A, et al (2017) Iarpa janus benchmark-b face dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, p 90–98
Xie W, Zisserman A (2018) Multicolumn networks for face recognition. arXiv preprint arXiv:1807.09192
Xie W, Shen L, Zisserman A (2018) Comparator networks. In: Proceedings of the European conference on computer vision (ECCV), p 782–797
Yi D, Lei Z, Liao S, et al (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923
Zhang K, Zhang Z, Li Z et al (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503
Zheng T, Deng W (2018) Cross-pose lfw: a database for studying cross-pose face recognition in unconstrained environments. Beijing Univ Posts Telecommun Tech Rep 5:7
Zheng T, Deng W, Hu J (2017) Cross-age lfw: a database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197
Zhong Y, Deng W, Hu J et al (2021) Sface: sigmoid-constrained hypersphere loss for robust face recognition. IEEE Trans Image Process 30:2587–2598
Funding
Open Access funding enabled and organized by Projekt DEAL. This research was funded by the Federal Ministry of Education and Research of Germany (BMBF) project AutoKoWaT, no. 13N16336 and by the German Research Foundation (DFG) project AL 638/15-1, Al 638/14-1, and Al 638/13-1.
Author information
Authors and Affiliations
Contributions
AK was involved in conceptualization, methodology, software, and writing—original draft preparation. AA-H was involved in project heading, funding acquisition, and supervision. AK and AA-H were involved in formal analysis, investigation, and writing—review and editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Khalifa, A., Al-Hamadi, A. JAMsFace: joint adaptive margins loss for deep face recognition. Neural Comput & Applic 35, 19025–19037 (2023). https://doi.org/10.1007/s00521-023-08732-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08732-5