1 Introduction

Deep learning techniques, specifically deep convolutional neural networks (DCNNs), have greatly improved performance in various fields, such as natural language processing, voice recognition, and computer vision. In particular, these techniques have greatly improved the performance of face-related tasks, including face recognition [28]. Recent advancements in face recognition have focused on leveraging deep learning methods to extract high-dimensional feature vectors for improved performance. These methods enable automatic learning of discriminative features for face recognition tasks, resulting in the use of facial recognition systems in various applications such as security, surveillance, human–robot interaction, and intelligent productive teaming [12, 14, 15, 17]. Additionally, thermal face recognition is a crucial application of face recognition [29, 31], operating effectively in low-light conditions and capable of identifying individuals even if they are wearing masks or other facial coverings [30].

Fig. 1
figure 1

Effects of class imbalance. a The model tends to make mistakes on new test samples of a poor class with the fixed additive margin. b The adaptive margins are more appropriate to solve these mistakes, in which a rich class needs a relatively smaller margin, while a poor class needs a relatively larger margin

Loss functions play a pivotal role in the learning optimization of DCNNs. One of the most widely used loss functions is the traditional softmax loss, which fails to produce highly discriminative feature vectors [5, 22, 35]. To overcome this issue, various loss functions [1, 5, 16, 20, 22, 35, 37] have been proposed to promote the feature vectors’ intra-class compactness and inter-class separability to enhance face recognition accuracy and generalization.

Current state-of-the-art (SOTA) face recognition methods mostly rely on margin-based softmax loss, which improves feature discrimination by adding a margin to each identity. These methods, such as SphereFace [22], CosFace [35], and ArcFace [5, 6], add the multiplicative angular margin, additive cosine margin, and additive angular margin, respectively, to increase class separation. These methods assume that all classes have enough samples to describe their distributions accurately, and thus using a constant margin for all classes would be sufficient. However, this assumption does not hold for public face datasets, which are often highly unbalanced. As shown in Fig. 1, the space spanned by existing training samples can accurately represent the real distribution for classes with many samples (rich classes). However, for classes with few samples (poor classes), the space spanned by existing training samples may only be a small part of the real distribution. Therefore, it can be concluded that a fixed margin may not be appropriate for classes with varying sample distributions, and an adaptive margin would be more suitable. This is because the adaptive margin can adjust the margin value according to the sample distribution of each class, and better constrain the intra-class variation of the class with few samples [20]. Furthermore, these margin-based approaches only enhance discrimination in the angle or cosine space, emphasizing one boundary while disregarding the other, which can lead to suboptimal performance. In this paper, we introduce a new loss function called joint adaptive margins loss (JAMsFace), which dynamically sets additive margins based on the class distribution and improves discrimination in both angle and cosine spaces.

Our major contributions are as follows:

  • We present a joined adaptive angular and cosine margins loss (JAMsFace) that dynamically adjusts the decision boundary to learn a more compact and precise face feature representation.

  • JAMsFace allows for more effective differentiation between features by considering both angular and cosine spaces, which improves the overall performance of face recognition.

  • We conduct comprehensive experiments on multiple face recognition benchmark datasets. The experimental results show that the proposed JAMsFace advances the state-of-the-art face recognition performance on five out of seven mainstream benchmarks.

2 Related work

The importance of loss functions in determining unique features during the training of face recognition methods cannot be overstated. Utilizing the appropriate loss functions can significantly enhance the performance of face recognition. The loss function methods can be divided into two categories: metric-based and margin-based [36]. The metric-based techniques measure the distance between two faces, while margin-based techniques aim to maximize the space between two faces.

2.1 Metric-based methods

In early works, using metric-based approaches was common practice. These methods aim to learn a similarity measure for a set of images using a deep metric learning network [7]. They strive to make visually similar images closer together in an embedding manifold, while pushing visually dissimilar images further apart. Two types of metric-based losses have been developed for different setups: the contrastive loss and the triplet loss. The contrastive loss [3] trains a network to predict whether pairs of samples belong to the same class or not by minimizing the embedding distance for positive pairs and maximizing the distance for negative pairs. This approach is also known as distance metric learning.

In contrast to the contrastive loss, the triplet loss uses triplet samples (anchor, positive, and negative). It was first used for training in FaceNet [26]. The goal is to minimize the distance between the anchor and positive pairs and to maximize the distance between the anchor and negative pairs.

However, using contrastive and triplet losses for deep representation learning can result in slow convergence and instability, as the embeddings are optimized only against one negative class during each update [4, 33, 37]. To address these issues, Sohn et al. [33] proposed \((N+1)\)-tuplet loss, which improves training convergence by selecting a positive sample from among \(N-1\) negative samples and considering the distance between the anchor and negative samples. However, the number of samples in each batch increases in a quadratic fashion, leading to an explosion in the sample space. Wen et al. [37] attempted to solve this problem by introducing center loss, which simultaneously learns the center for each class and penalizes the distances between the features and their corresponding class centers. Despite these efforts, center loss and similar methods are ineffective in tackling the open set problem in face recognition [4].

2.2 Margin-based methods

Several margin-based softmax loss functions have recently been proposed [1, 5, 11, 13, 20,21,22, 25, 35]. These methods create a decision boundary in the angular space by incorporating additional constraints to the target logit, making the learned features more discriminative than those generated by deep metric learning methods.

Liu et al. [21] proposed a large-margin softmax loss (L-softmax) that uses a piecewise function to ensure the monotonicity of the cosine function and explicitly promotes intra-class compactness and inter-class separability between the learned features. Liu et al. [22] extended L-softmax with SphereFace, which normalizes the weight matrix of the last fully connected layer and discriminatively spans the learned features on a hypersphere manifold. However, SpherFace is challenging to train as it employs a multiplicative angular penalty margin between the deep features and their corresponding weights.

Fig. 2
figure 2

The training system of the general softmax-based face recognition

To tackle this issue, CosFace [35] converts face classification to a measure based on the cosine distance by introducing an additive cosine margin on the cosine angle between the deep features and their corresponding weights. However, finding a suitable additive cosine margin that achieves both intra-class compactness and inter-class separability can be challenging. Deng et al. [5] proposed ArcFace, which improves the geometric interpretation of the margin and achieves better performance by introducing an additive angular margin, while He et al. [9] point out that the intra- and inter-class objectives in softmax are entangled; therefore, a well-optimized inter-class objective leads to relaxation of the intra-class objective and vice versa.

AdaptiveFace [20] and Dyn-arcFace [13] address the issue of unbalanced data by learning an adaptive margin for each class, while CurricularFace [11] adaptively adjusts the relative importance of easy and hard samples during different training stages by using curriculum learning with the loss function. On the other hand, KappaFace [25] adaptively modulates the positive margins based on class imbalance and difficulty. Among the various methods, AdaFace [18] considers the image quality during training and introduces a quality adaptive margin by estimating the image quality through feature norms.

In conclusion, margin-based softmax loss functions have significantly improved the performance of deep face recognition. As a result, they have become more commonly used in face recognition tasks because of their ability to effectively reduce intra-class variability and inter-class similarity, leading to better accuracy and robustness of face recognition models.

In this paper, a new loss function called JAMsFace for face recognition is proposed and validated on several current face recognition datasets [10, 23, 24, 32, 39, 44, 45]. The experimental results show that the proposed loss function improves the state-of-the-art performance of face recognition on five out of seven mainstream benchmarks.

3 Proposed approach

3.1 Preliminary

As shown in Fig. 2, a face recognition training system typically has three main components: a dataset for training, validation, and testing, a feature extraction network (such as a backbone), and a loss function. The loss function measures the similarity between samples, with similar samples being closer and different samples being farther apart. The system uses prototypes, represented by the weight vectors of the final fully connected layer, to represent each identity in the training set. The predicted scores, calculated through the final fully connected layer, represent the similarity between the feature vector and each prototype. The loss function optimizes the feature extraction network and final fully connected layer through backpropagation. The most commonly used loss function for classification is the softmax loss, which separates features from different classes by maximizing the probability of the correct class. The softmax loss is represented as:

$$\begin{aligned} \mathcal {L}_{S}= \frac{1}{N}\sum _{i=1}^{N}\log P_i = -\frac{1}{N}\sum _{i=1}^{N}\log \frac{e^{W_{y_{i}}^{T}x_{i}+b_{y_{i}}}}{\sum _{j=1}^{n}e^{W_{j}^{T}x_{i}+b_{j}}}, \end{aligned}$$
(1)

where \(P_i\) denotes the predicted probability of embedded feature \(x_i\) belonging to the correct classification. \(x_i \in \mathbb {R}^{d}\) is the embedded feature of ith training sample, which corresponds to a class \(y_i\). \(W_{j} \in \mathbb {R}^{d}\) denotes the jth column of the weight \(W\in \mathbb {R}^{d\times n}\). \(b_j\) is the bias, while N is the batch size. The number of classes in the training dataset and the embedding feature size are n and d, respectively. In practice, the bias is usually set to \(b_j= 0\) as in [5], and then \(W_{j}^{T}x_{i}+b_{j}\) is transformed as \(W_{j}^{T}x_{i}= \Vert W_j \Vert \Vert x_i \Vert cos\,\theta _j\), where \(\theta _j\) is the angle between the weight vector \(W_j\) and the feature vector \(x_i\). To optimize feature learning, the individual weight is set to \(\Vert W_j \Vert = 1\) by \(l_2\) normalization [5, 22, 34, 35]. In order to better optimize the classification result, the deep feature \(\Vert x_i \Vert \) is also normalized by \(l_2\) and re-scaled to a fixed value s. Thus, the original softmax can be modified as in Eq. 2 and is called normalized softmax loss (NSL).

Fig. 3
figure 3

Geometrical interpretation of different losses. The orange area represents the feature space of the poor class 1, and the green area represents the feature space of the rich class 2. a Modified softmax loss. b, c CosFace and ArcFace allocate an identical margin m for both classes, so the poor class cannot compact well. d JAMsFace allocates a larger joined margin to compact further the poor class, which implicitly optimizes the underlying real space. Note that \(m_{a_1}\) is the angular margin for class 1, \(m_{c_1}\) is the cosine margin for class 1, and \(\theta _3 = \theta _1+m_{a_1}\)

$$\mathcal{L}_{N} = - \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\log } \frac{{e^{{s(\cos {\kern 1pt} \theta _{{y_{i} }} )}} }}{{e^{{s(\cos {\kern 1pt} \theta _{{y_{i} }} )}} + \sum\limits_{{j = 1,j \ne y_{i} }}^{n} {e^{{s{\kern 1pt} \cos {\kern 1pt} \theta _{j} }} } }},$$
(2)

However, the normalized softmax loss has a limited ability to differentiate features for practical face recognition tasks. To overcome this limitation, various margin-based variants have been proposed [5, 21, 22, 35]. These variants introduce a margin between the target score and non-target scores, which allows for more accurate differentiation between features and improves the overall performance of the face recognition system. They can be formulated in a general form as follows:

$$\mathcal{L}_{G} = - \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\log } \frac{{e^{{s{\kern 1pt} g(m,{\kern 1pt} \theta _{{y_{i} }} )}} }}{{e^{{s{\kern 1pt} g(m,{\kern 1pt} \theta _{{y_{i} }} )}} + \sum\limits_{{j = 1,j \ne y_{i} }}^{n} {e^{{s{\kern 1pt} \cos {\kern 1pt} \theta _{j} }} } }}.$$
(3)

where \(g(m,\,\theta _{y_i})\) is the introduced margin function. For instance, SphereFace [22] introduces the function \(g(m_{1} ,{\mkern 1mu} \theta _{y} ) = \cos (m_{1} \theta _{y} ) \), where \(m_1\) is a multiplicative angular margin, \(m_1 \ge 1\) and is an integer. \(g(m_2,\,\theta _{y}) = \cos(\theta _y) - m_2\) with \(m_2 \ge 0\) is the motivation of CosFace [35], where \(m_2\) is an additive cosine margin, whereas \(g(m_3,\,\theta _{y}) = \cos(\theta _y + m_3)\) with \(m_3 \ge 0\) is introduced in ArcFace [5], where \(m_3\) is an additive angular margin. Thus, utilizing the margin penalties within the softmax loss achieved better discriminative features than the original softmax loss. Finally, these margin-based variants can be integrated in a combined form as \(g(m,\,\theta _{y}) = \cos(m_1\theta _y + m_3) - m_2\).

However, the previous methods introduce a fixed margin for all classes, which can be problematic when the training datasets have unbalanced class distributions. The fixed margin approach neglects the variations in class distribution among the training data, which can lead to mediocre performance [19, 20]. Moreover, these margin approaches enhance discrimination either in the angle or cosine space, emphasizing one boundary while ignoring the other. To tackle these issues, we propose an approach that considers variations in class distribution among the training data by introducing an additive dynamic penalty for both angular and cosine margins. This approach allows for more effective differentiation between features by considering both angular and cosine spaces and improves the overall performance of the face recognition system, particularly when the dataset has unbalanced class distribution.

Fig. 4
figure 4

Decision margins of different loss functions under binary classification. The dashed line represents the decision boundary, and the white areas are the decision margins. Class 1 is a poor class, and Class 2 is a rich class. JAMsFace allocates a larger joined margin to further compact the poor class, which implicitly optimizes the underlying real space. Note that \(m_{a_1}\) is the angular margin for class 1, \(m_{c_1}\) is the cosine margin for class 1, and \(\theta _3 = \theta _1+m_{a_1}\)

3.2 Proposed loss function (JAMsFace)

The proposed loss function aims to effectively optimize the discrimination objective in both angle and cosine spaces while addressing the class imbalance problem. To achieve this, we introduce an additive dynamic penalty for both angular and cosine margins. Rather than introducing a completely new loss function, the JAMsFace loss builds upon existing angular and cosine margin-based losses by incorporating two dynamic margin penalties. This approach enhances the model’s adaptability to the data distribution and further improves its discriminative power. By combining these dynamic penalties, our approach effectively balances the trade-off between intra-class compactness and inter-class separability, resulting in improved face recognition performance.

In a binary classification scenario, the angle between the learned feature vector \(x\) and the ground truth weight vector \(W_i\) of class \(C_i\) (\(i=1, 2\)) is represented by \(\theta _i\). A normalized softmax loss requires that \(\cos(\theta _1) > \cos(\theta _2)\) for correct classification of \(x\) as \(C_1\), and similarly for \(C_2\) requires \(\cos(\theta _2) > \cos(\theta _1)\) for correct classification of \(x\) as \(C_2\). However, this decision boundary has limited discrimination power for practical face recognition tasks (Fig. 4a). To address this, CosFace [35] proposes a large-margin classifier that requires \(\cos(\theta _1)-m > \cos(\theta _2)\) for correct classification of \(x\) as \(C_1\), and similarly, for correct classification of \(x\) as \(C_2\), it requires \(\cos(\theta _2)-m > \cos(\theta _1)\). Alternatively, ArcFace [5] proposes to enhance the discriminative power by adding an additive angular margin. Therefore, the target logit is \(\cos(\theta _i + m)\), and the classification boundary for class C1 is defined as \(s\left( \cos\left( \theta _1+m\right) - \cos\,\theta _2\right) = 0\), and similarly for C2, \(s\left( \cos\left( \theta _2+m\right) - \cos\,\theta _1\right) = 0\). The angular margin restrictions improve separability and classification performance by making feature vectors of the same class more compact and those of different classes farther apart.

These margin approaches typically focus on enhancing discrimination either in the angle or cosine space, emphasizing one boundary while ignoring the other. For instance, approaches such as CosFace and ArcFace employ a manually defined margin m that remains constant throughout the training process. This limitation of fixed and single-margin approaches restricts their ability to effectively capture the inherent complexity of face data and adapt to varying intra-class variations. By relying on a fixed margin, these methods may not fully exploit the discriminative potential of the feature space.

To address this, our proposed approach introduces joint adaptive margins that dynamically adjust for both the angle and cosine spaces. Instead of relying on a fixed margin, our method allows the margins to adapt to the data distribution, leading to improved discriminative ability and enhanced face recognition performance. Formally, joint adaptive margins softmax loss (JAMsFace) is defined as:

$${L}_{{JAMs}} = - \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\log } \left( {\frac{{e^{{s{\kern 1pt} (\cos (\theta _{{y_{i} }} + m_{{a_{{y_{i} }} }} ) - m_{{c_{{y_{i} }} }} )}} }}{{e^{{s(\cos (\theta _{{y_{i} }} + m_{{a_{{y_{i} }} }} ) - m_{{c_{{y_{i} }} }} )}} + \sum\limits_{{j = 1,j \ne y_{i} }}^{n} {e^{{s{\kern 1pt} \cos {\kern 1pt} \theta _{j} }} } }}} \right),$$
(4)

where \(m_{a_{y_i}}\) is the angular margin corresponding to the target class \(y_i\), which denotes the extent of the angle increase, while \(m_{c_{y_i}}\) is the cosine margin corresponding to the target class \(y_i\) and denotes the degree of increment of the cosine. The intuition behind joint adaptive margins can be illustrated using the previous simple binary classification example (with two classes C1 and C2). In order to correctly classify a sample \(x\), we need to ensure that \(\cos(\theta _1) > \cos(\theta _2)\). However, using joint adaptive margins, we instead require that \((\cos(\theta _1+m_a)-m_c) > \cos(\theta _2)\) where \(m_a, m_c > 0\). This results in a more stringent decision, since \(\cos(\theta +m_a)\) is lower than \(\cos(\theta )\) and \(\cos(\theta )-m_c\) is also lower than \(\cos(\theta )\). This approach essentially makes the decision boundary between the two classes more flexible and class-specific, leading to more effective discrimination power in the trained model.

In Fig. 3, the proposed JAMsFace and other margin-based softmax losses are illustrated geometrically. It explains that CosFace, ArcFace, and JAMsFace can be understood geometrically as projecting face features onto a hypersphere. This interpretation involves representing each face feature as a point in a high-dimensional space. The face recognition task is to find a decision boundary that separates these points into their respective classes. The decision boundary is a hyperplane that partitions the hypersphere into regions, where each region corresponds to a different face class. These methods can enhance face recognition discrimination power by finding an optimal hyperplane that maximizes inter-class distance and minimizes intra-class distance. However, JAMsFace stands out among the other margin-based softmax losses by allocating a larger joined margin to compact further the poor class (orange arc), which implicitly optimizes the underlying real space.

Table 1 Decision boundaries for class 1 under binary classification. Note that \(\theta _i\), \(i=1,2\) is the angle between \(W_i\) and x. s is the scale factor, and m is the constant margin. \(m_{a_1}\) and \(m_{c_1}\) are the angular and cosine margins of class 1, respectively
Fig. 5
figure 5

Feature distribution visualization of several loss functions

Table 2 Verification comparison with state-of-the-art methods on five benchmarks reported in terms of accuracy (%). On LFW, CPLFW, and CFP-FP, JAMsFace consistently extend state-of-the-art performances. JAMsFace scores comparable results to the state-of-the-art on AgeDB-30 and CALFW

3.3 Comparison with other loss functions

To provide a deeper insight into the differences between our method and other margin-based softmax losses, we analyze the distinctions in the decision boundaries of various margin-based softmax loss functions. These methods mainly differ in terms of margin placement and whether the margin is uniform or class-related (adaptive margin).

Decision Boundaries. The decision boundaries under binary classification are provided in Table 1 and Fig. 4. Figure 4 intuitively illustrates that the decision boundary of the original softmax loss is a dividing line. However, a sample located close to the boundary can cause misclassification due to insufficient discrimination. In contrast, both CosFace and ArcFace utilize a uniform margin between the classes, which does not consider the sample distribution of each class and leads to poor generalization. To address these limitations, JAMsFace employs class-specific margins in both cosine and angular spaces. For poor classes, such as class 1, a larger margin is used, resulting in more compact feature extraction and pushing the actual boundary of class 1 away from class 2. This approach effectively enhances the discrimination power of the trained model.

Toy Example. In order to better illustrate the robustness and the impact of our proposed JAMsFace, we conducted a toy experiment to compare the feature distributions obtained by different loss functions. Specifically, we created a toy dataset by selecting face images from eight identities in MS1MV2 dataset [5] and trained several 10-layer ResNet models that output three-dimensional features. The dataset consists of eight classes, where class 0 (represented in red) has the most samples (over 500), and classes 1 and 2 (represented in yellow and blue) have a rich number of samples (about 250 each). The remaining classes (3 to 7) have fewer (poor) samples (about 60 each). This distribution roughly simulates the sample number distribution of MS-Celeb-1M. To visualize the features, we normalized the three-dimensional features and plotted them on a sphere.

Figure 5 depicts that the softmax loss function prefers the rich classes (classes 0, 1, and 2) and allocates a larger space for them, which leads to noticeable ambiguity in decision boundaries. In contrast, CosFace and ArcFace reduce the intra-class variations and assign equal space to each class, regardless of the sample distribution. For instance, the pink and yellow points occupy almost the same space. Our proposed JAMsFace, however, focuses on optimizing both the poor and rich classes to be more compact. Comparing CosFace, ArcFace, and JAMsFace, it is evident that the rich class 0 (red points) occupies almost the same area, while for the poor classes (pink, green, orange, and purple), our approach produces more compact and separable features.

4 Experiments and analysis

To analyze the performance of our proposed JAMsFace model, we first introduce our implementation details, then perform extensive experiments on the most widely used benchmarks for face recognition, and finally compare our method with recent state-of-the-art deep face recognition models.

4.1 Implementation details

For data preprocessing, we follow the commonly used approach as in recent works [1, 5, 46]: Each face image is cropped to a size of 112 \(\times \) 112, using a similarity transformation based on the five face landmarks (two eyes, a nose, and two mouth corners) detected by MTCNN [43]. Finally, the RGB pixel values are normalized from [0, 255] to \([-1, 1]\).

Learning Strategy. We use slightly modified versions of well-known CNN architectures such as ResNet50 and ResNet100 for the embedding network, following [1, 5, 46]. The models are trained on an NVIDIA Quadro RTX 8000 with a batch size of 512 using the stochastic gradient descent (SGD) algorithm with a momentum of 0.9 and a weight decay of \(5\textrm{e}{-4}\). The learning rate for CASIA-WebFace [42] and VGGFace2 [2] starts at 0.001 and is reduced by a factor of ten at the 20th, 28th, and 32nd epoch, for a total of 34 epochs. The learning rate for the larger datasets is divided at the 10th, 18th, and 22nd epochs, stopping after 24 epochs. Regarding the memory buffer, we set the momentum \(\alpha = 0.3\), and \(\gamma \) was set to 0.5 and 0.7 for the smaller and larger datasets, respectively. All experiments were implemented using PyTorch [27].

Train data. We follow the trend in recent works and separately utilize CASIA-WebFace [42], VGGFace2 [2] and the refined MS1MV2 [5] as our training sets. This enables a direct and fair comparison with state-of-the-art methods. The MS1MV2 is a refined version of the MS-Celeb-1M [8] containing 5.8 M images of 85K identities. These datasets are publicly available.

Test Settings. In the test phase, we input the cropped and aligned 112 \(\times \) 112 images into the trained model to generate a 512-dimensional feature vector and normalize it into a unit-length vector. In the classification stage, cosine similarity is used to calculate the distance between the output features and each category, thereby realizing face recognition.

4.2 Evaluation metrics and benchmarks

Table 3 Results on IJB-B. We cite the results from the original papers for [2, 40, 41]. For the reimplemented methods, we use the hyperparameters that lead to the best results on the validation set. Results are in % and higher values are better

Metrics. Face recognition performance is typically assessed through two primary tasks: verification and identification, and each has its respective evaluation metrics. Two sets of samples, a gallery and a probe, are required to evaluate these tasks. The gallery is a set of known identities registered in the face recognition system, while the probe set consists of faces that must be recognized for verification or identification. The face recognition system decides whether to accept the matching of a probe face and a gallery face by comparing their similarity, calculated through some measurement between their features, with a given threshold. Specifically, when a probe face and a gallery face have the same identity, a true acceptance (TA) occurs if their similarity is above the threshold, and a false rejection (FR) occurs if their similarity is below the threshold. Conversely, when they are different identities, a true rejection (TR) occurs if their similarity is below the threshold, and a false acceptance (FA) occurs if their similarity is above the threshold. These basic concepts form the foundation of the evaluation metrics.

Benchmarks. Face recognition benchmarks are critical for evaluating the performance of different face recognition models. Labelled Faces in the Wild (LFW) [10] is the commonly used benchmark for face recognition in unconstrained environments. The original LFW protocol includes 3,000 genuine and 3,000 impostor face pairs and evaluates the mean verification accuracy on these 6,000 pairs. In addition to LFW, several other benchmarks are also used for face recognition evaluations. These include CFP-FP [32], CPLFW [44], CALFW [45], and AgeDB [24]. The IJB-B [39] and IJB-C [23] are also used for even more rigorous evaluations. These benchmarks are incredibly challenging, as they include significant pose, illumination, expression, and occlusion variations. These benchmarks are publicly available.

Fig. 6
figure 6

The ROC curves of JAMsFace and other start-of-art methods on IJB-B dataset

4.3 Evaluation results

Results on LFW, CALFW, CPLFW, CFP-FP and AgeDB-30.

We train our JAMsFace on MS1MV2 using ResNet100 as a backbone and provide a comprehensive comparison with other SOTA methods. To maintain consistency with previous approaches, we follow the unrestricted with labeled outside data protocol to report the results, as outlined in [5, 11]. Additionally, we utilized 7000 pair testing images for CFP-FP and 6000 pairs for the rest of the datasets to ensure an objective and reliable evaluation of our approach, like in [5]. The results illustrated in Table 2 demonstrate that our proposed method outperforms recent methods. Notably, our JAMsFace surpasses existing methods in terms of verification accuracy on LFW, CPLFW, and CFP-FP and achieves comparable results on the other two datasets. Compared to AdaptiveFace [20], which employs an adaptive margin for either the angular or cosine space, our JAMsFace delivers an impressive enhancement in verification accuracy, with gains of 0.24%, 1.13%, 4.6%, 3.77%, and 0.58% for LFW, CALFW, CPLFW, CFP-FP, and AgeDB-30 datasets, respectively.

Table 4 Results on IJB-C. We cite the results from the original papers for [2, 40, 41]. For the reimplemented methods, we use the hyperparameters that lead to the best results on the validation set. Results are in % and higher values are better

Results on IJB-B and IJB-C. To exhaustively evaluate the performance of our proposed loss function JAMsFace, we use two of the most challenging face recognition benchmarks, namely IJB-B and IJB-C. To ensure a fair comparison, we reimplement other state-of-the-art methods, including SphereFace, CosFace, ArcFace, and Circle loss. To further ensure consistency, we train all the implemented models on the widely used VGGFace2 dataset using the same CNN architectures as in [22, 35, 38]. Hence, VGGFace2 comprises 3.1 M images from 8.6K different identities. By employing this testing methodology, we can unbiasedly compare the efficacy of JAMsFace against other recent approaches.

IJB-B [39] comprises 1845 subjects for 21.8K still images (11.8K faces and 10k non-faces) and 55K frames from 7K videos. The standard 1:1 verification and 1:N identification protocols are used for the experiments. The protocol defines 12,115 templates and a list of 8,010,270 comparisons. Specifically, in the 1:1 verification protocol, 10,270 genuine matches and 8 M impostor matches are constructed, and in the 1:N identification protocol, 10,270 probes and 1875 galleries are constructed. IJB-C [23] is an extension of IJB-B and uses similar evaluation protocols. It adds 1661 new subjects to IJB-B with a total of 31.3K still images (21.3K faces and 10k non-faces) and 117.5K frames from 11.8K videos.

The evaluations are presented in Tables 3 and 4, with results expressed in terms of true accept rates (TARs) at varying false accept rates (FAR) for verification and true positive identification rates (TPIR) at varying false positive identification rates (FPIR) for identification. Remarkably, the proposed JAMsFace loss achieves the best performance on both identification and verification tasks compared to the baseline methods SphereFace, CosFace, ArcFace, and Circle loss. On the IJB-B benchmark (Table 3), JAMsFace scored a TAR at FAR1e-4 of 89.09%, significantly outperforming CosFace [35] and ArcFace [5] with 86.75% and 88.79%, respectively. Similarly, on the IJB-C benchmark (Table 4), JAMsFace achieved a TAR at FAR1e-4 of 91.81%, surpassing the performance of CosFace [35] and ArcFace [5] with 89.55% and 91.47%, respectively. Furthermore, the full ROC curves of JAMsFace on IJB-B and IJB-C are shown in Figs. 6 and 7, respectively. The results demonstrate that JAMsFace can achieve remarkable performance even at the challenging FAR=1e-5 setting.

Fig. 7
figure 7

The ROC curves of JAMsFace and other start-of-art methods on IJB-C dataset

4.4 Ablation study

Table 5 Effects of different joint margins. All models are trained on VGGFace2 using 64-CNN architecture. MA, AA, and AC are the abbreviations for multiplicative angular margin, additive angular margin, and additive cosine margin, respectively. F and A represent fixed and adaptive settings, respectively

We report the performances on LFW, AgeDB-30, CALFW, and CPLFW, as well as on the combined dataset from [38] in our ablation study. The combined dataset is created by combining these four validation datasets.

We demonstrate the effectiveness of the joint adaptive margins in our proposed loss function, JAMsFace, by comparing it with other alternatives that employ different fixed/adaptive margin settings. Equation 3 highlights three types of margins: multiplicative angular margin (MA), additive angular margin (AA), and additive cosine margin (AC). Our experimental analysis primarily focuses on assessing the impact of joint adaptive margins.

In Table 5, we initially evaluate the performance of fixed joint alternatives to investigate the impact of utilizing a joint margin. Subsequently, we introduce adaptive margins alongside fixed margins. Finally, we replace the fixed margins with adaptive margins to dynamically adapt to the data distribution.

Based on the analysis of Table 5, it is clear that the introduction of a mixed approach, which combines adaptive and fixed margins, results in improved performance across all datasets compared to the fixed margin alternatives. Notably, the mixed alternatives with adaptive additive angular or cosine margins demonstrate superior performance compared to those with an adaptive multiplicative margin. Additionally, the utilization of adaptive versions for both angular and cosine margins significantly enhances the performance and outperforms all other alternative methods.

These results provide direct evidence that the adaptive versions outperform the fixed ones, highlighting the effectiveness of adaptive margins in enhancing the discriminative ability of our approach.

To conclude, our approach achieves higher verification performance across multiple datasets and outperforms other methods employing either fixed joint margins or adaptive margins. These results demonstrate that our joint adaptive margin approach effectively balances the intra-class compactness and inter-class separability, leading to superior performance on face recognition tasks.

5 Discussion

Table 6 Verification performance results reported in terms of accuracy (%)

In this paper, we introduced a new loss function that offers significant flexibility in setting margins based on class distribution. Unlike existing methods such as CosFace [35], ArcFace [5], and AdaptiveFace [20], JAMsFace reduces intra-class variance and increases inter-class variance by utilizing joint adaptive margins in both the angle and cosine spaces. This approach encourages the model to learn more discriminative features and improves face recognition accuracy. Additionally, the utilization of joint adaptive margins not only enhances face representation but also improves overall performance.

To further validate the effectiveness of the proposed loss function, we implemented a fixed margin version of our JAMsFace, which we called ArcPlusCos loss, along with re-implementations of three other state-of-the-art loss functions: CosFace, ArcFace, and AdaptiveFace. To ensure fair comparisons, all the implemented losses and JAMsFace were trained on VGGFace2 [2] using 64-CNN architecture from [22, 35]. The angular and cosine margins were set to the best values reported in previous works [5, 35], respectively. Table 6 depicts the verification performance results, measured in terms of accuracy score, on several popular benchmark datasets. The results demonstrate that our proposed method outperforms the fixed margin variants and the single adaptive margin variant in all evaluation datasets, highlighting the significant improvement achieved by introducing joint adaptive margins.

In addition to its performance benefits, JAMsFace offers practical advantages by effectively addressing the class imbalance problem in face recognition and by leveraging both cosine and angular margins to tackle the challenges posed by unconstrained environments. It provides robustness to class imbalance, unconstrained environments, generalization to unseen classes, and can be integrated into existing face recognition frameworks. However, it is important to carefully evaluate the computational complexity and consider the specific requirements of the application when assessing the practicability of JAMsFace.

Nevertheless, it should be noted that our proposed method still requires a substantial amount of training data to achieve optimal performance. The joint penalty imposed by the cosine and angular margins in JAMsFace also adds computational complexity to the training process. To address these limitations, future research can explore techniques to improve the computational efficiency of margin-based softmax losses, especially for deployment in real-world scenarios involving mobile devices or cloud-based systems. Moreover, while JAMsFace has shown the best results on the cross-pose CPLFW dataset, there is still a gap for improvement in this area. Further research can explore alternative loss functions that can better handle cross-pose face recognition and enhance the model’s ability to handle pose variations.

6 Conclusions

In this paper, we propose a novel loss function named JAMsFace, which utilizes an adaptive cosine margin with angular margin penalties to avoid using a single constant penalty margin. Our motivation is based on the fact that real training data has a lot of variation within and between classes, which means that the fixed margin used in many margin-based softmax losses may not be the optimal way to learn the distance between and within different classes. Additionally, margin-based approaches only improve angle or cosine space discrimination, focusing on one boundary while ignoring the other. Therefore, we replace this fixed margin with a class-related margin in the cosine and angular spaces. This allows the model to learn a specific margin for each class, adapting to its intra-class variations while maintaining discriminative inter-class. The experiment results on several widely used benchmarks demonstrate that JAMsFace outperforms some current state-of-the-art face recognition methods. Specifically, JAMsFace advances the state-of-the-art face recognition performance on LFW, CPLFW, and CFP-FP and achieves comparable results on CALFW and AgeDB-30. Moreover, JAMsFace achieves a TAR at FAR1e-4 of 89.09% and 91.81% for IJB-B and IJB-C benchmarks, respectively.