1 Introduction

Remote authentication has excited great interests in various academic circles and otherwise, given the increasing reliance on online applications as well as the onset of certain conditions such as the COVID-19 pandemic. Such circumstances call for an easy-to-use, accurate, and efficient authentication system. Along this thread, Automatic Speaker Verification (ASV) system and other biometric systems such as face recognition, electronic signatures, iris-based, and hybrid methods have been proposed as a means to satisfy user needs [11, 24]. Nevertheless, virtually all of these systems are vulnerable to spoof attacks (i.e., spoofable). A system based on face recognition for example, may be spoofed by simply displaying a person’s image (photograph) to the system [10]. Likewise, a fingerprint system can be spoofed by copying a fingerprint. In particular, ASV systems are also vulnerable in the face of four types of attacks, including recording and replaying the voice of the authorized person (replay attack), text-to-speech systems that are trained with the voice of the targeted person, voice conversion systems, and speaker imitation [7]. The threats facing ASV systems in terms of spoof attacks are potentially high and may amass to serious implications [26]. In consequence, since 2015, ASVspoof challenges were held for research communities worldwide to try and enhance ASV systems so as to make them robust against spoofing attacks.

A total of four ASVspoof challenges have thus far been held, with the first instance in 2015, covering only speech synthesis and voice conversion (also called logical access scenario) attacks [32]. A variety of methods and systems were proposed and implemented by ASV organizers to produce spoof samples, exciting the interest of many researchers intrigued by both the challenge and the dataset provided therein. The second ASVspoof challenge held in 2017, focused more on the replay attack (also called physical access scenario) [6, 14]. In order to be able to test the performance of countermeasure systems in real conditions, the organizers produced the dataset in different environmental conditions and using different devices. Further comprehensive conditions were investigated in 2019 to account for all three attacks considered in previous challenges [30], ushering in the development of an extensive dataset using state-of-the-art voice conversion and speech synthesis systems. Spoofing samples in this challenge were more realistic and challenging in view of the improvements made to the spoof systems in previous years. For replay attacks, in particular, samples were produced with greater degrees of control, and a tandem detection cost function (t-DCF) metric was used as the primary metric to assess the efficiency of integrating countermeasures with ASV systems. The 2021 edition was considerably more complex than its predecessors, more challenging data that move ASVspoof nearer to more practical application scenarios and also add new deepfake task. Participants must only use 2019 edition train and development dataset to develop their models [34].

Inquiries made into the 2019 ASVspoof dataset results are suggestive of two primary drawbacks of the proposed methods. The first points to a lack of generalization and high error rate against unseen attacks, which is clearly observed given the difference between errors obtained for the training, development, and evaluation sets. In addressing this lack of generalization, numerous studies have tried to improve generalization by means of fusing several models (ensemble models) [15]. Such fusion and ensemble models and methods that use deep neural networks have led to considerable increases in model parameters as well as the necessary floating-point operations (FLOPS). Under such circumstances, it would be infeasible to use the proposed models in specific applications. This provides the required grounds for the integration of simple yet efficient countermeasure techniques with ASV systems to make them more robust. Moreover, the proposed body of research fails to provide a detailed understanding for how models detect spoof attacks or handle generalization issue. This ambiguity can be interpreted in terms of the incapacity of humans or rather human-oriented decision-making to differentiate between the spoofed and the bonafide samples detected by the final system. A detailed examination of this issue can provide further insight into the development of better systems.

The primary purpose of this work is to provide a model for detecting spoofing attacks on ASV systems. An interpretable attention mask in a new modular architecture is used for this purpose via the introduction of perception and attention branches in the model. Furthermore, for the first time in this domain, the EfficientNet-A0 [25] architecture was employed to achieve a system with low number of parameters and FLOPS. The proposed architecture along with the newly combined loss function and masks that provide a more human-oriented perspective, was used to obtain comparable and, in some cases, top-performing results in these spoofing attacks.

The following section provides a brief review of relevant studies conducted in recent years. The proposed countermeasures and the loss function are introduced in Sect. 3. Section 4 calls attention to the general configuration used for experiments and sect. 5 gives the analysis results along with a summarization of the work. This study is finally concluded in Sect. 6.

2 Related Work

This section reviews some of the research carried out on spoof attack detection, taking a look on the best-performing methods, as per results obtained on the ASVspoof 2019 dataset. Similar models and tasks were also investigated inclusive of new architecture and the application of attention mechanisms and attitude in the loss function.

Models proposed to assess the ASVspoof 2019 dataset can be categorized into two main classes: methods based on extraction and engineering of features and methods based on classifier architecture. Methods of the first category incorporate features such as Mel-filter Frequency Cepstral coefficients (MFCC), Inverted Mel-Filter Frequency Cepstral coefficients (IMFFC), Constant Q Cepstral Coefficients (CQCC), Group Delay (GD) gram, Instantaneous Amplitude (IA), Instantaneous Frequency (IF), X-vectors, and features from deep learning models [1, 3, 13, 19, 23, 27, 29, 33]. Some methods also use raw signals to extract features using methods such as SincNet [35] or Variational Auto Encoder (VAE) [5]. The second category deals with a variety of classifiers such as Neural network-based methods including VGG [35], Squeeze-Excitation (SE), Residual network, Siamese networks [17], and recurrent networks [12], as well as other traditional GMM-based methods. Certain methods have also used end-to-end structures for this purpose.

Z. Wu et al. [31] propose a novel feature genuinization based light convolution neural network (LCNN) system for detection of synthetic speech attacks. They transform a genuine feature distribution more close to that of the genuine speech. They fed transformed features to proposed LCNN system for detecting synthetic speech attacks. In another work, X. Cheng et al. [4] proposed the replay detection system based on a novel CQT-based modify delay group (MGD) feature to utilize the phase of CQT. An 18-layer ResNeWt model is used to detect the replay attacks. Their models were evaluated on ASVspoof 2019 physical access challenge dataset and show a significant improvement on the ability to detect the distortion introduced by the playback device and the ability to detect the reverberation introduced by far-field recording, compared with CQCC-GMM baseline system.

Cheng-I Lai et al. proposed a deep model to obtain discriminative features in both time and frequency domains [16]. The proposed design includes a filter-based attention mechanism used to improve or ignore commonly extracted features implemented in the ResNet architecture to classify attended input maps. The reserved classifier used in their study (Residual Network) consists of a convolution layer equipped with dilated mechanism instead of a fully connected layer, which runs as an attentive filtering network; i.e., masks input features. The obtained results were suggestive of the relatively high performance of the model given the use of an attention mechanism to produce attention masks as well as an appropriate classifier.

X. Li et al. attempted to use the Res2Net architecture, which has achieved significant results in various computer vision tasks [20]. They proposed a new Res2Net architecture by revisioning ResNet blocks to allow for multi-scale features. In a Res2Net architecture, input feature maps of a block are divided into several groups of channels with a similar residual structure to the original ResNet. Using channels, feature map sizes can be different, increasing the covered area, and thereby yielding features with different scales. This modification improves system performance and the model’s generalization against unseen attacks. In addition, using this architecture could reduce the size or number of model parameters relative to the original ResNet structure while improving model performance. The obtained results show that the Res2Net50 model outperforms the ResNet34 and ResNet50 models in both physical and logical access scenarios. They also showed that integrating the block with Squeeze-and-excitation (SE), which produces SE-Res2Net blocks, leads to better performance. Figure 1 illustrates the architecture and structure of these blocks. Significant results were also obtained in both scenarios for the proposed SE-Res2Net50 network based on SE-Res2Net blocks and Constant-Q Transform (CQT) feature. The network proposed in this work has nearly 0.9 million parameters, which is relatively small compared to other architectures. However, the main drawback to the model is the high number of FLOPS, which leads to increased runtime in the inference phase due to the multiplicity of blocks and the structure of SE-Res2Net.

Fig. 1
figure 1

ResNet, Res2Net, and SE-Res2Net blocks [20]

Zhang et al. focused on logical attacks in their work [36], explaining the lack of model generalization against unseen attacks as caused by the formulation of the spoof detection problem as a binary classification. The difficulty with using a binary classifier can be interpreted in terms of the distribution of training and test data for spoof and bonafide samples as not being the same. More specifically, samples in the test set generated by new systems or conditions not found in training data cause differences in distribution; which, however, is not the case for bonafide samples. The problem was, therefore, redefined as a one-class classification problem, where the distribution of a target class for a specific problem should be the same in both training and test datasets, irrespective of whether other classes have similar distributions or not. In such cases, the primary objective is to obtain the bonafide distribution and define a rigid decision boundary around it so that unseen samples from other classes cannot cross that decision boundary. To this aim, a one-class softmax loss function was incorporated for learning a feature space that can map bonafide samples in a dense space, while maintaining a good margin with spoofing samples. Finally, by means of the ResNet-18 network and the Linear-filter Frequency Cepstral coefficient (LFCC) features, the authors succeeded in attaining top-performing results for logical access attacks.

3 Proposed Model

3.1 Network Architecture

The overall architecture of the proposed network was designed with three main objectives in mind: a)- that the architecture be small enough to explicate an appropriate number of parameters, b)-maintaining an acceptable runtime in order to achieve satisfactory performance in most ASV applications; c)- the interpretability of the designed architecture by humans. To put differently, the architecture was required to somehow express what discriminates bonafide speech from speech made in a spoof attack as a means to improve systems in the future; lastly, the model was configured to emulate comparable performance to relevant classifiers used for this purpose.

To achieve all these goals, the Efficient Attention Branch Network (EABN) was proposed in this study. The intended framework adopts a well-performed Attention Branch Network [8] in computer vision as the main idea for the EABN architecture. As shown in Fig. 2, this network consists of two branches of attention and perception. The attention branch seeks to improve the performance of the perception branch by means of producing an attention mask, which is then applied to make the discriminative parts of the input feature map more. In addition, in order to improve the performance of the perception branch, masks produced by the attention branch are also interpretable from a human point of view. The primary work load is performed in the perceptual branch, where the probability output of each class is produced.

Fig. 2
figure 2

Proposed Efficient Attention Branch Network architecture

3.2 Attention Branch

The attention branch itself comprises of two main parts, as shown in Fig. 3. As can be observed, the input feature map is initially fed into the attention branch, which uses four consecutive basic blocks to extract the appropriate features and to convert the input features to 16-feature maps. The blocks consist of two convolution layers with 3\(\times \)3 kernels, which are then linked to the batch normalization layer. In addition to feature extraction, the first convolution layer also increases the feature map size, while the second convolution layer exclusively handles the feature extraction process. The obtained feature maps are eventually transformed from a 16-size map to a single feature using a convolution layer with a 1\(\times \)1 kernel, which then goes through a softmax layer to yield the final output attention mask.

The other branch produces a human-interpretable attention mask. This is carried out by using a convolution operation to transform the 16 feature maps into maps with the same size as the number of classes for the problem which in this study includes the bonafide and spoof classes. Then, using a global average pooling layer, these two feature maps are converted to a 2\(\times \)1 tensor. Finally, by applying softmax, the probability of a feature map belonging to each class is obtained. These probabilities are later used in the optimization process for the proposed combined loss function. Through the process of optimization, feature maps are generated so that in addition to help the perception branch, they can be used for classification and are interpretable from a human perspective.

Fig. 3
figure 3

Proposed architecture for Attention branch

3.3 Perception Branch

The perception branch can be constructed by almost any classifier. However, as the primary objectives of this study call for low number of parameters, low runtime, and good performance in network design the EfficientNet architecture is employed. The EfficientNet has been used as a high performing model in image classification tasks [28] and speech processing tasks such as speech recognition [21] and keyword spotting [25]. The fundamental architecture of the EfficientNet family is called EfficientNet-B0, which has about 4 million parameters. This number of parameters is not suitable for the target applications of this study. Alternatively, the approach introduced in the EfficientNet-Absolute Zero (EfficientNet-A0) work [25], which applies the reverse of the compound scaling method, was used. The scaling method (S) is designed to shrink a base model (M) by decreasing the depth (\(\alpha \)), width (\(\beta \)), and resolution of the input image (\(\gamma \)), simultaneously. A formulation of this method is given below as an optimization problem, in which the goal is to satisfy the intended conditions so that the final model has the best performance.

$$\begin{aligned} \begin{aligned}&\textrm{max}_{d,r,w}\ \textrm{Accuracy}(S(M,\ d,\ r,\ w)) \\&\qquad \text{ s.t. } \\&\qquad \qquad \frac{1}{20} \le \alpha \cdot \beta ^{2} \cdot \gamma ^{2} \le \frac{1}{16},\\&\qquad \qquad 0.2 \le \alpha ,\ \beta \le 0.6, \ \gamma =2 \end{aligned} \end{aligned}$$
(1)

The two parameters \(\alpha \) and \(\beta \) are set by applying a grid search on intervals [\(0.2-0.6]\) with steps of 0.005. Eventually, 19 models were evaluated with a small subset of samples, with parameters \(\alpha \) and \(\beta \) set at values 0.2 and 0.25, respectively. \(\gamma \) was also set at \(\approx \) 2, given the input image size (513\(\times \)400) and EfficieNet-B0 input-size of 256\(\times \)256. Figure 4 illustrates the final model obtained for the perception branch with 95,000 parameters. The input to this branch is \(m(x_i)\), where \(x_i\) is the input image for the \(i^{th}\) sample and is calculated from the following equation:

$$\begin{aligned} m\left( x_{i}\right) =\left( 1+g\left( x_{i}\right) \right) \times x_{i} \end{aligned}$$
(2)

where \(g(x_i)\) is the attention mask produced for the \(i^{th}\) sample by attention branch. The output of this network is a vector of length 256, which represents the embedded vector of the input image and is applied for two scenarios. The first one uses the vector along with a fully connected layer and the softmax layer to yield probabilities for each individual sample. The second scenario uses the vector as input to a loss function. Thus, samples are embedded in a 256-dimensional space in a most distinctive way.

Fig. 4
figure 4

Proposed architecture for perception branch operating via the reverse compound scaling method

3.4 Loss Function

To train model parameters, a combined loss function (equation 3) was used to account for all study objectives.

$$\begin{aligned} L_{\text{ total }}=L_{\textrm{PB}}+\lambda _{\textrm{AB}} L_{\mathrm {\textrm{AB}}} \end{aligned}$$
(3)

To train an attention branch capable of producing interpretable masks, the \(AB_\textrm{output}\) was used as input to a weighted Cross-Entropy (CE) loss function (\(L_\textrm{AB}\) in equation 3). It should be noted that by introducing the proposed loss function with coefficient \(\lambda _\textrm{AB}\), values in equation 3 are altered. Proceeding forward, the Triplet Center Loss (TCL) (\(L_\textrm{tc}\)) function is used to train the embedding vectors. TCL works in the same way as the triplet loss function, except that it no longer needs to mine triplets for training, and this difference makes the training process faster and more stable. This loss function considers center points for each class in the problem, which are initially assigned random values. The loss function causes the samples of each class to move closer to the centers of their classes and away from the centers of other classes. The two centers used in this study to represent spoof and bonafide samples are \(C_\textrm{spoof}\) and \(C_\textrm{bonafide}\), respectively. The goal here was to ensure that bonafide samples are close to the center of their respective target class, \(C_\textrm{bonafide}\), and away from \(C_\textrm{spoof}\). As a result, samples of a specific class in a dense space are closer to each other; representing feature vectors embedded for each sample in the desired space. \(L_\textrm{tc}\) can be obtained for the \(x_i\) sample as follows:

$$\begin{aligned} L_{tc}\left( x_{i}\right) =\left\{ \begin{array}{lll} \max \left( D\left( f_{i}, C_{\text{ spoof } }\right) +m-D\left( f_{i}, C_{\text{ bonafide } }\right) , 0\right) \times w_{\text{ spoof } } &{} \text{ if } \quad x_{i} \in \{ \text{ spoof } \text{ samples } \} \\ \max \left( D\left( f_{i}, C_{\text{ bonafide } }\right) +m-D\left( f_{i}, C_{\text{ spoof } }\right) , 0\right) \times w_{\text{ bonafide } } &{} \text{ if } \quad x_{i} \in \{ \text{ bonafide } \text{ samples } \} \end{array}\right. \nonumber \\ \end{aligned}$$
(4)

where \(f_i\) represents the feature vector obtained from input \(x_i\), measured in distance. w represents weights considered for each class with respect to the unbalanced number of instances of the classes. m represents the margin that causes the distance of samples of the same class to be at least m less than the samples of the opposite class. The cost function is further augmented with a cross-entropy function to improve the final results and maintain the stability of the optimization. Given that spoof samples have different difficulties, the focal loss obtained from equation below is used instead of cross-entropy.

$$\begin{aligned} L_{\text{ focal } }\left( p_{t}\right) =-\alpha _{t}\left( 1-p_{t}\right) ^{0.005} \log \left( p_{t}\right) \end{aligned}$$
(5)

finally the cost function of the perception branch is calculated from equation 6.

$$\begin{aligned} L_{P B}\left( x_{i}\right) =L_{t c}\left( x_{i}\right) +\lambda _{\text{ focal } } L_{\text{ focal } }\left( x_{i}\right) \end{aligned}$$
(6)
Table 1 Summary of the ASVspoof2019 dataset
Table 2 t-DCF hyperparameters value

4 Experimental Configuration

4.1 Dataset and Evaluation Metrics

The proposed method was evaluated using the ASVspoof 2019 and 2021 dataset, which includes two scenarios: physical access (PA) and logical access (LA). Details of this dataset are shown in Table 1. Furthermore, considering that one of the objectives of this research is the simultaneous use of countermeasure and ASV system, the tandem-detection cost function (t-DCF) and the equal error rate (EER) metrics are used. This metric was introduced as the primary evaluation metric of the 2019 challenge, which is calculated as:

$$\begin{aligned} \textrm{t}-\textrm{DCF}(s)=C_{1} P_{\textrm{miss}}^{\textrm{cm}}(s)+C_{2} P_{\textrm{fa}}^{\textrm{cm}}(s) \end{aligned}$$
(7)

where \(P_{\textrm{fa}}^{\textrm{cm}}(s)\) and \(P_{\textrm{miss}}^{\textrm{cm}}\) are the false acceptance error rate and the false rejection error rate of the countermeasure, respectively. Considering the threshold, s, values for the two error rates can be obtained as follows:

$$\begin{aligned} P_{\text{ miss } }^{\textrm{cm}}(s)= & {} \frac{\#\{ \text{ bona } \text{ fide } \text{ trials } \text{ with } \text{ CM } \text{ score } \le s\}}{\#\{ \text{ Total } \text{ bona } \text{ fide } \text{ trials } \}} \end{aligned}$$
(8)
$$\begin{aligned} P_{\textrm{fa}}^{\textrm{cm}}(s)= & {} \frac{\#\{ \text{ spoof } \text{ trials } \text{ with } \text{ CM } \text{ score } >s\}}{\#\{ \text{ Total } \text{ spoof } \text{ trials } \}} \end{aligned}$$
(9)

The two constants \(C_1\) and \(C_2\) represent the predefined cost for the errors, which are determined based on prior probabilities as shown below:

$$\begin{aligned} \left\{ \begin{array}{l} C_{1}=\pi _{{\text {tar}}}\left( C_{\textrm{miss}}^{\textrm{cm}}-C_{\textrm{miss}}^{\textrm{asv}} P_{\textrm{miss}}^{\textrm{asv}}\right) -\pi _{\textrm{non}} C_{\textrm{fa}}^{\textrm{asv}} P_{\textrm{fa}}^{\textrm{asv}} \\ C_{2}=C_{\textrm{fa}}^{\textrm{cm}} \pi _{\textrm{spoof}}\left( 1-P_{\textrm{miss}, \textrm{spoof}}^{\textrm{asv}}\right) \end{array}\right. \end{aligned}$$
(10)

Here, \(C_{\textrm{miss}}^{\textrm{asv}}\) represents the cost incurred by the error of the ASV system for the false rejection error rate of the genuine person, and \(C_{\textrm{fa}}^{\textrm{asv}}\) represents the false acceptance error rate when ASV authorizes the wrong person. Each countermeasure error also corresponds to two costs; \(C_{\textrm{miss}}^{\textrm{cm}}\), which indicates the cost in recognizing a bonafide sample as a spoof, and \(C_{\textrm{fa}}^{\textrm{cm}}\), which indicates a mistake in accepting a sample produced by a spoof system as bonafide. In addition, the probability of occurrence of any class of genuine (\(\pi _{\textrm{tar}}\)), non-target or imposter (\(\pi _{\textrm{non}}\)) and spoof attack (\(\pi _{\textrm{spoof}}\)) are also considered with the condition \(\pi _{\textrm{tar}}+\pi _{\textrm{non}}+\pi _{\textrm{spoof}}=1\). Cost and probability values are calculated as in Table 2.

4.2 Feature Extraction and Engineering

Based on past researches and works, a single acoustic feature is considered for each of the attacks. For the PA scenario, we use the logarithm of power spectrm (logPowSpec) with 25 ms frames, 10ms step size with 1024 samples (with zero padding applied if needed), using Hamming window. All the samples are first transformed into 4 s voice segments. To do this, samples that are less than 4 s are repeated to achieve a 4 s segment. Longer samples are also divided into 4 s segments with no overlap, and each segment is considered as an individual utterance. The final input form consists of a logPowSpec with 513\(\times \)400 dimensions (512 logarithms of spectrum magnitudes with 1 being the DC component). For the LA scenario, the LFCC feature is extracted according to the procedure used in the base model presented in the ASVspoof 2019 challenge. Here, 20ms frames with 512 point Fast Fourier Transform are used along with first and second derivatives of 20 LFCC features. Finally, a two-dimensional tensor with dimensions of 60\(\times \)400 was obtained.

As a further step, specAug technique [22] is applied for better training and generalization. This method works well for most of speech processing tasks, such as speaker verification, speech recognition, and keyword potting [25]. The method is implemented by applying zero masks on the time and frequency axis for each training sample with a probability of 0.25. The size of SpecAug Mask (band) is randomly selected between 20 and 80 frames on the time axis and 25 and 100 on the frequency axis. Also, for LFCC coefficients (with their derivatives), the size of zero mask on horizontal axis is between 20 and 80 frames and between 5 and 20 on vertical axis.

4.3 Perception Branch Models

In addition to the proposed EfficientNet-A0 model used in the perception branch, a SE-ResNet50 model was also used, which achieved significant results. The models were then compared in terms of both efficiency and performance, and the EABN modularity idea was evaluated accordingly.

4.4 Training Procedure

The final results obtained from experiments on small subsets of the ASVspoof 2019 dataset yielded values of 0.1, 0.005, and 32 for \(\lambda _{AB}\), \(\lambda _{focal}\) and m, respectively. To optimize the loss function with assigned values, configurations for the SE-ResNet50 architecture were adopted. In the case of Adam optimization, \(\beta _1\), \(\beta _2\), and learning rate were obtained at 0.9, 0.98, and \(10^{-9}\), respectively. The learning rate initially drops linearly for the first 1000 steps and then decreases in proportion to the inverse of the square root of the number of steps. All models were trained with 40 epochs and the model with the lowest EER on the development set of the dataset was selected as the optimal choice. Batch-sizes were set at 64 and 128 when using EfficienNet-A0 as the perception branch module with LFCC and logPowSpec respectively. Due to the relatively greater number of parameters for the SE-Res2Net50 model compared to EfficientNet-A0, a batch-size of 8 was used for LFCC and logPowSpec features. The models were implemented on a GTX-1080ti GPU on Linux OS. The source code of our implementations based on Python and Pytorch is publicly available.Footnote 1

5 Results

5.1 Evaluation of Perception Branch’s Models

This section evaluates the overall architectural EABN and the EfficientNet-A0 network as a classifier for spoof detection. To investigate EABN performance, the EfficientNet-A0 and SE-ResNet50 architectures were used for the perception branch, which have the lowest EER as a single model to the best of our knowledge. The results for both attacks are shown in Table 3. In the PA scenario, EfficientNet-A0 shows a better performance than SE-ResNet50 and has nearly ten times fewer parameters and seven times fewer FLOPS. This can be explained in terms of the enhanced performance of the EfficientNet-A0 model in extracting features from the LogPowerSpec. On the other hand, the SE-ResNet50 model performs better when LFCC feature are used for the LA scenario.

Table 3 Result of models used in perception branch and input features on ASVspoof 2019 evaluation dataset for PA and LA scenarios. K, M, and G represent Kilo, Mega, and Giga, respectively
Fig. 5
figure 5

Feature embedding visualization of our proposed loss function for evaluation (a) and training (b) sets of the ASVspoof 2019 LA attack. Features were reduced to 2-D space using PCA

5.2 Loss Function

The proposed combined loss function was used for the first time in this work to achieve a discriminative vector space to distinguish spoof samples from bonafide samples. More precisely, the triplet center loss was used to map input samples to a discriminative space. As shown in Fig. 5, the training samples mapping space is suitable for the classification problem. Examining test samples that include unseen attacks also demonstrates that the resulting space is reasonably discriminative. It can therefore be said that the model shows good generalization for unseen attacks. The best value for margin 32 was obtained in this study by testing three values of 16, 32, and 64 (for this margin please see the last paragraph of related works section).

Fig. 6
figure 6

Average of produced LFCC attention masks for some spoof attacks in ASVspoof 2019 evaluation

5.3 Attention Masks

One of the main concerns about the proposed architecture is to obtain attention masks that can be interpreted from a human point of view. This was investigated for the LFCC feature, with the averaged masks generated for all samples in the evaluation set shown in Fig. 6. Examining the LFCC feature masks obtained for different attack systems reveals that information corresponding to the second derivatives of LFCC coefficients are very effective in detecting spoof patterns.

Fig. 7
figure 7

Input feature (Inp.), produced attention mask (Att.), and final input feature for perception branch (Perc.) of some samples in the evaluation set for logPowSpec feature. B is bonafide class. Red boxes are parts of input features that the attention branch emphasizes and are interpretable from human’s point of view

For the logPowSpec mask, a few of samples from the evaluation set of the PA attack are shown in Fig. 7. The raw input features and results of the applied mask on them, which is input to the perception branch, are also shown in this figure. The masks blur or dominate some values at different frequencies. The mask points with lower values decrease the impacts of LogPowSpec values at frequencies that show lower capacity to discriminate spoof attacks from bonafide samples and vice versa. By examining the masks produced for physical access attacks, it can be understood that there is a lot of emphasis on silent parts. This is because it is easier and clearer to recognize the effects of recording and playback when there is no speech. In this regard, it can be said that paying attention to silence intervals and feature values corresponding to special frequency bands may lead to better detection of physical access attacks

5.4 Comparison with Other Single Models

The proposed models have been compared with some of the single models and the baseline models according to the presented objectives. Some of the top-performing models used for relevant purposes are shown and compared with the proposed model in Table 4. For the LA attack, the LFCC+SEResABNet+CombLoss model achieves an EER=1.89% and t-DCF=0.507, which outperforms the baseline model LFCC-GMM. The proposed model also outperforms its corresponding base model (LFCC+SEResNet50+CE) for approximately 0.98%. Also, by comparing the results obtained with other works, it can be seen that this model outperforms LFCC+ResNet18+OCS, which to the best of our knowledge, shows state-of-the-art performance. For physical access attacks, the LogPowSpec+EABN+CombLoss model achieved EER=0.86% and t-DCF=0.0239. This result is significantly better than the base models. Compared to results reported in the 2019 challenge, the proposed model also appears to outperform 90% of methods which use fusion models. These results, and other favorable features such as fewer parameters and shorter runtime compared to other models, prove the efficiency of the proposed EABN model. Finally, we evaluated the best-proposed models obtained on the ASVspoof 2019 dataset on the 2021 version as shown in Table 5. The results show that these models perform better than all the base models. These results indicate that the model presents good performance on LA attacks.

Table 4 Performance comparison of the proposed systems with known single systems tested on the ASVspoof 2019 PA and LA evaluation set. Models are named base on their input feature, the classification model, and the loss function
Table 5 Performance comparison of the proposed systems with ASVspoof 2021 baseline systems tested on the ASVspoof 2021 PA and LA evaluation set

6 Conclusion

Spoof detection is considered as a major security concern in authentication systems, particularly the ASV system, demonstrating a clear need for solutions to combat spoof attacks. There are generally two approaches to detect spoofing attacks on ASV systems: the first is to develop an appropriate classifier targeted specifically at detecting attacks, while the second approach is conducted as a preliminary step for extracting discriminative features. In the case of the former, most classifiers fail to consider the issue of optimality in terms of number of parameters and runtime. On the other hand, most proposed models are not interpretable from a human point of view, and features are chosen according to expert’s knowledge, and therefore lack generalization to unseen attacks. However, a modular architecture based on branches of attention and perception gives the system the ability to easily utilize any classifier or method to produce an interpretable attention mask and improve classification task. To this end, the proposed combined loss function, particularly the triplet center loss, succeeded in yielding a discriminative feature space that can help achieve a more generalized model for unseen attacks.

The proposed model and loss function were evaluated on ASVspoof 2019 data. Using LogPowSpec and LFCC features, along with the first-time use of the EfficientNet-A0 architecture and the efficient SE-Res2Net50, this study provides a novel method for detecting spoofs. The findings show that the LFCC+SEResNet50+CE model runs with an EER of 1.89% and t-DCF of 0.507 in the logical access scenario, which to the best of our knowledge, outperforms all state-of-the-art methods. The EABN+CombLoss also obtained an EER of 0.86% and t-DCF of 0.0239 for the physical access scenario, which is better than 90% of the models presented for the ASVspoof 2019 challenge. It is worth noting that the EfficientNet-A0 consists of only 95,000 parameters. The findings also shed light on certain special cases observed for the produced attention masks. For example, LFCC features outperformed MFCCs in detecting logical access attacks. Alternatively, to detect replay attacks, focusing more on silent segments and some frequency ranges in the human speech frequency range can improve the performance.

In this research, we were able to achieve the goals defined for a suitable countermeasure system. The first one was to provide a generalize system against unseen attacks. To achieve this goal, we proposed modular EABN architecture along with the combined loss function. In addition, providing a system that has a suitable (few) numbers of parameters and FLOPS is another main goal. We optimized EfficientNet-A0 and use it in the perception branch. This model has a few parameters and FLOPS as well as achieves comparable results. For future steps, considering that the proposed method has a modular architecture, other methods and models can be used in branches and their performance can be investigated. We can fuse branches in a multi branch network where each branch can use a specific architecture or a specific input feature. Finally, it is possible to examine the effect of using several branches of perception which can be trained together or separately.