1 Introduction

With the emerging challenges of deep fake spoofing attacks, a new generation of deepfake researchers must focus on audio solemnly. Malicious users have exploited this technology for illegal and conspiratorial purposes. Several approaches have been suggested to mitigate the risk of spoofed speech attacks. Due to the development of automatic speaker verification (ASV) and countermeasures challenges, spoofed audio detection has gained significant attention. There exist two major types of spoofing detection systems. The first is an effective classifier based on neural networks [1], while the second involves acoustic features based on signal processing [2]. Although these algorithms have shown excellent results, research demonstrated that their efficiency decreases when unseen spoofed attacks occur. The ASVspoof 2019 challenge introduced PA and LA tasks for detecting replay attacks and synthesized speeches to aid in developing anti-spoofing techniques. Research bodies and individuals have suggested numerous strategies to combat audio spoofing assaults [3]. Moreover, advanced ML-based approaches are employed to differentiate audio using knowledge-based and data-driven countermeasures [4].

In contrast, conventional ML approaches take time to manually craft feature extraction, which may lead them to overlook the profound information that underlies audio spectrograms. Dinkel et al. suggested a Convolution Neural Network (CNN)-based end-to-end Model employing raw waveforms as input [5]. Chintha et al. proposed a Recurrent Convolution Neural Network (RCNN) structure to detect fake audio [6]. To recognize anti-spoofed attacks, [7] employed a lightweight CNN model (LCNN) using the SoftMax loss function. Combining multiple classifiers enhances the classification performance. Hence, utilizing various feature representations along with Resnet [8] or additional classifiers was examined in [9] for better efficiency.

Nevertheless, there exists a generalization problem regarding unseen attacks, therefore it is necessary to develop a reliable and effective system that can detect fraudulent audio from any source. Therefore, our primary goal is to develop a novel and efficient framework to address this issue. This paper aims to demonstrate an effective framework for identifying fake audio based on a VGGish network combined with a Convolutional Block Attention Module (CBAM) block. Our model is an effective technique to utilize for audio spoofing detectors due to a simple layered architecture. It captures complex relationships within the data in feature maps due to both spatial locations and specific channels present in mel-spectrograms. CBAM provides an adaptive mechanism to adjust the attention based on the content of the input data.

Our designed framework is based on three distinct stages. In the first stage, the mel-spectrograms, which are visual representations of auditory features, are generated. Next, a deep multilayer system is trained on the ASV2019 [2] dataset to distinguish between real and spoofed audio. Finally, the network classifies mel-spectrograms into two classes: real and spoofed. More precisely, our model on the ASV2019 dataset is trained and cross-validated over the ASVSpoof 2021 dataset, which is comprised of an additional deepfake speech set. According to the experiments, the classification accuracy achieved over the ASVS2019 dataset is 99.2%, with a 0.8% error. Furthermore, in the cross-validation phase, the proposed detector efficiently distinguishes fake audio.

The following are the main contributions of our proposed model:

  • To propose an effective method for audio classification, analogous to image classification technique, based on an enhanced deep learning model attached to the CBAM module.

  • CBAM is an effective and powerful attention block to select the most relevant representations from mel-spectrograms.

  • The proposed model is examined against the Physical Access (PA) and the Logical Access (LA) sets of ASV2019, and cross-validated over ASVspoof 2021 dataset. The proposed model produced remarkable results; hence, it is clearly visible that the proposed technique is highly effective for fake audio classification tasks.

  • The attention block makes our technique more robust towards variations in the input audio’s mel-spectrograms such as changes in background noise, scale, and orientation.

The subsequent sections of the paper are presented as follows: Section 2 outlines relevant work; Section 3 describes the methodology; Section 4 explains the experiments accomplished and results; and finally, Section 5 concludes the work.

2 Related work

The increase usage of fabricated or modified audio has risen with the arrival of advanced Artificial Intelligence technologies, especially deep neural networks. The ease of producing audio deepfakes raises multiple privacy and security threats. Therefore, deepfake audio detection techniques are inevitably needed. When considering the misuse of such techniques and the possible harm they may bring, numerous researchers have developed a number of deepfake detection algorithms. Monteiro et al. [1] introduced a dual-adversarial domain adaption paradigm as well as a multi-model ensemble strategy. However, these solutions required equal both authentic and fake data. Mahum et al. [3] proposed an ensemble-based learning method for the detection of text-to-speech synthesized speech and attained significant results. However, they only focused on text-based speeches and did not consider the other types of synthesized speeches.

A support vector machine-based classifier was used for speaker verification utilizing the Gaussian mixture model (GMM) [4]. They achieved an equivalent error rate of 4.92% and 7.78% in the 2006 NIST speaker recognition evaluation test. The researchers suggested the Gaussian mixture model (GMM) and a Relative Phase shift with a support vector machine (SVM) to minimize the limitations of the speaker authentication system.

Furthermore, for the identification of fake speech, a comprehensive comparison of the hidden Markov model (HMM) and deep neural network (DNN) was conducted [5]. The suggested approach in [6] used spectrograms in the form of images as input to the convolutional neural network (CNN), establishing a foundation for audio processing using visuals. Multiple feature descriptors, such as Mel frequency cepstral coefficient (MFCC), spectrogram, and others, were employed [7], and the influence of GMM-UBM on accuracy was investigated in terms of EER. The result showed that combining diverse feature descriptors produce superior outcomes. Chao et al. [8] used two key methods to verify speakers, including kernel fisher discriminant (KFD) and support vector machine (SVM). They achieved better results than their earlier work using GBM and UBM methods. Furthermore, by replacing the dot product between two utterances with two i-vectors, the computational cost of the polynomial kernel support vector machine was decreased. The authors used a features selection strategy to achieve a 64% dimensionality reduction in features with an EER of 1.7% [9].

Loughran et al. [10] used a genetic algorithm (GA) with an altered cost function to overcome the problem of imbalanced data (where one class sample is bigger than the other). Malik et al. [11] investigated audio integrity and created a method for audio fraud detection based on acoustic impressions of the surroundings. However, these proposed models were unable to address generated audio content precisely. In [12], a DNN-based classifier was proposed to detect and use human log-likelihoods (HLL) as a scoring measure, which was shown to be superior to traditional log-likelihood ratios (LLR). They also used a variety of cepstral coefficients in the training of the classifier. Moreover, a convolutional neural network (CNN) was employed for audio classification [13, 14]. In [15], an exhaustive comparison of deep learning approaches for fake audio detection was made, indicating that CNN and RNN-based models outperform all other strategies [16]. showed that spectral features, such as MFCC features, are superior to other spectral features for the model’s input for detecting synthetic speech.

Furthermore, [17] discussed the problems and weaknesses of fake detection methods. A bispectral approach for analyzing and detecting synthetic sound was developed in [18]. They looked at bispectral characteristics, which were unusual spectral aspects in fake speech synthesis by DNNs. To distinguish the fake audio, they looked for high-order polyspectral characteristics. In [19], a capsule network-based technique was proposed. They improved the suggested system’s generalization and thoroughly evaluated the artifacts to improve the model’s overall performance. They also used their network to explore replay attacks in audio. In [20], authors proposed a model for fake audio detection named DeepSonar. They analyzed the network layers and the activation patterns for various input audios to examine the difference between fake and real speeches. They employed three English and Chinese datasets and attained an average accuracy of 98.1%.

In [21]. Mahum et al. proposed a model, namely DeepDet for the detection of TTS synthesis using Yet Another Mobile Network (YAMNet). The authors attached the attention block, i.e., BottleNeck Attention Module to improve the detection performance. The proposed model attained significant performance, however it was not intended to identify several types of fake speeches. Furthermore, for specific tasks, temporal convolutional networks (TCN) [22] have excelled in classical algorithms, such as RNNs and LSTMs [23]. used original speech recordings to clone the voice utilizing the most recent deep-learning techniques for text-to-speech synthesis systems. It took a couple of minutes to capture a real voice and seconds to generate fake audio. Although the approaches have advanced since [24], the difficulty of naturalness persists. Furthermore, signal processing strategies rather than machine learning algorithms are used to generate fake utterances in the VOCo and double voice models [25, 26]. Based on the mapping, the synthesized voice copies the original speech’s tone, tempo, rhythms, genre, and plain text. The number of false speeches was determined by the magnitude of the real speech. Because real and synthetic voices sound so similar, it is quite easy to deceive the listener, and fake speech could be used as evidence in court.

Although various models based on machine learning algorithms are employed to identify fake audio, machine learning-based systems require numerous steps, such as pre-processing of audio data, custom feature extraction, feature selection, and classification, which may increase computational cost and human effort. On the other hand, few researchers have used CNN for feature extraction before using a conventional classifier for classification [27].

A small number of datasets dedicated to fake audio detection systems have recently been generated. Reimao et al. [28] produced a synthetic speech identification dataset. The collection contained fake speech generated by open-source tools that employ contemporary speech synthesis technology. Wang et al. [29] created a fake English and Mandarin dataset using an open-source audio converter and voice synthesis engine. However, the ASVspoof databases are significant since they contain a variety of ways to generate spoofed speech to conduct tests for the development of fake audio detectors. Several existing techniques for spoofing detection are reported in Table 1.

Table 1 Currently available spoofing detectors

3 Methodology

The proposed system comprises three phases: Features Extraction, Training, and Classification. First, the data is gathered and resampled to 16,000 Hz with a single channel using the ASVspoof 2019 dataset. Then, a feature extraction layer to generate mel-spectrograms is employed. Our proposed model extracts the most representative features from the mel-spectrograms generated from audio since the mel-spectrograms of fake audio differ from real audio due to presence of breathing sound in the later form [43]. CBAM module was employed to extract most representative feature maps. Further, our proposed detector is a novel and robust model for audio spoofing detection, and it performs significantly for all sorts of spoofed speeches, according to the observations. Figure 1 illustrates the system’s flow diagram. Figure 2 shows the architecture of the deep learning network, and Fig. 3 exhibits the structure of CNN.

Fig. 1
figure 1

Flow diagram of the proposed method

Fig. 2
figure 2

General architecture of deep learning model

Fig. 3
figure 3

General architecture of CNN

3.1 Convolutional Block Attention Module (CBAM)

Convolutional Block Attention Module progressively infers a 1-dimensional attention map Nc ∈ R c×1×1 and a 2-dimensional spatial attention map Ns ∈ R 1 × h × w from an input intermediary feature map E ∈ R c × h × w [44]. The general attention process is as follows:

$${G}^{\prime }={N}_c(E)\otimes E,$$
(1)
$${G}^{\prime \prime }={N}_s\left({E}^{\prime}\right)\otimes {E}^{\prime },$$
(2)

Channel attention values are propagated along the spatial dimension during multiplication and vice versa. Element-wise multiplication is represented by ⊗, whereas G′′ represents the final polished output. The processing behind each attention map is shown in Fig. 4.

Fig. 4
figure 4

Overview of CBAM

The specifics of each attention module are listed below:

Channel attention module: A channel attention map is constructed by leveraging the inter-channel correlation of features. Channel attention concentrates on “what” is significant based on the input image since each feature map is regarded as a feature depictor [45]. The input feature map’s spatial dimensions are suppressed to determine the channel attention effectively. According to Zhou et al. [46], average pooling has been widely used to aggregate spatial data. Figure 5 gives an illustration of each attention sub-module.

Fig. 5
figure 5

Channel attention module

Hu et al. [47] adopted it in their attention module to analyze spatial statistics, while [46] recommended employing it to efficiently learn the scope of the target object. In contrast to earlier research and findings, max-pooling captures another crucial piece of information about distinguishing object features to infer better channel-wise attention. Therefore, max-pooled and average-pooled features are used simultaneously. The empirical confirmation demonstrated the effectiveness of our design by integrating both features rather than each one alone. It significantly enhances the network’s representation power. Below is a description of the intricate process. Employing both max and average pooling techniques, initially, the spatial data of a feature map is gathered to construct two distinct spatial context descriptors, \({G}_{\textrm{max}}^c\) and \({G}_{\textrm{avg}}^c\), representing max-pooled features and average-pooled features, respectively. Our channel attention map Nc ∈ c ×1×1 is produced by broadcasting both descriptors to a shared network. A single hidden layer is present in the multilayer perceptron that composes the shared network. To minimize the parameter overhead, the activation size of the hidden layer is adjusted to c/r ×1×1, where r represents the reduction ratio. The resulting feature maps were integrated with element-wise summation after each feature map traversed the sharing network (MLP). The following is a brief computation of the channel attention:

$${N}_c\left(\textbf{G}\right)=\upsigma \left(\textrm{MLP}\ \left(\operatorname{Max}-\textrm{Pool}\ \left(\textbf{G}\right)\right)+\textrm{MLP}\ \left(\textrm{Avg}-\textrm{Pool}\ \left(\textbf{G}\right)\right)\right),=\upsigma\ \left({\boldsymbol{W}}_{\textbf{1}}\left({\boldsymbol{W}}_{\textbf{0}}\left(\ {\boldsymbol{G}}_{\textbf{max}}^{\boldsymbol{c}}\right)\right)+{\boldsymbol{W}}_{\textbf{1}}\left({\boldsymbol{W}}_{\textbf{0}}\ \left({\textbf{G}}_{\textbf{avg}}^{\textbf{c}}\right)\right)\right),$$
(3)

where the sigmoid function is denoted by the σ, W0c/r × c and W1c × c/r, the Multilayer Perceptron’s weights W0 and W1 are shared by both inputs, W0 solely comes just after the Nonlinear Activation Function ReLU.

Spatial attention module: By leveraging the inter-spatial connections of features, spatial attention mapping was performed. In contrast to channel attention, spatial attention emphasizes “where,” which is an informational component and works in conjunction with channel attention. Then, the max pooling and average pooling operations along the channel axes are performed and combined to produce an effective feature descriptor before computing the spatial attention. It has been demonstrated that applying pooling techniques along the channel axis effectively identifies informative representations [48]. To produce a spatial attention map Ns(G) ∈Rh × w that specifies where to concentrate, a convolution operation to feature descriptor was performed. The spatial attention module is shown in Fig. 6.

Fig. 6
figure 6

Spatial attention module

Below, a thorough explanation of the procedure is given. First, max and average pooling operations are used to integrate the channel information of a feature map, creating two 2-dimensional mappings: \({G}_{\textrm{max}}^s\) and \({G}_{\textrm{avg}}^s\in {\mathbb{R}}^{1\times h\times w}\). Each represents the channel’s max and average-pooled features. Our 2-dimensional spatial attention map is then created by concatenating and convolving those with a standard conv. layer. The computation of the spatial attention is given below, where t7×7 is a convolution operation with a seven-by-seven filter size, and σ represents the sigmoid function.

$${N}_s\left(\textbf{G}\right)==\upsigma\ \left({t}^{7\times 7}\ \left(\left[\operatorname{Max}-\textrm{Pool}\left(\textbf{G}\right);\textrm{Avg}-\textrm{Pool}\left(\textbf{G}\right)\right]\right)\right)=\upsigma\ \left({t}^{7\times 7}\ \left(\left[\ {G}_{\textrm{max}}^s;{G}_{\textrm{avg}}^s\right]\right)\right),$$
(4)

Organization of the attention modules: When given an input image, the channel and spatial attention modules calculate complementary attention, emphasizing the “what” and the “where” of the image. This allows for the placement of two modules either in a parallel or sequential fashion. It is observed that the sequential structure yields better outcomes than a parallel structure.

3.2 Our proposed VGGish

VGGish-CBAM is an enhanced convolutional neural network, and its architecture is influenced by VGG networks. It is widely used for image classification. A sequence of convolution and activation layers is followed by a max pooling layer present in the network. It transforms audio input data into a semantically significant 128-D embedding that can be used as input to a downstream classification model. VGGish-CBAM embedding is more semantically compact than raw audio features. Our proposed version contains twenty-nine layers altogether. Then, the network also included four batch normalization layers along with the max-pooling and ReLU Layer. A CBAM module is introduced before the flattening layer. The improved model is exhibited in Fig. 7. The layer-wise details for VGGish is presented in Table 2.

Fig. 7
figure 7

The modified VGGish (VGGish-CBAM) architecture

Table 2 Layer-wise detail of our proposed model

The convolutional layer uses the 128 groups with a 3 × 3 filter size having stride 1, and the same padding. Next, the ReLU layer is used with a scale of 0.1 to utilize the features coming from the convolutional layer and introduce non-linearity in the training. In the next step, the max pooling layer reduces the dimensions with stride 2. Furthermore, a convolutional layer is utilized with a 3 × 3 filter size of 128 groups using stride 1. After the convolutional procedure, the features from the convolutional layer are activated using ReLU activation function. In the next step, two convolutional layers are utilized using 256 groups with 3 × 3 convolutions, and a stride of 1 with the same padding. Then, ReLU with a 0.1 scale is used. After that, a max pooling layer with size 2 × 2 and stride 2 is employed. After that, a convolutional layer with a 3 × 3 filter with a size of 512 using stride 1 is placed. After the convolutional procedure, the features from the convolutional layer are activated using ReLU as an activation function. Lastly, a CBAM module is placed, preceded to the fully connected layer.

In Fig. 8, a difference between real and fake mel-spectrogram is presented. It is clearly visible from the image that when a pause occurs during speech, in real speech, there exists a sound of breathing, whereas in fake audio, there is a sharp blue gap, which is the main difference that our proposed model is learning. This is the reason that our model generalized well due to the similar differences among all types of synthesized audio and real speech .

Fig. 8
figure 8

The mel-spectrogram of real and fake speech [3]

4 Experimental framework

This section explains the details of experiments conducted to assess the model’s performance. Additionally, details of the dataset that was used to assess the performance are also provided. Tandem detection cost function (t-DCF), accuracy, precision, equal error rate (EER), and recall are utilized as assessment measures for the ASVspoof 2019 dataset. For implementation, the MATLAB 2021 was used and the system had the following specifications: Core i5, 7th generation, and 12 GB RAM.

4.1 Dataset

The ASVspoof contest began with the establishment of the ASVspoof 2015 dataset [41], which was created to assess TTS and voice conversion (VC) recognition systems. The ASVspoof 2017 dataset [17] was released to assess Replay Detection (RA) detection systems. For the experimentations, the ASVspoof 2019 [2] is used: a diverse and extensive public dataset comprising all types such as TTS, VC, and RA, separated into two data collections: Logical Access (LA) and Physical Access (PA). Furthermore, the Logical and Physical Access corpora are subdivided into three subsets: Training, Evaluation, and Development. Tables 3 and 4 present the statistics on the LA and PA sets of the dataset.

Table 3 Physical access corpus of Asvspoof 2019
Table 4 Logical access corpus of Asvspoof 2019

The LA set comprises spoof and real speech data produced by 17 different VC and TTS systems. The details of the LA set are presented in Table 4.

To improve the ASV consistency in reverberant conditions [49, 50], ASVspoof 2019 [2] comprises simulated replay recordings [51,52,53] in deep acoustic environments as opposed to the ASVspoof 2017 dataset [17], which contained the replay attacks. For the collection of PA samples, physical characteristics were considered, for example, room sizes in which the audios were synthesized, which were divided into three categories: small, medium, and large. The talker to ASV distance (Ds), was classified into three categories: short, medium, and large distance. Various factors, such as physical space, including the floor walls, ceiling, and positions within the room were also considered. The reverberation level was described in terms of T60 reverberation time and is divided into three categories: short, medium, and high. Three different zones (A, B, and C) were used for recording, each of which corresponds to a particular talking distance (Da). In comparison to zones B and C, zone A recordings were perceived to be of higher quality. Table 5 contains the data on cloning algorithms.

Table 5 Logical access Corpus of ASVspoof 2019

4.2 Experimental protocols

This section contains information on the experimental protocols utilized to assess the model. To evaluate our model on the LA set, a training set of 25,380 samples (2580 genuine and 22,800 spoofed) was used. The proposed technique is evaluated on both the evaluation set and the development set. The development set comprises 24,844 samples (2548 genuine and 22,296 spoofed samples), whereas the evaluation set contains 71,237 samples (63,882 genuine and 7355 spoofed).

To evaluate the model on the PA set, a training set of 54,000 samples is employed (5400 genuine and 48,600 spoofed). The proposed model is assessed on both sets: the evaluation set and the development set. The development set comprises 29,700 samples (5400 genuine and 24,300 spoofed), whereas the evaluation set contains 134,730 samples (18,090 genuine and 116,640 spoofed samples).

4.3 Evaluation metrics

To assess the performance of our proposed VGGish model, various metrics are employed, i.e., equal error rate (EER), tandem-detection cost function (t-DCF), precision, recall, and accuracy. They rely on true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The term “TP” signifies fake audios that our classifier accurately identified, and FP denotes the number of audio samples that were mistakenly categorized as fake. Moreover, correctly identified audio samples as negative or real are denoted by true negative (TN). In contrast, incorrectly identified audio samples as negative are referred to as false negatives (FN). Additionally,

Precision is the proportion of true positive (TP) overall audio samples labeled as positive. The equation is shown below:

$$\textrm{Precison}=\frac{\textrm{True}\ \textrm{positive}}{\textrm{True}\ \textrm{positive}+\textrm{False}\ \textrm{positive}},$$
(5)

Accuracy implies that how much proposed system has accurately categorized audio samples. This is computed as follows:

$$\textrm{Accuracy}=\frac{\textrm{True}\ \textrm{Positive}+\textrm{True}\ \textrm{Negative}}{\textrm{True}\ \textrm{Positive}+\textrm{True}\ \textrm{Negative}+\textrm{False}\ \textrm{Positive}+\textrm{Fasle}\ \textrm{Negative}},$$
(6)

Recall is the proportion of actual positives (TP) that a system accurately identifies. The more closely the recall value approaches to 1, the better the model is. The following gives the recall equation:

$$\textrm{Recall}=\frac{\textrm{True}\ \textrm{positive}}{\textrm{True}\ \textrm{positive}+\textrm{False}\ \textrm{negative}},$$
(7)

Moreover, EER and t-DCF are standard metrics that are used to analyze the performance of the proposed spoofing detector.

In equal error rate, speaker verification systems (SVS) are often assessed using two types of errors: false acceptance rate (FAR) and false rejection rate (FRR). When a genuine speaker is rejected by the system or when a fake speaker is accepted by the system, there has been a false rejection or false acceptance [54] respectively. Once the ratio of the false acceptance rate (FAR) is equal to the false rejection rate, it is known as the equal error rate (EER). Lower EER values imply a higher system’s accuracy.

5 Performance evaluation and discussion

5.1 Performance evaluation on PA attacks

The primary objective of this experiment is to evaluate how well our proposed audio spoofing detector works on PA attacks. To achieve this, the audio samples are presented as mel-spectrograms from the PA set using the proposed VGGish model for classifying genuine and replay samples. As shown in Table 6, an EER of 0.52% and 3.0% is attained on the evaluation and development sets, while min-tDCF was 0.05 and 0.07, respectively. Based on the outcomes, it is concluded that our proposed spoofing detector did great, particularly on the evaluation set.

Table 6 Performance evaluation on PA dataset

Our proposed spoofing detection system has excellent classification performance. A 99.2% accuracy is attained for the evaluation set and 99.5% accuracy for the development set, respectively. These experimental results demonstrate that our spoofing detector is more effective at identifying physical access attacks on a large and diverse ASVspoof 2019 dataset. The results of this experiment allow us to conclude that the features of our proposed spoofing detector can efficiently capture the microphone anomalies and fingerprint data included in the replay samples. The performance plot over the PA set is shown in Fig. 9, where the y-axis presents the value across each metric.

Fig. 9
figure 9

Performance plot over PA corpus

5.2 Performance comparison of voice cloning algorithm

This experiment aims to identify the type of algorithm used to synthesize samples of the LA set in the ASVspoof 2019 dataset. The voice conversion (VC) and synthesized samples are both included in the Logical Access (LA) dataset. The LA set of ASVspoof 2019 employs 6 distinct cloning algorithms for speech synthesis. In this experiment, a model is trained on 22,800 samples of the LA training dataset and evaluated on 22,296 samples using the development set of the LA corpus of the ASVspoof 2019 dataset. The EER for the algorithms A01 to A09 is 0.7%, 2.2%, 1.09%, 1.02%, 1.01%, and 1.05%, respectively. Table 7 summarizes the results.

Table 7 Result of Asvspoof 2019’s cloning algorithms

The result shows that our proposed technique achieved the best performance over A02 and performed the least accurately on A04 and A06 algorithms. The proposed spoofing detector better detects the WORLD waveform generator’s cloning artifacts since the vocoder A02 uses voice synthesis with the WORLD waveform generator. The Waveform-Concat generator is used for the voice synthesis model A04. In contrast, model A06 employs spectral filtering along with the OLA waveform generator. Consequently, our method is marginally less effective than earlier in capturing the cloning artifacts of waveform-concat and spectrum filtering + OLA waveform generators. Overall, remarkable results have been achieved for the cloning algorithm’s detection. A performance plot of our proposed model using cloning algorithms of ASVspoof 2019 is shown in Fig. 10.

Fig. 10
figure 10

A comparison plot of ASVspoof 2019’s cloning algorithms

5.3 Performance assessment over TTS and VC

The primary objective of this test is to evaluate the efficacy of our proposed detector on voice conversion (VC) and TTS samples. The proposed model used 96 × 64-sized mel-spectrograms and related features to train our model to differentiate between genuine and falsified TTS and VC samples separately. The spoofed samples of the training dataset of the LA corpus are generated by four text-to-speech spoofing algorithms, namely A1–A4 and 2 voice conversion spoofing algorithms, A5 and A6, respectively. There exist thirteen spoofing algorithms utilized to generate the fake samples for the evaluation dataset of LA corpus, including 3 Voice Conversions (A17, A18, and A19), 7 text-to-speech (A7, A8, A9, A10, A11, A12, and A16), and 3 VC-TTS (A13, A14, and A15) spoofing algorithms.

A multi-stage experiment is used to assess the efficacy of our proposed approach for detecting voice conversion (VC) and text-to-speech (TTS) spoofing individually. In the first step, the model is trained using genuine and fake samples (text-to-speech) from the training dataset of the LA corpus. Next, the model is evaluated using genuine and fake samples (TTS) from the evaluation set of the LA corpus. As a result, an EER achieved 0.54% and a minimum TDCF of 0.04. In the second step, the model is trained using the genuine and fake samples (VC) from the training dataset of the LA corpus. Then, the model is evaluated using the genuine and spoofed samples (VC) from the evaluation dataset of the LA corpus. The proposed model attained a minimum tDCF of 0.40 and an EER of 1.3%. Table 8 presents the comprehensive results of TTS and VC spoofing detection. The performance plot over TTS and VC sets is shown in Fig. 11.

Table 8 Results on TTS And VC
Fig. 11
figure 11

Performance plot over TTS and VC

The above findings demonstrate that, in comparison to voice conversion (VC) detection, our proposed method is more effective at text-to-speech (TTS) spoofing detection. The suggested approach more effectively replicates the artifacts produced by Griffin Lim, Vocoder TTS, and neural waveforms. This may be because text-to-speech (TTS) samples lack the speakers’ periodic features. In contrast, the voice conversion spoofing model employs the actual voices as a source, retaining those features. The suggested system works efficiently on the LA corpus, with an EER of 0.07%, demonstrating the efficacy of our approach.

5.4 Performance assessment for detecting unseen attacks

This experiment aims to assess how well the suggested system performs against hidden Logical Access attacks such as A7–A19. The evaluation set of the Logical Access dataset contains 63,895 instances of unknown attacks, which are synthesized employing the spoofing algorithms. First, the model is trained using genuine and spoofed data from the Logical Access training set. Then, the model is evaluated using the samples from the evaluation set containing bonafide and spoofed samples from hidden assaults. The results are summarized in Table 9, and the plot is shown in Fig. 12. Our approach performed admirably for A07, A10, A11, and A13.

Table 9 Results on unseen LA attacks
Fig. 12
figure 12

Performance plot on unseen LA

In addition, our method performed poorly for A17, A18, and A19. This experiment indicates that voice cloning-based artificial speech is harder to distinguish from text-to-speech-based artificial speech. A comparison of waveform-generating techniques demonstrates that attacks using waveform filtering-based approaches (voice cloning (VC) vocoder, voice cloning (VC) waveform filtering, and spectral filtering are used by A17, A18, and A19 assaults) are among the most challenging to identify. Even though our approach experienced difficulties in efficiently capturing the artifacts generated by voice-cloned assaults, it worked better against the overall Logical Access assaults.

5.5 Performance evaluation against existing DL models

In this experiment, our proposed model is compared to various traditional Deep Learning Models, including InceptionNetV4, DenseNet201 MobileNetV2, and EfficientNet. Our suggested model, an enhanced VGGish classifier for accurately detecting fake speeches, is sustainable. A comparative analysis is performed on the above-mentioned models using metrics: EER, min-tCDF, accuracy, precision, and recall, as displayed in Table 10. The flowchart describing the algorithms utilized in this experiment is shown in Fig. 13.

Table 10 Performance evaluation against existing DL models
Fig. 13
figure 13

Flowchart for deepfake audio detection models used for comparative analysis

To detect whether an audio sample is fake or authentic, audio samples are transformed into Mel spectrograms and fed into these deep learning models. In total, 800 audio samples have been utilized from the PA corpus of the ASVspoof 2019 dataset. When comparing the outcomes of each of these methods, it is evident that our algorithm exceeds the other deep learning algorithms by a substantial margin, with 99.78% accuracy and 98.86% precision.

Our suggested method surpasses all others with an EER of 0.07% and a min-tDCF of 0.030. This comparative study allows us to conclude that our suggested model outperforms conventional DL models and can detect fake audio accurately. The plot is shown in Fig. 14.

Fig. 14
figure 14

Performance plot on evaluation set against existing models

5.6 Cross-validation

The ASVspoof 202 1[55] dataset is used to evaluate the robustness of the proposed model. ASVspoof 202 1[55] dataset is designed to be more challenging than earlier versions. While the training and development partitions remained the same as the ASVspoof 2019 LA dataset, there is a variation within the assessment set. An additional evaluation segment of Deepfake tasks is incorporated in the ASVspoof 2021 database. The Deepfake(DF) set of the ASVspoof 2021 is used, which consists of audios produced by fusing real and synthetic speech generated through Voice Conversion (VC) and Text-to-voice (TTS) synthesis techniques. The hyperparameters are identical to the first experiment. The suggested approach has shown outstanding generalization ability and delivered promising results. Table 11 presents the results obtained. These results confirm that our suggested detector is robust enough and performs significantly.

Table 11 Observations from the cross-dataset analysis

5.7 Comparison with existing fake audio detectors

In this experiment, the proposed model is compared with existing deepfake audio detectors. Several models have been proposed for the fake audio synthesis using deep learning [56]. Most of the existing models take input as raw audio, whereas some models take input in the form of extracted features. The same dataset, i.e., ASVspoof 2019, is used for this experiment. It is clearly visible from the reported results in Table 12 that our proposed VGGish-CBAM-based model achieves better results than other existing detectors. The better performance is due to the added attention module that extracts the spatial and channel feature maps in the VGGish network. Moreover, the audios are transformed into mel-spectrograms to consider the visual representations of the raw audio. This attribute makes it easy for our proposed technique to differentiate fake audio from real ones. The fake audio mel-spectrograms show flat patterns during the pauses, whereas in real audio, the pauses are shown with varying patterns.

Table 12 The comparison with existing fake audio detectors

Therefore, our proposed technique effectively identifies the fake audios. The comparison plot on EER is shown in Fig. 15.

Fig. 15
figure 15

EER Comparison plot of several fake audio detectors

6 Discussion

Our approach surpasses state-of-the-art techniques in terms of EER when compared to existing approaches. Our model systematically outperformed existing research [57,58,59,60,61,62] due to its simple and effective architecture and exhibited a remarkable ability to identify Logical Access (LA) attacks. The best-performing solution, which claimed an EER of 1.02% utilizing a TO-RawNet Orth - AASIST architecture [62] for LA assaults, which is substantially greater than our model score (0.07%). A remarkable 0.07% EER for Logical Access attacks is attained, proving the resilience of our model in recognizing complex attacks with various replay scenarios and microphone configurations. Our Physical Access (PA) attack outcomes, i.e., 0.52%, are nevertheless competitive when compared to other approaches. The advantage of our model is that it can detect both PA and LA attacks with similar effectiveness, unlike other systems such as [60], which obtained lower EERs for PA attacks, specifically. Moreover, cross-validation using the ASVSpoof 2021 dataset has demonstrated its generalizability, highlighting its applicability to more recent datasets.

In summary, our VGGish-CBAM model is an essential advancement in voice spoofing detection. In addition to its innovative performance in detecting both Physical Access (PA) and Logical Access (LA) attacks and its versatility in handling diverse attack scenarios, it contributes to the security of voice-based systems.

Along with the significant outcomes, some limitations were faced. For example, the model required high computational resources to transform the audio signal into mel-spectrograms of large size, such as 220 × 190 dimensions. We assessed the performance of the detector using varying sizes of mel-spectrograms, i.e., 220 × 190, 180 × 120, and 96 × 64. The training and computational time was reduced when the smaller size was used for mel-spectrograms. Therefore, we chose 96 × 64 dimensions to feed them to the network, attaining the maximum performance. The reason behind the better performance while using smaller sizes is that the stacked bins in mel-spectrograms are taken with respect to time, and it provides enough audio information. Thus, the proposed model reduces the computational overhead due to the attached attention block by providing better performance on small-sized mel-spectrograms.

7 Conclusion

This work aims to develop an audio spoofing detection system that can identify several types of attacks such as text-to-speech, voice conversion, and replay attacks using an enhanced deep learning model. A customized VGGish network along with a CBAM attention block is used to extract features from real and fake audio’ mel-spectrograms and categorize them. Our suggested model successfully captures sample dynamics, and environment artifacts, as well as different microphone settings of the replay attacks. Furthermore, our model is a significant technique for audio spoofing detectors due to a simple layered architecture. It captures complex relationships within the data in feature maps due to both spatial locations and specific channels present in an attention module. Moreover, the ASVspoof 2019 corpus is employed to evaluate the performance of the suggested technique, and the results show that our system is effective at identifying various spoofing attacks. More specifically, the model achieved an EER of 0.52% for PA attacks and 0.07% for LA attacks, respectively. The proposed approach also provides considerable results on cross-validation utilizing the ASVSpoof 2021 dataset.

It is analyzed that when the noisy audios are fed, the model’s performance degraded due to similar patterns of loudness in mel-spectrograms. Therefore, in the future, we aim to fine-tune our model to make it lightweight and more efficient.