Static–dynamic features and hybrid deep learning models based spoof detection system for ASV

Detection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convolutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static–dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.


Introduction
Building the robust spoof detection system for Automatic Speaker Verification (ASV) is now an essential task, as the attention and demand for voice protected authentication systems is increasing in the users of smart devices.According to a survey users are curiously looking forward to use the speech driven authentication systems [1].ASV system verifies whether the input speech signal is actually spoken by the authentic user or generated by the tricks by the imposter to gain access to the legitimate user's account.With the availability of low cost voice sensors, and advanced research attacks known as physical access (PA) attacks.Performance of ASV systems is greatly affected in the presence of these spoofing attacks [10].Various speech corpora have been proposed enriched with different kind of spoofing attacks.For instance, ASVspoof 2015 data includes SS and VC attacks [11], ASVspoof 2017 dataset includes only replay attack [12], Yoho dataset includes mimicry attacks [13], etc.The recently proposed ASVspoof 2019 dataset includes SS, VC and replay attacks, however, in two sets.This paper presents an initiative of putting all kind of attacks into a single dataset.
Along with attacks consideration, the robust designs of frontend and backend of an ASV system can become a preventive shield for spoofing attacks.Frontend of an ASV system uses a speech feature extraction technique to extract useful information form the recorded speech signal.Features of cepstrum domain that are Mel Frequency Cepstrum Coefficients (MFCC), Inverse Mel Frequency Cepstrum Coefficients (IMFCC) [14], Linear Frequency Cepstrum Coefficients (LFCC), Constant Q Cepstrum Coefficients (CQCC), etc. have performed remarkably well for the spoof detection tasks, and for speech and speaker recognition tasks as well.These techniques can model the human vocal tract and human auditory system very well [15][16][17].Human ear is proved to be deaf for the phase factor of sound.However, utilization of this factor for frontend development of speech driven devices [18,19] can be done by using All Pole Group Delay Function (APGDF), Modified Group Delay Function (MODGDF), etc.Both static and dynamic coefficients of speech features deliver the information of context and speaker specification information.These coefficients are passed to the backend spoof detection model.CQCC features are specially designed for spoof detection tasks proposed in ASV systems of [20,21] and it is claimed that, these features perform better than Instantaneous Frequency Cosine Coefficients (IFCC), MFCC, Epoch Features (EF).The proposed work in this paper also exploits a hybrid of static and dynamic CQCC features for developing the frontend.Also, it presents performance comparison of static and static-dynamic CQCC features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend.
Various machine learning techniques Gaussian Mixture Model (GMM), Hidden Markov Model (HMM) [22][23][24], Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), etc. are playing crucial role for classification tasks even in speech based systems [25,26].In case of ASV system, backend classification model takes the speech features as input and classifies the signal as spoofed or bonafide after analyzing the speaker specific information in them.In the initial research, GMM was used effectively as the backend model [27].As the deep learning algorithms are getting improved day by day ASV community has started to use CNN and LSTM models [28][29][30].In various speech and speaker recognition tasks, LSTM-based deep learning mod-els are performing better than the other models.However, CNN models are also giving satisfactory results [31][32][33].Also, different arrangements of frontend and backend models can bring smoothness and accuracy to spoof detection task.
The rest of the paper is organized as: second section discusses the related work then third section of the paper discusses the proposed method, the experimental setup details and results are presented in fourth section, fifth section explains the performance analysis of proposed models and systems then sixth section compares proposed systems with existing systems and seventh section concludes the proposal with dropping some light on future directions.

Related works
This section discusses the related works in this area.Literature is enriched with the experiments on various feature extraction techniques of audios at frontend and different classification models at the backend.Research done by Valenti et al. [34] discusses an approach with end to end speech signal passing to an evolving Recurrent Neural Network (RNN).System used in their work is designed with RNN and neuroevolution of augmenting topologies.The proposed work considers replay attack particularly.
The review done by Kamble et al. [35] presents a wide analysis of many existing ASV spoof systems from the perspective of ASVspoof challenge.Lai et al. [36] proposed Attentive Filtering Network based and ResNet classifier based system to detect replay attacks.The proposed attention-based filtering approach is used to improve feature representations.The proposed work used ASVspoof 2017 Version 2.0 dataset to attain a very low Equal Error Rate (EER).The authors claimed an improvement of about 30% over the existing ASVspoof 2017 enhanced baseline system.
ASVspoof 2019 challenge puts the three different types of attacks in one dataset and presents baseline models with LFCC and CQCC features at frontend and GMM at the backend [27].Chettri et al. [10] trained various deep learning backend models and tested them with different features extraction approach in front end.These backend models are further combined to get three ensemble models, where all the systems were tested for physical access and logical access attacks.
Recently, Dua et al. [30] also proposed the ensemble approach using LSTM based deep learning models at the backend, and three different feature extraction techniques Constant Q cepstral coefficients (CQCC), inverse Mel frequency cepstral coefficients (IMFCC) and MFCC at the frontend.The author claimed that their proposed ensemble model with CQCC features outperforms some already existing proposed ASV systems.
Motivated by these works, the proposed work in this paper compares performances of different deep learning models at backend by using them with static-dynamic CQCC features at frontend.The implemented work of this paper has also used combination of LSTM and CNN models for development of the backend.Also, two two-level spoof detection systems for ASV by using static-dynamic features at frontend are implemented.The first system does voting protocol based implementation by using CNN, LSTM models at first level and LSTM with Time Distributed Wrappers model at second level.The second system uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level.These systems can bring new insights in the development of spoof detection methods for ASV.

Proposed method
This section of the paper discusses the architecture of the proposed ASV system.Figure 1a shows the frontend and backend arrangement that has been used for comparison of static CQCC and static-dynamic CQCC features in the implemented ASV system.Speech signals taken from the dataset are applied to the frontend where static CQCC features are extracted with the general process of extraction and static-dynamic hybrid features are extracted with the proposed methodology.Then these features along with the labels from the dataset are applied to the backend model that runs the classification.These classification results are useful for feature comparison.Figure 1b shows the frontend and backend arrangement that has been used for comparison of various deep learning models by keeping static-dynamic CQCC features at frontend.Frontend used in this arrangement is the best performing feature extraction technique from the feature comparison.Speech signals and labels are the part of same dataset in whole arrangement.Backend here has all the proposed models for spoof detection and single model for speaker identification task.At the backend all chosen model are trained and their performances are analyzed.Systems of Fig. 2 are the arrangements of models from Fig. 1. Figure 2a shows the block diagram for the voting protocol based two-level spoof detection system.This system classifies the speech signal according to the voting protocol that is implemented with the help of level 1 and level 2. Level 1 applies analysis on the input that is further analyzed at level 2 as per the protocol to declare the decision.Figure 2b gives the block diagram for the two-level user identification and verification system.This two stages arrangement makes the use of speaker identification model at stage 1 result of which is passed to stage 2. Stage 2 uses the user identification and verification protocol along with chosen backend model to declare the classification result.The following is the pointwise contribution of the proposed work and following subsections discuss each component in detail.
• This paper promotes the development of single countermeasure that is free from every kind of spoofing attack.Therefore, initiative of modification in the used dataset is taken.AllSpoofsASV dataset (Fig. 1) is a generated variation of the standard dataset.

AllSpoofsASV dataset
A generated variant of ASVspoof 2019 dataset is used for building the proposed ASV systems.ASVspoof 2019 dataset is provided by the ASVspoof challenge community [37].The design of this dataset is intended to tackle with SS, VC and replay attacks in ASV systems.LA set of the dataset includes SS and VC spoofed utterances and PA set includes replay attacked utterances [27].All the audios are recorded in English language and are 2-8 s in length.However, the length of maximum number of audios lies between 4 and 6 s in both the sets.Proposed system is making use of both of the LA and PA sets by mixing them into a single set, All-SpoofsASV Dataset.Mixing of sets provides the reliability in developing the spoof detection systems in one run for all kind of included spoofing attacks of the used dataset.Table

Feature extraction using CQCC features
Constant Q Cepstral Coefficients (CQCC) feature extraction is used for extracting useful information from the recorded speech signal during both training and testing phase of an ASV system.In recent years, this technique is proved to be most promising for the development of robust and accurate ASV systems [20,21].The mathematical representation of CQCC feature extraction approach is described as: Here, Eq. ( 1) finds out the Constant Q Transform (CQT) of input speech signal p(n) in C CQF (e), and Eq. ( 2) finds out j number of CQCC coefficients in C CQCC ( j), where E is used for number of linearly spaced bins and e is used for indexing into the number of bins.The process of CQCC feature extraction applies Constant Q Transform (CQT) and then, it takes the log of powered spectrum [38].Also, before calculating the Discrete Cosine Transformation (DCT) it applies the resampling [39,40].It sets the number of feature coefficients and returns CQCC features.
The proposed system uses the find_CQCC_features () function for implementing CQCC feature extraction.This function applies the actual CQCC feature extraction process to the speech signal.This function takes an audio file as the input and returns a matrix of 90 × m_frames with the 30 static, 30 delta (D) and 30 delta-delta (DD) features for m_frames number of frames.m_frames denotes the number of audio frames extracted depending on the length of input audio.Firstly, it sets the initial values for number of bins per octave b, maximum frequency N max , minimum frequency N min , number of desired coefficients of any type n_coeff and type of feature f_type.Here, feature type f_type can be static (S), delta (D) or delta-delta (DD).Secondly, it calls the find_cqcc () function that takes all these initialized values as input to output the values as static, delta or delta-delta features.Algorithm working in this function starts with the calculation of gamma value that is one of the parameters to CQT application process.Then, it calculates the log power spectrum of the output of CQT application, which is considered for resampling before calculating the DCT.Function performing these operations are discussed further in this sections.Understanding of input taken, operations applied and nature of output of these functions are provided.Then, this algorithm takes care of taking only desired number of features.It returns the static, delta or delta-delta coefficients as per the value of f_type.Finally, find_CQCC_features () combines all type of coefficients into one matrix and finds out number of frames.This function ensures the 400 minimum number of frames in the output.If the number of frames are less than 400 then padding of zeros is done and the final matrix is the desired CQCC feature matrix.
This whole process uses some functions that are inbuilt functions of different libraries of Python and MATLAB [41,42].In the proposed work, these functions are named according to their functionality and are described further in this section.Function 1 given in the Appendix gives the pseudo code for find_CQCC_features () that calls find_cqcc () to compute CQCC features.

• audioread (): This function takes an audio file (audio_file)
as input, and returns its time series y and sampling rate N s .Number values in time series y depends on the length of the audio file, which further contributes to the number of frames.
• zscore (): This function calculates the row wise zscore for each value of the input matrix.As the values coming out from find_cqcc () function reside in a continuous range of small to large values.Hence, application of this function normalizes these values.General formula to calculate the zscore is given by the Eq.(3).
Here, x is the element value to be normalized, μ is the mean of the values of entire row and σ stands for standard deviation of those values.

• length (): This function takes a matrix as input and outputs
the value of number of columns in it.

Backend classification using deep learning models
This section gives brief a detail of the Deep Learning models that are used at the backend of the different architectures proposed in this paper.

Long short term memory (LSTM) with time distributed wrappers (M 1 )
Proposed Long Short Term Memory (LSTM) Network, shown in Fig. 3, is comprised of three time distributed dense layers, each having ReLU activation function.Time distribution wrapped layers are especially suitable for time varying data frames like audio, video, etc. Proposed LSTM model (M 1 ) has 32, 16 and 10 units in time distributed dense layers in this order.Number of units inside the layers are presented to provide the finer grained knowledge of structure of model to the readers.Motivation to choose these number of neuron is taken from the related work [30,31].After that 15% dropout is applied to disuse the effect of some randomly selected neurons.Addition of dropout layer prevents the model from overfitting.In the M 1 model, this operation is followed by three LSTM layers each having 10, 20 and 30 units in this order.Again these layers are followed by 10% dropout, and the result of dropout is passed to a dense layer having sigmoid activation function in it.

Long short term memory (LSTM) (M 2 & M 4 )
Proposed Long Short Term Memory (LSTM) Network, shown by Fig. 4, takes input on the first LSTM layers that are followed by two more LSTM layers.These layers have 10, 20 and 30 LSTM units in this order, which are chosen as per the results shown in [30,31].Output of these layers is passed to a dense layer of 24 units after applying 10% dropout.Again the output of this dense layer is passed to the last layer that is a dense layer with sigmoid activation function, after, applying the 10% dropout.An LSTM model (M 4 ) with the similar architecture has 20, 30 and 400 units in this order in its first three LSTM layers (Fig. 4).However, all the dropout and dense layers are having same specifications.

Two-dimensional convolutional neural network (2D CNN) (M 3 )
As shown in Fig. 5, the Two-Dimensional Convolutional layer (Conv2D) of proposed Two-Dimensional Convolutional Neural Network (2D CNN) (M 3 ) is comprised of 24 filters of 3 × 3 kernels size along with the ReLU activation function.After that a batch normalization layer is added, which itself is followed by three blocks of Conv2D and 2-Dimensional (2D) max pooling layers.Conv2D layers of these blocks have 16 filters of 5 × 5 kernel size, and 2D max pooling layers are of 2 × 2 pool size.These blocks are followed by a flatten layer that is followed by a dense layer of 10 units.After that, 10% dropout is applied to avoid the overfitting of the model.Last layer of this 2D CNN model is a dense layer with sigmoid activation function.

Spoof detection systems
This section discusses the two-spoof detection systems (System_1 and System_2) that are developed for the implementation of the proposed ASV system.Both System_1 and System_2 use the static-dynamic hybrid combination of CQCC features at frontend and different arrangements of M 1 , M 2 , M 3 and M 4 models at backend.

Voting protocol based two-level ASV system (System_1)
The two-level ASV system with voting protocol i.e.Sys-tem_1 focuses to the spoof detection task.It accepts the input speech signal if it is bonafide, and rejects it if it is spoofed by any of the SS, VC and replay attacks.Models M 1 , M 2 and M 3 provide the corresponding labels: bonafide or spoofed as output.Figure 6 shows the proposed System_1 that has models M 2 and M 3 at the first level and M 1 resides at the second level, where F is treated as a global variable.
Purpose of putting models M 2 and M 3 at level one is that both of these models are equally good, when evaluated for Equal Error Rate (EER).This adds fairness in the classification result of this level.M 1 is the most powerful model.Hence, it is put at the second level.Firstly, each input audio file is applied to the models M 2 and M 3 .Then, voting protocol is applied to their decisions.A find_binary () function maps these decisions to the Boolean values i.e.FALSE for spoofed decision (due to any of the SS, VC and replay) and TRUE for the bonafide decision made by the model.Voting protocol compares both outputs of find_binary () function for both the first level models.If the outputs from both the models is same, then it is returned as the final classification result of the system.Otherwise, the audio file is tested on the model M 1 at the second level and its classification result after passing to find_binary () function is returned.At the end, proposed system returns TRUE or FALSE for input speech being Bonafide or spoofed, respectively.Function 2, added in the Appendix section, gives the pseudo code for the implemented voting protocol that uses find_binary () function.

Two-level ASV system with user identification and verification (System_2)
The System_2, as shown by Fig. 7, also executes its process in two stages/levels.In the first stage, it identifies the user id for the applied speech signal.Then, user's voice signal is verified, whether it is bonafide or spoofed, in the second stage of the system.System uses User Identification and Verification Protocol to accomplish this task, where F and I are treated as global variables.
As a result, system identifies the validity of claimer along with the genuineness of the applied speech signal.Firstly, input audio signal is applied to the model M 4 of the first stage.Model M 4 predicts the identification of the user (U i ) out of already registered n users.This predicted identity is supplied to stage 2 where user identification and verification protocol is applied.At this stage, n number of instances {(M 1 U 1 ), (M 1 U 2 ), ……, (M 1 U n )} of model M 1 resides, which are trained for n number of users {U 1 , U 2 ,….,U n }.Model M 1 checks whether the speech signal is bonafide or spoofed at this stage, and the decision is mapped with a valid integer value in variable A. set_terary () function maps to integer value THREE if the U i and I are not same, maps to integer value ONE if the decision is Bonafide along with U i and I are same, and maps to integer value TWO if the decision is spoofed.At the output variable, if A is ONE then the user is valid and speech is Bonafide, if A is TWO then the user is invalid and speech is spoofed, and if A is THREE then user is invalid.Function 3, appended in the appendix, gives pseudo code for the implementation of the System_2.

Experimental setups
This section of the paper deals with the experimental details for implementation of the proposed ASV system.The frontend feature extraction is implemented by using Octave on Linux Operating System.The training, development and evaluation of backend models are done with Anaconda platform on Windows operating system.All the used audios and labels are taken from training, development and evaluation sets of AllSpoofsASV Dataset.During training the deep learning models, Python's inbuilt features are used for weight updation, that is backpropagation algorithm and loss functions are used.For the two class classification problems, binary cross entropy loss is used as the loss function.It finds out the probability or score for an utterance between zero and one.Categorical cross entropy loss function is used as loss function for multi class classification of user identities (specifically in the training of M 4 ).
A learning rate is required for iterative updation of weights during the training process.In the proposed work, ADAM (Adaptive Momentum) optimizer algorithm is used to achieve the adaptive value of learning rate [43,44].It combines the advantages of Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Algorithm (RMSProp).AdaGrad defines the learning rate for each parameter to improve the performance of model sparse gradient, whereas RMSProp makes the use of average of latest values of gradi-ents of weights.ADAM algorithm passes both the gradient and square gradient to the exponential moving average function.For heavy models and large size of datasets, it can solve practical problems efficiently [43][44][45].System arrangement for different comparisons and analysis are discussed later in this section.
The performance of the proposed architectures and systems are evaluated with the help of two evaluation measures Equal Error Rate (EER) and Percentage Accuracy.Spoof detection systems are evaluated by using EER and user identification system is evaluated by percentage accuracy.EER is the equal value of False Acceptance Rate (FAR) and False Rejection Rate (FRR) [27,28], where FAR is ratio of number of spoofed utterances having score more than or equal to the threshold to the total number of spoofed utterances and FRR is ration of the number of bonafide utterances having score less than the threshold value to the total number of bonafide utterances.The mathematical representation FAR and FRR is given by Eqs. ( 5) and ( 6), respectively.EER aims to calculate the FAR and FRR with the help of threshold .For the equal values of these parameters, it declares the EER for the system.

FAR
Total count of utterances with score ≥ Total count of spoofed utterances FRR Total count of bonafide utterances with score < Total count of bonafide utterances Percentage accuracy is calculated with the help of correct predictions and total number of input samples to be checked.Mathematical formula of percentage accuracy is given by Eq. (7).

Percentage Accuracy
Count(correct predictions) Count(input samples) × 100 (7) In this case, the division of total correctly predicted user samples by the total number of user input samples is multiplied by 100.

Frontend features extraction
For spoof detection task, firstly, model M 1 is trained with only 30 static CQCC features calculated by doing some modifications in find_CQCC_features () function.Mean of m_frames frames for each coefficient of 30 features is used.A vector of 1 × 30 dimensions is extracted in case of static features and Model M 1 is trained up to five epochs with the batch size of 512.Secondly, Model M 1 is trained with the static-dynamic hybrid CQCC features calculated by find_CQCC_features () function.All 30 static, 30 delta and 30 delta-delta CQCC features for all m_frames frames (without taking mean) are used in this arrangement.A matrix of 90 × m_frames dimensions is extracted for each audio in this case.To balance the comparison criteria, this arrangement has also been trained up to five epochs with the batch size of 512.Equal Error Rate (EER) for both the arrangements is found out to compare the performances of the feature sets.The comparative analysis for evaluation data with both features is shown in Table 2.

Backend deep learning models with System_1
The proposed work compares performance of all the backend deep learning models M 1 , M 2 and M 3 , implemented individually, with voting protocol based System_1 by using static-dynamic CQCC features at the front end and All-SpoofASV dataset.Model M 1 is trained with the batch size of 512 up to five epochs, Model M 2 is trained with the batch size of 512 for 20 epochs and model M 3 is trained with the batch size of 500 for 15 epochs.For the training of all three models, patience of two is used for early stopping criteria, binary cross entropy loss function is used to measure the loss and ADAM optimizer is used for optimization purpose in both the systems [43,44].
As described earlier, trained models M 1 , M 2 are used at level 1and M 3 is used at level 2 for development of voting protocol based spoof detection system System_1.The performance analysis of M 1 , M 2 , M 3 and System_1 is done by using the parameter EER.Table 3 shows the comparative val-ues of EER for evaluation datasets for all the three backend models and System_1.

Model M 4
User identification model M 4 is trained individually for eight users (n) with the batch 512 up to 80 epochs using categorical cross entropy loss function.Model M 4 is tested by using parameter percentage accuracy.Percentage accuracy of the model is calculated for evaluation set, as shown by Table 4.

System_1 and System_2
System_2 uses trained model M 4 for user identification task is used at stage 1 and n number of instances of model M 1 are used at stage 2.However, the training of Model M 1 in Sys-tem_2 is different from System_1.In System_2, it is trained eight times separately for each user out of the total eight existing users.For this, bonafide and spoofed utterances of each specific user are taken.Firstly, user identification is done for eight users by the stage 1, and then, user identification verification protocol is invoked for verification at stage 2. The performances of System_1 and System_2 for spoof detection task are evaluated using the parameter EER for evaluations sets, as shown in Table 5.

Results
This section presents the performance and comparison results of all systems discussed in third section.For obtaining the results, the proposed work uses the procedure adopted by state of the art works of [10,15,26].As described earlier in "AllSpoofsASV dataset" section, the dataset used by the proposed system is already divided in training, development and evaluation sets.Therefore, it is not required to partition the dataset in ratios for training, development and evaluation samples.For evaluation in case of ASV systems, EER is the used evaluation protocol that is applied on the classification results of the model for spoof detection task [10,15,26].Models for this work have been trained five times with the training set, and for each trained model development set is applied.Network parameters have been tuned for all the systems to obtain stable parameters.On the development results, EER evaluation protocol is applied and accuracy of the model is verified.Mean of all five development set test results is considered to show in presented tables.Evaluation set is applied on the model when it becomes stable after all training passes and EER is calculated for the classification result.Protocols of systems one and two are applied with the evaluation set performances of models.For the task of speaker identification, percentage accuracy is calculated as evaluation measure on development set results using fivefold validation approach.It is also evaluated for evaluation to check the performance.

Comparison of CQCC features
Models set for features comparison are trained five times and average i.e. mean + standard deviation (SD) of the results is taken to conclude the EER.It can be observed in Table 2 that combination of static and dynamic CQCC features is performing better than static CQCC features.Hence, this combination is used in the development of further proposed spoofed detection systems.

Comparison of used deep learning models with System_1
These models are trained five times and the EER evaluation measure is calculated on development set for each training for model.Table 3 represents the EER value for five training and development passes (presented by sequence of "D i " in Table 3) along with the average value of results.Then, the performance on evaluation set and System_1 are shown.Results presented in Table 3 shows that M 1 outperforms the other two backend models for spoof detection, when implemented individually.However, voting protocol based System_1 outperforms all the three backend models.Voting protocol is applied once the ave performances of all the deep learning models are concluded.

Performance of model M 4
The average percentage accuracy of the model M 4 is calculated for evaluation set by averaging the five runnings, as shown by Table 4.The percentage accuracy, as described earlier, is calculated by Eq. ( 7) using correct predictions and total number of input samples to be checked.It can be observed from Table 4 that M 4 performs satisfactory.

Comparative analysis for System_1 and System_2
The performances of System_1 and System_2 for spoof detection task are evaluated using the parameter EER for both development and evaluations sets, as shown in Table 5.It can easily be observed in Table 5 that System_2 is performing better than the System_1.However, System_2 is limited to the private or local domain because it uses limited number of users.An increase in number of users will add more complexity in development of an ASV system as for each user separate training model M 1 is required, which is not practically feasible.Hence, System_1 performs satisfactory as it is applicable to the public domain.

Comparison of proposed system with existing systems
This section compares the performances of the proposed systems, System_1 and System_2, with some of the existing systems from the literature.Chettri et al. [10] 6 shows the comparison of these systems with proposed systems of this paper.Although, some systems from literature seem to be good for detection of a particular attack type.However, proposed system is also performing good for the detection of all three kinds of spoofing attacks in one run.

Conclusion
Undoubtedly, the ASV systems are highly exposed to spoofing attacks.However, their performance is fine enough that industry is attracted to use them in practical applications.Initiative to design a single dataset can provide new insights to the spoof detection task.AllSpoofsASV Dataset, a variation of ASVspoof 2019 dataset, is a small step towards this.Combination of different feature coefficients with hybrid deep learning models can help in development of robust ASVs.This paper shows that a combination of static and dynamic CQCC performs better with LSTM models than only static features.Also, comparison of results shows model LSTM with Time Distributed Wrappers (M 1 ) outperforms the models LSTM (M 2 ) and CNN (M 3 ), when evaluated by Equal Error Rate (EER).However, the two-level voting protocol based spoof detection system System_1 that uses M 2 , M 3 at level 1 and M 1 at level 2 performs best of them all.As model LSTM (M 4 ) provides satisfactory performance it can be used particularly for speaker identification with spoof detection.Also, two-level spoof detection system with user identification and verification System_2 that uses M 4 at stage 1 and M 1 at stage 2 performs better than System_1.However, it is limited to limited number of users.Using it for public domain or an organization with more and variable number of speakers will increase the complexity and requirement of storage space for the system.For future work, more attacks like twins and mimicry should can be added into the dataset, and more hybrid possible combinations of features and deep learning models can be exploited.Considering the importance of the spoof detection in ASV, more efficient and complex structures like VGG-family of deep learning models can also be used as future extension of the proposed work.

Fig. 1 aFig. 2 a
Fig. 1 a Proposed ASV system for features' comparison.b Proposed system for Deep Learning Models' Fig. 2 a Voting protocol based two-level spoof detection.b Two-level spoof detection system with user identification and verification

•
Selection of suitable features for the frontend is essential.This work tests, whether static CQCC or a combination of static and dynamic CQCC speech features perform better at frontend, where both features have LSTM with time distributed wrappers model at backend.
• Different deep learning models, LSTM, LSTM with time distributed wrappers and CNN based systems are implemented with static-dynamic CQCC features to measure their performances individually.• One voting protocol based implementation by using CNN, LSTM models at first level and LSTM with Time Distributed Wrappers model at second level is done.And, another implementation using LSTM model for user identification at first stage and LSTM with Time Distributed Wrappers for verification at the second stage is performed.
More specifically, it does the padding of zeros for the desired number of columns in a given matrix.• cqt (): This function applies the Constant Q Transform (CQT) to the representative values of a speech signal.CQT changes the frequency domain into the time domain along with maintaining the constant Q factor across the signal.This function applies logarithm operation on the input values.Logarithm is calculated for the squared spectrum that is output of cqt () function.• resample (): This function converts the geometrically spaced bins provided by CQT into linearly spaced bins.Bins are converted into linear space to make the signal compatible with Discrete Cosine Transformation (DCT).• dct (): This function applies DCT internally.Application of DCT is helpful in signal compression task, conversion of frequency domain into time domain, etc.
• zero_padding (): Functionality of this function is to add extra columns of zero values up to the desired number of rows.• cut ():cut () function cut a matrix to the desired number of rows.• delta (): This function calculates the derivative of the applied values.

Table 2
Comparative

Table 3
Comparison of backend spoof detection models

Table 5
Performance of proposed systems SS, VC and replay attacks.ASVspoof 2019 challenge has provided a GMM model trained with LFCC and CQCC features at frontend for SS, VC and replay

Table 6
[46]arison of proposed system with existing systems ✔ Indicates that a particular attack is addressed and ✖ indicates that a particular attack is not addressed attacks[27].Jung et al.[46]has trained a Deep Neural Network Model with 7 spectrograms, i-vectors and raw waveforms only for replay attack detection.Table *