1 Introduction

Biometric is a process of identifying and differentiating between individuals based on the differences in biological and behavioral characteristics. According to the National Science & Technology Council’s (NSTC) Subcommittee on Biometrics, biometric is common terminology used to narrate a characteristic or a process [11]. When biometric is used to describe a characteristic, it refers to the quantifiable biological and behavioral characteristics which could be used for automated recognition. Likewise, when biometric is used to narrate a process, it refers to the methods of automatically recognizing a biometric subject based on observable biological and behavioral properties.

Biometric can be categorized into two main types, namely physiological and behavioral biometrics [100]. Physiological biometrics refers to the distinct characteristics that are related to an individual’s physical body shape like DNA, eyes (iris and retina), fingerprint, and face [12]. On the other hand, behavioral biometrics refers to the unique characteristics that are related to an individual’s behavioral patterns like typing rhythm, voice, and human motion. Examples of biometric technologies that have been applied widely in societies are fingerprint recognition-based immigration control, virtual assistant via speech recognition, and smartphone login using face recognition.

Voice biometric can be applied in many ways. For example, it can be used in healthcare for voice disorder detection, assists in voice disorder assessment and treatment [81]. An application of voice biometric, namely voice recognition or speaker recognition refers to the process of recognizing the person who is speaking. Voice recognition can be further classified into two categories, namely speaker identification [133] and speaker verification [24]. Speaker identification refers to the process of identifying the speaking person, whereas speaker verification refers to the process of verifying the claimed identity of the speaking individual, as presented in Fig. 1. Voice recognition uses both physiological and behavioral components in identifying and verifying the identity of the speaker. Some applications of voice recognition are access control, forensic criminal investigation, surveillance of phone conversation, and banking transaction [68].

Fig. 1
figure 1

Illustration of speaker identification versus speaker verification

Despite the benefits brought by voice recognition technology, spoofing attacks from security adversaries is inevitable. A spoofing attack refers to a malicious party launching an attack to impersonate an authorized individual in the voice recognition system to bypass and get access to the system. Due to the ease of obtaining biometric data via social media such as Facebook, Instagram, and WhatsApp [71], countermeasures against spoofing attacks are needed to enhance the security of biometric systems. These countermeasures are known as voice Presentation Attack Detection (PAD). However, progress in PAD in the field of speaker recognition does not receive equal attention as other types of biometric such as fingerprint and face recognition [28]. Some of the reasons that affect the progress in anti-spoofing measures in the speaker recognition field are late invention, deployment, and limited applications of voice recognition technology in the past compared to biometrics like fingerprint and face recognition [73]. Nevertheless, the hands-free property of voice recognition has made it widely accepted and applied to various fields, not limited only to smartphone login, voice verified bank transaction, and access control verification [69]. Hence, an effective ready-to-use voice PAD system for voice recognition is required as there were no publicly available finished products of these voice PAD systems [114].

There are numerous works recently conducted on voice PAD that can be found in the literature. However, to the best of our knowledge, four articles [49, 88, 103, 136] presented a survey and indexed in Scopus. The most similar [103] was published in 2019, that present reviews and summarizes some voice PAD for speaker recognition systems. Article [88] focuses only on replay attacks while two articles [49, 143] focus only on the voice PAD presented in ASVspoof Challenges. Meanwhile, the article [103] published in 2019, presented all four types of presentation attacks. However, most of the papers did not provide a descriptive taxonomy on recent voice PADs. A taxonomy categorizes previous work based on the identified attributes that could help readers understand the topic better. Hence, the survey presented in this paper aims to expand the domain of knowledge by providing the categorization of the related work and building taxonomy from the most recent work on voice PAD, which includes those presented in the ASVspoof2019 Challenge. Besides, this paper also contributed by providing the trends and analyses of voice PAD, which are lacking in the other survey articles. The issues and future direction of voice PAD are also described in this paper.

The paper has contributions:

  • To produce a taxonomy on recent voice PAD systems.

  • To visualize trends of work on PAD.

  • To identify the issues faced by current voice PAD and describes corresponding future directions of PAD.

The remaining of this survey paper is arranged according to the following. Section 2 described the methodology used to conduct this survey. The recent speaker verification systems published in the last five years are presented in Section 3. The findings of the survey, which include voice spoofing attacks and PAD, analysis on the trend of recent voice PAD, research gaps, and future direction of voice PAD systems are presented in Section 4 This survey paper is concluded in Section 5.

2 Methodology

In this section, the methodology used to survey recent speaker verification systems, voice spoofing, and voice PAD are described.

First, to identify the recent work on speaker verification systems, we referred to a variety of sources, including online resources such as news, forums, scientific materials that include journals and conference articles. Online resources are used to retrieve up-to-date information about the applications of speaker recognition, as well as speaker recognition security risks, issues, and incidents which have been identified or happened. Meanwhile, scientific materials such as journals and conference articles are included in this survey to assess the state-of-the-art speaker recognition systems and corresponding types of voice PAD to secure speaker recognition systems.

To assure that this survey only covers the state-of-the-art speaker verification systems and voice PAD, only related scientific materials published in recent years (2015–2021) are considered. Nonetheless, several older but significant articles are included, as well. A total of 172 Scopus indexed articles are considered in this survey. Recent articles presented in Interspeech and ICASSP Conferences are also included in this paper. To search for all possible voice PAD articles indexed in Scopus, common keywords such as “voice” and “anti-spoofing” are used to search for the PAD articles. As technical terms such as “presentation attack detection” and “PAD” may miss out on some relevant articles, hence these technical terms are not selected as keywords. The most recent survey paper on voice PAD was published in 2019 [103].

3 Speaker Verification

Voice recognition, commonly known as speaker recognition, refers to recognizing the speaking person, whereas speech recognition refers to recognizing the words from speech. Voice recognition is grouped into two categories, namely speaker identification and speaker verification, as illustrated in Fig. 1. Voice recognition uses both physiological and behavioral components in identifying and verifying the identity of the speaker.

Speaker verification is a process where the claimed identity of the owner of the voice, the target voice, is verified by comparing the target voice with the registered voice in the database [47]. Hence, speaker verification is a 1:1 matching process between the target voice and voices registered in the database [4]. The main application of speaker verification is on authentication such as verification of identity in phone banking transactions and voice authenticated access control for door lock [69, 109].

There are two phases in Automatic Speaker Verification (ASV) systems, namely the speaker enrollment phase and speaker verification phase (speaker verification phase) [117]. In the speaker enrollment phase, the aim is to generate the speaker models. First, features are extracted from the voice captured by the voice recognition system. Second, the features are used to generate the speaker model. Last, the generated speaker model is enrolled in the database of the voice recognition system. This process is drawn and shown in Fig. 2. The verification of the speaker is conducted in the speaker verification phase. First, features are extracted from the voice by the voice recognition system. At the same time, the claimed or targeted speaker model is retrieved from the database. Second, the patterns of extracted features are matched with the retrieved speaker model. If the matching obtained a score equal to or greater than the threshold set in the voice recognition system, then the claim of the identity by the speaker is accepted; otherwise, the claim is rejected. Fig. 3 was drawn to show the process of speaker verification.

Fig. 2
figure 2

Speaker enrollment phase

Fig. 3
figure 3

Speaker verification phase

Based on Figs. 2 and 3, the three key components in speaker verification systems include feature extraction, speaker modeling, and pattern matching. During pattern matching, the speaker verification system will either accept or reject the claim of identity based on the score of pattern matching [94]. Equal Error Rate (EER) is often the performance measures used for speaker verification [36, 87, 95, 113, 143]], a metric that is commonly used to assess the biometric system performance. EER represents the value of error in which the error rates, namely the False Acceptance Rate (FAR) and False Rejection Rate (FRR) are equal. Note that FAR refers to the probability of incorrectly accepts an unauthorized access attempt by a biometric system. In contrast, FRR refers to the probability of wrongly rejects an authorized access attempt by a biometric system. The better the performance of speaker verification, the lower the EER is [95].

From the literature, the Gaussian Mixture Model (GMM) [95,96,97] is the method that produced a more robust and better-performing speaker verification system than other speaker modeling approaches. Therefore, it has been extensively used for feature extraction for speaker verification in recent works [1, 107, 115]. GMM is a probabilistic model describing normally distributed subpopulations within an overall population and was used in voice recognition feature extraction. To verify the identified speaker from a speech, GMM compares the captured voice with a general, person-independent speaker model. The Universal Background Model (UBM) [46, 67, 76] is often being used as the general model for GMM. In speaker verification, UBM is a general model used to represent general feature characteristics that can be used to compare against the specific person being verified. Researchers also have successfully applied other speaker modeling techniques such as i-vector [21, 36], and x-vectors [116] for front-end feature extraction. i-vectors was introduced as a simple model for speaker recognition in which the feature extraction was conducted using simple factor analysis. x-vectors were the fixed-dimensional embeddings extracted with DNN for speaker recognition, and it was found that the x-vector-based system out-performed the standard i-vector-based system.

On the other hand, back-end classifiers such as Deep Neural Network (DNN) [2, 24, 115] and Probabilistic Linear Discriminant Analysis (PLDA) [121] were shown to be able to discriminate between spoof and genuine speech signals with low EER using features like i-vectors and x-vectors. Recently, there were a number of end-to-end approaches proposed for speaker verification [40, 64]. An end-to-end approach, in the context of speaker verification, is a model or classifier that is trained together with the feature learning. Nonetheless, another approach emerged that focuses on learning speaker features while leaving the classifier as a separate component. This approach indicates that if the feature learning is strong enough, then the limitation of the classifier will become negligible. Compared to the end-to-end approach, the feature learning approach was found to outperform the end-to-end approach consistently, based on a dataset consisting of utterances from 5,000 speakers [129].

To further improve the performance of the ASV system, some works used score normalization. In [8], it has been shown that score normalization can lead to not only improved performance but also better calibration and a more reliable threshold for ASV systems. For example, an improvement of 30% was produced by the adaptive symmetric score normalization (s-norm) by the work [70] using NIST SRE 2016 dataset. Another method to improve further the performance of a speaker verification system is using fusion [10, 108]. There are two frequently used fusion techniques for speaker verification, namely, score fusion [55] and feature fusion [5]. Score fusion is a method used to make a final decision by matching the scores output from more than one biometric modal. Score fusion can be conducted by combining scores generated by biometric models using approaches like logistic regression. Due to its implementation simplicity, score fusion is the most commonly used fusion method in multibiometric systems [98]. However, recent works [32, 55] have shown that the scored fusion outperformed feature fusion as a fused feature is more complex and may lose its significant traits, which can be used to distinguish accurately between different speaking individual [55].

Although the state-of-the-art ASV systems are capable of verifying the claimed identity of speakers with a low error rate and high accuracy, the systems are prone to presentation attacks due to the ease of obtaining biometric data. Voice PAD was introduced to mitigate the problem of ASV against presentation attacks. The next section presents recent works of voice PAD systems.

4 Voice Presentation Attack Detection, Taxonomy, Research Gap, and Future Direction

This section contains three sub-sections to present voice PAD. Section 4.1 presents voice presentation attacks and PAD. The taxonomy of recent voice PAD is described in Section 4.2. Section 4.3 presents the analysis of the trend on recent voice PAD. Section 4.4 presents the research gap and future direction of voice PAD.

4.1 Voice Presentation Attack Detection (PAD)

Voice spoofing attacks can be grouped into two categories; the sensor level attack and transmission level attack [136]. The transmission attack can be avoided and deflected by having a secure transmission protocol and assistance from security software. By comparison, it is much more difficult to defend against sensor level attacks and requires considerable attention due to the ease of obtaining biometric data as described in Section 1 [71]. This sub-section has thus presented an overview of recent works on the sensor level attack. Commonly, sensor level attack is referred to as presentation attack, where adversary tries to bypass the voice recognition system through spoofed voice input. There are four main categories of voice presentation attacks, namely impersonation, replay, voice conversion, and speech synthesis attacks [136].

The first voice presentation attack, the impersonation or zero-effort imposter is a spoofing technique that requires no assistance from electronic devices. Impersonation is carried out by mimicking a specific person’s way of speaking. Impersonation is not an effective ASV spoofing method [55]. However, there is a case where a non-identical twin reporter from British Broadcasting Corporation (BBC News) had successfully spoof the voice recognition system of HSBC to access the bank account of his twin after mimicking his twin brother’s voice [111]. Hence, the threats that impersonation posed to ASV systems must not be underestimated. Meanwhile, the recent work indicated that imitation of speech patterns like fundamental frequency and some key format of a speaker is possible. However, mimicry to replicate all characteristics of a targeted voice seems to be physically impossible [113] due to the uniqueness of the human vocal tract. Moreover, recent studies [74] concluded that mimicry is unable to duplicate natural speech regardless of mimicry training and impersonation skill as it is not a natural act. Another work shows that stable spectral peak features which represent invariant vocal tract characteristics of a speaker could be effective in differentiating genuine and imposter voices [113]. Although impersonation is not effective to spoof most ASV systems [55], the other three types of presentation attacks are major threats [136] because the voiceprint used to spoof ASV systems are originated from the genuine speaker.

The second presentation attack, the replay attack, is the most popular type of spoofing attack as it is the simplest to conduct. As biometric data can be obtained easily through social media, replay attacks can be conducted by anyone using recording devices such as smartphones. The replay attack is more straightforward compared to speech synthesis and voice conversion attacks. The replay attack is more likely to be performed by non-professional adversaries to spoof ASV systems as replaying a pre-recorded audio involves little knowledge of audio signal processing. Several replay attack detectors have been developed for ASV systems. For example, the baseline system in the ASVspoof 2017 Challenge which is based on Constant Q Transform Cepstral Coefficients (CQCC) features with 2-class Gaussian Mixture Model (GMM) classifier was recorded an EER of 30.60% on the evaluation dataset [22]. Nonetheless, using the speech frame selection approach [58], the performance of the CQCC-GMM countermeasure improved to 21.60% EER. Replay attack detectors proposed by researchers using Recurrent Neural Networks (RNN) with Filter Bank (Fbank) features have achieved 9.81% EER [16], whereas the one using Deep Neural Networks (DNN) and Support Vector Machine (SVM) classifiers with CQCC and High-Frequency Cepstral Coefficients (HFCC) features have achieved 11.5% EER [82]. The best replay attack detector system in ASVspoof 2017 Challenge hit an EER of 6.73% on the evaluation dataset was a fusion system that adopted several classifiers and features [60].

The third presentation attack is speech synthesis and voice conversion. Unlike replay attack, spoofing speaker verification system using speech synthesis and voice conversion requires knowledge of signal processing [55], which is mostly conducted by professional adversaries. Speech synthesis attack is one of the effective presentation attacks towards ASV systems where Text-To-Speech (TTS) technology is applied by concatenating available pieces of speech data [105]. Recently several Synthetic Speech Detectors (SSDs) were introduced to protect speaker verification systems from speech synthesis attacks [38, 102]. Since most synthetic speeches were generated using parametric vocoders, SSDs that use phase information [89] for synthetic speech detection has been shown to be effective [23]. As a result, phase-based SSD has become state-of-the-art for detecting synthetic speech [85, 106]. Nonetheless, most of the introduced SSD systems are only effective against parametric vocoders, which use minimum-phase filters for speech synthesis. Thus, phase-based SSD are prone to speech synthesis attack from vocoder which uses mixed-phase filters [23].

Voice conversion attack is performed by converting the voice of a spoof attacker into the voice of the target speaker to cheat the ASV systems. Indicators of converted voice such as the absence of natural speech phase information can be extracted as a feature to detect the converted speech from genuine speech. For instance, features such as cosine normalization and frequency derivative of phase spectrum information can be extracted to detect the converted speech with EER of 6.0% and 2.4% respectively [27]. Other features like Local Binary Pattern (LBP) extracted from images generated from speech signals such as spectrogram [27] can also be used to detect artificial signals such as synthesized and voice-converted speech.

There were several efforts done to foster the development of countermeasure to spoofing of ASV systems. The countermeasures were often known as Presentation Attack Detection (PAD). For example, the building of more public datasets such as the ReMASC dataset that consists of genuine and replayed speech corpus collected in realistic voice-controlled systems’ usage scenarios [33]. ReMASC contains recordings from 50 speakers of both genders and of different ages and accents. The recordings were composed of 132 voice commands which were collected in four different environment settings with different levels of noise. The four environments were two indoor with settings of quiet and noisy background, one outdoor, and one moving vehicle scenario. Four different microphones were used in the data collection. As the ReMASC corpus was made up of recordings via a variety of microphones instead of a single microphone, it is well-suited for multi-channel voice PAD research such as [34]. Another major effort from the community of spoofing and anti-spoofing for ASV was the ASVspoof Challenge series. There were a total of three ASVspoof Challenges up-to-date, namely ASVspoof 2015, ASVspoof 2017, and ASVspoof 2019. In general, the ASVspoof Challenge series aims to promote the development of a generalized voice spoofing countermeasure to detect varying and unforeseen spoofing attacks using standardized datasets, protocols, and evaluation metrics.

The first series of ASVspoof challenges, the ASVspoof 2015, was held within the scope of a special session at Interspeech 2015. ASVspoof 2015 dataset consists of genuine, synthesized, and voice-converted utterances in which the utterances were collected from 106 speakers (45 male and 61 female). Spoofed utterances for training and development were generated using three voice conversion and two speech synthesis algorithms. All five algorithms used to generate the spoof utterances of the training and development set were used to generate the spoof utterances in the evaluation set. In addition, an additional five algorithms were used to generate more spoof utterances in the evaluation set, referred to as unknown attacks. EER was used as the primary metric to evaluate the performance of submitted countermeasures. In ASVspoof 2015, there were a total of 16 primary submissions used for ranking in the challenge. In general, most of the submissions achieved low EER, which is less than 1% for known attacks in ASVspoof 2015. The best system submitted to ASVspoof 2015, named System A, has used two features namely Mel-Frequency Cepstral Coefficients (MFCC) and Cochlear Filter Cepstral Coefficients Plus Instantaneous Frequency (CFCCIF), and GMM classifiers with score fusion in detecting the spoof speech with an average EER of 1.211% for known and unknown attacks. In particular, System A has achieved an average EER of 0.408% for known attacks and 2.013% for unknown attacks respectively. A similar trend of having higher EER for unknown attacks than known attacks can be seen in all 16 submissions. This trend can be seen as potential overfitting in the countermeasures proposed. One of the identified reasons for getting higher EER when detecting unknown attacks was the unreliability of the counter-measures in detecting S10 attacks, the only attack that was generated using the waveform concatenation approach. Details of all 16 submissions can be referred to [139]. More details regarding the ASVspoof 2015 Challenge can be found in [137, 138].

Due to the shortcoming of ASVspoof 2015, ASVspoof 2017 was organized to highlight the replay attacks which were excluded during ASVspoof 2017. It was held as a special session at Interspeech 2017. The main objective of ASVspoof 2017 was to assess spoofing attack detection accuracy with ‘out in the wild’ conditions, to detect replay attacks in particular. The ASVspoof 2017 dataset consists of genuine utterances that were based on the text-dependent RedDots corpus [61], a speech corpus made up of utterances from 49 male and 13 female speakers, whereas the spoof utterances were based on the replayed version of RedDots corpus [52]. Similar to ASVspoof 2015, EER was used as the primary metric to evaluate the performance of submitted countermeasures in ASVspoof 2017. In ASVspoof 2017, there were a total of 49 submissions received for the challenge. The best performing system, System S01, achieved an EER of 6.73%. There were six conditions (C1-C6), ranging from replayed recordings in the condition of background noise that was comparably easier to detect, to high-quality replayed recordings that were difficult to detect. The performance of the countermeasures for condition C6 was consistently the worst. This indicates that the state-of-the-art countermeasures, at that time, were prone to the effects of high-quality replay recordings to spoof the ASV systems. The comparison of the spoof detection rate for ASVspoof 2015 and ASVspoof 2017 suggests that the detection of replay attacks is more difficult than speech synthesis and voice conversion attacks. Hence, the work [51] has highlighted that the generalization of countermeasures remains an open problem. Readers are referred to [22, 50] for more details on the ASVspoof 2017 Challenge.

Similar to the previous editions, the most recent ASVspoof 2019 was held as a special session at Interspeech 2019. The ASVspoof 2019 Challenge extended the previous ASVspoof challenges in several aspects [7]. First, the ASVspoof 2019 covered all three main types of spoofing attacks, namely speech synthesis or text-to-speech (TTS), voice conversion, and replay attacks [131]. Second, the addition of the latest speech synthesis and voice conversion systems with regards to the ASVspoof 2015. Third, a more well-controlled evaluation setup was used for the assessment of replay countermeasures in ASVspoof 2019 compared to ASVspoof 2017. Lastly, the ASVspoof 2019 aligns the countermeasures with the ASV system more closely compared to ASVspoof 2015 and ASVspoof 2017, which were focused on standalone countermeasures. Although the ASVspoof 2019 Challenge was still a standalone spoofing detection task, the adoption of the tandem Decision Cost Function (t-DCF) metric as the primary performance evaluation measure in the challenge will ensure the results obtained reflect the performance of countermeasures on the reliability of ASV systems. The use of EER as the only evaluation metric may not reflect the reliability of the countermeasures in previous ASVspoof challenges [124]. In ASVspoof 2019 Challenge, there were 48 and 50 submissions received for the Logical Access (LA) and Physical Access (PA) scenarios, respectively [124]. Both EER and t-DCF metrics were used in the evaluation. The best performing system for the LA scenario, System T05, has achieved an EER of 0.22% and a t-DCF of 0.0069%. The best performing system for the PA scenario, System T28, has achieved an EER of 0.39% and a t-DCF of 0.0096%. Nonetheless, the majority of countermeasures could not produce an EER of less than 5% in both LA and PA scenarios.

The t-DCF was a new evaluation metric proposed to address two shortcomings of EER. Firstly, EER may not be a reliable performance measure when ASV and spoof countermeasures are combined. Secondly, the metric EER may be biased against user authentication applications that have high user prior but a low spoofing attack prior, such as telephone banking. Hence, the work [53] proposed to migrate the performance evaluation from spoof countermeasures-centric to ASV-centric with the aid of t-DCF, a newly introduced performance metric. t-DCF is a generalized DCF metric to enable the evaluation of combined ASV and spoof countermeasures. It extended the conventional DCF used in ASV research to scenarios involving spoofing attacks. As there were two detection systems, namely ASV and spoof countermeasures, each with two possible false alarms, four costs were identified, namely the cost of ASV system rejecting a target trial, the cost of ASV system accepting a non-target trial, the cost of countermeasures rejecting a human trial, and the cost of countermeasures accepting a spoof trial. These four costs were used in the calculation of the t-DCF metric. Besides, the work presented in [53] has shown analysis on top-performing countermeasures in ASVspoof 2015 and 2017 with t-DCF focused on spoofing attacks prior. From the result, EER and t-DCF show differences for higher priors, thus some ranking changes can be observed.

Meanwhile, there have been some interesting researches carried out related to biases in model performance resulting from dataset artifacts [19]. The first example is described as the following. Researchers investigated six features extracted from speech for replay attack detection using GMM. Then, the factors that influence the predictions of the GMM models were determined. As a result, researchers uncovered a feature or cue which the models were exploiting; the initial silence frames of zeros present in genuine signals but absent in spoofed signals. The cue was found to make the GMM based spoof detection system to classify incorrectly [17]. Researchers further investigated whether the biases in the model caused by the cue can be resolved by eliminating the initial frames of zeros from the test files. From the experiments, researchers found out that such an approach helped reduced the error rate of the spoof detection systems. Although the vulnerability of ASV systems to spoofing attacks has initiates the development of countermeasures; still, there was no research done on what did countermeasures are learning to discriminate between genuine and spoof speeches. Recently, researchers investigated the local behaviour of a CNN-based replay detection system submitted to ASVspoof 2017 Challenge using the SLIME algorithm [18]. Researchers found out that the model investigated was using the first 400 milliseconds of audio for most of the spoofing instances to make a prediction. This raised an issue of trustworthiness of the detection systems when these systems were shown to exploit cues from the database which are unrelated to the problem for prediction.

4.2 Voice PAD: The Taxonomy

In this section, a taxonomy of the recent work on PAD systems is presented. The taxonomy is built to summarize and provide a clearer picture of the focus and similarities of work on PAD. Seven attributes were selected for inclusion in the taxonomy. These attributes were chosen as they were imminent and can be found in all the articles being surveyed. The seven attributes included in the taxonomy are types of presentation attack, features, classifiers, fusion, methodology, datasets, and evaluation criteria. Each of the attributes was categorized into groups and sub-attributes. For example, the ‘types of presentation attack’ attribute can be grouped into the ‘device assisted attacks’ and ‘attacks that require no device (zero-effort)’. Works that focused on ‘device assisted attacks’ employed either ‘replay’, ‘speech synthesis and voice conversion’, or ‘multiple types of attack’. Sections 4.2.14.2.7 describe each of these attributes in detail. Figure 4 summarizes the state-of-the-art voice PAD taxonomy.

Fig. 4
figure 4

Summary of voice PAD taxonomy

4.2.1 Types of presentation attack

The first attribute considered in the taxonomy is the types of presentation attacks. The attacks can be grouped into two, (i) the presentation attack using electronic devices and (ii) the presentation attack without an electronic device. Most of the works found were focused on detecting presentation attacks using electronic devices, which were conducted using replay, voice conversion, speech synthesis, or a combination of all.

In the presented taxonomy, speech synthesis and voice conversion were grouped as one subcategory due to these two attacks were similar. They often require the use of an audio processor called vocoder to produce artificial voice. As these attacks require knowledge of on signal processing, assistance from professionals may be needed. Some recent works on speech synthesis and voice conversion detection are [26, 41, 134].

Replay attack has been one of the main focuses in recent works, as it is the most straightforward presentation attack that can be carried out with the aid of electronic devices by the attacker. It can be launched quickly by sneakily records someone’s speech and playback the recording to spoof the ASV system. Additionally, unlike attacks on speech synthesis and voice conversion, attackers require no skills and knowledge in signal processing to perform a replay attack. However, replayed speech generation by professional attackers to perform replay attacks do require laborious and time-consuming procedures to yield large databases. Nonetheless, due to the limited availability of replay data, replay detection was not well generalized against unseen conditions, especially channel mismatch conditions [110]. As replay attacks do not require specialized knowledge, the threat of a replay attack can be considered more significant compared to voice conversion and speech synthesis attacks. Some recent works on replay attack detection are [3, 57, 92].

Some works employ a combination of all attacks described above. In these cases, researchers proposed voice PAD systems that counter multiple types of device-assisted presentation attacks. In the actual situation, when an attacker launches a spoofing attack, there is no prior knowledge of the type of attacks being used. As an example, a PAD system developed to counter speech synthesis and voice conversion attacks may ineffective against replay attacks, and vice versa. Recent work showed that a system that is effective against speech synthesis and voice conversion experienced a drastic performance decline when used to differentiate between genuine and replay attacks [142]. Hence, PAD systems to detect spoofing attacks regardless of attack types are needed. Some recent works on PAD capable of detecting multiple types of spoofing attacks are [59, 99, 119, 144].

On the other hand, the impersonation attack or zero-effort imposters was shown to be unable to penetrate most of the state-of-the-art ASV systems [56]. Due to the unavailability of a public dataset for impersonation attacks, there was barely any researches conducted on detecting impersonation attacks on ASV systems. A work, [68] described in [56] experimented on the efficacy of impersonation, in which it turns out that professional imitators are unable to pass the ASV system authentication. However, another research showed the opposite. There was a recent work proposed on detecting speech impersonation [83]. As no public impersonation dataset available for the task, high-quality impersonation speech data was collected. The impersonation corpus was made up of two speakers, 40 genuine and 28 spoof samples. The work used MFCC as the feature and CNN as the classifier for impersonation detection and recorded an EER of 35.85%. The result shown indicates a need to develop a robust countermeasure against impersonation as professional impersonators may succeed in spoofing the ASV system.

4.2.2 Features

The second attribute presented in the taxonomy is features. Some works on PAD used a single feature, while others used more than one. In general, from the surveyed articles, more researchers used multiple features than that of a single feature.

Most single feature PAD systems were based on, but not limited to, MFCC, CQCC, and Linear Predictive Coding (LPC). One of the popular features, MFCC, is the coefficients that make up a Mel-Frequency Cepstrum (MFC) collectively [13]. MFCC was found to be useful for speech synthesis and voice conversion detection [93], though it performed poorly in replay detection [132]. Another popular feature used for voice PAD is the LPC, which is often used as audio features in speech recognition and speaker recognition. In particular, LPC is a technique used frequently in signal processing, in which linear predictive model information is used to represent the spectral envelope of the compressed speech signal [126]. The other popular feature used in PAD, CQCC, is a coefficient extracted from Constant Q Transform (CQT). Recent work has shown that the application of CQCC in PAD outperformed that of MFCC [123]. It is also shown that CQCC was one of the best performing features for voice spoofing detection [122].

As for multiple features-based PAD systems, popular features are including but are not limited to MFCC, CQCC, and Inverted Mel Frequency Cepstral Coefficients (IMFCC). whereas IMFCC is the characteristics property of the audio system which contains complementary information to MFCC. As the name suggests, MFCC is based on the mel scale whereas IMFCC is based on the inverted mel scale. Recent works showed better performance of PAD when multiple features were used as different features contain complementary information that discriminates genuine from spoof voice better [35, 48, 142]. Moreover, by using different features and models in classification through the fusion method, a significant improvement in the performance of voice PAD can be observed [15].

4.2.3 Classifiers

The third attribute of the taxonomy is classifiers. From articles surveyed, classifiers used in voice PAD can be categorized into three groups, namely conventional, deep learning, and multiple classifiers. The often-used classifiers are conventional classifiers, followed by multiple classifiers, and deep learning.

One of the widely used conventional classifiers in recent works for PAD tasks was GMM as it is an effective probabilistic model for speaker verification tasks [65]. Unlike speaker verification, UBM adaptation was not required for spoof speech detection. GMM was used to classify genuine and spoof voices in which the process is similar to that of speaker verification using GMM.

Besides, another conventional classifier known as SVM was also extensively used in recent works due to its excellent performance in classification tasks. In the recent work [66], SVM with Radial Basis Function kernel (RBF) was found to outperform classifiers such as Decision Tree, Naive Bayes, and K-Nearest Neighbour (KNN) with an EER of 1% on the evaluation set of the ASVspoof 2019 PA dataset. Several kernels for SVM were also been tested by the researchers, but none perform better than the RBF kernel. An interesting observation is that SVM with polynomial and RBF kernels produced superior detection results compared to SVM with linear kernel due to non-linearities present in 1st and 2nd order replay samples of the dataset.

Deep learning methods were also frequently applied in PAD tasks. Unlike conventional classifiers, deep learning is one of the machine learning that is composed of networks capable of learning without supervision from labeled data [39, 104]. From recent works, it is found that deep learning classifiers such as DNN [120], RNN [30], and CNN [88] are capable of automatic feature abstraction in which more informative features can be identified. The informative feature extracted from voices leads to better performance in voice PAD systems [88].

The application of multiple classifiers in PAD systems also can be found in the literature. In many domains involving machine learning and classification, it has been shown that applying ensemble classifiers may improve the performance of a system [42, 79]. Nevertheless, in the field of voice recognition, it has been shown that applying multiple classifiers with the same feature hardly improves the performance of the voice PAD system [15].

4.2.4 Fusion

The fourth attribute of the taxonomy is fusion. From the literature, the work may apply fusion or no fusion. With respect to those works that apply fusion, two main fusion methods were employed namely score fusion and feature fusion. Although there are other fusion methods available, the number is small.

Score fusion is undertaken such that several scores generated by voice PAD models are considered in the classification decision [91, 128]. These scores are combined using sum, max, min, mean, standard deviation, weighted or normalized sum, etc. From the literature, score fusion is the most frequently used fusion approach in the voice PAD systems and has shown effectiveness in improving the detection rate.

Feature fusion is performed either via serial feature fusion or parallel feature fusion [118] to boost the recognition rate. Serial fusion is a method of fusing features by serially combining multiple feature vector sets into a single feature vector, called a serial fused feature. Unlike serial fusion, which is based on the union-vector, parallel feature fusion is based on a complex vector, a vector that has components of complex numbers. Between these two fusion strategies, parallel feature fusion has outperformed serial feature fusion for attack detection [141].

Other fusion such as the ensemble approach was found to be more generalized against spoofing to ASV systems. A recent work [78] introduced an end-to-end ensemble approach to jointly train two models separately were perform well on LA and PA attacks. Then, a third model learned the output of the two models and yielding a single score as detection output. Experiment results showed that the ensemble approach produced EER of 9.87% and 1.75% on evaluation sets of LA and PA sets of ASVspoof 2019 dataset respectively. Though the performance on the LA task was poorer than the PA task, the ensemble result of the LA task still improvised from EER of ranged 13-16% produced by each individual model.

4.2.5 Methodology

The fifth attribute of the taxonomy is methodology. The methodology attribute can be grouped into three categories, namely classic machine learning, end-to-end learning, and hybrid approach as shown in Fig. 5. Classic machine learning is the most common method used in the surveyed work. In classic machine learning, pre-determined features that are usually manually crafted were extracted from data samples and fed into the pre-determined classifiers to predict the class label of the data samples [31]. The feature extraction and classification are two separate modules in classic machine learning. For example, the official ASVspoof baseline system is a voice PAD system based on CQCC features front-end and a GMM classifier backend [22].

Fig. 5
figure 5

Categories of the methodology used in recent voice PAD

In end-to-end learning, all features from data samples were identified and learned by deep learning processes automatically and jointly to determine the class label of the data samples. The feature learning and classification are under one module in end-to-end learning. Different from classic machine learning, end-to-end learning handles the entire learning process from input data to output prediction. For example, in a recent work [25], raw waveform-based deep learning spoof detection model jointly acts as both feature extractor and end-to-end classifier where there was no pre- and post-processing on the data input needed.

As the name suggests, the hybrid approach used both classic machine learning and end-to-end learning in the architecture of the voice PAD system. The hybrid methodology takes advantage of the manually crafted and automatically extracted features. The features used are varied and may provide a better representation of the data. Though the hybrid methodology is rare in the context of the PAD system, the hybrid approach has been shown to attain better performance in spoof speech detection [14].

4.2.6 Datasets

The sixth attribute of the taxonomy is datasets. The datasets attribute can be grouped into public and private datasets. There are two commonly used datasets for voice PAD researches, namely ASVspoof and AVspoof. Three Automatic Speaker Verification Spoofing and Countermeasures Challenges (ASVspoof) were organized previously, namely ASVspoof 2015, ASVspoof 2017, and ASVspoof 2019, in which datasets were made publicly available to download. ASVspoof 2015 dataset consists of speech synthesis and voice conversion attacks, whereas ASVspoof 2017 dataset consists of replay attacks. ASVspoof 2019 dataset contains speech synthesis, voice conversion, and replay attacks. As for the AVspoof dataset, it was made publicly available and it contains replay, speech synthesis, and voice conversion attacks. Details regarding ASVspoof and AVspoof datasets can be found in [6] and [45] respectively.

ASVspoof was the most frequently used dataset among the surveyed articles. About 62% of the considered articles used ASVspoof datasets for experimentation. Some researchers applied multiple datasets for cross-database evaluation [86], but the number is limited. Furthermore, from the surveyed articles, cross-dataset experiments of voice PAD have shown that the state-of-the-art voice PAD systems were not well generalized as the performance significantly degrades when encountered unseen spoofing attacks.

4.2.7 Evaluation criteria

The last attribute of the taxonomy is the performance evaluation criteria. From the surveyed articles, it can be seen that EER is the main criterion used for performance evaluation of voice PAD systems as over three-quarters of the works evaluating their proposed PAD using EER. A limited number of the surveyed articles have chosen accuracy as the single performance evaluation criteria, for example, [44]. Some researchers evaluate their work using multiple performance evaluation criteria such as EER with min-tDCF, EER with Half Total Error Rate (HTER), and False Match Rate (FMR) with False Non-match Rate (FNMR). In 2021, the proportion of recent works that used multiple evaluation criteria was recorded at 28.57%. Since not all study was evaluated using the same criterion, this creates a problem when comparing the works. Hence, some recent works such as [62, 145], and [140] provided more than one evaluation criteria for performance comparison. A fair comparison of voice PAD in terms of performance may be made through the standardization of evaluation criteria.

4.3 An Analysis to the Trend of Voice PAD in Recent Years

This sub-section presents the analysis of voice PAD works in recent years. Statistical analyses of the trend of works on PAD is conducted to discover their limitation and subsequently project the potential future works of voice PAD. Visualizations [75] were used to show the trends based on the attributes enlisted in the taxonomy of voice PAD, described in the preceding section. Figures 6 to 20 visualize the trends; each is explained in detail.

Fig. 6
figure 6

The trend of the detection of types of presentation attack by voice PAD in recent years

Figures 6 and 7 visualized the analyses of the type of presentation attack attribute. Based on Fig. 6, the proportion of work on speech synthesis and voice conversion-based PAD decreases steadily from the year 2015 to 2019, whereas both replay and multiple attack type targeting PAD increases from the year 2015 to 2018. From 2019 onwards, the works on replay attack targeting PAD decreases to 14.29% in 2021 while the number of voice PAD works targeting speech synthesis and voice conversion as well as multiple types of attacks steadily increase to 42.86% and 42.86% respectively in 2021. As there is no prior knowledge regarding the types of presentation attacks for PAD systems before the detection takes place, countermeasures that can detect multiple types of attacks become the favorite. On the other hand, the decrement in the proportion of speech synthesis and voice conversion from 2015-2019 may be due to the shift in attention of researchers to replay and multiple types of attacks in the period. Nonetheless, the overall trend shift towards speech synthesis and voice conversion as well as multiple types of attacks in 2021. The increment in the proportion of multiple types of attack as shown in the statistics support the significance in detecting presentation attacks regardless of the types. Figure 6 shows that most of the past work focusing on the development of countermeasures for device-assisted attacks (98.84%). Only two works (1.16%) presented an approach to detect impersonation attacks.

Fig. 7
figure 7

The proportion of device assisted and zero-effort imposter attacks as target detection in recent voice PAD

Figures 8 and 9 visualized the analyses of the feature attribute. From the perspective of features used in recent work on voice PAD as shown in Fig. 8, the majority of works used multiple features from 2015-2021. This is because the use of multiple features contains complementary information that can be used to better discriminates genuine from spoof voice [35, 48]], [142]. However, the trend of using multiple features seems to be reduced to 57.14% in 2021. The two most frequently used individual features, CQCC and MFCC, became less preferred due to the most recent work showing the superiority of multiple features in detecting presentation attacks [20, 112, 127]. In 2021, works that applied MFCC as a single feature for voice PAD were none. Overall, MFCC and CQCC were used in 4.65% and 3.49% of the recent works respectively, as a single feature. Other features were used 18.02% of the total considered work.

Fig. 8
figure 8

The trend of features used in voice PAD in recent years

Fig. 9
figure 9

The proportion of features used in recent voice PAD

Figures 10 and 11 visualized the analyses of the classifier attribute. As for the trend of classifiers used in voice PAD in recent years, as shown in Fig. 10, GMM was the most preferred classifier until 2018. The emergence of deep learning has impacted the selection of classifiers on voice PAD as the number of works using deep learning has steadily increased from 2016 to 2019. The utilization of multiple classifiers in the form of ensemble classifiers has also gained attention recently [37, 101]. The switched of interest on classifiers selection, concerning voice PAD, may be caused by better detection results produced by deep learning which is capable of feature abstraction while ensemble classifiers which capable of gathering complementary information over GMM [88]. Although the usage trend of GMM as an individual classifier experienced fluctuation, GMM is still the most frequently used classifier (43.02%), followed by multiple classifiers (27.33%) and deep learning (23.84%) as shown in Fig. 11. The application of single classifiers like SVM, HMM, and other classifiers was very limited with shares of 2.33%, 0.58%, and 2.91% respectively in recent years.

Fig. 10
figure 10

The trend of classifiers used in voice PAD in recent years

Fig. 11
figure 11

the proportion of classifiers used in recent voice PAD

Figures 1213, and 14 visualized the analyses of the fusion attribute. Figures 12 shows the trend of fusion application in a recent voice PAD. Score fusion was the most preferred fusion from 2015-201. This trend fluctuated from 2020 onwards. One interesting observation is the usage of feature fusion that receives mixed reception from works considered. Since feature fusion can produce good detection results [20, 77, 130], it is conjectured that feature fusion could still be employed in the future. Most works did not use fusion from 2020 onwards. Nonetheless, from the works considered in this paper, only 105 (61.05%) applied fusion, whereas 67 (38.95%) works did not, as shown in Fig. 13. Figure 14 shows that, among various fusion approaches, score fusion was used in 38.95% of the considered works. Feature fusion, multiple fusion, and other fusion methods were mere slightly exceeding 20% in usage when totaled up.

Fig. 12
figure 12

The trend of the application of fusion in voice PAD in recent years

Fig. 13
figure 13

The proportion of recent PAD using fusion and no fusion

Fig. 14
figure 14

The proportion of fusion used in recent voice PAD

Figures 15 and 16 visualized the analyses of the methodology attribute. Figure 15 shows the trend of the methodology used in recent voice PAD systems. The classic machine learning approach was the most preferred methodology in recent years except in 2020. In 2020, end-to-end learning was the most preferable approach. There were 83.14% of recent works that contributed to voice PAD using the classic machine learning approach. This is because classic machine learning was found to outperform the end-to-end approach consistently in recent work [129]. One interesting observation is the usage of end-to-end learning that receives mixed reception from recent works such that no trend can be observed. Nonetheless, the end-to-end learning approach still contributed more than 15% in overall recent works considered. Only two recent works (1.16%) applied the hybrid approach in the voice PAD.

Fig. 15
figure 15

The trend of the methodology used in voice PAD in recent years

Fig. 16
figure 16

The proportion of methodology used in voice PAD in recent years

Figures 17 and 18 visualized the analyses of the datasets attribute. ASVspoof datasets were the most commonly used voice PAD datasets in recent years, recording 73.84% of usage. Multiple datasets were also used in training and evaluating recent voice PAD with a usage proportion of 13.95%. Complementarily, while only 13.95% of the recent works used multiple datasets, 86.05% of the works used a single dataset in training and evaluating the work. This trend indicates that a lack of cross-datasets evaluation was done in recent works for voice PAD, which may cause the proposed voice PAD to be less generalizable.

Fig. 17
figure 17

The trend of datasets used in voice PAD in recent years

Fig. 18
figure 18

The proportion of datasets used in recent voice PAD

Figures 19 and 20 visualized the analyses of the evaluation criteria attribute. EER is the most used criteria to evaluate the performance of voice PAD with 72.67% usage across the years, as shown in Fig. 20. Though some works were using multiple evaluation criteria (23.84%), only 3.49% of the recent work evaluating the work using accuracy as the single evaluation criteria, as shown in Fig. 20. Nonetheless, there was a dramatic decline in the use of EER as shown in Fig. 19. Meanwhile, the usage of multiple criteria was increasing steadily in recent years, except for a slight drop in 2018. Still, it experienced a significant increment to 56.25% in 2020 but dropped to 28.57% in 2021. The increase in the cumulative number of recent works that used different evaluation criteria indicates using a single metric for evaluation may not sufficient to show how well a PAD system performed.

Fig. 19
figure 19

The trend of evaluation criteria used in voice PAD in recent years

Fig. 20
figure 20

The proportion of evaluation criteria used in recent voice PAD

4.4 Research Gap and Future Direction of Voice PAD

This sub-section presents the research gap and corresponding future direction of voice PAD. The state-of-the-art voice PAD field is suffering from several issues that can be found in the articles considered. Some of the proposed future works are designed to deal with the issues found. Figure 21 summarized the research gap and future direction of voice PAD. Details of the issues and potential future works are presented in Sections 4.4.1 and 4.4.2 respectively.

Fig. 21
figure 21

Summary of research gap and future direction

4.4.1 Issues of Voice PAD

This section presents the issues of voice PAD found in the literature. Five main issues were identified: (i) spoof-type dependent PAD, (ii) difficulty in the generalization of the PAD systems, (iii) limited available datasets, (iv) limitation of conventional classifiers, and (v) lack of cross-datasets performance evaluation.

Spoof-type dependent PAD. :

Most of the state-of-the-art voice spoofing countermeasures are indicative and specific to the types of spoofing [28]. A PAD system that addresses speech synthesis and voice conversion may not be effective against a replay attack and vice versa. For example, a PAD system trained with genuine, speech synthesis, and voice conversion attacks dataset but tested on a dataset consisting of genuine and replays attacks only could not detect the replay attacks effectively [90].

Application of several PAD systems [80] into an ASV system may be a possible solution to compensate for the response time tradeoff. Another concern is that as most state-of-the-art PAD systems were proposed to detect specific types of spoofing attacks, security adversaries may exploit this flaw to spoof and bypass PAD systems by applying multiple spoofing types in one spoofing attempt. For example, voice conversion can be applied on top of impersonation to boost spoofing effectiveness [136]. Therefore, spoof-type dependent PAD which is designed to detect only single type of attack may not able to detect that spoofing attack.

Difficulty in generalization. :

State-of-the-art PAD systems are not well generalized. Several cross-database evaluations conducted by researchers have shown that the performance of voice PAD systems declined when it is trained using a dataset and tested using another dataset that has different types of spoofing attacks [32, 43, 55]. It can be seen that the current state of PAD is still dataset-dependent to achieve low error rates (FAR, FRR, HTER, and EER). The approaches in preparing different datasets by different entities are different, factors like different recording environments, recording devices, spoofing algorithms, and noise levels. As a result, the quality of different datasets of audio is different. Hence, the performance can be inconsistent when evaluating using different datasets if the PAD system is not well generalized.

Limited available datasets. :

The limitation of datasets availability has caused most of the PAD systems modeled using the text-independent method for voice surveillance applications [136]. Nevertheless, many ASV systems have been developed using text-dependent modeling techniques for authentication purposes [135]. In order to evaluate the performance of speaker verification with PAD, datasets that can be used for both speaker verification and PAD are needed, such that both speaker identity label and the spoof-genuine label must be made available in the datasets. The limited availability of datasets that can be used to evaluate both PAD and speaker verification has induced the limitation in validating the effectiveness of PAD systems in an actual situation.

Limitation of conventional classifiers. :

Conventional classifiers like GMM-UBM for speaker identification and verification are vulnerable to voice conversion attacks [84]. Since most of the current speaker verification systems are GMM based, efforts to incorporate additional steps in the GMM-based speaker verification systems to capture artificial signals are necessary [27]. This is due to conventional classifiers such as GMM-UBM and SVM do not have the capability of feature abstraction, which can be found in deeper learning classifiers such as DNN, RNN, and CNN [88]. Possible complementary information can be obtained to identify spoofing from a genuine voice better by utilizing a fusion of classifiers in different natures [88] compared to conventional classifiers.

Lack of cross-datasets performance evaluation. :

In a recent study, it has been shown that current voice PAD systems were not well generalized as the performance of voice PAD degrades when evaluated using different datasets [54]. To determine whether a voice PAD system is robust enough against unseen spoofing attacks, a different dataset can be used in model evaluation. The evaluation process is known as cross-datasets evaluation. However, most of the current voice PAD systems (76.51%) were evaluated on a single dataset, as shown in Fig. 18. Therefore, cross-datasets evaluations are needed to ensure the proposed voice PAD system is robust enough against unseen spoofing attacks.

4.4.2 Potential future works

This section presents the possible future works that should be considered to improve the performance of voice PAD and speaker verification.

The priority of replay attack detection. :

The threat level posed by the replay attack to ASV is significant. From the results of the ASVspoof 2019 Challenge, replay attacks of higher quality were difficult to be detected by state-of-the-art PAD systems [124]. In addition, most of the voice PAD systems were evaluated using a corpus made up of first-order replayed recordings (replayed once), the detection of multi-order replay attacks (replayed multiple times) has not been done [9]. Hence, more work should be directed at replay attack detection, while spoof detection should include both replay and artificial speech (speech synthesis and voice conversion). As described in Section 1, anyone can initiate a replay attack simply by using an electronic voice recorder or smartphone. No specific skills in signal processing are required to launch replay attacks. In the future, researchers may put replay attack detection as the first layer of security in ASV against presentation attacks when designing voice PAD [136].

Noise resilient PAD. :

Most of the performance evaluations made on proposed PAD systems consider only identical conditions for both training and testing. In reality, this is not the case. The condition of perceiving the voice input from users may vary. The variable condition to capture voice input, such as background noise, degrades the quality of voice captured by ASV systems. Similarly, in a noisy condition, the performance of PAD systems degrade significantly [29]. Hence, future work may include a variety of background noise conditions into PAD systems performance evaluation to generate a better PAD system that is resilient to acoustic mismatch conditions [136].

Cross-datasets performance evaluation. :

As mentioned previously, current PAD systems are not well generalized as the performance of PAD in cross-datasets evaluation tends to degrade [54]. To verify whether the proposed PAD systems are well generalized, cross-datasets evaluation can be performed. If the voice PAD system managed to achieve consistent performance across different datasets, then the robustness of the system will be justifiable [63]. However, there were not many recent works that applied cross-datasets evaluation to show the performance of the proposed PAD systems against unseen attacks. As the technology keeps evolving, the methods to spoof ASV systems will increase. Hence, it is crucial to introduce a robust PAD system that is well generalized.

Robust PAD for ASV in Smart Home. :

As the concept of Smart Home promotes handsfree and automation properties [72], biometric technologies, including voice recognition, are the best method to be used for access control and personalization [125]. However, current speaker recognition systems can be vulnerable to presentation attacks as the existing PAD systems are very indicative and specific in detecting presentation attacks [28]. Hence, most of these PAD systems are inapplicable in the actual situation where types of presentation attacks are unknown in reality [28]. When applied to Smart Home, the threat of the presentation attacks would become significant. Therefore, there is an urgent need to develop a robust PAD system to secure ASV systems and hence the adoption of ASV in Smart Home applications can be accelerated.

Fusion for robust voice PAD. :

Last but not least, fusion can serve as an effective choice to improve the performance of voice PAD. Although there are a variety of methods to perform fusion such as score fusion and feature fusion, only 63.76% of the recent works applied fusion to enhance the performance of voice PAD. However, the increase in additional tasks for fusion computation leads to the increase of computation time of voice PAD, which may be served as the reason for 36.24% of the recent works not apply fusion. Nonetheless, the trade-off between computation time and the detection rate should be reviewed in the future whether the enhancement in the detection rate worth the trade-off computation time induced by the fusion.

5 Conclusion

As time progress, more works on voice PAD were published. However, there was a limited systematic survey available on the current state of research and application. To the best of our knowledge, most of the papers did not provide a detailed taxonomy of recent voice PADs. This paper is thus produced to offer an extensive survey of speaker verification systems, spoofing attacks, and voice PAD to secure speaker verification systems published in 2015 to 2019. A total of 172 Scopus indexed articles on voice PAD were considered in producing this survey.

In order to understand the trend of work on voice PAD systems, a taxonomy of state-of-the-art voice PAD systems was built based on the survey on the works considered. Analyses of the trend on recent works on PAD, based on the identified attributes from the taxonomy, were also presented. From the analyses, the researchers’ interest in developing models to detect multiple types of attacks is increasing. Furthermore, deep learning usage as classifiers for voice PAD has also increased since 2016, although GMM is still the most frequently used classifier.

The research gap and future direction of voice PAD were subsequently established and described in this paper. There were five existing issues of voice PAD identified, namely spoof-type dependent PAD, difficulty in generalization, limited available datasets, limitation of conventional classifiers, and lack of cross-datasets performance evaluation. Five potential works for voice PAD were suggested to resolve the identified issues, namely priority of replay attack detection, noise resilient PAD, cross-datasets performance evaluation, robust PAD for ASV in Smart Home, and fusion for robust voice PAD.

To conclude, investigating how voice PAD was employed in ASV systems is highly significant to ensure future research will concentrate on the right dimension of the voice PAD, thereby improving voice PAD systems performance. The presented taxonomy could be used by other researchers to plan their research contributions and activities. The potential future direction found could further enhance efficiency and increase the number of voice PAD systems applications.