Introduction

Automatic speaker recognition is a technique to identify a person from the characteristics of his/her voice, which has been applied in voice interaction systems, such as smartphones [1], front-end of voice wake-up devices [2], and on-site access control for secured rooms [3], etc. However, recent works have shown that speaker recognition is vulnerable to malicious attacks, e.g., spoofing and adversarial attacks. In the case of spoofing attacks, audio files sound like the target victim and can get access to speaker recognition systems [4, 5]. The typical methods of spoofing include impersonation, replay, speech synthesis, and voice conversion [6, 7]. On the contrary, audio files in adversarial attacks do not have to sound like the target victim but can also get access to the speaker recognition system, e.g., for white-box attacks in [8,9,10,11,12,13] and black-box attacks in [8, 9, 14, 15]. Actually, there are plenty of methods proposed for conducting adversarial attacks on image classification, including Box-Constrained L-BFGS [16], Fast Gradient Sign Method (FGSM) [17], Basic Iterative Method (BIM) [18], Jacobian-based Saliency Map Attack (JSMA) [19], One Pixel Attack [20], Carlini and Wagner Attacks (C&W) [21], DeepFool [22], Universal Adversarial Perturbations [23], etc. Unlike the existing extensive studies on image classification, adversarial attacks on speaker recognition are just emerging recently, as presented in a recent work [24]. It is a straightforward solution to deploy fraud detection systems when protecting real-world voice-based biometrics authentication systems [25]. However, in most previous fraud detection works, small perturbations such as adversarial examples have not been considered [26]. In those works, fraud detection systems may be ineffective on detecting adversarial examples given the fact that adversarial examples have only quite small modifications regarding to their original counterparts. In this paper, we consider the attacking scenario depicted in Fig. 1. The adversarial attacking takes the following steps. An attacker first has the source voice. An adversarial learning algorithm subsequently generates a perturbation signal based on the source voice and adds them together to obtain an adversarial voice. Human listeners cannot tell the difference between the source voice and the adversarial one. However, a speaker recognition system can be fooled to recognize the adversarial voice as from speaker A. Given this property, adversarial examples can be utilized to protect the privacy of users of voice interface by preventing the users’ voice being identified by some speaker recognition system. Therefore, our work could reduce the chance of malicious usages of one’s voice biometrics.

Fig. 1
figure 1

source voice. An adversarial learning algorithm subsequently generates a perturbation signal based on the source voice and adds them together to obtain an adversarial voice. Human listener cannot tell the difference between the source voice and the adversarial one. However, the speaker recognition model will be fooled to recognize the adversarial voice as from speaker A in miscellaneous tasks

Attacking scenario of this paper. The blue dotted line represents the generation process of adversarial perturbations, and the black solid line represents the conducting process of an adversarial attack. An attacker first has the

A feasible method for crafting an adversarial example is adding a tiny amount of well-tuned additive perturbation on the source voice to fool speaker recognition. There are generally two kinds of attacks according to how to fool speaker recognition. One is untargeted attack, in which speaker recognition fails to identify the correct identity of the modified voice. The other is targeted attack, in which speaker recognition recognizes the identity of an adversarial sample as the specific speaker. Given the experiences on adversarial attacks for image classification [27] and speech recognition [28], targeted attack is much more challenging than the untargeted one. In this paper, we investigate targeted attack for automatic speaker recognition.

A key property of a successful adversarial attack is the difference between the adversarial sample and the source one should be imperceptible to human perception. Unfortunately, some studies have not paid enough attention on this property, where additive perturbations could be too large to be imperceptible by human listening as illustrated in Fig. 1, e.g., significant background noises were introduced to conduct a successful attack but also deteriorated the quality of the source voice when adding perturbations [14]. A feasible solution would be considering the psychoacoustic property of sounds as studied in [12] and [29]. In this paper, inspired by auditory masking, we will improve the imperceptibility of adversarial samples by constraining both the amounts and the amplitudes of the adversarial perturbations.

The less the prior knowledge required by an attack, the easier the attack conducted in practical usage. Given the assumption that an attacker does not have any knowledge on the inner configuration of the recognition systems, we focus on the black-box adversarial attack in this paper, where an attacker can at most access the decision results or the scores of predictions, following the definitions in [30]. However, the difficulty of black-box attack is much greater than that of white-box attack, as reported in [14]. An algorithm assisted by differential evolution was proposed in [20] to perform black-box attack on image classification by only modifying a few of pixels to create an adversarial image. In this paper we make adversarial audio samples by only modifying partial points of an utterance. Given the fact that excessive large amplitudes in audio samples would produce harsh noises, constraints on the amplitudes of adversarial samples will be considered in our methods.

With the vastly usage of time–frequency features as the inputs of speaker recognition, Mel-Spectrum, Mel-frequency cepstral coefficients (MFCCs) and log power magnitude spectrums (LPMSs) were utilized to generate adversarial features in [8] and [9], respectively, where the attacks were performed on feature space rather than on time domain signals. In this paper, we generate adversarial perturbations directly on waveform-level (and not on the spectrogram) to yield high-quality samples for attacking, which is more realistic in real-life scenarios as pointed out in [13].

There are three typical tasks for automatic speaker recognition, say open-set identification (OSI) [31], close-set identification (CSI) [32] and automatic speaker verification (ASV) [33]. Existing works have investigated attacks on CSI [10,11,12,13], ASV [8, 9] and OSI [14]. In this paper, we comprehensively study targeted adversarial attacks towards all these three tasks within the proposed framework. Comparison and analysis will also be presented with respect to the existing works.

In summary, our method has the following properties, imperceptibility, black-box, waveform-level, and availability to multiple tasks. An overview of the working flow of the proposed method is shown in Fig. 1. The rest of the paper is organized as follows. In second section, we first revisit state-of-the-art speaker recognition systems and three typical tasks and then describe the configurations of adversarial attacks. The proposed algorithms are presented in third section. Experimental settings are described in fourth section and the results and analysis are reported in fifth section. The conclusion is given in sixth section.

Speaker recognition and adversarial attacks

In this section, state-of-the-art speaker recognition systems and three typical tasks involved in speaker recognition are introduced. The configurations of adversarial attacks are also presented.

Speaker recognition: tasks and models

Tasks and their decision functions

We consider three typical tasks in speaker recognition, OSI, CSI and ASV.

An OSI system enrolls multiple speakers in the enrollment stage, say a group of speakers G with IDs {1, 2, …, n}. In the following testing stage, for an arbitrary testing voice x, the system tries to decide whether x is from one of the enrolled speakers or none of them, according to the similarity of x regarding to the enrolled utterances of the speakers in G. A predefined threshold θ is taken to conduct the binary decision, the decision function D(x) in OSI is hereby

$$ D({\mathbf{x}}) = \left\{ {\begin{array}{*{20}ll} {\mathop {\arg \max }\limits_{i \in G} f_{i} ({\mathbf{x}})} & {{\text{if}}\;\mathop {\max }\limits_{i \in G} f_{i} ({\mathbf{x}}) \ge \theta ,} \\ {0, } & {\text{otherwise,}} \\ \end{array} } \right. $$
(1)

where \(f_{i} ({\mathbf{x}})\) denotes the similarity score of x on the utterances of the enrolled speaker i. Intuitively, the system classifies the input utterance x as from speaker i if and only if the score \(f_{i} ({\mathbf{x}})\) is the highest one and not lower than the threshold θ. Therefore, a successful targeted attack on OSI should satisfy the following two conditions: (1) the score on the target speaker is the highest one among G and (2) the score is larger than θ.

CSI is a task to identify the identity of a speaker in a closed set, i.e., an input utterance will always be classified as from one of the enrolled speakers. The decision function is thus

$$ D({\mathbf{x}}) = \mathop {\arg \max }\limits_{i \in G} f_{i} ({\mathbf{x}}). $$
(2)

A successful targeted attack on CSI only needs to make the score of the target speaker the highest one among G.

ASV is a task to verify if an utterance spoken by a claimed speaker. ASV has exactly one enrolled speaker and its decision function is thus

$$ D({\mathbf{x}}) = \left\{ {\begin{array}{*{20}ll} {1,} & {{\text{if}}\; \, f({\mathbf{x}}) \ge \theta ,} \\ {0, } & {\text{otherwise,}} \\ \end{array} } \right. $$
(3)

where f(x) is the score of the testing utterance x on the enrolled speaker. A successful attack on ASV is hence to make the score as large as possible to be higher than the threshold of the system. As seen above, ASV is a special case of OSI, but it will be studied separately in this paper given its importance and vastly usage in biometric authentication.

State-of-the-art models

In this paper, state-of-the-art speaker recognition systems, Gaussian Mixture Models—Universal Background Models (GMM–UBM), i-vector and x-vector, are considered in our experiments.

GMM–UBM is a traditional statistics-based method in speaker recognition, which has been utilized in both academia and industries [33]. Our study on the vulnerability of GMM–UBM is helpful to understand the weaknesses of speaker recognition based on statistics.

An i-vector system consists of two basic components: GMM–UBM and a total variability matrix [34]. i-vector had been a benchmark method for implementing speaker recognition systems before the emergence of deep learning-based methods. The extraction of i-vector is essentially a factor analysis process, which tends to yield high accuracy with inputs of long utterances. Our study on the vulnerability of i-vector is helpful to understand the vulnerability of models using factor analysis.

In an x-vector system, Time Delay Neural Networks (TDNN) is introduced to produce speaker-discriminative embeddings [35]. Network architectures have been extensively studied recently towards better representation of speaker characteristics, among which x-vector is an opensource benchmark one. Our study on the adversarial examples of x-vector system is helpful to understand the weakness of speaker recognition systems based on neural networks.

General formulation of adversarial attacks

Given an audio sample x, the key step to craft an adversarial sample is to obtain the perturbed signal e(x) by solving the following optimization problem:

$$ \begin{array}{*{20}l} {\mathop {\min }\limits_{{e({\mathbf{x}})}} \quad Q(D({\mathbf{x}} + e({\mathbf{x}})), y_{{{\text{tar}}}} )} \\ \text{s.t.} \quad \left\| {e({\mathbf{x}})} \right\|_{p} < \varepsilon , \\ \end{array} $$
(4)

whose the goal is to fool the classifier to produce erroneous output for \({\hat{\mathbf{x}}} = {\mathbf{x}} + e({\mathbf{x}})\) and where s.t. means “subject to”, ||⋅||p means p-norm. In other words, if the label of the target victim speaker is \(y_{{{\text{tar}}}}\), a successful attack fools the classifier (e.g., GMM–UBM, i-vector or x-vector) to produce ytar for the perturbed sample \({\hat{\mathbf{x}}}\). In this paper, two norms \(l_{\infty }\) and \(l_{{0}}\) are taken as ||⋅||p to constrain the maximum amplitude and length of perturbation, respectively.

The proposed methods

In this section, the procedures of generating imperceptible, black-box, waveform-level adversarial examples are formulated into an optimization problem. Specifically, objective functions are configured for the tasks of OSI, CSI, and ASV, respectively.

Given our motivation on performing waveform-level adversarial attacks, an input utterance is represented by a vector where each entry is a sample point. Let D be the target speaker recognition model, and let \({\mathbf{x}} = (x_{1} , \ldots ,x_{n} )\) be the source utterance with \(x_{i}\) denoting the i-th sample point. The adversarial attack is thus to deceive D in Eqs. (1), (2) and (3) by modifying x with an additive adversarial perturbation vector \(e({\mathbf{x}}) = (e_{1} , \ldots ,e_{n} )\), i.e., \({\hat{\mathbf{x}}} = {\mathbf{x}} + e({\mathbf{x}})\). As explained in the Introduction, the difference between \({\hat{\mathbf{x}}}\) and x should be imperceptible, and D should classify \({\hat{\mathbf{x}}}\) as the targeted speaker predefined by the attacker. Therefore, \(l_{{0}}\) is introduced to measure the number of non-zero sample points in \(e({\mathbf{x}})\) with d as its upper bound; \(l_{\infty }\) is introduced to constrain the amplitude of \(e({\mathbf{x}})\) to be less than a given small tolerance value ζ. Thus, solving the perturbation e(x) boils down to the following constrained optimization problem,

$$ \begin{array}{*{20}ll} {\begin{array}{*{20}c} {\mathop {\min }\limits_{{e({\mathbf{x}})}} \quad Q_{0} (D({\mathbf{x}} + e({\mathbf{x}})),y_{tar} ) + \lambda Q_{1} ({\mathbf{x}},e({\mathbf{x}}))} & \\ \end{array} } \\ {\begin{array}{*{20}c} {\text{s.t.}} & {||} \\ \end{array} e({\mathbf{x}})||_{0} \le d{\text{ and ||}}e({\mathbf{x}}){||}_{\infty } \le \zeta ,} \\ \end{array} $$
(5)

where Q0 is a loss function to evaluate if the modification \({\mathbf{x}} + e({\mathbf{x}})\) has successfully been classified as the target speaker ytar. The configuration of Q0 for OSI, CSI and ASV will be presented in  “Attacks on OSI”, “Attacks on CSI” and “Attacks on ASV” sections. Q1 is a regularization term proposed in this paper to further improve imperceptibility by considering the energy distribution over x, which is inspired by auditory masking and will be presented in “Regularization to promote imperceptibility” section.

Attacks on OSI

The goal of the targeted attack on OSI is to find an optimized perturbation \(e^{ * } ({\mathbf{x}})\) such that \({\hat{\mathbf{x}}} = {\mathbf{x}}{ + }e^{ * } ({\mathbf{x}})\) can be classified as some target speaker with \(tar \in G\). To conduct a successful attack on OSI, the following two conditions should be satisfied simultaneously: the score \(f_{tar} ({\hat{\mathbf{x}}})\) of the target speaker tar should be the highest one among all the enrolled speakers in G, and not lower than the preset threshold θ. Given how OSI makes decisions in (1), the loss function Q0 to be minimized is hereby

$$ Q_{0} ({\mathbf{x}}) = \max (\theta ,\mathop {\max }\limits_{j \in G,j \ne tar} f({\hat{\mathbf{x}}})) - f_{tar} ({\hat{\mathbf{x}}}), $$
(6)

where \(f_{tar} ({\hat{\mathbf{x}}})\) and \(\max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}})\) refer to the score on target speaker tar and the highest score on the remaining speakers, respectively. The objective function tries to increase the gap between \(f_{tar} ({\hat{\mathbf{x}}})\) and \(\max (\theta ,\max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}}))\). In detailed optimization processes, if \(\max (\theta ,\max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}})){ = }\theta\), it means that all the enrolled speakers are identified outside G, except the target one. Therefore, one only needs to ensure \(f_{tar} ({\hat{\mathbf{x}}})\) to be larger than the threshold θ. If \(\max (\theta ,\max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}})){ = }\max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}})\), one has to increase the gap between the target speaker and the maximum one among the remaining speakers, i.e., to increase the gap between \(f_{tar} ({\hat{\mathbf{x}}})\) and \(\max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}})\).

Attacks on CSI

Given the closed set involved in CSI in (2), it is expected to find some small perturbation e(x) such that the score on the target speaker is as large as possible, while the score on the second largest one is as small as possible. The loss function to obtain \({\hat{\mathbf{x}}}\) towards a successful targeted attack on CSI would be

$$ Q_{0} ({\mathbf{x}}) = \max_{j \in G,j \ne tar} f({\hat{\mathbf{x}}}) - f_{tar} ({\hat{\mathbf{x}}}). $$
(7)

Attacks on ASV

As an important special case of OSI, ASV has exactly one enrolled speaker and verifies if the input voice is from the enrolled speaker or not. Given how ASV makes decisions in (3), the loss function for making adversarial samples in ASV turns out to be

$$ Q_{0} ({\mathbf{x}}) = \theta - f_{tar} ({\hat{\mathbf{x}}}). $$
(8)

The objective is to make the score on the target speaker \(f_{tar} ({\hat{\mathbf{x}}})\) as large as possible to be higher than the threshold θ.

Regularization to promote imperceptibility

Beyond the constraints on e(x) in (5), the imperceptibility of adversarial perturbation can further be enhanced in an adaptive way by considering the energy distribution of x. This is actually an auditory masking effect. In psychoacoustics, auditory masking occurs when the auditory perception of a sound (named by reference sound) is affected by the presence of another sound (named by masking sound) [36]. Inspired by auditory masking, a regularization term is proposed to promote the design of \(e({\mathbf{x}})\), that is

$$ Q_{1} ({\mathbf{x}},e({\mathbf{x}})) = \sum\limits_{i = 1}^{N} {\frac{{e_{i} ({\mathbf{x}})}}{{\left| {x_{i} } \right| + \varepsilon }}} , $$
(9)

where ε is a very small constant to avoid numerical errors. The regularization term reflects the amplitude ratio of the perturbations to the original waveform at sample-point level. As a part of the loss function (5), when one tries to minimize (5), the regularization term would also be minimized. That is, one allows a large ei(x) given a large |xi|, but penalizes the large ei(x) when |xi| is small. This process tends to produce perturbations with a similar shape of the input signal, as denoted in (6) of Fig. 2. The intuitive interpretation of (9) is putting more perturbations on the segments of x with high energy, which would reduce the perception on the reference sound e(x) when playing together with the masking sound x.

Fig. 2
figure 2

Comparison of the proposed method w.r.t. the baselines on adversarial attacks of CSI. In each subgraph, the top one is an overview of the adversarial waveform (i.e., the blue line represents perturbed waveform and the red line represents the perturbations), while the bottom one is the zoomed-in view of the crafted perturbations. It is seen that our proposed method puts perturbations according to the energy distribution of the waveform to improve imperceptibility

Differential evolution for black-box optimization

Differential evolution (DE) is a kind of evolutionary algorithm for solving complex multi-modal optimization problems [37]. DE does not use any information of the system for optimizing and is thus suitable for black-box optimization in adversarial attacks. Moreover, the objective function involved does not have to be differentiable nor have an analytical form [38]. Thus, it matches the goal to solve (5) in this paper.

The key step of DE lies in the population selection that keep the diversity [39]. In specific, during each iteration another set of candidate solutions (children) is generated according to the current population (parents). Then the children are compared with their corresponding parents, surviving only if they are more fitted (possess higher fitness value, i.e., the objective function (5) in this paper) than their parents. In such a way, by only comparing the fitness of parent and his child, the goal of keeping diversity and improving optimization can be achieved simultaneously [40].

Experimental settings

Data sets and speaker recognition models

Experimental evaluation of the proposed methods is conducted on the commonly used data sets for speaker recognition, say VoxCeleb1 [41], VoxCeleb2 [42] and LibriSpeech [43]. VoxCeleb1 is a large-scale text-independent speaker identification data set collected in the wild, which is extracted from videos uploaded to YouTube and consists of 153,516 utterances of 1251 speakers. VoxCeleb2 has over 1 million utterances for over 6000 celebrities extracted from videos uploaded to YouTube. VoxCeleb1 and VoxCeleb2 provide a large amount of the diversity of speaker characteristics. LibriSpeech is derived from audiobooks that are part of the LibriVox project and has 1000 h of speech sampled at 16 kHz.

In this paper, the data from VoxCeleb1 (153,516 utterances from 1251 speakers) and VoxCeleb2 (1,128,246 utterances from 6112 speakers) is utilized to train the models of GMM-UBM, i-vector and x-vector for automatic speaker recognition by following the recipe provided in Kaldi [44]. This data set is named by Train-1 Set (1,281,762 utterances from 7363 speakers in total).

The Imposter Set consists of 100 utterances of 20 speakers from the test-clean subset of LibriSpeech with 5 utterances per speaker. The Test Set is configured in the same way but with 20 different speakers each with 5 utterances, from the test-clean subset in LibriSpeech too. In either Imposter Set or Test Set, the gender ratio is 1:1, i.e., 10 males and 10 females are considered. The speaker IDs involved are shown in Table 1 where “M” and “F” denotes male and female, respectively.

Table 1 The distribution of Imposter and Test sets

The roles of the sets of Train-1, Imposter and Test will be presented in Table 2 and “Experimental design” section.

Table 2 Configuration of speakers and adversarial examples for the three tasks (intra-gender + inter-gender in the last column)

Experimental design

When conducting adversarial attacks, besides the three typical tasks presented in “The proposed methods” section, genders are also considered, where both intra-gender and inter-gender attacks are configured and evaluated. Therefore, there are six combinations in total.

To comprehensively evaluate the performance of adversarial attacks including our proposed one, three groups of experiments are designed for the following purposes.

(1) Matched attacks. In this group of experiments, we assume the attacker uses the same speaker recognition model the victim is using. That is, the model f to obtain an adversarial sample in (6)–(8) matches the model to be attacked. In the experiments, algorithms recently proposed in [13, 14] are taken as baselines for comparison. To increase the representativeness of the experiments, the source speakers of CSI and ASV/OSI are designated to be different, i.e., source speakers in CSI are from the Test Set and those in ASV/OSI are from the Imposter Set. As shown in Table 2, for CSI, the source and target speakers are both from the Test Set. Therefore, 1900 adversarial examples will be crafted and evaluated in total for CSI. In this setting, \(P_{10}^{2} \times 5 \times 2 = 900\) examples are for intra-gender and \(10 \times 10 \times 5 \times 2 = 1000\) ones are for inter-gender, where \(P_{10}^{2}\) is the permutation. For OSI and ASV, the source and target speakers are taken from the Imposter Set and the Test Set, respectively. In either task, there are 2000 adversarial tests in total, where half of them are intra-gender and the remaining ones are inter-gender, as summarized in Table 2.

(2) Transferability. In this group of experiments, we assume the attacker does not know the speaker recognition model the victim is using, which means the model f in (6)–(8) is different from that to be attacked. To evaluate the transferability of the adversarial examples crafted by our algorithm, we treat GMM–UBM, i-vector and x-vector trained on Train-1 (system A, B and C, respectively) as the source systems f. Other systems with different architectures or different training set or both are taken as targeted victim systems. A portion of LibriSpeech (train-clean-100 subset) is utilized to train the target victim models of GMM–UBM, i-vector and x-vector, named by Train-2 Set (28,539 utterances from 251 speakers). The detailed configurations are shown in Table 3 where I, II, VI, VII, XI and XII are attacks cross different models. III, IX and XV are cross training data. IV, V, VIII, X, XIII and XIV are cross both model and data.

Table 3 Experimental configuration for evaluating transferability

(3) Robustness. As shown in (6)–(8), there are three key hyper parameters in our proposed method, i.e., the maximum number of data points to be modified d, the maximum amplitude to be modified ζ, and the regularization parameter λ. The parameter d controls the amount of perturbation, ζ is a parameter which controls the maximum amplitude of perturbation, while the regularization parameter λ weights the amplitude of perturbation. Therefore, it is necessary to analyze the influence of the parameters and to provide suggestions on the choices of them.

Evaluation metrics

To give a comprehensive quantitative evaluation on adversarial audio samples, the following three metrics are introduced to measure efficacy, quality and imperceptibility, respectively.

  • Targeted attack success rate (TASR)TASR reflects the efficacy of adversarial examples on attacking speaker recognition systems. It is defined as the ratio between the number of successful attacks and the total number of attempts.

  • Signal-to-noise ratio (SNR)SNR is proportional to the log ratio between the power of the source voice x over the power of the perturbation e(x). High SNR reflects the high quality of an adversarial sample, which is consistent with the imperceptibility as motivated in this paper.

  • Perceptual evaluation of speech quality (PESQ)—PESQ is a commonly used metric for evaluating speech enhancement. In this paper, given the assumption that x is a clean signal, \({\hat{\mathbf{x}}}\) would better also hold a high perceptual quality. Otherwise, the difference between \({\hat{\mathbf{x}}}\) and \({\mathbf{x}}\) can be noticed and the adversarial attack would fail by simply listening to the sound.

Results and analysis

In this section, we report and analyze the results of the experiments designed in “Experimental design” section.

Evaluation of matched attacks

Parameter settings

The threshold \(\theta\) in ASV and OSI can be estimated by the approach proposed in [14]. The thresholds for different tasks of the same system keep unchanged. As a trade-off between effectiveness and imperceptibility, we choose d = L/3, λ = 0.005 and ζ = 0.003 in the experiments unless otherwise specified, where L represents the length of the utterance. The analysis of choosing λ and d will be discussed in Sect. D.

Performance on adversarial attacks

The results for GMM-UBM, i-vector and x-vector are shown in Tables 4, 5 and 6, respectively. No effort attack means utilizing the unmodified source voice to attack a speaker recognition system. It can be seen that our proposed method is effective on deceiving speaker recognition systems by yielding over 69% success rates on all the attempts even on the most difficult one, OSI. In terms of SNR and PESQ, the average SNR value is around 40 dB and the average PESQ value is around 4.00, which indicate high quality of the adversarial samples and thus high imperceptibility. Inter-gender attacks are slightly difficult than intra-gender ones due to the difference between male voices and female voices, as expected.

Table 4 Performance of the proposed method for matched attacks on GMM-UBM with d = L/3, λ = 0.005 and ζ = 0.003
Table 5 Performance of the proposed method for matched attacks on i-vector with d = L/3, λ = 0.005 and ζ = 0.003
Table 6 Performance of the proposed method for matched attacks on x-vector with d = L/3, λ = 0.005 and ζ = 0.003

By comparing the values reported in Tables 4, 5 and 6, it is observed that x-vector is the most difficult system to be attacked, the easier ones are i-vector and GMM-UBM subsequently. By comparing the rows of Tables 4, 5 and 6, attacking OSI is the most difficult task compared with attacking CSI and ASV. However, the high average SNR and PESQ scores show that the adversarial samples crafted by our methods are not easily perceived by human listening as well as holding high success rates (i.e., TASR’s) on attacking. These results demonstrate the effectiveness of our method.

We have also compared the proposed method with recently proposed baselines on CSI tasks with x-vector as the victim speaker recognition model, e.g., Fast Gradient Sign Method (FGSM) in [13], FakeBob in [14] and PGD-100, C&W \(l_{\infty }\) and C&W \(l_{2}\) in [13]. Unlike the white-box configuration of FGSM, PGD-100, C&W \(l_{\infty }\), and C&W \(l_{2}\) in the referred works, an additional x-vector model is trained as a substitution of the victim model to fulfill the black-box attack configuration in this paper. The experiments are thus black-box attacks with matched speaker recognition systems between attackers and victims. The results on TASR, SNR and PESQ are shown in Table 7. As seen from the table, our method has greatly improved the quality of adversarial samples on SNR and PESQ with a tiny sacrifice on TASR, which demonstrates the advantage on the imperceptibility of our method.Footnote 1

Table 7 Our method w.r.t. recently proposed baselines on matched CSI using x-vector with d = L/3, λ = 0.005 and ζ = 0.003

Visualization

Adversarial samples for a successful attack and the corresponding perturbations are shown in Fig. 2. It is seen that the baseline methods tend to add uniform perturbations through the temporal axis, which would be noticed by listening to the non-speech segments carefully. However, our method adds more perturbations on the data points with high energy, while adds less perturbations on the data points with low energy. This will reduce the chances being noticed by human perception. For quantitative evaluation, the background noise is suppressed in the generated adversarial examples by our method as measured by SNR and PESQ in Table 7.

Towards intuitive interpretation of the adversarial process, t-SNE is taken as the dimension reduction tool to visualize the i-vectors in Fig. 3 and x-vectors in Fig. 4 involved in attacking CSI. t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map [45]. In Figs. 3 and 4, source voices are denoted by circles with five different colors with each color for one speaker. Adversarial examples are denoted by triangles with five different colors, where the color indicates the source speaker of an adversarial example. It can be seen from the figures that targeted adversarial attack pulls source voice into the cluster of the target speaker. In addition, x-vector clusters in Fig. 4 hold larger margins than those of i-vectors in Fig. 3, which explains the drop of TASR’s from attacking x-vector to attacking i-vector from Tables 5 to 6.

Fig. 3
figure 3

source utterances and adversarial examples. Source voices are denoted by circles with five different colors with each color for one speaker. Adversarial examples are denoted by triangles with five different colors where the color indicates the source speaker of an adversarial example (Better viewed in color)

A t-SNE figure to visualize i-vectors of

Fig. 4
figure 4

source utterances and adversarial examples. Source voices are denoted by circles with five different colors with each color for one speaker. Adversarial examples are denoted by triangles with five different colors where the color indicates the source speaker of an adversarial example (Better viewed in color)

A t-SNE figure to visualize x-vectors of

Evaluation of transferability

Transferability cross model or data

The transferability of our method is evaluated as shown in Table 8. Though degradation of performance is observed in all the cross-model (i.e., I, II, VI, VII, XI and XII in Table 8), cross-data (i.e., III, IX and XV in Table 8) and cross-model-and-data (i.e., IV, V, VIII, X, XIII and XIV in Table 8) attacks, over 59% TASR can still be obtained with relatively high SNRs and PESQs.

Table 8 Evaluation of transferability for targeted attacks on CSI task with d = L/3, λ = 0.005 and ζ = 0.003

The degradation is consistent with the results reported in other domains of adversarial attacks, where the distinction of the substitute model from the target model has a negative impact on black-box attacks, even the target model shares the same classifier as the source one but is trained on different data sets. As seen from Table 8, the adversarial samples trained on x-vector system have stronger transferability than GMM-UBM and i-vector, which reveals that it is better to choose a substitute model with high performance on speaker recognition when generating adversarial examples. It is also observed that the match of models is much more important than the match of training data to conduct successful attacks. In an attacker’s view, he/she should learn as much as possible knowledge of the target victim model. While in a defender’s view, any information on the speaker recognition model should be protected carefully, including the model of choice, configuration of models, parameters, and training data etc.

Transfer attack to the commercial system Microsoft Azure

Microsoft Azure is a cloud-based operating system supported by Microsoft. The main goal of Microsoft Azure is to provide a platform for developers to develop applications that can run on cloud servers, data centers, the Web, and PCs [46]. It supports both the ASV and OSI tasks via online API (Application Programming Interface). Since ASV task on Microsoft Azure is text-dependent, we have only tested the attack on OSI, where 20 speakers from the Test Set are enrolled in the OSI system of Microsoft Azure. The baseline performance of this OSI system is tested by utilizing 50 original voices from Imposter Set and the TASR is 0%. We then attack Microsoft Azure using 50 adversarial voices crafted using GMM-UBM1, i-vector1 and x-vector1 systems as the source systems in Table 3, where the TASRs are 42%, 24% and 30%, respectively. The top TASR is from GMM-UBM, which may indicate that the speaker recognition API on Microsoft Azure shares much similarities with GMM-UBM.

Evaluation of parameter sensitivity

In this section, we discuss the selection of parameters that control the strength and length of adversarial samples, and verify the validity of DE algorithm.

Results of changing parameters λ, d and ζ

Figures 5, 6 and 7 show the results of parameter sensitivity regarding to λ, d and ζ on CSI with x-vector as the speaker recognition model. In summary, three constant variables, d = L/3, λ = 0.005, and ζ = 0.003 are taken for these experiments, where we change one variable by fixing the remaining two.

Fig. 5
figure 5

Parameter sensitivity with respect to λ on CSI with x-vector as the speaker recognition model. The vertical line indicates the performance with λ = 0.005. d and ζ are fixed to L/3 and 0.003, respectively

Fig. 6
figure 6

Parameter sensitivity with respect to d on CSI with x-vector as the speaker recognition model. The vertical line indicates the performance with d = L/3. λ and ζ are fixed to 0.005 and 0.003, respectively

Fig. 7
figure 7

Parameter sensitivity with respect to ζ on CSI with x-vector as the speaker recognition model. The vertical line indicates the performance with ζ = 0.003. λ and d are fixed to 0.005 and L/3, respectively

Higher λ and lower d or ζ tend to yield adversarial samples with higher quality (i.e., higher SNR or PESQ) with some sacrifice on attacking efficiency (i.e., lower TASR). It is found that λ = 0.005, d = L/3 and ζ = 0.003 made a relatively good balance between efficiency and quality.

The role of the regularization term

Table 9 shows the results with and without the regularization term in (9). The regularization term plays a role of optimizing the location of perturbations to be added, which is important to reduce the perception on perturbation noises and hence to improve the quality of adversarial samples. When removing this term, the quality of the generated adversarial examples drops significantly in SNR and PESQ, with a slight increase of TASR, as shown in Table 9. These results demonstrate the effectiveness of regularization term on generating imperceptible adversarial samples as motivated in the Introduction.

Table 9 The roles of regularization when conducting matched attacks on x-vector (with/without regularization)

Replacing black-box attacks by white-box ones

Table 10 shows the performance of white-box attacks using the differential evolution algorithm in the white-box versions of FGSM, PGD and C&W. Both the attacking model and the victim model is x-vector, where the TDNN is adopted to compute gradients. By comparing the numbers in Table 10 and those in Table 7, the TASR values of white-box attacks have been greatly improved compared with respect to them of black-box attacks. However, in real-world settings, attackers can seldomly know the configurations of the victim systems, i.e., the attacks are actually black-box ones.

Table 10 Results of white-box attacks

Replacing waveform-level perturbations by feature-level ones

Since the interfaces of most ASV systems are speech waveforms, a substitution of waveform-level perturbation is to compute feature-level perturbations first and to transform the features back into waveforms subsequently. The results reported in Table 11 are obtained by computing perturbations on MFCC [8] and LPMS [9] and transforming the perturbed MFCC’s and LPMS’s back to waveforms (by copying the phase information of the original voice). The victim model is x-vector. Deteriorated performance is observed comparing to our proposed waveform-level method. It is worth noting that the regularization term in Eq. (9) is designed on waveform level, so we have to omit this term when conducting the adversarial experiments on the feature-levels. One may argue that additional regularization terms can be invented on these features, but they are out the scope of our paper.

Table 11 The performance of the attacks using different speaker features on x-vector

The extracting processes of MFCC and LPMS can actually lose some information when conducting the transforms. The feature-level perturbations can be destroyed when passing through the transform, as also analyzed in [47]. The abbreviations appeared in the paper are shown in Table 12.

Table 12 Abbreviations in the paper

Conclusions

This paper proposed a new method to conduct imperceptible black-box waveform-level targeted adversarial attacks against speaker recognition systems. Inspired by auditory masking, the imperceptibility was promoted by only modifying a part of selected sample points with a constraint on modification amplitudes. Extensive experiments conducted on three typical tasks and state-of-the-art models of speaker recognition were conducted, and the results showed effectiveness of the proposed method. Transferability and robustness of the methods were evaluated and analyzed, which shed lights on both attackers and defenders on designing secure speaker recognition systems.

Future works may explore adaptive attack methods to improve the efficiency of the adversarial samples [48]. In addition, the adversarial attacks currently take several minutes to be conducted, which is infeasible for real-time implementations. Therefore, future attacks will have to be computationally efficient while retaining their robustness [49].