1 Introduction

Deep neural networks (DNNs) have found widespread applications across various domains of artificial intelligence, including image classification (Du and Pun 2020), object tracking (Zhou and Pun 2020), and automatic speech recognition (ASR) (Khan et al. 2023). Despite their remarkable achievements, DNNs exhibit a surprising vulnerability to adversarial attacks. These attacks have proven to be highly effective in deceiving DNN models by introducing imperceptible adversarial noise (Aldahdooh et al. 2022; Bécue et al. 2021).

In recent years, significant advancements in adversarial attacks have been observed in the field of computer vision, spanning image recognition (Zhang et al. 2022), object detection (Zhang and Wang 2019), and video recognition (Lo and Patel 2020). Numerous approaches have been proposed and applied in practical settings. However, the realm of adversarial audio attacks, particularly robust adaptive adversarial attacks, has received relatively less attention. Therefore, research in the domain of audio attacks remains relatively underexplored, and the corresponding defensive strategies are even scarcer. As illustrated in Fig. 1, we can observe a typical instance of an adversarial audio attack. In this example, the innocuous phrase on the left, “Life was like a box of chocolate,” is effortlessly manipulated to convey a malicious message, “This audio has been attacked.“ This scenario vividly underscores the vulnerability of DNN models to malicious manipulation, which, in turn, leads to erratic behavior in ASR systems. Consequently, there is a pressing need to develop an effective and resilient defense mechanism against adversarial audio attacks. This paper is dedicated to creating a robust and adaptive detection framework designed to counter state-of-the-art adversarial audio attacks.

Fig. 1
figure 1

Typical adversarial audio example generation framework for fooling ASR systems

Traditional methods for countering adversarial audio attacks have primarily focused on minimizing adversarial noise in input audio. These methods encompass techniques like signal quantization, down-sampling operations, and local smoothing mechanisms (Yang et al. 2018). While they offer some degree of adversarial example detection, their effectiveness has been unsatisfactory because these crafted adversarial examples are regarded as features rather than vulnerabilities (Ilyas et al. 2019). Adversarial attacks have long grappled with an overfitting problem, revealing the intrinsic properties of adversarial examples.

For adversarial defense, Yang et al. (2018) introduced a simple yet effective strategy by incorporating a randomization layer into classifiers before processing images, thus mitigating adversarial examples. Kwon et al. (2019) discovered that adversarial noise is susceptible to additional distortion, even a small amount of which can significantly reduce the success rates of adversarial attacks from 100% to 6.21%. This finding underscores the overfitting issue associated with adversarial audio attacks. Rajaratnam and Kalita (2018) devised a defense based on audio pre-processing, introducing stochastic perturbations to flood specific frequency bands of an input signal, effectively countering adversarial audio examples. This strategy has been extended into a robust defense method by combining audio compression, band-pass filtering, audio panning, and speech coding to address more complex scenarios. Yang et al. (2019) argued that adversarial audio examples exhibit sensitivity to temporal dependencies. When an input audio signal is split into two fragments and transcribed separately, the defense system often produces significantly different transcription results for the divided inputs when the input is adversarial. Wu et al. (2023) presents a defense method named AudioPure, which protects acoustic systems from adversarial attacks using diffusion models. This method harnesses the generative capabilities of diffusion models to add noise and purify adversarial audio, thus restoring clear audio. The experiments have shown that this method surpasses other baseline methods in certified robustness.

Additionally, recent work has highlighted the potential of multi-model detection in mitigating attacks with weak transferability. Conventional adversarial attacks are generally tailored to specific classifiers, resulting in weak transferability and reduced overall robustness. In the IJCAI-19 Alibaba Adversarial AI Challenge, the organizers notably enhanced the defense efficacy through the implementation of a multi-model detection framework. Experimental findings showcased the robustness of this defense approach against attack methods with limited transferability. Notably, ASRs vary widely in structure, such as DeepSpeech (Hannun et al. 2014), which is an end-to-end ASR system with four RNN layers, and Google Cloud Speech (Sak et al. 2014), which is based on an LSTM-based RNN architecture. This diversity in ASR architectures exacerbates the transferability challenge in adversarial audio attacks. As of now, no transferable audio attack has been successfully designed.

It’s unfortunate that current defense mechanisms still face challenges when dealing with intricate adaptive adversarial attacks, despite their success in countering certain types of audio attacks. Designers must carefully fine-tune input signal preprocessing to ensure low false-positive rates (FPR) and low false-negative rates (FNR). Nonetheless, these fine-tuning-based tactics are not well-suited for robust attacks such as black-box attacks (Taori et al. 2019) or physical attacks (Yakura and Sakuma 2019). Furthermore, existing simple pre-processing defenses fail to account for the threat posed by adaptive attacks. This implies that they are not capable of effectively countering adaptive attacks with carefully crafted adversarial examples, particularly when the attacker possesses knowledge of the detection intricacies. As a result, two crucial questions arise.

  • Q1. Is it viable to develop a defense mechanism capable of significantly altering input audio while maintaining high recognition accuracy?

  • Q2. How can we leverage the vulnerabilities of adversarial attacks to devise an effective approach for addressing the formidable challenge posed by adaptive attacks?

We should note that a shorter conference version of this paper appeared in Du et al. (2020). In our prior research, we introduced a unified defense framework that incorporated noise padding into adaptive sound reverberation to mitigate the risk posed by adversarial audio samples. The experimental results demonstrated the efficacy of our adaptive detection framework against representative and even adaptive adversarial attacks. However, we have identified a critical limitation in our previous defense mechanisms; they can fail entirely when confronted with robust adaptive attacks, significantly compromising our defense performance. Consequently, the primary objective of this paper is to enhance our defense framework to address this issue more effectively.

In our prior work, we discovered two key insights:

  • Room impulse responses (RIRs) convolution can effectively disrupt adversarial noise and dismantle their transferability while preserving benign samples. In contrast to traditional fine-tuning approaches, our defense framework utilizes an adaptive acoustic room simulator to generate a wide range of synthetic utterances. This adaptive approach allows our framework to dynamically adjust its defense strength, avoiding attacks while maintaining a low false-positive rate (FPR).

  • Leveraging temporal dependency (Yang et al. 2019) to disrupt the continuity of adversarial perturbations is a simple yet effective operation when countering audio attacks. We introduced a multi-fragment noise padding mechanism to break the continuity of adversarial noise. Instead of roughly dividing the input audio into segments, our core concept involves using a Voice Activity Detector (VAD) to pinpoint the audio signal’s cut points, ensuring both low FPR and low FNR.

In this paper, we combine and develop these two prior designs to construct a unified pre-processing mechanism, forming the first layer of our improved defense framework. For the second layer, we employ transferability as a criterion. All pre-processed audio passes through an adaptive ASR transcribing layer to determine if an attack occurred. Furthermore, we develop an optimal unified detection architecture capable of handling the most challenging adaptive attacks by incorporating the aforementioned three defense mechanisms. This mechanism boasts high adaptability and randomness, compelling attackers to consider additional defense details during their design process, thus increasing their computational cost.

Our defense framework conducts the following operations: Initially, for a given input audio, we employ an adaptive artificial utterance generation strategy to mitigate the overfitting problem associated with adversarial examples. To disrupt the temporal dependencies of adversarial perturbations, we use multiple RIR-convolution operators to craft various reverberation audios. Simultaneously, we adaptively calculate the complexity based on the input audio’s Sound Pressure Level (SPL). Next, we introduce a multi-fragment-based noise padding mechanism to break the continuity of adversarial attacks. To achieve this, we incorporate Gaussian noise into each silent segment of the artificial utterance generated during the RIR-convolution step. Consequently, our defense mechanism adapts to disrupt the continuity of adversarial attacks, all the while ensuring that the padding noise doesn’t compromise detection performance. After these initial pre-processing steps, the input’s continuity is disrupted if it’s adversarial. Subsequently, all pre-processed audio undergoes stochastic processing by multiple ASRs to yield transcription results. We then calculate a similarity score between these results to determine the input’s adversarial nature. Figure 2 outlines the flowchart of our adaptive defense framework. In summary, this paper’s primary contributions are as follows:

  • We introduce a unified pre-processing framework designed to address robust and intricate audio adversarial attacks. As far as we know, existing adversarial detection methods are insufficient in effectively countering robust adaptive attacks in ASR. Our framework significantly bolsters the defense complexity, incorporating a plethora of defense details to amplify the computational burden faced by adaptive attacks.

  • Building upon our previous framework, we have integrated an adaptive ASRs transcribing mechanism to fortify the resilience of our defense, particularly in the face of adaptive attacks.

  • We proposed a novel SPL-based audio intensity detection formula, which significantly enhances the defense complexity when dealing with robust adaptive attacks. To evaluate the efficacy of our defense framework against these attacks, we have tested it against a range of challenging adversarial audio attacks, including single-word attacks, Multi-Sample Ensemble Method (MSEM), and Weighted Objective-based Ensemble Model (WO-EM) attacks.

2 Background

2.1 Representative adversarial attacks

Let (xy) represent a pair comprising an input audio sample x and its corresponding correct transcription y. In the context of adversarial attacks, these attacks can be categorized into two main types based on their objectives: targeted and non-targeted attacks. In targeted attacks, the goal is to find a small perturbation vector \(\delta\) that can be added to the input signal in order to mislead the ASR system. Specifically, the objective is to make the ASR system output a specified target transcription \(y^*\), where \(y \ne y^*\). This type of attack is particularly challenging in the audio domain due to the inherent complexity of audio signals and the transformation they undergo before being processed by ASR systems.

Audio attacks can be broadly classified into two categories:

White-box attacks: These attacks have access to all the details of the target ASR system, including its internal architecture and parameters. Carlini and Wagner (2018), for example, proposed an end-to-end targeted audio adversarial attack on ASR models, achieving a 100% success rate with slight but effective adversarial noise. In white-box attacks, the adversarial perturbation \(\delta\) is calculated based on the input audio x and the target transcription \(y^*\). The objective function includes a trade-off parameter \(\lambda\) and the Connectionist Temporal Classification (CTC) loss:

$$\begin{aligned} \underset{\delta }{\arg \min }\ \quad \lambda \Vert \delta \Vert _{p}+c \cdot {\text {CTC}}\left( f_{\theta }(x+\delta ), y^{*}\right) \end{aligned}$$
(1)

where c serves as a balancing factor, determining the trade-off between achieving adversarial characteristics and maintaining proximity to the original audio. Meanwhile, \(f_\theta (\cdot )\) denotes the recognition results produced by the ASR model, and the term \(\Vert \delta \Vert _{p}\) denotes the imposed \(L_p\) norm constraint on the perturbation vector. Carlini & Wagner also use a max-norm constraint to speed up adversarial audio generation, adding adversarial noise uniformly throughout the input signal.

Black-box attacks: In contrast, black-box attacks do not require access to the internal details of the target model but focus solely on obtaining real-time output results. Taori et al. (2019) introduced a Momentum-based Mutation black-box audio attack that adaptively adjusts mutation probabilities during optimization. The objective function in this case aims to optimize a small yet effective black-box perturbation to the input audio, combining genetic algorithms and gradient estimation:

$$\begin{aligned} p_{new}=\chi \times p_{old}+\frac{\phi }{| \text {currScore}-\text {prevScore} |} \end{aligned}$$
(2)

where Score represents the CTC loss, \(\chi\) and \(\phi\) is used to dynamically update the mutation probability from \(p_{old}\) to \(p_{new}\) based on the progress of optimization. These attack strategies highlight the challenges and complexities involved in crafting adversarial audio examples, whether with or without knowledge of the target ASR system’s internal details.

Fig. 2
figure 2

Flowchart illustrating the adaptive defense framework proposed for addressing adversarial audio examples

3 Proposed method

This paper presents an efficient detector designed to mitigate various adversarial attacks, encompassing both white-box and black-box attacks, as well as the intricate robust adaptive attacks. As far as our knowledge goes, there is presently no robust defense model accessible for effectively handling robust adaptive adversarial attacks (Qin et al. 2019; Yakura and Sakuma 2019). In response to various existing adversarial attacks, we have identified the weaknesses in these attacks and devised corresponding defense mechanisms:

  • Overfitting: Existing adversarial attacks are often plagued by overfitting issues. To mitigate this, we have developed an adaptive acoustic room simulator that generates diverse artificial utterances. This adaptive defense framework can dynamically adjust its defense strength to maintain a low False Positive Rate (FPR).

  • Continuity: Adversarial attacks typically require the crafted adversarial noise to be consistently applied throughout the entire input audio. To disrupt this continuity, we have designed a Multi-Fragment Noise Padding method, ensuring that adversarial examples are interrupted while maintaining stable performance on benign audio.

  • Transferability: Achieving cross-model attacks in audio adversarial attacks remains a challenge. Therefore, we propose an adaptive ASRs transcribing strategy. We argue that using multiple ASRs can help recover the final transcript results to the greatest extent possible, even if the transferability of the pre-processed audio has been destroyed.

  • Computation Cost: Attackers must consider all defense strategies when designing adaptive attacks. However, as more details are taken into account, the computational burden of the adaptive design exponentially increases. Our framework, a unified-based mechanism that combines three defense schemes, is inherently complex and adaptive. This complexity poses a significant challenge to adaptive attacks, often causing them to fail to converge.

Figure 2 presents an overview of our proposed framework, highlighting the adaptive detection mechanism. Our design comprises two layers: The first layer incorporates a unified pre-processing mechanism, which includes adaptive artificial utterance generation and multi-noise padding operations. The second layer features an adaptive ASRs transcribing strategy. More details on these layers will be elaborated upon in the following subsections.

3.1 Unified pre-processing mechanism

SPL-based Complexity Analysis: In our earlier research, we introduced the VAD for two key objectives: (1) to assess the perturbation level, assisting our framework in automatic defense complexity adjustment, and (2) to pinpoint optimal locations for the multi-noise padding operation. However, VAD may not detect silent segments in the input audio when addressing adaptive attacks, particularly single-word attacks, which could potentially limit the effectiveness of certain aspects of our defense mechanism. Therefore, in this paper, we confine the use of VAD exclusively to the multi-noise padding aspect of our design.

Fig. 3
figure 3

The left part from top to bottom: the benign input signal; the audio processed using the random noise defense mechanism (Rajaratnam and Kalita 2018); the audio processed using our RIRs-convolution mechanism. The right part shows their corresponding spectra

To maintain a consistent complexity level, we assess voice intensity using the real-world metric known as Sound Pressure Level (SPL). SPL quantifies sound pressure and is measured in decibels (dB). It can be calculated as follows:

$$\begin{aligned} \textrm{SPL}(\textbf{x})=20 \log _{10} P(\textbf{x}) \end{aligned}$$
(3)

where P(x) represents the power of the signal of length N, computed as:

$$\begin{aligned} P(\textbf{x})=\sqrt{\frac{1}{N} \sum _{n=1}^{N} x_{n}^{2}} \end{aligned}$$
(4)

where \(x_n\) represents the n-th component of the array x. The defense complexity of our model can be calculated as:

$$\begin{aligned} \text{ complexity }=N_R =\log _{2}\left( \frac{\alpha +S P L(x)}{\beta }\right) \end{aligned}$$
(5)

In this equation, parameters \(\alpha\) and \(\beta\) are introduced to balance the importance of defense complexity and efficiency. For this paper, we set \(\alpha =50\) and \(\beta =2\). \(N_R\) represents the number of convoluted RIRs. A higher number of RIRs (greater \(N_R\)) contributes to more stable defense performance, but it also incurs higher computational costs. In Sect. 5, we will delve into the relationship between the number of selected RIRs and the defense rate in an ablation study. In our design, we set \(complexity \in [3,\infty )\) to ensure that a minimum of 3 RIRs participate in the defense process, aligning with the number of auxiliary ASR systems used.

Adaptive Artificial Utterances Generation: After determining the complexity level, \(N_R\) RIRs are randomly selected from a database (Nakamura et al. 2000; Kinoshita et al. 2013; Jeub et al. 2009). RIR-convolution, widely employed to replicate reverberation in varying environments, is utilized. For the attacked recognition model, data augmentation (Perez and Wang 2017) is an effective operation to enhance model accuracy and robustness. In the RIR-convolution mechanism, r RIRs are applied by an acoustic room simulator to create an artificial utterance with different room configurations. Given a benign sample x, we can obtain a speech signal with reverberation as follows:

$$\begin{aligned} \mathscr {F}\{x * r\}=\mathscr {F}\{x\} \cdot \mathscr {F}\{r\} \end{aligned}$$
(6)

where \(*\) represents the convolution operation.

Figure 3 illustrates three types of audio waveforms and their corresponding spectra: the benign input signal, audio processed by a random noise defense mechanism (Rajaratnam and Kalita 2018), and audio processed by our RIRs-convolution mechanism. Our RIR-convolution approach exhibits a more significant modification of the input audio compared to other noise padding mechanisms. Additionally, due to the victim classifier’s data augmentation training, the recognition results of pre-processed benign samples remain unaltered.

Multi-fragment noise padding: This mechanism is designed to disrupt the continuity of the perturbation noise. Unlike other defenses (Yang et al. 2019) that directly transcribe split speech into text, our approach involves VAD to identify the most suitable location for inserting noise, as in our previous design:

$$\begin{aligned} \begin{array}{l}H_{0}: \quad x=n \\ H_{1}: \quad x=n+s\end{array} \end{aligned}$$
(7)

where \(H_0\) and \(H_1\) represent silence and speech in the detected audio signal. n and s denote the noise and speech signals, respectively. In this study, we assess seven prominent VAD frameworks, namely Short-Time Energy (STE), Short-Time Zero-Crossings Rate (STZCR), STE-ZCR (Haigh and Mason 1993), End-to-End-VAD (Ariav and Cohen 2019), Bowon Lee’s VAD (Lee and Hasegawa-Johnson 2007), and WebRTC VAD.Footnote 1 Following experimental analysis, we will select the most suitable VAD for our framework.

For each of the transformed \(N_R\) audio samples, VAD is applied to detect non-active (silent) segments \(H_0 = \left\{ H_{01}, H_{02},...,H_{0\,s}\right\}\) and determine suitable splitting points for inserting noise. In our design, each silent segment becomes a cutting center. The transformed \(N_R\) audio is then partitioned into multiple speech segments. Short, small Gaussian noise is introduced between each split to break the continuity of the adversarial noise. Consequently, we acquire \(N_R\) speeches that have been subjected to noise insertion. It is noteworthy that this procedure exclusively introduces minor random noises into the silent segments of the speech, mitigating the risk of adversarial examples while preserving benign sample performance.

3.2 Audios allotting and adaptive ASRs transcribing

In image attacks, most designs are limited in their ability to maintain their adversarial effects on specific recognition models due to the lack of transferability. An interesting countermeasure emerged in the IJCAI-19 Alibaba Adversarial AI Challenge, where organizers employed multiple classifiers to simultaneously recognize all attacked adversarial examples. This approach effectively mitigated image attacks with insufficient transferability. Drawing inspiration from this, we propose a simple yet effective adaptive ASRs transcribing mechanism. This approach employs diverse ASRs to transcribe the input audio and assesses the similarity between their transcriptions for adversarial example detection. Furthermore, our integrated framework encompasses a sophisticated pre-processing architecture meticulously engineered to disrupt the continuity and transferability of input adversarial examples before undergoing adaptive ASRs transcription.

Our adaptive ASR systems consist of two components: the first part is the target ASR system, responsible for transcribing the input audio, while the second part is the auxiliary ASR system, tasked with transcribing the pre-processed audio. This implies that both the original audio and the pre-processed \(N_R\) audios mentioned in Sect. 3.1 are provided to both the target ASR system and the auxiliary ASR systems, resulting in a total of \(N_R+1\) transcribed text sentences. To establish an adaptive transcription mechanism, we have developed a complexity function for adjusting the number of allocated audios as follows:

$$\begin{aligned} N_{D}=m \cdot \text{ complexity } = m \cdot N_R \end{aligned}$$
(8)

where \(m \in [1,k)\) is a coefficient intended to adjust the number of audios allocated to k auxiliary ASR systems (in our design, \(k=3\)). For example, when \(m=1\) and \(complexity=12\), it signifies that each auxiliary ASR system is responsible for transcribing four pre-processed audios. Conversely, when \(m=k\), it implies that all pre-processed audios will be assigned to every auxiliary ASR system.

Similarity score analysis: After the adaptive ASRs transcribing process, we employ similarity scores to evaluate the transcription similarity between the \(N_D\) pre-processed audios and the input audio. In particular, we utilize three similarity metrics: the word error rate (WER) (Levenshtein 1966), the Cosine similarity score, and the Jaro-Winkler similarity score (Gomaa et al. 2013).

As a result, we obtain a set of final calculated similarity scores denoted as \(S=\{S_1, S_2, \ldots , S_{N_D}\}\). These similarity scores are crucial for determining whether the input audio is benign or subjected to an attack. In the case of an attacked input audio, the average similarity scores, denoted as \(Avg_S = (S_1 + S_2 + \ldots + S_{N_D})/N_D\), must be lower than a specified threshold score T. This threshold score T is set to ensure that the false positive rate (FPR) remains below 5%. Conversely, in the case of a benign sample, a higher similarity score is expected.

4 Adaptive adversarial attacks

To verify the effectiveness of our proposed defense against adaptive attacks, which possess full knowledge of the defense mechanism, we introduce three adaptive attack scenarios.

1. Single-word attacks: In this type of attack, the target label corresponds to a single word rather than a complete sentence, and it employs the same formula as Eqn(1). It’s worth noting that single-word attacks are generally considered less severe threats to many defense frameworks. Since single-word audio does not possess non-active segments, it can effectively avoid our VAD detection. Therefore, our hypothesis is whether our defense framework is still robust against adversarial attacks if VAD is invalid.

2. Multi-Sample Ensemble Method (MSEM): The use of ensemble methods in deep neural network (DNN) training has experienced a resurgence in recent years, demonstrating state-of-the-art performance and robustness in classification tasks (Hansen and Salamon 1990; Krogh and Vedelsby 1995; Caruana et al. 2004). This approach has also been applied to craft robust attacks (Athalye et al. 2018; Eykholt et al. 2018; Li et al. 2019; Komkov and Petiushko 2019). Similar to the MSEM method introduced in Du and Pun (2020), Du et al. approximate the solution to the following objective function:

$$\begin{aligned} \underset{\delta }{\arg \min }\ \quad \lambda \Vert \delta \Vert _{p}+\frac{1}{S} \sum _{k=1}^{S} c \cdot {\text {CTC}}\left( f_{\theta }( s(x + \delta ), y^*\right) ) \end{aligned}$$
(9)

where x and y denote the processed input audio and the target transcription, respectively. \(f(\cdot )\) represents the ASR model, and CTC stands for the Connectionist Temporal Classification loss function used for training the neural network. The set S represents a distribution space comprising a collection of reverberated speech utterances s generated according to Eqn (6). The parameter c serves as a balancing factor like Eqn (1) and the term \(\Vert \delta \Vert _{p}\) denotes the imposed \(L_p\) norm constraint on the perturbation vector. This algorithm enables attackers to execute robust audio attacks capable of withstanding various over-the-air scenarios.

3. Weighted Objective-based Ensemble Model (WO-EM): Currently, one of the most significant challenges in transferable attacks arises from the diverse architectures of individual speech recognition systems.In the context of crafting adversarial examples, ensemble model attacks have proven effective in deceiving multi-classifiers, particularly within the field of computer vision (Du and Pun 2020). In this paper, we present the WO-EM attack, designed to improve the transferability of adversarial perturbations. In formulating the objective function, we expand Eqn (9) to:

$$\begin{aligned} \underset{\delta }{\arg \min }\ \quad \lambda \Vert \delta \Vert _{p}+\frac{1}{M} \sum _{k=1}^{m} c \cdot {\text {CTC}}\left( f_{\theta }( m(x + \delta ), y^*\right) ) \end{aligned}$$
(10)

where \(m \sim M\) denotes various automatic speech recognition systems.

5 Experimental results

5.1 Setup

Our experiments were conducted within the TensorFlow environment on a computational system equipped with two NVIDIA GeForce GTX 3090 GPUs, each with 24 GB of memory. We employed Mozilla’s DeepSpeech implementation,Footnote 2 specifically DeepSpeech_v0.1.0 (Hannun et al. 2014), as our target ASR system. This choice aligns with the settings utilized in other research related to both attack and defense. Additionally, our multi-ASR framework incorporated DeepSpeech_v0.1.1, Google Cloud Speech,Footnote 3 and Amazon Transcribe.

Our audio classification task encompassed two categories of attacks. Firstly, we considered representative attacks, namely the white-box and black-box attacks, which are widely acknowledged standards for evaluating most defense methodologies. Secondly, we introduced adaptive attacks, including the single-word attack, MSEM, and WO-EM, specifically designed to challenge the robustness of our defense framework. Furthermore, we included five state-of-the-art defense methods for comparison purposes, namely, the random noise defense method (Kwon et al. 2019), limited random noise defense method (Rajaratnam and Kalita 2018), audio splitting method (Yang et al. 2019), diffusion-based defense (Wu et al. 2023) and our previous work (Du et al. 2020).

Here, we prepare three datasets:

  • Mozilla Common Voice dataset: Common VoiceFootnote 4 is a set of voice data that users read on the Common Voice website and is a corpus of 16KHz English speech derived from many public domain sources, such as user-submitted blog posts, old books, movies, and other public speech corpora.

  • Speech Commands dataset: Speech Commands dataset (Warden 2018) is an audio dataset containing 65000 audio files, it includes ten words: “Yes”, “No”, “Up”, “Down”, “Left’, “Right”, “On”, “Off”, “Stop”, and “Go”. Each audio is just a single command lasting for one second. 900 samples from this websiteFootnote 5 are used to mount the single-word attack.

  • TIMIT: The TIMIT Acoustic-Phonetic Continuous Speech CorpusFootnote 6 is a widely recognized and foundational dataset for evaluating automatic speech recognition systems and various phonetic studies. The TIMIT dataset includes broadband recordings of 630 speakers of eight major American English dialects, each reading ten phonetically rich sentences.

Furthermore, we formulated the detection accuracy function as follows:

$$\begin{aligned} accuracy = 1 - FPR - FNR \end{aligned}$$
(11)

where FPR represents the false positive rate, indicating the proportion of benign samples misclassified as adversarial, and FNR represents the false negative rate, denoting the proportion of adversarial inputs incorrectly classified as benign.

5.2 Results on different adversarial audio attacks

In this section, we will evaluate the effectiveness of the proposed defense mechanism against various attack tasks. We will measure the FPR, FNR, and overall accuracy to assess its performance. Initially, we will assess our defense using representative attacks. Subsequently, we will employ three introduced adaptive attacks to test the resilience of our framework. Additionally, we will reproduce and compare our work with the other four state-of-the-art defense methods introduced earlier.

Table 1 presents the three benchmark audio datasets utilized in our evaluation. Specifically, this study utilizes 250 benign samples from the Mozilla Common Voice dataset and 250 benign samples from the TIMIT dataset to generate 500 adversarial examples employing representative attacks. Due to computational constraints, we have incorporated only three to five target words into each audio sample. It’s worth noting that generating each adaptive adversarial example demands approximately 30 min, whereas single-word attacks can be generated in just two minutes. Consequently, we have generated 9,000 single-word adversarial examples for evaluation, employing the Speech Commands dataset.

Table 1 Datasets used in our evaluation

Representative attacks: In this section, we will consider four representative attack methods: the C &WFootnote 7 (Carlini and Wagner 2018) attack method and the TaoriFootnote 8 (Taori et al. 2019) black-box attack method introduced in Sect. 2.1. Furthermore, we introduce two additional, markedly challenging attack techniques for comparative analysis: Qin’s attackFootnote 9 (Qin et al. 2019) and the SMACKFootnote 10 (Yu et al. 2023) black-box attack, thereby enriching our exploration of adversarial methodologies within the domain.

We initially assess the detection error rate of the introduced Voice Activity Detection (VAD) framework, defining the error rate as \(Error = (SdN + NdS)/(T)\) (Gilg et al. 2020), where SdN represents “Speech frames detected as Noise,” NdS stands for “Noise frames detected as Speech,” and T corresponds to the total number of frames.

Table 2 demonstrates that the Short-Time Energy with Zero-Crossing Rate (STE-ZCR) achieves the lowest error rate at 10.7% in voice activity detection. Additionally, the STE-ZCR method maintains superior accuracy, with only a 12.6% error rate when subjected to adversarial attacks. Hence, we have chosen STE-ZCR as the VAD method for our design.

Table 2 Error rates of different VADs

Table 3 presents the detection results when six defense mechanisms are applied to handle representative attacks. The terms “splitting_2” and “splitting_3” indicate that the input speeches are divided into 2 and 3 fragments, respectively. When dealing with the C &W attack, which is the weakest among the representative attacks, most of the defense mechanisms perform well, as the C &W attack does not prioritize robustness. In contrast, our adaptive framework effectively circumvents these issues and achieves a 100% detection rate against the C &W attack.

An intriguing observation from the experimental results is as follows:

  1. 1.

    While the splitting-based mechanism excels in interrupting the continuity of adversarial noise, it also negatively affects the accuracy of benign samples, resulting in a higher FPR.

  2. 2.

    While “splitting_2” maintains a low FPR, the high FNR suggests that this approach is ineffective against audio attacks (refer to Table 5).

Table 3 Accuracy rates obtained by various defense frameworks, including random noise padding (Kwon et al. 2019), limited noise padding (Rajaratnam and Kalita 2018), the audio splitting method (Yang et al. 2019), diffusion-based defense (Wu et al. 2023), our previous defense (Du et al. 2020), and the proposed defense

Furthermore, the experimental results demonstrate that the random noise padding mechanism and limited noise padding operation exhibit a low detection rate when confronted with black-box attacks. This implies that making minor alterations to the input audio is ineffective in countering black-box or robust attacks. In contrast to white-box attacks, which can swiftly generate adversarial examples with minimal perturbations, black-box attacks lack complete information from the ASR model. Consequently, substantial perturbations are required for such attacks. However, these increased perturbations also enhance the robustness of black-box attacks, enabling the generated adversarial examples to evade most fine-tune-based pre-processing defenses. Importantly, the experimental results reveal that Taori black-box attacks are susceptible to temporal dependencies, and both our adaptive framework and the audio splitting mechanism effectively reduce the FNR when countering black-box attacks. Compared to the Taori black-box attacks, the SMACK method enhances robustness in black-box adversarial settings by reconstructing audio signals. Notably, this approach demonstrates efficacy in executing over-the-air black-box attacks, a significant advancement in adversarial machine learning. Experimental results indicate that the SMACK black-box attack exhibits remarkable robustness across various adversarial strategies. This resilience extends even to defenses designed against robustness attacks based on Expectation over Transformation (EoT), where SMACK performs commendably. Exceptionally, only our adaptive defense framework and the diffusion-based defense have shown promising results in mitigating these attacks. Specifically, the diffusion-based method disrupts the integrity of adversarial audio inputs by reconstructing them using a diffusion model, thereby impairing the effectiveness of SMACK attacks. On the other hand, our proposed method exploits the limited transferability inherent to existing attack techniques. By dynamically identifying malicious inputs using an ensemble of models, our approach achieves superior defensive performance. This analysis reveals a critical insight into the landscape of adversarial attacks; current methodologies, whether black-box or white-box, struggle with the effective transferability of attacks.

To gain deeper insights into the effectiveness of our proposed framework, we conducted ablative evaluations to examine the contributions of different parameters employed in our defense framework. Figure 4 provides detailed experimental results from these ablation studies. The results in Fig. 4 demonstrate that our settings in Eqn (5) are reasonable. Furthermore, Table 4 displays our detection accuracy for various RIR and allotted audio settings. It is evident that higher complexity can significantly improve the system’s accuracy rate. However, higher complexity also leads to increased computational overhead. This is precisely why we designed Eqn (5).

Fig. 4
figure 4

Comparisons of detection accuracy curves are presented for various experimental settings \(\alpha\) and \(\beta\) (Eqn 5) in our ablation studies

Table 4 Detection rates for various similarity calculation methods against the WO-EM attack. “Ours_x” indicates the number of RIRs employed in our defense framework. m represents the coefficient from Eqn 8
Table 5 Accuracy rates of various defense frameworks (random noise padding (Kwon et al. 2019), limited noise padding (Rajaratnam and Kalita 2018), audio splitting method (Yang et al. 2019), diffusion-based defense (Wu et al. 2023), our previous defense (Du et al. 2020), and the proposed method) when handling our designed adaptive attacks (single-word attack, MSEM, and WO-EM)

Adaptive attacks: For the single-word attack, our objective is to assess the robustness of our defense framework even when the noise padding mechanism fails. As illustrated in Fig. 5, our defense maintains a high defense rate (99.7%) even when the noise-padding mechanism proves ineffective. In our previous design, defense complexity primarily relied on the adversarial noise detected by the VAD system. However, since single-word adversarial examples lack silent segments, adjusting complexity based on detected noise becomes impractical. In contrast, our SPL-based measurement effectively addresses this issue. Additionally, we observed that the adaptive ASRs transcribing mechanism significantly improves our detection accuracy compared to our previous design, as demonstrated in Table 5.

Fig. 5
figure 5

A heat map illustrating the success rates of single-word attacks (a) and the success rates of single-word attacks after transformation (b). The diagonal with zeros represents trivial source-target pairs for which no adversarial examples were generated

It is essential to emphasize that for sentence attacks, although RIR-convolution may result in incorrect transcriptions for a few words, its overall influence on the similarity score remains minimal. However, in the context of single-word attacks, the impact of RIR-convolution on the similarity score is magnified, leading to a reduction in detection accuracy. This phenomenon is particularly noticeable in the DeepSpeech model but occurs less frequently in other models like Google Cloud Speech and Amazon Transcribe. We propose that this phenomenon may be attributed to overfitting in the DeepSpeech model.

In the case of MSEM, we assume that the attacker possesses full knowledge of our adaptive RIR-convolution mechanism but doesn’t have prior information about which specific RIRs our defense framework will use due to its adaptive nature. Using Eqn (9), the attacker aims to generate robust adversarial examples against our adaptive RIRs-convolution operation. In our implementation, we select 16 RIRs to create robust adversarial examples due to computational limitations. The final test results are presented in Table 5. These results reveal that simple audio pre-processing operations struggle to handle robust attacks, with their accuracy dropping significantly to 0%. Conversely, although the ensemble attack gains complete knowledge from our defense framework, it faces challenges maintaining robustness when certain RIRs are used. Furthermore, after RIR-convolution, the transferability of adversarial examples is compromised to some extent. This sets the stage for our adaptive ASR transcribing mechanism. Notably, our proposed framework effectively handles robust attacks, achieving an accuracy of 98.2%. This demonstrates that even if our RIR-convolution defense fails, our adaptive ASRs transcribing mechanism remains resilient against adversarial examples, highlighting the high robustness of our unified framework. Compared to our previous work, applying the adaptive ASRs transcribing mechanism enhances the performance of our detection framework from 89.2% to 98.2%. An intriguing observation from our study is the diminishing efficacy of robustness attacks predicated on ensemble strategies in the face of continuously evolving defense mechanisms. Notably, the diffusion-based defense mechanism has demonstrated commendable performance in thwarting MSEM attacks, even surpassing the outcomes of our previous endeavors. This leads us to posit that the robustness of over-the-air attacks warrants further investigation and enhancement to meet future application challenges effectively.

Fig. 6
figure 6

Similarity scores, including Word Error Rate (WER), Cosine similarity, and Jaro-Winkler similarity, for 40 benign samples (red column) and their corresponding adversarial examples (green column) are computed using both our previous defense and our proposed method when facing the MSEM and WO-EM attacks

Fig. 7
figure 7

Three similarity scores obtained using our defense methods with varying padding noise lengths when processing (a) the original samples and (b) the adversarial examples

Fig. 8
figure 8

Three similarity scores produced by our defense methods with varying padding noise magnitudes when processing (a) the original samples and (b) the adversarial examples

Figure 6 displays the average similarity scores, specifically the Word Error Rate (WER), the Jaro-Winkler similarity score, and the Cosine similarity score, between the transcriptions of input speeches and the transformed speeches when different defense strategies are employed (our previous defense and the proposed mechanism). The experimental results reveal that the WER similarity score exhibits the best performance in distinguishing adversarial audio from benign samples. Although the Jaro-Winkler similarity score yields higher scores for benign samples, it demonstrates poor performance against adaptive robust adversarial examples. Additionally, while the Cosine similarity score performs well in discrimination, its unstable detection results can lead to higher False Positive Rates (FPR) and False Negative Rates (FNR). Upon the application of the adaptive ASRs transcribing mechanism, as shown in Table 4, the WER still achieves the highest accuracy across different settings involving various Room Impulse Responses (RIRs). Furthermore, as depicted in Figs. 7 and 8, the experimental results indicate that WER maintains consistent performance even with different padding noise settings. Consequently, we select WER as our similarity calculation method. Regarding the WO-EM attack, it’s important to note that the WO-EM attack proposed in this paper is not perfect enough. Since Google Cloud Speech and Amazon Transcribe are black-box systems for the attacker, our attack cannot perform ensemble calculations by obtaining their internal architecture and parameters. However, the experiment remains significant. Tables 6 and 7 illustrate that although the WO-EM is designed to counter our adaptive ASRs transcribing mechanism, our pre-processing operation can disrupt the continuity and transferability of adversarial examples, achieving an accuracy of 98.5%. Based on this, we have reason to believe that even if highly transferable adversarial attacks are explored in the future, our defense mechanism can still maintain its effectiveness.

We attempted to combine MSEM and WO-EM to create more resilient attacks against our defense. However, the computational demands of the ensemble attack approach for adaptive attacks became prohibitive when considering additional RIRs and classifiers. Furthermore, increasing the number of training samples resulted in lower SNR for the generated adversarial examples. This led to perturbation noise that was more pronounced than the original input speech. Consequently, while robust attacks might succeed against our defense, the excessive noise introduced would make them easily detectable by humans.

Table 6 Transcribed results of the representative adversarial attack as processed by the proposed defense. The benign transcription is “Life was like a box of chocolate”, while the targeted text is “This audio has been attacked”
Table 7 Transcribed results of the WO-EM adversarial attack as processed by the proposed defense. The benign transcription is “Life was like a box of chocolate”, while the targeted text is “This audio has been attacked”

6 Conclusion

This paper introduces a novel and effective adversarial defense framework designed for addressing adversarial audio attacks against ASR systems. Our defense framework comprises two key layers. The first layer is an adaptive unified pre-processing mechanism, strategically crafted to disrupt the continuity of adversarial noise, effectively countering a range of representative attacks. The second layer involves an adaptive ASR transcribing method, further enhancing the robustness and efficacy of our framework. Notably, our experimental results illustrate that this framework can be seamlessly integrated into various ASR systems without incurring additional computational overhead. It is compatible with a wide array of ASR platforms.

We conducted extensive experiments to validate the effectiveness of our approach. This included evaluating different similarity scores and assessing its resilience against various attacks in comparison to state-of-the-art defense methods. In summary, our evaluation results demonstrated the following: 1) Fine-tuning-based pre-processing methods, such as the random noise padding method and limited noise padding method, excel at mitigating simple audio attacks like the C &W attack. However, they face challenges when defending against more complex and robust attacks, such as black-box or adaptive attacks. 2) Solely relying on an audio splitting strategy to counter audio attacks presents difficulties in balancing the trade-off between FPR and FNR. 3) Currently, continuity and transferability are identified as weaknesses in most audio attacks. 4) Our proposed adaptive framework effectively combats both representative and adaptive attacks through adaptive learning. It surpasses other representative defense frameworks in terms of detection accuracy and robustness.