1 Introduction

The deployment of deep learning (DL) in various sectors, especially in image classification, marks a significant milestone in the evolution of artificial intelligence (AI) (Rana and Bhushan 2023; Kaur et al. 2023; Musa et al. 2023). Yet, the susceptibility of DL systems to adversarial examples, a form of sophisticated evasion attack in adversarial machine learning (AML) (Szegedy et al. 2013), raises critical concerns about their reliability and safety, particularly in applications where stakes are high (Chakraborty et al. 2023) These adversarial examples, carefully crafted inputs designed to deceive DL models, not only undermine the accuracy of these systems but also pose substantial risks to their operational integrity and reliability in essential applications. Accordingly, the bridge between these concerns is the development of advanced strategies in DL that not only emphasize accuracy and fairness but also prioritize robust and trustworthy architectures to counteract the sophisticated threats posed by adversarial examples (Li et al. 2023; Grzybowski et al. 2024).

Current efforts to enhance DL model trustworthiness encompass various strategies, each with specific advantages and limitations. One prominent approach is adversarial training (Kim et al. 2023), where models are trained with a range of adversarial examples. This exposure aims to prepare the model to recognize and neutralize such attacks in real-world situations. However, this method can be resource-intensive and might lead to a trade-off between robustness and accuracy, where the model’s effectiveness on regular data is compromised (Tsipras et al. 2018; Zhang et al. 2019). Moreover, it may not guarantee protection against all types of adversarial attacks, particularly those not included in the training phase.

Strategies like regularization techniques and modifications to network architectures (Liu et al. 2024; Xia et al. 2024) are also employed to improve model generalization and minimize overfitting, thereby increasing reliability. However, striking the perfect balance in regularization can be challenging, and network architecture modifications can add design complexities and unpredictability in model behavior. Additionally, due to the fast-paced advancements in offensive methods, present defense mechanisms might be insufficient to combat the recently developed attacks in this area.consequently, An in-depth analysis is necessary to investigate the characteristics of adversarial attacks and their corresponding countermeasures. In the scholarly domain, a plethora of investigative works have delineated various facets pertinent to this field. Notably, certain surveys have accentuated the distinct attributes of affected applications. Akhtar and Mian (2018) focused their examination on adversarial attacks targeting DL within the realm of computer vision, exploring attacks peculiar to diverse components like autoencoders, generative models, recurrent neural networks, deep reinforcement learning, semantic segmentation, object detection, and facial attributes.

Simultaneously, other scholarly inquiries have endeavored to thoroughly decipher the complexities inherent in AML. The taxonomic framework for classifying attacks and threats, as proposed by Pitropakis et al. (2019), encapsulates various stages of the ML continuum, aiming to fortify the formulation of robust counterstrategies. In a comprehensive review by Miller et al. (2020), the focus is cast on adversarial learning attacks and defenses, especially in the context of deep neural network classifiers. This extensive survey encompasses a spectrum of attack modalities such as test-time evasion, data poisoning, backdoor attacks, and reverse engineering, alongside proposing apt defensive methodologies. Machado et al. (2021) conducted an inclusive examination of defense mechanisms specifically tailored for image classification.

Furthermore, Xu et al. (2020) undertook a systematic and exhaustive investigation of adversarial examples and their counteractive measures to deep neural networks. This survey traverses multiple data typologies including images, graphs, and textual content, endeavoring to illuminate adversarial tactics and defense mechanisms across a variety of data domains. Liu et al. (2018) rendered a comprehensive discourse on the security perils and corresponding defensive techniques in the milieu of ML, underscoring the susceptibilities of ML algorithms and training data across domains like image processing, natural language processing, pattern recognition, and cybersecurity. Additionally, Wang et al. (2019b) delved into an array of adversarial attack methodologies, discussing corresponding defensive stratagems from both scholarly and industrial perspectives. Li et al. (2018) investigated attacks capitalizing on shared model vulnerabilities and presented counteractive measures to safeguard data confidentiality within DL applications. In parallel, Shafee and Awaad (2021) systematically evaluated adversarial attack and defense technologies in DL, enhancing the understanding of techniques to augment the resilience of AI systems against adversarial threats. Moreover, Qiu et al. (2019) proffered a comprehensive technical retrospect of the evolution of AML research over the preceding decade, offering a valuable retrospective overview of the field’s research progression.

However, the existing literature in the field of adversarial machine learning (AML) reveals several research gaps for further exploration. Firstly, there is a noticeable absence of thorough analysis regarding the unique vulnerabilities and patterns of adversarial attacks across different DL application areas. This suggests a gap in understanding the reliability and effectiveness of defense strategies for various DL models beyond image classification. Additionally, there is a shortfall in studies that evaluate the practical application and real-world impact of these adversarial attacks and their countermeasures, particularly focusing on their implementation and efficacy in actual operational settings. Lastly, there is an ongoing need for contemporary and continuous research that keeps pace with the newest developments, challenges, and emerging trends in this rapidly changing field of AML.

Our research aims to bridge the gaps identified in the current literature as follows:

  • To conduct an in-depth analysis of various types of adversarial attacks, mainly focusing on evasion attacks such as white-box, black-box, and Generative Adversarial Network (GAN)-based attacks. This objective aims to elucidate the methodologies and motivations behind these attacks, contributing to a clearer understanding of adversarial techniques in the context of image processing.

  • To evaluate the effectiveness of AML techniques across diverse datasets and evaluation methods. This objective seeks to provide a broader, real-world perspective on the performance and applicability of AML in various scenarios and conditions.

  • To identify potential new research directions based on the insights gained from the analysis of adversarial attacks. This includes proposing innovative approaches or improvements to existing methodologies to enhance the trustworthiness of DL models.

  • To investigate the actual consequences of adversarial attacks on DL models, aligning with principles of AI trustworthiness. This involves understanding the practical implications of adversarial threats on model reliability and trust in real-world applications.

  • To develop a framework that focuses on enhancing the robustness of DL models by targeting specific vulnerabilities identified in the research including addressing issues such as limited data, high complexity, noise-induced errors, and non-robust characteristics of models.

The structure of the SLR is as follows: Sect. 2 outlines the research methodology employed, while Sect. 3 provides a background on adversarial examples in image classification. Sections 4 and 5 present the results and discussion of the research questions addressed in the SLR. In Sect. 6, a proposed conceptual framework for mitigating adversarial attacks is discussed. Section 7 presents the recommendations and future work and finally, the conclusion is presented in Sect. 8.

2 Research methodology

This section describes the SLR methodology used to address two research questions and categorize AML approaches. The methodology of this SLR is inspired by the guidelines provided by Kitchenham et al. (2007), Weidt and Silva (2016). These guidelines are widely used in the literature for conducting SLRs (Agbo et al. 2019; Lo et al. 2019).

2.1 Research questions

This SLR covers the answers to the following research questions:

  • RQ1 What are the capabilities and vulnerabilities that adversaries take advantage of to impact the trustworthiness of DL?

  • RQ2 What qualities do adversarial attacks share that could be considered when designing a trustworthy DL-based system?

2.2 Search process

The first step in this SLR protocol was to construct a keyword search string to combine different combinations to form a string. Thus, the search strategy has been involved in four steps: keyword identification, search string, source selection, and search process creation.

2.2.1 Defining keywords

Keywords were established to obtain the actual paper results for the research questions. Table 1 shows all of the different keywords identified for the quest.

Table 1 Derived keywords

2.2.2 Forming search string

A search string has been established based on the keywords for individual questions. The search string was reviewed and updated for the best valid results in the search sources. The following steps set the search string:

  1. 1.

    Deriving keywords from research questions and topic.

  2. 2.

    Identification of alternating spellings or synonyms for significant terms.

  3. 3.

    Keywords identification.

  4. 4.

    The Boolean operator OR is used for synonyms.

  5. 5.

    Boolean operator AND used for linkage of significant terms.

The following search string was generated as a result of the above procedure. (Security threat OR Security attack) AND (Machine learning OR Deep learning) AND (Adversarial example OR Adversarial machine learning OR Adversarial image). Our search string comprises two parts; the first focuses on machine learning security, and the second describes the adversarial examples.

2.2.3 Selection of sources

Five online databases, IEEE Xplore, Scopus, ScienceDirect, Springer, and Wiley, have been consulted on the literature source. Due to their scholarly rigor and cover of our discussion area’s many aspects, these databases have been chosen. In addition, search results references have been modified, and standard internet engines such as Google Scholar have been used to ensure that no research has been missed.

2.2.4 Documenting search strategy

During this phase, a comprehensive overview of the search strategy was prepared to encompass all relevant information. Table 2 provides details such as the search date, search string, online library name, and the total number of records retrieved based on the search string. Furthermore, it outlines the criteria used during the identification and screening stages of the search. Moreover, Table 3 displays the number of records obtained during the identification phase.

Table 2 Search strategy documentation
Table 3 Identification stage report

2.3 Inclusion–exclusion criteria

The following inclusion–exclusion criteria have been implemented while selecting the final studies for analysis:

  • Include studies that are accessible online and have been published between 2013 and 2023.

  • Include studies related to AML approaches in the computer science domain.

  • Include studies that follow a scientific guideline in AML approaches.

  • Exclude review and survey articles from the selection process.

  • Exclude studies that are not written in English.

  • Exclude studies that have not undergone a peer-review process.

  • Exclude studies that are not published as open access.

  • Exclude studies that are not in the final publication stage.

2.4 Quality assessment (QA) questions

The QA checklist for quality assessment questions was established to avoid bias and select more relevant studies. Each paper is evaluated against a predefined set of questions, and scores are assigned based on their responses. The checklist includes the following questions:

  • Are the research objectives adequately stated?

  • Does the study discuss attacks perpetrated using adversarial examples?

  • Is there an explanation of the adversary’s capabilities?

  • Do the countermeasure techniques related to defeating adversarial examples contribute to secure DL-based systems?

This checklist consisted of several questions, and based on the extent to which a paper adequately addressed each question, a score was assigned. A score of “1” was given if a paper thoroughly answered a question, while a score of “0.5” indicated a partial answer. Papers that did not address a question received a score of 0. The accumulative score, the aggregate value (AV), was calculated by summing the scores obtained for all questions in the checklist. To determine which papers would undergo further analysis, a threshold of 2 was established. Papers with an AV above the threshold were considered for inclusion in the analysis, while those with an AV below the threshold were excluded.

Table 4 presents the number of records retrieved during the screening stage.

Table 4 Scanning stage report

In order to deliver a comprehensive response to the identified research questions, the next section provides a comprehensive examination of image classification, along with an in-depth exploration of the phenomenon of adversarial examples in the context of image classification tasks.

3 Background

In an image classification task, classifier C is trained on a large set of input–output pairs \(\left( X,Y\right)\) to learn the values of weights that hold the knowledge gained throughout the training phase of a DL model f to make accurate prediction decisions. The prediction produced by the classifier C is based on the probability distribution represented as \(C(x)={\text {argmax}}f(x)\). The values of weights are updated to reduce the loss value J, a function used to evaluate the prediction error between the predicted value, which is the output of f(x), and the actual output of y. To measure the magnitude of change required to update the weights, the gradient of the computed loss in the backpropagation is used by the gradient descent algorithm, which is used to train DL models. The learning process continues iterating until the algorithm discovers the model parameters and converges to the lowest possible loss (where the gradient has a minimum value). The iterative learning is controlled by a learning rate hyperparameter, which defines the step size to consider the next update of the model parameters. During the test phase, the DL model is deployed using the learned weight parameter values to make predictions on inputs that were not observed during training. Since the weight parameters represent the training acquired model knowledge, the ideal model should generalize and make correct predictions for inputs outside the training domain. However, adversaries manipulate DL model inputs with adversarial instances, demonstrating that this is not the case.

3.1 Adversarial perturbation search process

Many existing attack methods, such as the gradient-based method, adopt optimization algorithms to achieve approximate adversarial perturbations. An adversary exploits the computed loss resulting from backpropagation to calculate gradients for each network parameter (weights) and updates them according to the added perturbation. This results in misclassification by maximizing the loss function. To achieve this goal, an attacker formulates the process of finding perturbations as an objective optimization problem to maximize the loss function \(J(x,y,\emptyset )\) and solve it within a defined threat model.

A threat model specifies an attacker’s capabilities and the goal of an adversarial attack. The attacker’s goals are modeled in terms of compromising the targeted model’s security, privacy, integrity, or availability. The adversary’s capabilities can be modeled using existing knowledge so that the adversary can leverage their knowledge of the feature set, the distribution of the training dataset, or the learning algorithm. This is referred to as a white-box attack. In comparison, if adversaries have limited knowledge about the targeted model, they can conduct their attack using an exploration procedure in which they pose as a regular user and, through a series of queries to the targeted model, explore the distribution of the training data while also utilizing either the feedback from sample labels, label score, or the probability values returned by the model. This is referred to as a black-box attack. As illustrated in Fig. 1, these two attack accessibility situations are leveraged to pursue poisoning and evasion attacks.

The primary focus of this paper revolves around evasion attacks, which directly impact the integrity and reliability of DL systems. These attacks aim to manipulate or deceive the system during the inference phase by modifying input data, leading to misclassification or erroneous results. This poses a substantial risk to the system’s overall performance and trustworthiness. Furthermore, evasion attacks have practical implications across diverse domains, such as computer vision, natural language processing, and fraud detection systems. Within these domains, the successful execution of evasion attacks can have severe consequences. For example, in computer vision, misclassification of crucial objects can result in erroneous decision-making and compromised security. In natural language processing, manipulating sensitive information can lead to privacy breaches and the propagation of misinformation. In fraud detection systems, evasion attacks can facilitate unauthorized access and circumvent detection mechanisms, enabling undetected fraudulent activities. Therefore, by examining the techniques and vulnerabilities associated with evasion attacks, this paper aims to enhance the understanding of these attacks and contribute to developing robust countermeasures.

Fig. 1
figure 1

Adversarial examples generation using white-box and black-box threat models

3.2 Adversarial example in an image classification task

Crafting an untargeted adversarial example \(x^{\prime }\) involves solving the following optimization problem:

$$\begin{aligned} \min _{x'} \Vert x' - x\Vert _p, \quad \text {s.t. } C(x') \ne C(x), \quad x' \in [0,1]^n. \end{aligned}$$
(1)

Where D is a distance metric, such as \(L_p\) norm, used to measure the distance between the original image x and its adversarial counterpart \(x^{\prime }\), the perturbation \(\delta\) is added to the original image x to generate the adversarial example \(x^{\prime }\). The constraint \(C(x^{\prime }) \ne C(x)\) ensures that the classifier misclassifies the adversarial example. Similarly, crafting a targeted adversarial example \(x^{\prime }\) involves solving the following optimization problem:

$$\begin{aligned} \min _{x'} |x' - x|_p, \quad \text {s.t. } C(x') = t, \quad x' \in [0,1]^n. \end{aligned}$$
(2)

where t is the target class label. The constraint \(C(x^{\prime }) = t\) ensures that the adversarial example is classified as the target class t. To find adversarial perturbations \(\mathrm {\delta }\) that satisfy \(C(x+\mathrm {\delta })\ne C(x)\), the smaller the size of \(\mathrm {\delta }\), the better the adversarial attack approach. These perturbations can be generated for each or all clean input data. Since finding a minor perturbation is a fundamental premise for adversarial examples, the measurement of the magnitude of perturbation is significant. The commonly used measure is p-norm distance, which relies on \(l_{p}\) for the measurement. \(l_{0}\), \(l_{2}\), and \(l_{\infty }\) are three widely used \(l_{p}\) metrics. \(l_{0}\) distance represents the number of image features that have been modified. \(l_{2}\) quantifies the standard Euclidean distance between the perturbed image \(x^{\prime }\) and the unperturbed sample x, whereas \(l_{\infty }\) distance represents the maximum change to any parameter. To identify perturbations that can be injected into the original sample and deceive the DL model, various search approaches are adopted for white- and black-box threat models to identify perturbations of the appropriate magnitude.

4 Discussion

This section offers a thorough examination of AML’s classification, the datasets used for research, and the metrics used to evaluate different attack strategies. Our goal is to enhance the development of more durable and dependable deep learning models capable of resisting adversarial attacks in real-life situations, by gaining a deep understanding of these elements. Additionally, this includes acquiring knowledge about the robustness, accuracy, and generalization abilities of deep learning models when faced with adversarial conditions. Following a detailed technical review of the AML methods discussed in this study, a complete categorization of AML techniques within the domain is depicted in Fig. 2. Essentially, this SLR is centered on the exploration of adversarial attack methods to answer specific research inquiries.

Fig. 2
figure 2

Schematic flowchart depicting the direction of contributions in the context of AML for the SLR

4.1 Dataset

The standard DL datasets commonly employed in computer vision tasks are described as follows:

  • Modified National Institute of Standards and Technology (MNIST) (LeCun 1998). It is a widely used dataset for the development and evaluation of image classification and machine learning models. It consists of a training set of 60,000 images and a test set of 10,000 images of handwritten digits, each of size 28 \(\times\) 28 pixels. The MNIST dataset is often used as a benchmark for testing and comparing the performance of different ML algorithms, as well as for teaching and learning purposes.

  • CIFAR-10 (Krizhevsky 2009) is a dataset consisting of 50,000 training images and 10,000 test images, each of size 32 \(\times\) 32 pixels, labeled over 10 categories. The dataset is widely used for benchmarking ML algorithms and is extensively utilized in computer vision research.

  • ImageNet (Fei-Fei et al. 2010). It is a large-scale image database created by researchers at Stanford University. It was the first image dataset to contain over one million labeled images, and it is widely used for training and evaluating image classification and object detection models. The images in the dataset are organized into more than 20,000 categories, with a total of over 14 million images. The ImageNet dataset has played a significant role in the development of DL models for image recognition and has contributed to the rapid progress in the field of computer vision.

  • Street view house numbers (SVHN). The SVHN dataset (Netzer et al. 2011) is a real-world image dataset used for developing ML and object recognition algorithms. It comprises images of house numbers taken from Google Street View and has over 600,000 labeled images. The images are of size 32 \(\times\) 32 pixels and have centered, well-lit digits. The dataset has 10 classes, corresponding to the digits from 0 to 9. The SVHN dataset is often used as a substitute for the MNIST dataset, which consists of images of handwritten digits and is commonly used as a benchmark in ML. One advantage of the SVHN dataset over MNIST is that the images are taken from real-world scenes.

  • MegaFace (Kemelmacher-Shlizerman et al. 2016). It was a publicly available dataset that is used for evaluating the performance of face recognition algorithms with up to a million distractors (i.e., up to a million people who are not in the test set). MegaFace contains 1 M images from 690K individuals with unconstrained pose, expression, lighting, and exposure.

  • Labeled faces in the wild This dataset (Coates et al. 2011) is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes (such as birds, cats, and trucks), among which 5000 images are partitioned for training while the remaining 8000 images are for testing. All the images are color images with 96 \(\times\) 96 pixels in size.

4.2 Evaluation metrics

Based on the review of the papers analyzed, the metrics for evaluating the proposed attack approaches are as follows:

  • Success rate (Goodfellow et al. 2014b; Moosavi-Dezfooli et al. 2016; Madry et al. 2017). An indicator that quantifies the percentage of adversarial samples adequately identified by a DNN as an adversarial target class.

  • Distortion rate (Szegedy et al. 2013; Kurakin et al. 2016). An indicator that expresses the percentage of pixels adjusted in the original sample (input features) to achieve adversarial examples within the white-box threat model. In contrast, in the black-box threat model, the metric measures the distortion percentage under different volumes of queries and the number of queries under contesting distortion rates.

  • Hardness metric (Papernot et al. 2016b). This metric determines the most accessible pairs of inputs to exploit.

  • Human perception (Papernot et al. 2016b). This metric is used to evaluate the rate of the adversarial examples that are still visually classified as an original sample by humans.

  • Structural similarity (SSIM) (Wang et al. 2004). This metric quantifies image degradation in terms of perceivable changes in luminance, contrast, and structure.

  • Stability metric (Tabacof and Valle 2016).This metric measures the stability of a classifier by identifying the percentage of images that keep or switch labels when distortion is added to the original image in the pixel space.

  • Perceptual Adversarial Similarity Score (PASS) (Kurakin et al. 2016). This metric is proposed to quantify adversarial images more consistently than is done by widely used \(l_{p}\) norm measurements.

5 Results

In today’s world, numerous sectors depend on the capabilities of AI to process large volumes of data and make informed decisions. However, the reliability of AI is not always guaranteed, and it can lead to undesirable outcomes. For example, there was an incident where a self-driving car caused a deadly accident due to the AI system’s inability to detect a pedestrian (Levin and Wong 2018). This incident underscores the risks associated with AI if its development is not rigorously supervised. Thus, it is vital to ensure AI systems are safe for human interaction. Additionally, it is important to develop and adhere to regulations and policies that prevent AI from causing harm, whether intentional or unintentional, to society and its users (Ryan 2020; Kaur et al. 2022).

To maintain trustworthiness in AI, two principle frameworks have been proposed in AI literature: one by the Organization for Economic Co-operation and Development (OECD) (Yeung 2020), and another by the European Commission’s AI High-Level Expert Group (HLEG) (Holzinger et al. 2022).

Notably, the guidelines from OECD and HLEG focus on key aspects of AI such as robustness, safety, security, explainability, and fairness, all of which are essential for creating dependable AI systems. Despite ongoing efforts to enhance these systems’ reliability, the existence of adversarial examples remains a challenge to trust, especially in sensitive areas like healthcare (Asan et al. 2020). As shown in Fig. 3, ensuring the trustworthiness of ML is in alignment with the OECD and HLEG principles and involves two main components. The first involves the creation of secure datasets that emphasize privacy, integrity, and confidentiality in accordance with these guidelines. The second focuses on the importance of robust ML models, whose performance is crucial for explainability, leading to the development of sustainable models known for their robustness and dependability.

Therefore, Understanding the common characteristics of attacks allows us to identify potential weaknesses and strengthen defense strategies accordingly. This knowledge about vulnerabilities enables us to improve the resilience of ML systems by implementing measures that counteract these vulnerabilities, aligning with the OECD and HLEG guidelines. Following this research analysis, the capabilities of adversarial attacks that affect the trustworthiness of DL systems have been thoroughly investigated for both black-box and white-box threat models.

Fig. 3
figure 3

Properties of trustworthy AI principles (Holzinger et al. 2022)

5.1 The capabilities and vulnerabilities that adversaries take advantage of to impact the trustworthiness of DL (RQ.1)

5.1.1 White-box vs. black-box threat model

This section introduces attack approaches from the perspective of two dimensions: the level of knowledge possessed by an attacker and the access method, whether it is through a white-box or black-box approach. There is particular emphasis on the knowledge an attacker has, such as output labels, the probability of prediction results, or the transferability attribute used to estimate a gradient. Additionally, we focus on the evaluation criteria used to determine the success of an attack in relation to the standard dataset utilized for the attack.

5.1.1.1 White-box threat model

Adversarial examples were first generated by Szegedy et al. (2013). An attacker aims to minimize the loss function that would change the prediction to the target label \(y^{\prime }\) by computing \(D(x,y^{\prime })\). Since neural networks are generally non-convex, the box-constrained L-BFGS method is adopted to approximate it by performing a line search to find the minimum \(c>0\) for which the perturbation \(\mathrm {\delta }\) satisfies \(f(x+\mathrm {\delta })=y^{\prime }\). The objective problem formulated in Eq. 2 is solved as follows:

$$\begin{aligned} \min \left( c|\delta | + J(x', y') \right) , \quad \text {s.t. } x' \in [0,1]^n. \end{aligned}$$
(3)

According to experiments conducted on several non-convolution models, box-constrained L-BFGS can reliably identify adversarial samples, as indicated by the authors. However, it is impractical and time-consuming for large datasets. Another proposed approach that uses gradients to maximize the loss function J(x,y) to find perturbations is the Fast Gradient Sign Method (FGSM) (Goodfellow et al. 2014b). This approach is more straightforward than L-BFGS. The objective problem 1 is solved by using the linear approximation to J:

$$\begin{aligned} \max J(x + \delta , y), \quad \text {s.t. } \delta : |\delta |_{\infty } \le \varepsilon . \end{aligned}$$
(4)

The computed loss is then used to generate the adversarial example:

$$\begin{aligned} x' = x + \epsilon \cdot {\text {sign}}(\nabla _x J(x, y)). \end{aligned}$$
(5)

This method is computationally more efficient than complex methods such as L-BFGS, as indicated by the results.

Instead of using the sign of the gradient as in FGSM, Kurakin et al. (2016) proposed Fast Gradient Value Method (FGVM) to generate multiple possible adversarial examples for a single image. This method imposes no restrictions on each pixel, resulting in images with significant differences. rather than using \(l_p\) norms to quantify adversarial images, a new metric called score PASS is introduced to measure the noticeable similarity to humans. The proposed approach disregards unnoticeable differences based on pixels. PASS involves two stages: (1) aligning the modified image with the original image and (2) measuring the similarity between the aligned image and the original one. Therefore, the adversarial problem 2 has to minimize \(D(x,x^{\prime })\) such that the distance metric is estimated through the combination of alignment and similarity measurement.

The least-likely class method (LLC) is a variant of FGSM that is proposed by Kurakin et al. (2018) to find perturbations that minimize the loss function and change the prediction to the target label \(y^{\prime }\). Instead of maximizing the loss function, LLC minimizes the loss function as FGSM does. To solve the untargeted adversarial problem 4, the linear approximation to J is used:

$$\begin{aligned} \min J(x + \delta , y'), \quad \text {s.t. } \delta : |\delta |_{\infty } \le \varepsilon . \end{aligned}$$
(6)

The adversarial examples are then computed using the following equation:

$$\begin{aligned} x' = x - \epsilon \cdot {\text {sign}}(\nabla _x J(x, y')). \end{aligned}$$
(7)

Kurakin et al. (2018) proposed a step-by-step version of FGSM, called I-FGSM, which is also known as the basic iterative method (BIM). The aim of I-FGSM is to find a stronger adversarial perturbation than the perturbation generated by FGSM, but with limitations on the amount of change in the generated adversarial image at each iteration. Instead of utilizing \(\epsilon\) as the size of the single step in the direction of the gradient sign, it splits it into several equal small step sizes \(\alpha\), and runs FGSM iteratively at each iteration step. It has been found that I-FGSM produces superior results to FGSM, but its primary disadvantage is the insufficiently adjustable step size adjustment. Additionally, the uniform distribution of step sizes hinders rapid loss growth at high gradient values and leads to fine-tuning at low gradient levels, as pointed out by Shi et al. (2020).

The projected gradient method (PGD) is proposed by Madry et al. (2017) as a variant of the BIM algorithm. In PGD, random noises are added before computing BIM to enhance its effectiveness. the adversary seeks to find the adversarial perturbation by maximizing the loss function, similar to other methods. Shi et al. (2020) proposed the Ada-FGSM approach, which adaptively adjusts the step size for adding noise under the guidance of gradient information, ensuring that no more noise is added than the specified range to the original image. this is achieved by recording the gradient value at each step and comparing it with the current gradient value to allocate the corresponding step size to each noise element. the objective is to solve the optimization problem 1 for finding \(y^{\prime }\) by controlling \(\epsilon\) for both \(l_\infty\) and \(l_2\) norms. the adaptive step size approach of Ada-FGSM, as proposed by Shi et al. (2020), is claimed to effectively exploit the correlation between gradient information and decision boundaries. In contrast to I-FGSM, which uses the gradient to alter direction, Ada-FGSM allows for greater control over the direction of adding noise and step size, which is enabled by adjusting the noise-adding step size under the guidance of gradient information. the authors demonstrate that gradient information suggests that Ada-FGSM is more likely to increase the loss function than I-FGSM.

The success rate metric under the \(l_\infty\) norm is used to measure the success of an attack after adding a perturbation that does not exceed the specified range to the original image. the accuracy reduction rate metric is used to quantify the attacker’s ability to attack the target model. for evaluation under the \(l_2\) norm, the median and size of adversarial perturbations are utilized.

To smooth the direction of noise addition, the MI-FGSM method was introduced by Dong et al. (2018). A momentum term is added to I-FGSM to update the current direction for noise addition using the previous gradient value. Specifically, the gradients of the first t iterations of the adversarial example are accumulated until the t-th iteration, and then perturbed in the direction of the sign of the accumulated gradient with a step size of \(\alpha\). the MI-FGSM generates adversarial examples using \(l_\infty\) and \(l_2\) metrics. results show that the proposed approach significantly improves the transferability of adversarial examples and outperforms both FGSM and I-FGSM in black-box attacks.

In order to improve the generalization ability of models and avoid overfitting, which is a limitation of I-FGSM, Yu et al. (2020) proposes two mini-batch approaches: the Mini-Batch Iterative Fast Gradient Sign Method (Mb-IFGSM) and Mini-Batch Momentum Iterative Fast Gradient Sign Method (Mb-MI-FGSM). the proposed approach uses the gradient information of a subset of samples from a training set, which are selected to form a mini-batch and update the current samples. Mini-batch images (\(x_1, x_2,\) ..., \(x_m\)) are generated by passing an input image through m parallel randomization layers. The input image is then updated iteratively by Mb-IFGSM, depending on the total gradient of all branches of randomization layers. In the conducted experiments, the mini-batch gradient approach prevents the model from overfitting by using the average gradient of mini-batch samples rather than a single sample to represent the optimal global direction better.

The proposed method, known as Kryptonite, focuses on the targeted extraction and manipulation of the Region of Interest (RoI) in images to introduce imperceptible adversarial perturbations (Kulkarni and Bhambani 2021). Its main objective is to enhance further adversarial attacks that utilize momentum by monitoring and evaluating the changes occurring within the RoI. Kryptonite employs a region of interest extractor to track these relevant features and assess the attack’s progress. By optimizing the momentum applied to the IFGSM Method based on the observed changes, Kryptonite aims to highlight the increased vulnerability of images, particularly in the region of interest.

The proposed approach by Xiao and Pun (2021) aims to overcome the limitations of current gradient-based adversarial attack methods, often resulting in noticeable pixel modifications and easily detectable changes in the generated adversarial examples. the authors present a novel method called Constricted Iterative Fast Gradient Sign Method (CI-FGSM) to address these issues. CI-FGSM tackles the problem by reducing the accumulation of redundant perturbations, achieved through minimizing the influence of previous gradient-based entities during the crafting process. as a result, CI-FGSM generates adversarial examples with fewer pixel changes while maintaining their effectiveness.

Modifying the input sample propsed by Papernot et al. (2016b) depends on perturbing two input features, leading to a significant change in network outputs. A matrix, defined as the Jacobian of a function learned by DNN, is used to construct adversarial saliency maps to identify the notable features to be included in the perturbation. this method is called the Jacobian Saliency Map Attack (JSMA). the algorithms introduced for generating adversarial examples assume that the knowledge of the DL architecture and weight parameters is accessible. once an adversarial saliency map has identified an input feature, it needs to be perturbed to achieve an adversary’s goal. success rate, distortion rate, hardness metric, and human perception are used for evaluation. due to the growth and update of the saliency map in each iteration, the proposed approach incurs a substantial computational cost to reduce distortion, as demonstrated by the findings.

To reduce the computational cost of JSMA and maintain its effectiveness, Qiu and Zhou (2020) proposed measuring the sensitivity of each input element to the target classification. this identifies the significant elements that affect the final classification based on locally evaluated gradients (partial derivatives) of logits. then, the most sensitive parts are used to generate an adversarial example by perturbing the chosen pixel intensity by searching for the most sensitive feature with the highest element of the sensitivity score. the evaluation metrics used are the attack success rate, Dong et al. (2018), and SSIM.

To gain a more transferrable targeted adversarial example, Li et al. (2020) proposed computing a gradient with respect to a triplet loss function, which is adopted for a targeted attack process, such as the loss of an adversarial example while its targeted label is minimized. in contrast, the loss of an adversarial example and the actual label is maximized. to achieve this goal, a triplet input of the logit of the original image, actual label, and target label is used. then the adversarial examples generated by the I-FGSM method are updated with the gradient of the triplet loss.

The AB-FGSM approach was developed to overcome the common issue of low transferability in existing methods of generating adversarial examples (Wang et al. 2022). the authors were motivated by the remarkable performance of the AdaBelief optimizer in terms of convergence and generalization. they proposed integrating this optimizer into the I-FGSM method to enhance the efficiency and effectiveness of generating adversarial examples. they aimed to investigate whether the iterative AB-FGSM method could expedite the generation of adversarial examples in a white-box manner while improving the transferability of such examples in a black-box setting.

The DeepFool method introduced by Moosavi-Dezfooli et al. (2016) aims to change the prediction of an input by finding the closest distance from the original input to the decision boundary of an adversarial example. the slightest adversarial perturbation is in the direction of the nearest decision boundary, which is computed using a linear approximation to decision boundaries. adversarial examples are generated by computing the distance from an input x to a decision boundary between y and \(y^{\prime }\).

Carlini and Wagner (2017) assumed that an adversary has complete access to a neural network, including its architecture and all parameters. the objective function is formulated with a constraint on \(\delta\) such that \(0 \le x + \delta \le 1\) for all pixels. to optimize over this constraint, a new variable w is introduced. attacks are constructed with the three \(l_p\) distance metrics. the proposed attack outperforms FGSM, DeepFool, and JSMA attack approaches in finding better adversarial examples since it looks for an adversarial example that is strongly misclassified with high confidence. moreover, it is effective in breaking the robustness of models secured by defensive distillation, with a high transferability rate.

The proposed approach by Kwon et al. (2018), a random untargeted attack (RUA) generates an adversarial example \(x^{\prime }\) with a random untargeted label r. to identify the random untargeted \(x^{\prime }\), optimization problem 1 is formulated for the pertained model f, original sample x, and true class label y:

$$\begin{aligned} \arg \min _{x'} J(x, x') \quad \text {s.t. } f(x') = r. \end{aligned}$$
(8)

For a given x, y, and r, a transformer is utilized to generate \(x^{\prime }\), which is then inputted to the model f. The proposed approach is repeated for t iterations until a random untargeted \(x^{\prime }\) is generated while minimizing the loss function. according to the author, this method demonstrates the ability to generate a random untargeted adversarial example without pattern vulnerability, where defending models can determine the original class by analyzing the output classes due to the similarity between the original class and specific classes.

The proposed approach by Chu et al. (2020) suggests using \(L_2\) regularization, which was initially used in neural network training to avoid overfitting, to restrict the gradient and reduce the magnitude of adversarial noise to improve the quality of adversarial examples created by FGSM and BIM. adversarial perturbation is estimated with respect to the added regularization. Two algorithms for adversarial generation are proposed by adopting the improved calculation of adversarial perturbation: Fast Gradient Norm Method (FGNM) based on FGSM and Peak Iteration Norm Method (PINM) based on the iterative gradient method BIM. moreover, the proposed approach adopts two metrics, Peak Signal-to-Noise Ratio (PSNR) and SSIM, to identify similarities and differences between the original image and its corresponding perturbed image. for more details on these metrics, readers can refer to Wang et al. (2004). the proposed algorithms are evaluated by comparing them to FGSM and BIM using PSNR and SSIM metrics. the results show that the PSNR of the adversarial image generated using FGNM and PINM is significantly higher than that of the image obtained using FGSM and BIM. furthermore, FGNM and PINM can produce examples of higher quality. however, FGNM is unable to launch a successful attack.

In addition to gradient-based attacks, recent studies have demonstrated the effectiveness of feature-level methods in deceiving black-box models (Wang et al. 2021). these methods focus on distorting the intermediate representations of a specific network rather than perturbing the final classification distribution. such approaches have proven to be highly successful in generating adversarial examples. to address the issue of domain-overfitting in adversarial attacks, Huang et al. (2022) proposes a novel method called DEFEAT (Decoupled Feature Attack). the author acknowledges the limitations of current one-stage methods, which tend to overfit by estimating gradients and updating perturbations in a round-robin manner.

To overcome this drawback, DEFEAT introduces a two-stage approach that decouples the process of perturbation generation from the optimization process. in the first stage, known as the learning stage, DEFEAT incorporates specific optimization strategies to create a lower-dimensional adversarial distribution. this distribution consists of diverse perturbations with higher loss values. in the second stage, the generation stage, DEFEAT samples noise from the learned distribution to construct adversarial perturbations. this iterative process aims to reduce the gap between different domains of feature-perturbed images.

5.1.1.2 Black-box threat model

The primary objective in creating adversarial examples in a black-box setting is to optimize objective functions in order to minimize the perturbations added to the original image, while also maintaining a moderate number of queries to estimate the gradient. Adversarial examples based on black-box approaches can be categorized according to the type of knowledge exploited for a query. These categories include transfer-based attacks, score-based attacks, and decision-based attacks.

In transfer-based attacks, the transferability of adversarial examples is exploited by having the adversary train a model to simulate the black-box model’s behavior. decision-based attacks can be organized into boundary and evolutionary-based categories. an attacker generates adversarial examples in a boundary-based attack by randomly choosing a descent direction to acquire a mutation, leading to misclassification. In contrast, an evolutionary algorithm is utilized as an optimization method to maximize the target class probability in an evolutionary-based attack. however, the evolutionary-based attack is out of this research’s scope since it adopts different generation procedures to generate adversarial examples. therefore, only the corresponding attack approaches references will be provided.

One of the proposed transfer-based attacks that is proposed by Papernot et al. (2017). this attack involves training a local substitute DNN close to the decision boundary of the target model on a synthetic dataset. in this dataset, the adversary creates inputs, and the outputs are labels assigned by the target model. adversarial examples are then constructed using the known substitute parameters.

As a boundary-based attack, the proposed approach by Song et al. (2019) depends on computing the boundary of a classification result by utilizing a combination of linear fine-grained and Fibonacci searches. a zeroth-order algorithm by Chen et al. (2017) is utilized to estimate the gradient. the generated adversarial examples are added to the gradient direction nearest to the boundary found by the fine-grained search algorithm. the distortion rate metric is examined under different numbers of queries under contrasting distortion rates.

In a score-based attack, the probability of prediction results can be acquired by an adversary, and then the approximate gradient can be used to construct adversarial examples. following this type of attack, Tabacof and Valle (2016) utilized the vector of probabilities \([p_1,\ldots , p_n]\) that output from the classifier of input image x. they formulated the optimization objective function to minimize \(\delta\) added to x that satisfies its probability \(p=f(x+ \delta )\), while the constraint on \(\delta\) has to be within a specified pixel scale of lower L and upper U limits. Cross-entropy is adopted as a loss function between the assigned probability p by the classifier and the targeted perturbed probability \(p^{\prime }\). Based on the analysis by Tabacof and Valle (2016), a robust DL is more susceptible to adversarial images than a weak, shallow classifier. as they stated, the nature of noise affects the resilience of both adversarial and original images. the attacking approach is evaluated by measuring the stability of a classifier.

The attack proposed by Narodytska and Kasiviswanathan (2017) utilized access to the probability score of the top-k class labels. K-misclassification is a new idea of misclassification in which the objective is to modify the input such that the network fails to identify a valid label even when relying on the top-k predictions. formally, the goal is to misclassify an image x with a true label y such that the output f(x) of the network satisfies y but does not include it in the top-k class labels returned by some function F. Consequently, the objective for probability p of class y, assigned by the classifier, is formulated as \(\min (f_y (x^{\prime })=p_y)\). a greedy local search is then utilized, where the current image is refined by the local neighborhood in each round. Compared to FGSM, the proposed approach modifies a tiny fraction of pixels and adopts less noise per image, while FGSM alters all pixels. in terms of the time needed to generate adversarial examples, the results demonstrate that FGSM takes a short time with high confidence scores for adversarial images.

Bhagoji et al. (2018) stated that an adversary requires query access to confidence scores to carry out gradient estimation attacks. however, since a gradient can be approximated directly by access to only the function values, Bhagoji et al. (2018) uses the finite difference (FD) of logits to estimate the difference between the logit values for a true label y and the second most likely class \(y^{\prime }\). as the number of queries needed per adversarial sample could be too large for high-dimensional input with FD, and inspired by the relation between gradients and directional derivatives for differential functions, the gradient is estimated for groups of features selected randomly instead of assessing them using a single feature at a time. this reduces the number of queries needed. the attack success rate, average distortion (defined by the metric of the distance used between benign and adversarial samples), and the number of queries are used as evaluation metrics.

An NP-Attack is introduced by Bai et al. (2023), a distribution-based black-box attack method that utilizes image structure information to characterize adversarial distributions and reduce query dependency. NP-attack offers various optimization options by leveraging variables within the Neural Process (NP) framework. the deterministic variable optimization prioritizes local information, while the latent variable optimization focuses on global information. these optimization variants enhance the adaptability of NP-attack and lead to distinct impacts on the placement of adversarial perturbations.

The Perceptual Quality-Preserving (PQP) black-box attack method is proposed by Gragnaniello et al. (2021) to preserve the quality and effectiveness of adversarial perturbations. the proposed approach incorporates innovative features, including the targeted injection of adversarial noise in “safe” regions that minimally impact image quality but significantly affect decision-making. safe regions are identified by analyzing the local gradient of perceptual quality measures such as SSIM (Wang et al. 2004) or VIF (Sheikh and Bovik 2006). Additionally, to ensure compatibility with systems accepting only popular integer formats, perturbations, and queries are constrained to 8-bit integer values.

The brute-force attack method (BFAM) is proposed for generating adversarial examples against ML-based systems in cybersecurity (Zhang et al. 2020). the generation process depends on the output of the target classifier’s confidence scores to determine which inputs should be modified. then, driven by the confidence scores of the target classifier, key features are identified to determine which modifications contribute to generating adversarial examples that can fool the target classifier.

HopSkipJumpAttack is a method developed by Chen et al. (2020) that generates adversarial examples by observing the output labels returned by the targeted model. to achieve this goal, the method estimates the gradient direction using binary information at the decision boundary.

The proposed Adam Iterative Fast gradient Method (AI-FGM) involves an iterative process where the gradient of the loss function with respect to the input image is computed. this gradient is then used to perturb the input image in the direction that maximizes the loss function. the AI-FGM algorithm incorporates additional components, such as a second momentum term and decay step size, commonly used in neural network training to improve convergence and performance on test sets (Yin et al. 2021). the primary objective of the AI-FGM algorithm is to enhance the transferability of adversarial examples. this allows the algorithm to generate adversarial instances capable of fooling multiple models, even when the attacker can access only one.

Evolutionary algorithm-based approaches are also used in the generation process. these algorithms are based on the iterative generation of potential solutions to a problem, resulting in selecting the best solution (Eiben and Smith 2015). several proposed attacks, such as those developed by Wang et al. (2018), Zhou et al. (2021), Su et al. (2019), Chen et al. (2019), Alzantot et al. (2019), Lin et al. (2020), exploit this feature to obtain the best generation of adversarial perturbations.

5.1.1.3 GAN-based attack approaches

The previously discussed adversarial attack approaches are limited by the use of original images in the generation process. these approaches seek to introduce an imperceptible perturbation \(\delta\) to input x to generate the adversarial example \(x+\delta\) that can fool the targeted classifier. however, state-of-the-art generation approaches use generative adversarial networks (GANs) to create adversarial and realistic examples. These approaches are not limited to an original sample and can generate new instances similar to the training data. this concept is inspired by generative modeling, which automatically detects and learns regularities or patterns in input data for the model to generate new examples chosen from the original dataset (Goodfellow et al. 2020). GANs achieve a sense of realism by combining two neural networks: a generator that learns to map some noise z with distribution \(p_z (Z)\) and create synthetic data that is as similar as possible to the training data X, and a discriminator that learns to distinguish real data from the generator’s output. additionally, different architectures are proposed for the GAN, depending on the input used by the generator to generate the fake samples. the samples can be generated by leveraging noise or an original sample, as shown in Fig. 4.

Fig. 4
figure 4

An illustration of the inputs that generator G may utilize to create sample \(x^\prime\).The samples can be generated either by noise z, the neighborhood of z in space, or the original sample x with loss \(J_G\)

The generator and discriminator are trained via the min–max value function V(GD) Goodfellow et al. (2014a), which is represented as follows:

$$\begin{aligned} \min _G \max _D V(G, D) = {{\mathbb {E}}}_{x \sim p_{\text {data}}(x)}[\log D(x)] + {{\mathbb {E}}}_{z \sim p_z(z)}[\log (1 - D(G(z)))] \end{aligned}$$
(9)

Here, the discriminator D(x) estimates the probability of reality of a real instance, while D(G(z)) estimates the probability that a fake instance is real. \({{\mathbb {E}}}_x\) represents the estimated error over all real data instances, while \({{\mathbb {E}}}_z\) represents the evaluated prediction error over all generated fake instances G(z). the gradients obtained by backpropagating through both the discriminator and the generator are used to update generator weights.

One of the proposed GAN-based approaches, which is not limited to the benign sample x for adversarial example generation, is an auxiliary classifier GAN (AC-GAN) (Song et al. 2018). the AC-GAN approach utilizes gradient descent to search for a good noise vector \(z^*\) in the neighborhood of the original noise vector z, which is then transformed using the class label y to generate an adversarial example \(G(z^*,y)\) for the target model. the AC-GAN approach incorporates an additional classifier C to assist in generating more diverse adversarial examples.

The approach proposed by Khoshpasand and Ghorbani (2020) trains separate generators for each class of inputs to generate unrestricted class-conditional adversarial examples. the generator for a source class \(G_y\) is trained on samples with the label y to generate samples that do not belong to class y when evaluated by the targeted model, i.e., \(f(G_y(z))\ne y\). to find the optimal value of z that satisfies the target, the generated samples are made adversarial while preserving the similarity between adversarial instances and actual instances of the source class. the effectiveness of the proposed method was evaluated using human labeling to demonstrate its success in targeted attacks. Specifically, three human labelers were asked to assign a ground truth label to the generated adversarial examples. the results showed that the approach by Khoshpasand and Ghorbani (2020) can generate adversarial examples that have a larger transferability than AC-GAN.

To generate transferable adversarial examples from random noise, the Adversarial Transferring on Generative Adversarial Nets (AT-GAN) approach was proposed by Wang et al. (2019a). the approach uses a GAN trained similarly to AC-GAN to learn the distribution of adversarial examples, and the generator is then transferred to attack the target classifier. the training consists of two stages: the first stage aims to obtain the original generator G, which is then transferred to learn the distribution of adversarial examples and used to attack the targeted model.

The generator G introduced by Zhao et al. (2017) is trained to generate adversarial examples from dense z, which is the latent representation of the input data x. The matching invertor \(I_y\) is trained separately to map the input data samples to their corresponding dense representations. during training, the generator G is updated to generate adversarial examples that are misclassified by the target classifier. The invertor \(I_y\) is also updated to ensure that the generated adversarial examples are close to their original data samples in terms of the dense representation. this approach improves the efficiency of adversarial example generation as it avoids the need for computing the gradient of the loss function with respect to the input data.

The proposed Adv-GAN approach by Xiao et al. (2018) aims to generate adversarial samples from the true distribution of samples. in this approach, the generator G takes the original sample x as its input and generates a perturbed sample as x+G(x). the generated sample is then sent to discriminator D and target model f. the discriminator tries to distinguish the perturbed data x+G(x) from the original data x, and it outputs the loss \(J_G\). on the other hand, model f takes x+G(x) as its input and outputs its loss \(J_f\), which represents the distance between the prediction and the target class \(y^{\prime }\). the authors claimed that since Adv-GAN tries to generate adversarial instances from the underlying actual data distribution, it can produce more image-realistic adversarial perturbations than other attack strategies.

Instead of teaching the generator to generate samples close to the actual samples, Burnel et al. (2020) adopts a reweighting method to learn the generator map between the latent space and the adversarial sample from the true distribution. the reweighting approach aims to produce weights of misclassified samples larger than a correctly classified sample. the reweighting process is achieved according to the prediction of a pre-trained classifier.

An approach proposed by Zhang (2019) constructs one-shot adversarial examples directly from original images without creating noise z and adopts an image-to-image translation architecture. after training the GAN model, the generator is utilized to minimize the loss function. the generated samples are used as inputs to a targeted classifier, a discriminator, and an auxiliary classifier, where a copy of the flattened layer output of the discriminator is utilized as a classifier. this classifier predicts class labels for real and adversarial images, representing its loss as cross-entropy. as a result, the entire objective function is formulated regarding GAN loss, classifier f loss, auxiliary classifier loss, and the loss term defined to measure the distance between generated and real samples.

The proposed method by Mutlu and Alpaydın (2020) aims to improve the generator of a standard GAN trained with Wasserstein loss by incorporating an encoder E component. The encoder is implemented as a DNN with an additional loss term and is used to learn the mapping from sample x to its corresponding latent representation z (inverse of the generator). the generated sample x, the output of encoder E(x), and z are then fed into the discriminator. by defining different loss criteria as hints, the encoder allows for a reconstruction error to be calculated, leading to improved quality in a generation. the bidirectional form of the GAN is thus formed, which directs the learning process toward a better generator.

Two conditional generative adversarial net-based methods are proposed to train a generative model by Yu et al. (2018). The first method, called conditional generative adversarial network for fake examples (CGAN-F), generates attack samples directly from Gaussian noise without source images. The generator of CGAN-F is fed by a conditional input, noise z combined with true label y, to generate \(x^{\prime }\). In contrast, the discriminator takes as input the original sample x, \(x^{\prime }\), and y to distinguish fake samples from the real ones. The second method, called conditional generative network for adversarial examples (CGAN-Adv), uses an encoder-decoder-decoder architecture. The first decoder generates the artificial image \(x^{\prime }\), and the encoder output is fed into another decoder to obtain the corresponding perturbation r. The objective function is expressed in terms of two losses: one loss is added to make the output similar to the source image by considering the average magnitude of the perturbation, and another loss guarantees that the result is identical to the source label.

The authors (Zhong et al. 2020), proposed GANs with Decoder–Encoder Output Noises (DE-GANs) to develop a better distribution of noises by teaching the model to acquire information about the image manifold using real images. The DE-GAN architecture utilizes a decoder–encoder structure to convert an uninformative Gaussian noise into an informative one, which is then fed into the GAN generator. The decoder is trained to map a random noise vector to a noise vector that has a similar distribution as the noise vectors obtained from the generator’s output. This new informative noise vector is then used as input to the generator, which outputs a generated image. The encoder is used to map the generated image back to its corresponding informative noise vector. By using the decoder and encoder, the authors aim to learn the mapping between the informative noise vectors and the generated images. This approach is shown to be effective in improving the quality of generated images and reducing mode collapse.

In order to address the limitations of some attack algorithms, which could not generate adversarial examples in batches, Deng et al. (2021) proposes the integration of the DCGAN (Li et al. 2019) architecture for a batch generation. By leveraging the DCGAN architecture, the research aims to learn the distribution of adversarial examples generated by the FGSM algorithm. training the DCGAN with adversarial examples produced by the FGSM algorithm enables the generator to capture these adversarial instances’ distinctive features and distribution patterns, facilitating efficient batch generation. To tackle the issue of easy detection in adversarial attacks, particularly in the black-box scenario where a large number of queries can raise suspicion. Attack Without a Target Model (AWTM) is proposed by Yang et al. (2021). Unlike traditional approaches, the proposed algorithm does not require querying the target model since it does not aim to target a particular model specifically. Inspired by the structure of advGAN, AWTM is designed with a modification in the generator G. Instead of utilizing random noise as input, G takes a dense vector as input. The generator G performs two main tasks: firstly, it reconstructs a set of features v in a lower-dimensional space, guided by the discriminator D. Secondly, during the reconstruction process, G aims to manipulate the output to cross the boundary of the random classifier f, assisted by the adversarial loss provided by f. To facilitate the training of the generator G, an autoencoder structure is utilized, formed by the mapper and the decoder. The mapper maps a normal sample x to a dense vector v within a high-dimensional feature space, while the generator G processes v to produce the adversarial example. Table 5 displays the classification of the articles mentioned above and the critical evaluation criteria for attack approaches in DL-based systems. The most crucial fundamental context in adversarial attack approaches is determined by the adopted threat model, which is used to determine the quantity of knowledge available to generate malicious attacks. Moreover, a comparison of the studies above is introduced by employing the assessment aspects in adversarial attacks. The parameters comprise metrics evaluation for attack success, the dataset used, success in the presence of a security policy, and attack success against real-world applications.

Table 5 Classification of recent studies and current evaluation parameters for adversarial attack approaches

Table 6 compares the reviewed studies of GAN-based attacks by employing the assessment aspects in GAN-based adversarial attacks. The key foundational aspects in GAN attack methodologies are influenced by the chosen architecture, which involves utilizing either the original sample or noise for generating adversarial examples. The parameters include the extent to which the generation process depends on the original samples and the incorporation of auxiliary architectures with the generator or discriminator.

Table 6 GAN-based adversarial examples architectures

In GAN-based adversarial attacks, it has been observed that GANs generating adversarial samples without relying on the original samples require additional architecture to enhance the generation process (Song et al. 2018; Khoshpasand and Ghorbani 2020; Wang et al. 2019a; Zhong et al. 2020). Since the attacker’s goal is to minimize multiple loss functions, these additional architectures increase the model’s complexity. On the other hand, when generating malicious samples with original sample dependency (Xiao et al. 2018; Burnel et al. 2020; Zhang 2019; Mutlu and Alpaydın 2020), there is a greater likelihood of producing image-realistic adversarial perturbations, resulting in adversarial examples that are resistant to various defenses.

On the other hand, methods that do not rely on the original sample for generating adversarial examples, such as the approach proposed by Song et al. (2018), are still limited to using noise. However, finding suitable noise for generating effective attacks can be a time-consuming process that requires hundreds of iterations, which in turn leads to slow generation speeds.

Due to the SLR conducted on the analysis of AML approaches, next section discusses a list of common characteristics of adversarial attacks. Understanding these common characteristics can help identify vulnerabilities and ultimately make a system more trustworthy.

5.2 The qualities do adversarial attacks share that could be considered when designing a trustworthy DL-based system (RQ.2)

By interpreting the intuition behind these attacks, the results of these interpretations can also be utilized to develop an end-to-end defense framework, which will help identify the extent to which a DL model can be trusted under adversarial settings. For example, Wickramanayake et al. (2021) used interpretations to generate efficient augmented data samples for training the model and improving interpretability and model performance. The following is a discussion of the commonly concluded characteristics of adversarial attacks and the limitations that might hinder the construction of an interpretable and trustworthy DL model.

  • Adversarial examples differ from normal examples in that they lie on a lower-dimensional manifold and produce different patterns of activation in later stages of the model.

  • The existence of adversarial examples highlights the gap between the real distribution of features and the learned distribution during training. Thus, these examples are often found in low-probability regions of the training distribution and lead to significantly different classification probabilities compared to untampered examples.

  • The adversarial examples also tend to be farther from the boundary of the task manifold compared to normal samples.

  • Adversarial attack samples are of lower quality than normal samples obtained in typical operating conditions.

  • The vulnerability of DL models to adversarial examples may be due to the decision boundary being too close to the normal data submanifold, allowing small perturbations to lead normal data across the boundary and trick the classifier.

  • Non-robust features in DL can also contribute to this vulnerability. Even simple linear models can be susceptible to adversarial examples if their input data has sufficient complexity.

  • The occurrence of adversarial examples may also be an inevitable result of the simplification techniques used to train DL models.

  • Analyzing the attacks may yield an interpretable model that explains or reveals the ways in which deep models make decisions, such as indicating discriminative features used for model decisions (Ribeiro et al. 2016) or the importance of each training sample as a contribution to inference (Koh and Liang 2017). however, despite a plethora of proposed defense approaches, there are still limitations observed in the analyzed attacks that may hinder the construction of a secure and robust DL-based model.

  • Defensive approaches are evaluated under non-standard attacks in terms of the strength of the attacks and the magnitude of the added perturbations used to formulate these attacks. thus, this cannot provide information about to what extent the accuracy of the model will be degraded under adversarial settings. moreover, as shown in this research, the majority of designed attacks were not evaluated against any security policy.

  • Most research has assessed their proposed approach by calculating the ASR, while some have considered additional evaluation metrics such as generation speed (Kwon et al. 2018). however, there needs to be more standardization in the way these metrics are modeled. For example, the speed of the generation metric is not designed in a standard way, as it is modeled as the average amount of time required to generate one adversarial example at inference time in some works and as the speed measured by generating 1000 malicious instances in other studies such as introduced by Kwon et al. (2018). therefore, defensive approaches are unlikely to be effective against unknown attacks. it is necessary to discover a mechanism for evaluating a model’s trustworthiness to maintain absolute accuracy when implemented in real-world applications.

  • On the other hand, it has been noticed that some of the proposed attacks approaches lack standard evaluation metrics to assess their effectiveness, making it challenging to accurately analyze the vulnerabilities. It is consequently difficult to compare the effectiveness of the proposed methods, particularly in the presence of security policies. therefore, it is challenging to find blind spots in the adversarial examples. thus, a standard evaluation metric should be considered when the model is deployed in real-life applications. the absence of other performance indicators, such as model build time, misclassification rate, and precision, should be perceived as an essential constraint for evaluating classifier performance (Provost and Fawcett 2001).

  • Moreover, few studies have adopted the evaluation against some security policies as an adversarial training (AT) approach. this approach has shown its effectiveness against several types of attacks and is therefore used as an evaluation for the effectiveness of these attacks.

  • Most of the studied attacks have not been evaluated for their performance in real-world applications, which hinders the interpretability of the model’s behavior in such scenarios. quantifying a model’s accuracy in the absence or presence of attack mechanisms can significantly contribute to defining the level of generalization for both the attack and defense approaches. moreover, the need for studies quantifying the amount of perturbation caused by each attack that enables an adversary to achieve a high success rate is concerning. however, paying attention to this issue may help establish a lower bound on the amount of perturbation required to achieve an attack’s success rate while evaluating the effectiveness of defense models under a standard perturbation threshold.

  • Additionally, assessing a model’s resistance to perturbations of varying magnitudes determines the extent to which a successful prediction of unperturbed images can be guaranteed, especially for approaches that incorporate an auxiliary detector during the training phase.

  • A model’s robustness should be measured using both small and large-scale datasets, such as MNIST and ImageNet, to produce interesting results for attack approaches. as observed in the literature, some proposed approaches show impressive results on small datasets but fail on large-scale datasets. moreover, some evaluations are conducted only on small datasets. consequently, evaluating both datasets can help create a generalized model that is not biased toward specific samples.

The following section introduces a conceptual framework that considers the attributes of the examined attacks. By integrating the knowledge acquired from analyzing these attack attributes, the proposed framework seeks to bolster the system’s resilience and deliver predictions that are easily understandable and explicable.

6 Proposed conceptual model

Considering the characteristics of adversarial examples discussed earlier, we propose the Transferable Pre-trained Adversarial Deep Learning framework (TPre-ADL) to counter the impact of adversarial attacks. The proposed framework, illustrated in Fig. 5, incorporates four principal attributes of adversarial attacks: limited data availability, high dimensionality, test error sensitivity to noise, and the presence of non-robust features.

Fig. 5
figure 5

The proposed Transferable Pre-trained Adversarial Deep Learning model (TPre-ADL)

As shown in Fig. 5, the proposed framework utilizes a GAN network and transfer learning mechanism. The GAN consists of two main components: a generator and discriminator networks. The generator network takes as input a noise vector and uses convolutional layers to transform the input into an image that matches the target distribution. On the other hand, the discriminator network takes an input image and outputs a probability that the image is real (as opposed to generated by the generator). The discriminator network typically uses convolutional layers to extract features from the image and then uses a series of fully connected layers to compute the final probability. The generator and discriminator networks are trained using an adversarial loss function that encourages the generator to produce images that the discriminator cannot distinguish from real images. Once the pre-model is trained, the proposed framework uses a transfer learning technique to transfer the pre-trained generator and discriminator of the model to be used as a feature extractor and anomaly detector for the targeted classifier. Transfer learning is a technique in ML where knowledge gained from solving one problem is leveraged to improve performance on a different but related problem. The rationale behind using transfer learning in this framework is to transfer the learned representations and knowledge captured by the pre-trained models to the targeted classifier. Moreover, train the targeted classifier on a low-dimensional dataset to mitigate the effect of adversarial examples. By doing so, the targeted classifier can benefit from the pre-trained generator’s feature extraction capabilities and the pre-trained discriminator’s anomaly detection capabilities.

The transfer process includes modifying the pre-trained generator model and adding it as a layer at the beginning of the targeted classifier architecture. The targeted classifier is then fine-tuned on the targeted data using the extracted features from the generator as input features for the targeted classifier. The pre-trained discriminator is also combined with the classifier as an anomaly detection system to identify potential adversarial examples, preventing them from interfering with the classification process and improving the overall accuracy and robustness of the system. In summary, the proposed framework addresses the effects of adversarial attacks by incorporating several characteristics:

  • Insufficient data: Adversarial attacks can exploit weaknesses in training data, especially when the dataset is small and lacks diversity. The framework utilizes a GAN-based model to overcome this limitation. The GAN is trained on a large, high-dimensional dataset, which helps improve the model’s robustness. Additionally, the GAN generates additional samples to augment the training data, enhancing diversity and reducing the risk of adversarial attacks.

  • High dimensionality: Adversarial attacks can be more effective on models trained on high-dimensional data, as small perturbations in input features can cause significant impacts. To make the model more resilient to additive noise, the proposed framework reduces the dimensionality of the data. By reducing the dimensionality, the model becomes less susceptible to minor variations in input features, thereby increasing its robustness against adversarial attacks. This reduction is achieved through a transfer learning technique, which transfers the learned parameters from a pre-trained adversarial model to the target model. By reducing dimensionality, the model becomes less prone to minor variations in input features, thereby enhancing its resilience against adversarial attacks.

  • Test error on noise: Adversarial attacks often involve adding small amounts of noise to the input data, causing the model to misclassify the sample. To mitigate this effect, the proposed framework trains the model to be more robust to noise. By incorporating noise during training, the model becomes more resilient to perturbations and performs better on noisy data, making it harder for adversarial attacks to succeed.

  • Non-robust features: Adversarial attacks can exploit non-robust features in the input data, making them challenging to identify and mitigate. In this framework, a pre-trained discriminator is used as an anomaly detector. The discriminator can identify non-robust features and flag them as potential vulnerabilities. Furthermore, the pre-trained generator from the GAN is utilized as a feature extractor to identify and remove these non-robust features. By eliminating such features, the overall robustness of the model is improved, making it more resistant to adversarial attacks.

The proposed framework offers several notable benefits in comparison to other defense approaches. Firstly, it does not depend on learning specific attack types to provide protection against them. Instead, it utilizes the GAN architecture to train on a diverse range of features extracted from a large dataset. This approach enhances the framework’s resilience and adaptability, making it capable of effectively countering various potential attacks. Additionally, our framework incorporates transfer learning techniques, eliminating the need for time-consuming and computationally expensive retraining from scratch. By leveraging transfer learning, we can train and fine-tune the model on a smaller dataset, enabling faster adaptation to new attacks and reducing the extent of retraining required. Moreover, our framework specifically addresses the common characteristics that adversaries exploit in conducting attacks. By mitigating these vulnerabilities and incorporating robust features during training, we significantly enhance the overall trustworthiness of the DL model, contributing to its reliability and security.

However, while the proposed framework addresses the limitations of insufficient data, it still relies on the quality and diversity of the training data. If the training data is biased, incomplete, or lacks diversity, the framework’s ability to improve robustness may be limited. Various hyperparameters, such as the architecture of the GAN, the selection of transfer learning techniques, and the choice of noise levels during training, can influence the performance and effectiveness of the framework. Finding the optimal configuration of these hyperparameters can be challenging and may require careful experimentation.

7 Recommendations and future work

For future work and recommendations, there are several areas that warrant exploration to enhance the proposed framework. Firstly, it is crucial to address limitations associated with dataset quality and diversity. It is recommended to investigate methods that can mitigate biases, incompleteness, or insufficient diversity in the training data. Secondly, optimization of the various hyperparameters within the framework is of utmost importance. A comprehensive analysis of the GAN architecture, transfer learning techniques, and noise levels during training should be conducted to enhance performance. Advanced optimization methods should be utilized to determine the optimal configuration of these hyperparameters. Thirdly, the establishment of standardized evaluation metrics and benchmarks specific to adversarial defense frameworks is highly recommended. This would facilitate fair comparisons among different approaches and provide a benchmark for future research.

Additionally, real-world deployments should be taken into consideration to evaluate the framework’s performance under diverse deployment conditions and resource constraints. By exploring these areas, the proposed framework can be further enhanced, contributing to the advancement of adversarial defense in the academic realm.

Moreover, for gaining insights into the inner workings of AI models and effectively identifying anomalous or adversarial behavior, incorporating explainable AI (XAI) techniques in detecting and mitigating adversarial attacks plays a paramount role in ensuring the robustness and reliability of AI systems. by understanding the underlying mechanisms of adversarial attacks, XAI empowers us to design more secure and trustworthy DL models, effectively safeguarding against potential threats and vulnerabilities. In this context, XAI techniques serve critical roles in addressing adversarial attacks:

  • Feature attribution methods, such as LIME (Ribeiro et al. 2018) and SHAP (Lundberg and Lee 2017), help identify the most influential features or pixels in an input image contributing to the model’s decision. Analyzing the importance of these features provides insights into how adversarial perturbations manipulate them to deceive the model.

  • Rule extraction methods extract interpretable rules or decision trees from complex models targeted by adversarial attacks (Angelov and Soares 2020). This reveals the decision boundaries and rules exploited by adversarial perturbations.

  • Visualizing the decision-making process (Apley and Zhu 2020) of a model through techniques like saliency maps, class activation maps, or occlusion analysis reveals the influential regions in an input image, including those targeted by adversarial perturbations.

  • Exploring counterfactual examples (Mothilal et al. 2020) involving slight modifications in the input sheds light on how these changes can lead to different model predictions. This highlights the vulnerability of the model to adversarial perturbations.

By employing these XAI techniques, we can gain a deeper understanding of adversarial attacks and enhance the trustworthiness of DL models by mitigating their impact. There are numerous case studies demonstrating how improved explainability and transparency have directly bolstered the trustworthiness of AI systems. A notable instance is the application of XAI methods such as Grad-CAM in medical imaging, as detailed in the studies by Suara et al. (2023) and Raghavan et al. (2023). This method exemplifies how enhanced explainability can cultivate greater trust and reliability in AI systems, particularly in critical areas like healthcare diagnostics. Such techniques have played a pivotal role in augmenting the interpretability and dependability of AI-supported cancer detection. For example, the integration of prior knowledge into Grad-CAM has been shown to increase the accuracy of cancer detection in research involving biopsy images, thereby reducing misdiagnosis risks and significantly boosting result interpretability. The progress in Grad-CAM technology broadens its application across various medical imaging types, including MRI and PET scans, proving its efficacy as a dependable tool in cancer detection and diagnosis.

Furthermore, a specific research methodology includes the analysis of histopathology images to detect metastases in lymph nodes, crucial for cancer diagnosis and staging. This method involves training a model on an extensive dataset of histopathology images, using techniques like data augmentation and regularization to enhance performance. In this scenario, the application of Grad-CAM provides vital visual cues, highlighting the image areas that most influence the model’s predictions. This visual interpretability is essential for pathologists and medical professionals, as it assists in deciphering the AI’s logic and adds a crucial layer of verification to the AI’s decision-making process, as discussed by Srinivasu et al. (2022)

8 Conclusion

In conclusion, this paper has bridged a notable gap in existing research on the trustworthiness of DL systems, particularly in the context of adversarial attacks. While current literature extensively explores defensive strategies against such attacks, our study has pinpointed a crucial deficiency in examining the root causes of DL models’ vulnerability to these adversarial manipulations. This narrow focus has been a significant barrier to the development of holistic and effective defense mechanisms. Our systematic analysis shed light on the inherent vulnerabilities within DL models that are exploited by adversarial examples. By unraveling these weaknesses, our research contributes to a more profound understanding of the success factors behind adversarial attacks and the strategies to counter them more efficiently. The introduction of the Transferable Pretrained Adversarial Deep Learning framework (TPre-ADL) represents a notable leap forward in this area. TPre-ADL not only remedies the shortcomings of existing defense tactics but also introduces a robust and ethical approach to augmenting the resilience of DL models against adversarial threats. This research highlights the significance of a balanced approach in adversarial machine learning, underscoring the necessity to comprehend both the mechanisms of attacks and the intrinsic vulnerabilities of models. This balanced perspective is vital for the continued development of DL systems that are technically adept, ethically responsible, and genuinely trustworthy. As the field of artificial intelligence progresses, it is crucial that research and development endeavors adhere to these principles, guaranteeing the creation of AI systems that are secure, dependable, and advantageous for society. Looking ahead, we advocate for ongoing research into the complexities of adversarial attacks and the vulnerabilities of DL models. It is through such detailed and comprehensive research that we can foresee and neutralize emerging threats, thereby charting a course toward a future where AI systems are as resilient as they are sophisticated.