1 Introduction

Object detection (OD) models are commonly used in computer vision in various industries, including autonomous driving (Khatab et al., 2021), security surveillance (Kalli et al., 2021), and retail (Melek et al., 2017; Fuchs et al., 2019; Cai et al., 2021). However, these models are susceptible to adversarial attacks—a significant threat that can undermine the models’ performance and reliability  (Hu et al., 2021). This vulnerability affects both existing and newly developed OD-based solutions, such as OD-based smart shopping carts. Since existing retail automated checkout solutions often necessitate extensive changes to a store’s facilities (as in Amazon’s Just Walk-Out; Green, 2021), a simpler solution based on a removable plugin placed on the shopping cart and an OD model was proposed (both in the literature (Oh & Chun, 2020; Santra & Mukherjee, 2019) and by industry).Footnote 1 This computer-vision-based approach is already replacing traditional bar-code systems in various supermarkets worldwide, signaling a shift towards enhancing the shopping experiences. Nonetheless, the vulnerability of OD models to adversarial attacks (Hu et al., 2021) poses a potential risk to the integrity of such smart shopping carts and similar OD solutions in other domains. For example, an adversarial customer could place an adversarial patch on a high-cost product (such as an expensive bottle of wine), which would cause the OD model to misclassify it as a cheaper one (such as a carton of milk). Due to the expansion in retail thefts (Federation, 2022; Forbes, 2022), such attacks would result in a revenue loss for a retail chain and compromise the solution’s reliability, if not addressed.

Fig. 1
figure 1

Adversarial theft detection when using X-Detect

The detection of such physical patch attacks that are designed to deceive OD models into hallucinating a different object, presents a significant challenge in retail and similar OD-based solutions settings, since: (1) the defense mechanism (adversarial detector) should raise an alert in an actionable time frame to prevent the theft from taking place; (2) the adversarial detector should provide explanations for the raised adversarial alert prevent a retail chain from falsely accusing a customer of being a thief (Amazon, 2022); and (3) the adversarial detector should be capable of detecting unfamiliar threats (i.e., adversarial patches in different shapes, colors, and textures; Zhang et al., 2019). Several methods for detecting adversarial patches for OD models have been proposed (Chiang et al., 2021; Ji et al., 2021; Liu et al., 2022), however, none of them provide an adequate solution that addresses all of the above requirements.

We present X-Detect,Footnote 2 a novel adversarial detector for OD models which is suitable for real-life settings. Given a new scene, X-Detect identifies whether the scene contains an adversarial patch, i.e., whether an alert needs to be raised (an illustration of its use is presented in Fig. 1). X-Detect consists of two base-detectors: the object extraction detector (OED) and the scene processing detector (SPD), each of which alters the attacker’s assumed attack environment in order to detect the presence of an adversarial patch. The OED changes the attacker’s assumed machine learning (ML) task, i.e., object detection, to image classification by combining an object extraction model and a customized k-nearest neighbors (KNN) classifier. The SPD changes the attacker’s assumed OD pipeline by adding a scene preprocessing step to limit the effect of the adversarial patch. The two base detectors can be used as an ensemble or on their own.

We empirically evaluated X-Detect in both the digital and physical space using five different attack scenarios (including adaptive attacks). The evaluation was performed using a variety of OD algorithms, including Faster R-CNN (Ren et al., 2015), YOLOv8 (https://github.com/ultralytics/ultralytics), Cascade R-CNN (Cai & Vasconcelos, 2019), Cascade RPN (Vu et al., 2019), and Grid R-CNN (Lu et al., 2019). In the digital evaluation, we digitally placed adversarial patches on objects in the Common Objects in Context (COCO) dataset. In the physical evaluation, we chose to demonstrate X-Detect capabilities in the retail domain and physically placed 17 adversarial patches on objects (products found in retail stores). Using those patches we created more than 1700 adversarial videos that were recorded in a real smart shopping cart setup. For this physical evaluation, we created the Superstore dataset, which is an OD dataset tailored to the retail domain. Our evaluation results show that X-Detect can successfully identify digital and physical adversarial patches, outperform state-of-the-art methods [Segment & Complete  (Liu et al. 2022) and Ad-YOLO  (Ji et al., 2021)] without interfering with the detection in benign scenes, and provide explanations for the raised adversarial alerts, without being pre-exposed to adversarial examples. The main contributions of this paper are as follows:

  • To the best of our knowledge, X-Detect is the first explainable adversarial detector for OD models; moreover, X-Detect can be employed in any user-oriented domain where explanations are needed.

  • X-Detect is a model-agnostic solution. By requiring only black-box access to the target model, it can be used for adversarial detection for any OD algorithm.

  • X-Detect supports the addition of new classes without any additional training, which is essential in retail where new items are added to inventory on a daily basis.

  • The resources created in this research can be used by the research community to further investigate adversarial attacks in the retail domain, i.e., the code implementation, Superstore dataset and the corresponding adversarial videos are available at the following link: https://github.com/omerHofBGU/X-Detect.

The structure of the upcoming sections is as follows: Sect. 2 includes relevant background on adversarial patch attacks and used computer vision techniques. Section 3 delves into relevant related works. Section 4 outlines X-Detect components and pipeline; In Sect. 5, details on the evaluation and experimental settings are presented. Section 6 introduces the evaluation results and relevant insights. Finally, Sects. 7 and 8 include the discussion, conclusions, and future work respectively.

2 Background

Adversarial samples are real data samples that have been perturbed by an attacker to influence an ML model’s prediction (Chen et al., 2020; Chakraborty et al., 2021). Numerous digital and physical adversarial attacks have been proposed (Carlini & Wagner, 2017; Brown et al., 2017), and recent studies have shown that such attacks, in the form of adversarial patches, can also target OD models (Hu et al., 2021; Shapira et al., 2022). Since OD models are used for real-world tasks, those patches can even deceive the model in environments with high uncertainty (Lee & Kolter, 2019; Zolfi et al., 2021). An adversary that crafts an adversarial patch against an OD model may have one of three goals: (1) to prevent the OD model from detecting the presence of an object in a scene, i.e., perform a disappearance (hidden) attack (Song et al., 2018; Thys et al., 2019; Hu et al., 2021); (2) to allow the OD model to successfully identify the object in a scene (correct bounding box) but cause it to be classified as a different object, i.e., perform a creation attack (Zhu et al., 2021); or (3) to cause the OD model to detect a non-existent object in a scene, i.e., perform an illusion attack (Liu et al., 2018; Lee & Kolter, 2019). Examples of adversarial patch attacks that target OD models (which were used in X-Detect ’s evaluation) include the DPatch (Liu et al., 2018) and the illusion attack of Lee & Kotler (2019), which craft a targeted adversarial patch with minimal changes to the bounding box.

To successfully detect adversarial patches, X-Detect utilizes four computer vision related techniques: (1) Object extraction (Kirillov et al., 2020)—the task of detecting and delineating the objects in a given scene, i.e., ‘cropping’ the object presented in the scene and erasing its background; (2) Arbitrary style transfer (Jing et al., 2019)—an image manipulation technique that extracts the style “characteristics” from a given style image and blends them into a given input image; (3) Scale invariant feature transform (SIFT) (Lowe, 2004)—an explainable image matching technique that extracts a set of key points that represent the “essence” of an image, allowing the comparison of different images. The key points are selected by examining their surrounding pixels’ gradients after applying varied levels of blurring; and (4) The use of class prototypes (Roscher et al., 2020)—an explainability technique that identifies data samples that best represent a class, i.e., the samples that are most related to a given class (Molnar, 2020).

3 Related work

While adversarial detection in image classification has been extensively researched (Aldahdooh et al., 2022; Chou et al., 2020; Yang et al., 2020), only a few studies have been performed in the OD field. Moreover, it has been shown that adversarial patch attacks and detection techniques suited for image classification cannot be successfully transferred to the OD domain (Liu et al., 2018; Lu et al., 2017a, b). Adversarial detection for OD can be divided into two categories: patch detection based on adversarial training and patch detection based on inconsistency comparison. The former is performed by adding adversarial samples to the OD model’s training process so that the model becomes familiarized with the "adversarial" class, which will improve the detection rate (Chiang et al., 2021; Xu et al., 2022). Adversarial training detectors can be applied both externally (the detector operates separately from the OD model) or internally (the detector is incorporated into the architecture of the OD model). One example of an adversarial training detector applied externally is Segment & Complete (SAC) (Liu et al., 2022); SAC detects adversarial patches by training a separate segmentation model, which is used to detect and erase the adversarial patches from the scene. In contrast, Ad-YOLO (Ji et al., 2021) is an internal adversarial training detector, which detects adversarial patches by adding them to the model’s training set as an additional "adversarial" class. The main limitation of adversarial training detectors is that the detector’s effectiveness is correlated with the attributes of the patches presented in the training set (Zhang et al., 2019), i.e., the detector will have a lower detection rate for new patches. In addition, none of the existing detectors sufficiently explain the alert raised.

The inconsistency comparison detection approach examines the similarity of the target model’s prediction and the prediction of another ML model, which is referred to as the predictor. Any inconsistency between the two models’ predictions will trigger an adversarial alert (Xiang & Mittal, 2021). DetectorGuard (Xiang & Mittal, 2021) is an example of a method employing inconsistency comparison; in this case, an instance segmentation model serves as the predictor. DetectorGuard assumes that a patch cannot mislead an OD model and instance segmentation model simultaneously. However, by relying on the output differences of those models, DetectorGuard is only effective against disappearance attacks and is not suitable for creation or illusion attacks that do not significantly alter the object’s shape. Given this limitation, DetectorGuard is not a valid solution for the detection of such attacks, which are likely to be used in smart shopping cart systems.

4 The method

X-Detect ’s design is based on the assumption that the attacker crafts the adversarial patch for a specific attack environment (the target task is OD, and the input is preprocessed in a specific way), i.e., any change in the attack environment will harm the patch’s capabilities. X-Detect starts by locating and classifying the main object in a given scene (which is the object most likely to be attacked) by using two explainable-by-design base detectors that change the attack environment. If there is a disagreement between the classification of X-Detect and the target model, X-Detect will raise an adversarial alert. In this section, we introduce X-Detect ’s components and structure (as illustrated in Fig. 2). X-Detect consists of two base detectors, the object extraction detector (OED) and the scene processing detector (SPD), each of which utilizes different scene manipulation techniques to neutralize the adversarial patch’s effect on an object’s classification. These components can be used separately (by comparing the selected base detector’s classification to the target model’s classification) or as an ensemble to benefit from the advantages of both base detectors (by aggregating the outputs of both base detectors and comparing the result to the target model’s classification).

Fig. 2
figure 2

X-Detect structure and components

The following notation is used: Let F be an OD model and s be an input scene. Let \(F(s)=\{O_b,O_p,O_c\}\) be the output of F for the main object in scene s, where \(O_b\) is the object’s bounding box, \(O_c\in C\) is the classification of the object originating from the class set C, and \(O_p\in [0,1]\) is the confidence of F in the classification \(O_c\).

4.1 Object extraction detector

The OED receives an input scene s and outputs its classification for the main object in s. First, the OED uses an object extraction model to eliminate the background noise from the main object in s. As opposed to OD models, object extraction models use segmentation techniques that focus on the object’s shape rather than on other properties. Patch attacks on OD models change the object’s classification without changing the object’s outlines (Liu et al., 2022), therefore the patch will not affect the object extraction model’s output. Additionally, by using object extraction, the OED changes the assumed object surrounding by eliminating the scene’s background, which may affect the final classification. Then, the output of the object extraction model is classified by the prototype-KNN classifier—a customized KNN model. KNN is an explainable-by-design algorithm, which, for a given sample, returns the k closest samples according to a predefined proximity metric and uses majority voting to classify it. Specifically, the prototype-KNN chooses the k closest neighbors from a predefined set of prototype samples P from every class. By changing the ML task to classification, the assumed attack environment has been changed. In addition, using prototypes as the neighbors guarantees that the set of neighbors will properly represent the different classes. We chose to use KNN and not other classification algorithms since it is relatively straightforward for humans to understand why a prediction was made. The prototype-KNN proximity metric is based on the number of identical visual features shared by the two objects examined. The visual features are extracted using SIFT (Lowe, 2004), and the object’s unique characteristics are represented by a set of key points. The class of the prototype that has the highest number of matching key points with the examined object is selected. We chose to use SIFT since its matching points simulate the human thinking process when comparing two images. The OED’s functionality is presented in Eq. (1):

$$\begin{aligned} OED(s) = P_{KNN}\left(\max _{p_i\in P}\Big |SIFT\left(OE(s),p_i\right)\Big |\right) \end{aligned}$$
(1)

where \(P_{KNN}\) is the prototype-KNN, \(p_i \in P\) is a prototype sample, and OE is the object extraction model. The OED is considered explainable-by-design: (1) SIFT produces an explainable output that visually connects each matching point in the two scenes; and (2) the K neighbors used for the classification can explain why the prototype-KNN outputs the predicted class.

4.2 Scene processing detector

As the OED, the SPD receives an input scene s and outputs its classification for the main object in s. First, the SPD applies multiple image processing techniques on s. Then it feeds the processed scenes to the target model and aggregates the results into a classification. The image processing techniques are applied to change the assumed OD pipeline by adding a preprocessing step, which would limit the patch’s effect on the target model (Thys et al., 2019; Hu et al., 2021). The effect of the image processing techniques applied on the target model needs to be considered, i.e., a technique that harms the target model’s performance on benign scenes (scenes without an adversarial patch) would be ineffective. After the image processing techniques have been applied, the SPD feeds the processed scenes to the target model and receives the updated classifications. The classifications of each processed scene are aggregated by selecting the class with the highest probability sum. The SPD’s functionality is presented in Eq. (2):

$$\begin{aligned} SPD(s) = Arg_{m \in SM} \left(F\left(m(s)\right)\right) \end{aligned}$$
(2)

where \(m\in SM\) represents an image processing technique, and the main object’s classification probability is aggregated from each processed image output by Arg. The \(m\in SM\) and Arg are empirically selected by their effect on the benign scenes (Sect. 5.4). The SPD is considered explainable-by-design since it provides explanations for its alerts: Every alert raised is accompanied by the processed scenes, which are explanations-by-examples, i.e., samples that visually explain why X-Detect’s prediction changed.

5 Evaluation

5.1 Datasets

The following datasets were used in the evaluation: Common Objects in Context (COCO) 2017 (Lin et al., 2014)—an OD benchmark containing 80 object classes and over 120K labeled images.

PASCAL 2012 (Everingham et al., 2012)—an OD benchmark containing 20 object classes and  10K labeled images. A dataset’s subset was used to further analyze X-Detect ’s low FPR.

Superstore—a dataset that we created, which is customized for the retail domain and the smart shopping cart use case. The Superstore dataset contains 2200 images (1600 for training and 600 for testing), which are evenly distributed across 20 superstore products (classes). Each image is annotated with a bounding box, the product’s classification, and additional visual annotations (more information can be found in the supplementary material). The Superstore dataset’s main advantages are that all of the images were captured by cameras in a real smart cart setup (described in Sect. 5.2) and it is highly diverse, i.e., the dataset can serve as a high-quality training set for related tasks.

5.2 Evaluation space

X-Detect was evaluated in two attack spaces: digital and physical. In the digital space, X-Detect was evaluated under a digital attack with the COCO dataset. In this use case, we used open-source pretrained OD models from the MMDetection framework’s model zoo (Chen et al., 2019). We used 100 samples related to classes that are relevant to the smart shopping cart use case in the COCO dataset to craft two adversarial patches using the DPatch attack (Liu et al., 2018), each of which corresponds to a different target class—"Banana" or "Apple." To create the adversarial samples, we placed the patches on 100 additional benign samples from four classes ("Banana," "Apple," "Orange," and "Pizza"). The test set used to evaluate X-Detect consisted of the 100 benign samples and their 100 adversarial samples. We note that the adversarial patches were not placed on the corresponding benign class scenes, i.e., an "apple" patch was not placed on "apple" scenes and a "banana" patch was not placed on "banana" scenes.

Fig. 3
figure 3

Examples of the various patches used in the evaluation in the physical space. The patches were printed in various sizes and the smallest patch that effectively deceived the target model was chosen for our evaluation

In the physical space, X-Detect was evaluated in a real-world setup in which physical attacks were performed on real products from the Superstore dataset. For this evaluation, we designed a smart shopping cart using a shopping cart, two identical web cameras, and a personal computer (illustrated in Fig. 4a). In this setup, the frames captured by the cameras were passed to a remote GPU server that stored the target OD model. To craft the adversarial patches, we used samples of cheap products from the Superstore test set and divided them into two equally distributed data folds (15 samples from each class). Each of the adversarial patches was crafted using one of the two folds. The patches were crafted using the DPatch  (Liu et al., 2018) and Lee & Kolter (2019) attacks. In total, we crafted 17 adversarial patches, and examples for these patches are presented in Fig. 3. These patches were then printed in various sizes, ranging from \(200\times 200\) pixels to \(400\times 400\) pixels. Each patch was utilized to attack our smart shopping cart setup. Notably, we used the smallest patch that effectively deceived the target model when placed on the expensive item. Using those patches, we recorded 1700 adversarial videos, in which expensive products containing the adversarial patch were placed in the smart shopping cart. The test set used consisted of the adversarial videos along with an additional equal number of benign videos (available at).Footnote 3

5.3 Attack scenarios

Figure 4b presents the five attack scenarios evaluated and their corresponding threat models. The attack scenarios can be categorized into two groups: non-adaptive attacks and adaptive attacks. The former are attacks where the attacker threat model does not include knowledge about the defense approach, i.e., knowledge about the detection method used by the defender. X-Detect was evaluated on four non-adaptive attack scenarios that differ with regard to the attacker’s knowledge: white-box (complete knowledge of the target model), gray-box (no knowledge of the target model’s parameters), model-specific (knowledge on the ML algorithm used), and model agnostic (no knowledge on the target model). The patches used in these scenarios were crafted using the white-box scenario’s target model; then they were used to simulate the attacker’s knowledge in other scenarios (each scenario targets different models). For example, to simulate the model-specific threat model, we evaluated target models that use the same algorithm as the white-box target model, however different structures and weights.

We also carried out adaptive attacks (Carlini et al., 2019). In the context of adversarial ML, an adaptive attack is a sophisticated attack that is designed to deceive the target model (in our case mislead the object detector) and evade a specific defense method that is assumed to be known to the attacker (X-Detect in our case). This type of attack takes into account the defense parameters during the optimization process, intending to make the resulting patch more robust to the defense. It is important to note that in an adaptive attack since the attacker adds another component to the optimization process, i.e., the component that should make sure that the attack will not be detected by the known defense mechanism, such an attack is considered more difficult to perform. We designed three adaptive attacks that are based on the (LKpatch) Lee & Kolter (2019) attack of Lee & Kotler, which is presented in Eq. (3):

$$\begin{aligned} LK_{patch} = Clip\left(P - \epsilon *sign\left(\bigtriangledown _p L(P,s,t,O_c,O_b)\right)\right) \end{aligned}$$
(3)

where P is the adversarial patch, L is the target model’s loss function, and t are the transformations applied during training. Each adaptive attack is designed according to X-Detect ’s base detectors’ settings: (1) using just OED, (2) using just SPD, and (3) using an ensemble of the two. To adjust the LKpatch attack to consider the first setup (i), we added a component to the attack’s loss function that incorporates the core component of OED—the prototype-KNN. The new loss component incorporates the SIFT algorithm’s matching point outputs of the extracted object (with the patch) and the target class prototype. The new loss component is presented in Eq. (4):

$$\begin{aligned} {\begin{matrix}&OE_{SIFT} = -norm\left(SIFT\left(OE(s,P),p_{target}\right)\right) \end{matrix}} \end{aligned}$$
(4)

where \(p_{target}\) is the target class prototype and norm is a normalization function. In this setup, we do not consider the object extraction model since causing it to extract a wrong part of the image would omit the adversarial patch from the SIFT input, resulting in a low number of SIFT points with the target class and an unsuccessful attack.

Fig. 4
figure 4

Experimental evaluation settings: smart shopping cart setup (a), attack scenarios (b)

To adjust the LKpatch attack to incorporate the second setup (2), we added the image processing techniques used by the SPD to the transformations in the expectation over transformation functionality t. The updated loss function is presented in Eq. 5:

$$\begin{aligned} P_{ASP} = Clip\left(P - \epsilon *sign\left(\bigtriangledown _p L(p,s,t,sp,O_c,O_b)\right)\right) \end{aligned}$$
(5)

where sp are the image processing techniques used. To adjust the existing LKpatch attack to incorporate the last setup (3), we combined the two adaptive attacks described above, by adding \(OE_{SIFT}\) to the \(P_{ASP}\) loss.

5.4 Experimental settings

All of the experiments were performed on the CentOS Linux 7 (Core) operating system with an NVIDIA GeForce RTX 2080 Ti graphics card with 24GB of memory. The code used in the experiments was written using Python 3.8.2, PyTorch 1.10.1, and NumPy 1.21.4 packages. We used different target models, depending on the attack space and scenario in question. In the digital space, only the white-box, model-specific, and model-agnostic attack scenarios were evaluated, since they are the most informative for this evaluation. In the white-box scenario, the Faster R-CNN model with a ResNet-50-FPN backbone (Ren et al., 2015; He et al., 2016) in PyTorch implementation was used (weights were taken from torchvision library). In the model-specific scenario, three Faster R-CNN models were used—two with a ResNet-50-FPN backbone in Caffe implementation, in each one a different regression loss (IOU) was used, and the third used ResNet-101-FPN as a backbone (weights were taken from torchvision library). In the model-agnostic scenario, the Cascade R-CNN (Cai & Vasconcelos, 2019) model and Grid R-CNN (Lu et al., 2019) were used, each of which had a ResNet-50-FPN backbone (weights were taken from MMDetection library); and YOLOv8L  (https://github.com/ultralytics/ultralytics) model (weights were taken from Ultralitics library). In the physical space evaluation, all five attack scenarios were evaluated. In the adaptive and white-box scenarios, a Faster R-CNN model with a ResNet-50-FPN backbone was trained for 40 epoches with the seed 42 and initial weights were taken from torchvision library. In the gray-box scenario, three Faster R-CNN models were trained (with the seeds 38, 40, 44) with a ResNet-50-FPN backbone and initial weights were taken from torchvision library. In the model-agnostic scenario, a Cascade R-CNN model, Cascade RPN model (Vu et al., 2019), YOLOv3 model (Redmon & Farhadi, 2018) were trained with the seed 42 and initial weights were taken from MMDetection library; and YOLOv8L  (https://github.com/ultralytics/ultralytics) model was trained with the seed 42 and initial weights were taken from Ultralitics library. The trained models are available at.Footnote 4 In the attacks performed, the learning rate was reduced automatically (on plateau), the batch size was set at one, and the patch was placed on the main object. The patch size was set at 120*120 pixels in the digital space and 100*100 pixels in the physical space. The transformations used in the LKpatch attack were brightness (between [0.8,1.6]) and random rotations. Our code is available at.Footnote 5 All of the adversarial samples (i.e., scenes with an adversarial patch) were designed to meet the following requirements: (1) the target OD model still identifies an object in the scene (there is a bounding box); (2) the attack does not change the output’s bounding box drastically; and (3) the attack changes the object classification to a predefined target class or group of classes (e.g., ‘cheap’ products).

In the experiments, X-Detect ’s components were set as follows. OED was initialized with 10 prototypes from each class. To ensure that the prototypes were representative, we extracted the object associated with the prototype class. The SIFT algorithm was originated from the Open-CV library and used with the default parameters with the exception of max key points of 1000 and four pyramid levels. The number of neighbors used by the prototype-KNN was set to seven. The scene processing detector used the following image processing techniques: blur (six in the digital space and 12 in the physical space), sharpen, random noise (0.35 in the physical space), darkness (0.1), and arbitrary style transfer. The arbitrary style transfer technique used the evaluated scene served as both the input and the style image, i.e., the input scene was slightly changed. The base detectors were evaluated separately and in the form of an ensemble. Two types of ensembles were evaluated: (1) a majority voting (MV) ensemble, which sums the probabilities for each class and returns the class that received the highest sum; and (2) a 2-tier ensemble, which first applies the SPD, and if an alert is raised, the scene is passed to the OED.

To properly evaluate X-Detect, we compared it to two state-of-the-art adversarial detectors, Ad-YOLO (Ji et al., 2021) and SAC  (Liu et al., 2022) (Sect. 3). Both detectors were implemented according to their paper’s details and their available code. To train the Ad-YOLO target models, we extended the datasets to include samples of the adversarial patch class, which consists of scenes with objects on which adversarial patches were placed. The patches used were crafted for this purpose alone and were not part of the original evaluation sets. The scenes were manually annotated with a bounding box and a classification and were added to the original training sets. Then, the models were trained on the extended training set using the algorithms and architectures of the original target models. The models were tested on scenes with objects containing similar patches and achieved 98% accuracy for the adversarial patch class. In SAC, an additional instance segmentation model (UNet) was trained to detect and remove the adversarial patches. Although SAC was originally designed to improve the robustness against digital disappearance attacks, it can be adjusted to detect illusion attacks in digital and physical spaces. We implemented SAC using the provided code from the paper and performed the required adjustment by comparing the target model prediction before and after applying the SAC model, which is supposed to detect and remove the patch. For the digital evaluation, we used the original patches from the paper, since the digital evaluation was performed on the same dataset (COCO). For the physical evaluation, we fine-tuned the original SAC UNet model using additional patches that were crafted with the DPatch and Lee & Kotler attacks based on our SuperStore dataset.

Fig. 5
figure 5

Digital space evaluation results

Table 1 Physical space evaluation results
Table 2 Adaptive attack detection
Fig. 6
figure 6

X-Detect’s explainable output. A OED output—matching points between the product and its most similar prototype in the benign and adversarial scenarios. B SPD output—the classifications for a given scene and various image manipulations

6 Experimental results

The attacks used in the digital space (see Sect. 5.2) reduced the mean average precision (mAP) in all attack scenarios by \(60\%\), i.e., the OD models’ performance was substantially degraded. In X-Detect evaluation, only the successful adversarial samples were used. There were \(95\%\), \(35\%\), and \(38\%\) successful adversarial samples respectively in the white-box, model-specific, and model-agnostic attack scenarios. Figure 5 shows the digital space evaluation results (performed on the COCO dataset) for X-Detect, Ad-YOLO, and SAC detection for the white-box, model-specific, and model-agnostic attack scenarios. The figure presents the detection accuracy (DA), true positive rate (TPR), and true negative rate (TNR) of X-Detect components, Ad-YOLO, and SAC. In the model-specific and model-agnostic attack scenarios, the results presented are the means obtained by each detector (see Sect. 5.3) with a standard deviation of 0.034 and 0.017 respectively. In the figure, we can see that X-Detect outperformed Ad-YOLO and SAC on all of the performance metrics. X-Detect had the highest TPR by the OED; the highest TNR by the 2-tier ensemble (along with SAC); and the highest DA by the 2-tier ensemble.

To further analyze X-Detect ’s low FPR, we evaluated X-Detect using a subset of the PASCAL-VOC dataset in the model-agnostic attack scenario. In this evaluation, we employed the YOLOv8 target model, which yielded one of the highest FPRs in the COCO digital evaluation (0.15, 0.25, 0.15 and 0.06 in the OED, SPD, ME ensemble and 2-Tier ensemble respectively). The evaluation includes 1648 samples obtained by filtering samples of the non-relevant classes (classes that exist in the PASCAL-VOC dataset and not in the COCO dataset) and samples that the YOLOv8 target model incorrectly predicted. The PASCAL evaluation results showed that X-Detect ’s FPR was even better in most cases—0.057, 0.32, 0.051, and 0.026 in the OED, SPD, ME ensemble, and 2-Tier ensemble respectively. These results further emphasize X-Detect ’s ability to avoid false alarms.

Furthermore, we evaluated the sensitivity of X-Detect to obfuscation in the input scene. In this evaluation, we used the COCO digital evaluation set and evaluated X-Detect performance on scenes with different levels of obfuscation. In general, as the obfuscation levels increase the target model’s performance decreases. Therefore, in this evaluation, we used the YOLOv8 target model, which had one of the best performances on the benign evaluation set (mAP of 0.238) out of all target models. We gradually increased the obfuscation levels while increasing the effect on the target model up to the point where the model’s performance was unacceptable (wrongly predicts more than \(70\%\) of the benign images). Each scene was first obfuscated, then saved in its benign form, and then the adversarial patch was placed on it to create its adversarial form (i.e., the adversarial patch was not obfuscated). In addition, in each evaluation, only the obfuscated scenes that were correctly predicted by the target model were used. As expected, as the obfuscation level increased, the number of benign samples that were correctly predicted by the target model decreased, as well as the number of successful adversarial samples. X-Detect ’s detection results indicate that both the TPR and the FPR slowly increased as the obfuscation level increased (a maximal TPR value of 1.0 and FPR value of 0.33). The increase in the FPR was in all configurations of X-Detect (the SPD, OED, and the MV and 2-tier ensembles) and can be explained by the decrease in the target model’s tolerance to scene manipulations (as done by the SPD) and the decrease in the number of concrete SIFT key points (that are used by the OED). In contrast, while most of X-Detect ’s configurations maintain the same TPR (the SPD and the MV and 2-tier ensembles), the TPR of the OED detector was increased. This can be explained also by the decrease in the number of concrete SIFT key points—as the number of key points decreases the OED outputs more alarms, resulting in both higher TPR and FPR values.

The attacks used in the physical evaluation (Sect. 5.2) decreased the models’ performance substantially. In X-Detect evaluation, only the successful adversarial samples were used. There were \(80\%\), \(80\%\), \(74\%\), \(33\%\), and \(28\%\) successful adversarial samples in the adaptive, white-box, gray-box, model-specific, and model-agnostic scenarios respectively. Table 1 presents the results of the physical space evaluation (performed on the Superstore dataset) for X-Detect, Ad-YOLO, and SAC for the white-box, gray-box, model-specific, and model-agnostic scenarios on six metrics: DA, TPR, TNR, false positive rate (FPR), false negative rate (FNR), and inference time (the detector’s runtime for a single video). The results of the gray-box, model-specific, and model-agnostic scenarios are the means obtained by each detector with a standard deviation of 0.032, 0.044, and 0.024 respectively, where the results in bold are the best results. The results in the table show that X-Detect in its different settings outperformed Ad-YOLO and SAC with the exception of the inference time metric (\(\sim\)0.4 s). X-Detect had the highest TPR by the OED; the highest TNR (along with SAC) by the 2-tier and MV ensembles; and the highest DA by the 2-tier ensemble.

Note that the TPR results of Ad-YOLO and SAC across all experiments can be explained by their detection approach, i.e., supervised methods that are based on adversarial training. Their reliance on adversarial patches during training limits their effectiveness in detecting new, unfamiliar adversarial patches, unlike unsupervised methods (as X-Detect is), which typically exhibit greater adaptability in recognizing novel threats.

We also evaluated X-Detect ’s performance in the adaptive attack scenario. Adaptive attacks require the addition of a component that enables the attack to evade the known defense mechanism (Sect. 5.3). The three main goals of an adaptive attack are: (1) to converge; (2) to deceive the target model (i.e., the object detector); and (3) to evade the given defense (i.e., X-Detect). The first goal refers to the ability of the attack to converge, i.e., the attack’s ability to succeed in reducing the loss during the optimization process despite the incorporation of additional defense parameters. If the attack does not succeed in converging, the employed defense mechanism is effective, i.e., the attacker cannot compute a patch that can deceive the target model, and at the same time evade the defense method. Therefore, in the case the attack optimization process cannot converge, the defense parameters are so complex that they limit the optimization process, causing the adaptive attack to fail. Assuming that the attack did converge, it may fail in achieving the second goal of deceiving the target model since the gradients’ landscape of the optimization process becomes more complex and restricted when the defense parameters are taken into account. Finally, the third goal is to evade the given defense more effectively than the original attack. To summarize, when comparing the original attack with the adaptive attack, given an effective defense mechanism, we expect that it will be harder for the adaptive attack to converge, it will be less successful in fooling the target model, however, it will be less detected by the defense mechanism.

Fig. 7
figure 7

The physical adaptive attack crafting results

In our evaluation, we attempted to incorporate the exact defense mechanism knowledge within the attack, i.e., the prototype-KNN classifier and scene processing techniques used by X-Detect. However, the attacks did not converge when the original parameters were used (as shown in Fig. 7), i.e., the attacks failed to produce a patch that deceived the target model and the defense method. This demonstrates the effectiveness of X-Detect. In the OED adaptive attack, this can be explained by the low number of matching points between the attack and target item prototypes in the areas where the patch did not cover the item. This outcome can also negate the success of a targeted attack against the object extraction model (which the OED relied on) since extracting a different region in the scene will produce a low number of matching points with the target object as well. In addition, in the SPD adaptive attack, the scene processing techniques used by X-Detect (the level of random noise and blurring) caused a dramatic and contrastive change to the scene.

In order to further evaluate the effectiveness of our adversarial detector, X-Detect, we used a more relaxed versions of X-Detect components within the adaptive attack (to allow the attack to converge). Specifically, we adjusted the OE_SIFT component to have less effect on the complete loss by adding a log to the prototype-KNN classifier’s output and using the NLLLoss function. Additionally, we decreased some of the parameters of the SPD image processing techniques (the blurring from 6 to 1 and the random noise from 0.35 to 0.25), which caused a less dramatic alteration to the scene. Table 2 presents X-Detect ’s TPR and TNR in the adaptive attack scenario. The results indicate that while these adaptive patches partially deceive the target model, they did not succeed in evading X-Detect, i.e., X-Detect successfully detected most of the adversarial samples while maintaining a TPR of at least \(92\%\). This shows that evading X-Detect is improbable, even when relaxing some of its parameters. Therefore, X-Detect succeeds in detecting adaptive adversarial patches. Figure 6 presents examples of X-Detect ’s explainable outputs, which justify the alert raised.

7 Discussion

When analyzing the attack’s behavior in the crafting phase, we observed that the patch location influenced the attack’s success, i.e., the success rate improved when the patch was placed on the attacked object. Furthermore, in the physical evaluation, we observed that the attacks are sensitive to the attacker’s behavior, such as the angle, speed of item insertion, and patch size.

When analyzing the attacks in the different attack scenarios, we observed that as the knowledge available for the attacker decreased, the number of successful adversarial samples also decreased. The reason for this is that the shift between attack scenarios relies on the patch’s transferability. Therefore, the adversarial samples that succeed in the most restricted attack scenario (model-agnostic) can be considered the most robust. Those samples would be harder to detect, as reflected in the performance of X-Detect in its different settings. However, all the results show the detection of a single attack; in the shopping cart use case, an attacker would likely try to steal as many items as possible. When considering the occurrence of a repeated attack and its sensitivity to the attacker’s behavior, X-Detect ’s ability to expose the attacker would increase exponentially.

The expected overhead in inference time when using X-Detect could be a valuable aspect in real-time applications, i.e., the expected time added to the OD model’s inference time (referred to as \(T_{OD}\)). When examining the expected inference time overhead, we can see differences between the two base-detectors. The OED’s object extraction model has the same inference time as the OD model and can be run in parallel, i.e., the object extraction model does not add additional time to the inference process. The OED’s subsequent classification model, theprototype-KNN classifier, cannot be run in parallel to the OD model and adds \(T_{KNN}\) time to the inference time. Note, that \(T_{KNN}\) is smaller than \(T_{OD}\) given the nature of its internal algorithms (the KNN classification and the SIFT feature extraction algorithms). Therefore, the OED’s added time for a single scene can be estimated as \(T_{KNN}\). On the other hand, the SPD adds a more substantial time for a single inference. The SPD repeatedly queries the OD model with the manipulated scenes, i.e., |SM| times where SM is the group of manipulated scenes. Since the application time of each image processing technique is negligible, the SPD’s added time for a single scene can be estimated as \(|SM| \times T_{OD}\). Note that the time for the predictions’ aggregation is relatively quick and negligible.

In the evaluation, four detection approaches were used: only OED, only SPD, an MV ensemble, and a 2-tier ensemble, however the question of which approach is the most suitable for adversarial detection remains unanswered. Each of the approaches showed different strengths and limitations: (1) the OED approach managed to detect most of the adversarial samples (high TPR) yet raised alerts on benign samples (high FPR); (2) the SPD approach detected fewer adversarial samples than the OED (lower TPR) yet raised fewer alerts on benign samples (lower FPR); (3) the MV ensemble reduced the gap between the two base-detectors’ performance (higher TPR and lower FPR) yet had a longer inference time; and (4) the 2-tier ensemble reduced the MV ensemble’s inference time and improved the identification of benign samples (higher TNR and lower FPR) yet detected fewer adversarial samples (lower TPR). Therefore, the selection of the best approach depends on the use case. In the retail domain, it can be assumed that: (1) most customers would not use an adversarial patch to shoplift; (2) wrongly accusing a customer of shoplifting would result in a dissatisfied customer, and harm the company’s reputation; (3) short inference time is vital in real-time applications like the smart shopping cart. Therefore, a company that prioritizes the customer experience would prefer the 2-tier ensemble approach, while a company that prioritizes revenue above all would prefer the OED or MV ensemble approach.

8 Conclusion and future work

In this paper, we presented X-Detect, a novel adversarial detector for OD models which is suitable for real-life settings. In contrast to existing methods, X-Detect is capable of: (1) identifying adversarial samples in near real-time; (2) providing explanations for the alerts raised; and (3) handling new attacks. Our evaluation in the digital and physical spaces, which was performed using a smart shopping cart setup, demonstrated that X-Detect outperforms existing methods in the task of distinguishing between benign and adversarial scenes in the four attack scenarios examined while maintaining a 0% FPR. Furthermore, we demonstrated X-Detect effectively under adaptive attacks. However, it is crucial to acknowledge the possible limitations of X-Detect: (1) the OED requires to set prototype samples for every class in advance, which might not always be possible; and (2) in cases where there is no clear prioritization between false alarms and missed alarms, all the detection approaches should be examined before selecting the most suitable one. Future work may include applying X-Detect in different domains and expanding it for the detection of other attacks, such as backdoor attacks.