1 Introduction

Single visual tracking is a beneficial application in computer vision that aims to track the movement of a single-target object within a scene in real-time [1]. It has numerous applications, for example, video surveillance, robotics, and autonomous driving. Visual tracking typically involves detecting the location of an object in the first frame of a video and then tracking it through subsequent frames, such as SiamFC [2], SiamRPN [3, 4], SiamRPN++ [5], ATOM [6], DiMP [7], and lighttrack [8], and CHASE [9]. However, this task is challenging due to factors such as changes in lighting conditions, occlusion, and background clutter.

In recent years, the performance of visual tracking models has been threatened by adversarial attacks, for example, FGSM [10], iFGSM [11], and miFGSM [12]. These attacks are designed to manipulate the input data and cause the model to malfunction, resulting in incorrect object tracking. Adversarial attacks are a growing concern in the field of machine learning, as they can undermine the accuracy and reliability of models in critical applications.

Visual tracking attacks involve generating adversarial noise that is added to the frames, causing the tracking model to fail [13,14,15,16,17,18]. The attackers use sophisticated tactics such as model generators to infiltrate the noise into the template frames or search frames. The perturbation caused by visual object-tracking (VOT) attackers is particularly alarming, and models such as CSA [15], FBA [16], and DFA [17] can generate real-time adversarial noise, leading to complete disruption of top trackers.

To address this challenge, researchers have proposed various countermeasures to improve the robustness of deep learning models against adversarial attacks [19, 20]. However, these approaches are often not effective in visual tracking due to fundamental differences in the network structures of classification and visual tracking models. The classification model emphasizes feature extraction while visual tracking comprises 3 main blocks, feature extraction, template matching, and region proposal network. Feature extraction block is the most important for all deep learning applications. However, the top trackers are robust to subtle noise as shown by the perturbation effect in our ablation study. The random noise of impulse and Gaussian method can slightly decrease the tracking performance, while the CSA [15] and DFA [17] generated adversarial noise to be lower MAE than random noise but their efforts have more disrupted performance than random methods.

To overcome this limitation, we have proposed a novel defense model for visual tracking attackers, specifically designed to denoise the adversarial noise before feeding it to the tracking model. Our approach leverages the power of denoising techniques to remove adversarial noise and improve the quality of the input data. This helps to ensure that the tracking model receives clean and accurate data, making it more robust against adversarial attacks. We further propose a feature-wise defender on a foundation of deep learning, leveraging the power of convolutional neural networks (CNNs) for feature extraction and denoising. The proposed defense model is trained on a large dataset of clean and adversarial images. The model learns to distinguish between clean and adversarial images and to denoise the adversarial images to improve the performance of the visual tracking model. We evaluate the effectiveness of our defense model on several benchmark datasets and demonstrate that it significantly improves the robustness of visual tracking models against adversarial attacks.

Fig. 1
figure 1

Comparison tracking heatmap of clean images, adversarial noise images, and our denoised images

Our proposed method’s effectiveness is demonstrated in Fig. 1 through a comparison of search region frames and their corresponding heatmaps on a clean tracker, an adversarial noise attack, and our proposed defensive model. The contributions are described as follows:

  1. (1)

    We employ a pixel-wise denoiser and feature-wise defender to effectively handle suspicious frames when integrated into the target tracker. The Objective loss functions address both clean pairs or adversarial pairs, encompassing both scenarios of clean and adversarial conditions.

  2. (2)

    Our model has been successful in defending against adversarial attacks in single-object tracking across various benchmark datasets, such as VOT2018, OTB100, and LaSOT. We introduce innovative network architectures, called MUNet, which minimizes computational resources while preserving tracking robustness.

  3. (3)

    In an additional experiment, MUNet successfully transfers its defensive capability to other top trackers, namely DiMP and DaSiam. The performance results closely align with those of top trackers, as demonstrated on the OTB100 dataset.

2 Related Works

2.1 Visual-Tracking-Based Siamese Network

In the past decade, the field of object tracking has seen significant advancements in accuracy and speed. One of the earliest successful approaches was the Siamese fully convolutional network (SiamFC) [2] that learned the similarity between pairs of images, allowing it to track an object in a video sequence. Expanding on the success of SiamFC, the Siamese region proposal network (SiamRPN) [3] improved the accuracy of the tracking algorithm by introducing a region proposal network [21] that generated candidate object locations.

SiamRPN++ [5] further improved upon SiamRPN by introducing a deeper and wider backbone network based on ResNet-50 [22], along with a multi-scale feature fusion mechanism to handle scale variations in the tracked object. Another extension to the Siamese architecture is the Siamese Box Adaptive Network (SiamBAN) that uses an attention mechanism to weight the feature maps based on their relevance to the object being tracked, thereby improving the accuracy of bounding box predictions. SiamMask [4] expanded the capabilities of the Siamese architecture to include instance segmentation in addition to object tracking. This was accomplished by incorporating a mask branch into the network that generated pixel-level segmentation masks for the tracked object.

To achieve real-time tracking on CPU devices, the Siamese Cascaded Architecture for Real-time object tracking (SiamCAR) [23] uses a lightweight feature extractor and a proposal generator to achieve high accuracy and speed. SiamFC++ [24] further extends the SiamFC architecture by upgrading head network structure and adding the regression structure to increase model accuracy. Moreover, they replace the AlexNet [25] backbone with GoogLeNet [26] backbone, which achieves superior performance with fewer computing resources.

In addition to these architecture-based approaches, neural architecture search (NAS) is a method for automatically designing deep neural network architectures for visual tracking tasks. DARTS [27] optimizes the network architecture for a given tracking task using a reinforcement learning-based search strategy, resulting in a high-performance network with reduced manual effort. CHASE [9] proposed a robust visual tracking to improve the tracking model and early stop avoids overfitting via a cell-level differentiable neural architecture search mechanism to adapt backbone features relative to the objective of Siamese tracking networks. Although NAS in VOT has achieved high tracking performance, NAS requires high resource power consumption in model training.

2.2 Adversarial Attacks for Single Visual Tracking

Recently, adversarial attacks on VOT have received increasing attention in research as a means to anticipate against top trackers. Adversarial attacks are not limited to classification tasks, and in visual tracking, researchers have sought to generate imperceptible noise to disrupt object tracking. Recently, online learning attacks have been proposed that can successfully disturb top trackers, for example, Hijacking tracker [13], one-shot adversarial attacks [14], and IoU attack [18]. However, online training is impractical in visual tracking due to its time-consuming nature; for example, the IoU attack [18] takes 0.625 s per frame in inference time, as illustrated in Table 5. In offline training, Yan et al. [15] proposed offline training based on SiamRPN++ [5] with the GOT-10k dataset [28], named cooling-shrinking attack (CSA). This attacking model addressed a tracking weakness by decreasing the target heatmap density, misleading the predicted bounding box to be shrunken. The feature-based attack [16] and the diminishing-feature attack (DFA) [17] were introduced to attack the sole template frame by minimizing the variance of its feature channel. This perturbed effect leads to tracker malfunction on both template and search frames. Apart from that, DFA perfectly transfers the attacking effect to unseen top trackers without learning from their parameters. It is explicit that adversarial attacking models can harm visual tracking models. Therefore, the adversarial defense against adversarial attack is essential to invest and urgently invent an adversely defensive model.

In the domain of adversarial attacks on image classification, notable works such as DUNET (Defense against adversarial attacks using high-level representation guided denoiser) [19] demonstrated success in defense against both whitebox and blackbox attacks. However, the efficacy of these defensive mechanisms does not fit to the context of single-object tracking. In single-object tracking, the model requires feeding template and search frames in different scales into tracking model simultaneously while DUNET can generate only single and fixed size denoised image. Moreover, denoised model fails to defend against adversarial attacks in tracking scenarios. In the ablation study demonstrated that deep denoiser prior (DPIR) [29] fails to defend against DFA[17] in every denoised level.

It is evident that adversarial attacking models pose a significant threat to visual tracking systems. Therefore, investing in and promptly developing adversarial defense mechanisms is imperative to create models that can robustly counter adversarial attacks in tracking scenarios.

We propose the adversarial defense on Siamese network-style visual tracking, named multi-model UNet (MUNet), to defend against suspicious images, whether they are infiltrated by adversarial noise or not, before feeding them to the tracker. MUNet models comprise five pairs of layers, such as 3, 4, 5, 6, and 7. Our MUNet models can vary in parameter size and inference speed based on the number of pairs of layers. The defending performance of our largest model is outstanding, while our fewer-layer models are also efficient and very close to the most significant model.

3 Defensive Model Methodology

3.1 Constrainted Optimization of Adversarial Defense

An adversarial attack definition aims to maximize loss function \(\ell \) of a machine learning function f(.) of adversarial data \(x' = x + \epsilon \) and output target y by adding perturbed noise \(\epsilon \) within the norm ball B(x). The definition equation can be described as

$$\begin{aligned} x' = \text {argmax}_{x' \in \textit{B}_{\epsilon }(x)} \ell (f(x'),y). \end{aligned}$$
(1)

In defending purposes, an adversarial defense can be defined in two ways. The former is to minimize loss function \(\ell \) of a predicted defensive data \(\tilde{x}\) and clean data x by removing predicted noise \(\delta \) out of an adversarial data \(x'\) using denoised model generator D(.) as

$$\begin{aligned} \tilde{x} = \text {argmin}_{ \tilde{x} \in \textit{D}_{\delta }(x') } \ell (\tilde{x}, x) . \end{aligned}$$
(2)

The latter definition is to minimize loss function \(\ell \) of a machine learning function f(.) of a predicted defensive data \(\tilde{x}\) and output target y. A defensive model generator D(.) is to generate a predicted defensive noise \(\delta \) and subtract this predicted noise \(\delta \) from an adversarial data \(x'\) as described in 3.

$$\begin{aligned} \tilde{x} = \text {argmin}_{ \tilde{x} \in \textit{D}_{\delta }(x') } \ell (f(\tilde{x}), y). \end{aligned}$$
(3)

3.2 Pixel-Wise Denoiser and Feature-Wise Defender

Liao et al. [19] proposed a defensive model named pixel-guide denoiser to defend either whitebox or blackbox attackers in classification models. The DUNET inspires us to invent a loss function named pixel-wise denoiser (PD) to defend against an adversarial attack in single visual tracking which requires predicting either denoised template \(\tilde{z}\) and denoised search \(\tilde{x}\) regions before feeding to the Siamese head. To be a more challenging task, the PD requires the ability to denoise an adversarial image or unchanged a clean image. Let \(D_{\delta }(.)\) be a defensive model which can generate both adversarial template \(z'\) and adversarial search \(x'\) regions, simultaneously. In particular, a defensive model would maintain a clean image and retain the tracking performance. The PD loss function can be rewritten as in Eq. (4) when suspicious template and suspicious search regions are defined as \(z^*\) and \(x^*\).

$$\begin{aligned} L_{PD} = \frac{\Vert z - D_{\delta }(z^*) \Vert _{2}}{3\times w_z\times h_z} + \frac{\Vert x - D_{\delta }(x^*) \Vert _{2}}{3\times w_x\times h_x} \end{aligned}$$
(4)

To improve a robust model for single visual tracking, a denoised image by using pixel-wise denoiser might be limited due to possibly some adversarial noise being amplified and leading to the tracker malfunction. In Eq. (3), a defensive model can boost tracking robustness by minimizing the individual objective function for each specific tracking model in which the model training may effectively be limited to only a specific training model.

The diminishing feature attack (DFA) [17] addressed the transferability of perturbed tracking performance from target tracking model SiamRPN++ [5] to blackbox models, DaSiam [30] and DiMP [7]. The DFA learns to attack only a dominant feature without learning from a classification branch or regression branch. We then adopt a high-level guide denoiser from DUNET [19] as a feature-wise defender. The defensive model generates a predicted defensive template \(\tilde{z}\) and a predicted defensive search \(\tilde{x}\) or maintains a suspicious frame \(\{z^{*}, x^{*}\}\) in case suspicious frame is without adversarial noise \(\{z, x\}\).

$$\begin{aligned} L_{FD} = \frac{\Vert \varphi (z) - \varphi (D_{\delta }(z^*)) \Vert _{2}}{M \times N \times O_{z}^{w} \times O_{z}^{h}} + \frac{\Vert \varphi (x) - \varphi (D_{\delta }(x^*)) \Vert _{2}}{M\times N \times O_{x}^{w}\times O_{x}^{h}} \end{aligned}$$
(5)

Let \(\varphi (.) \in \mathbb {R}^{M \times N \times O_{.}^{w} \times O_{.}^{h}}\) denotes the feature extraction of SiamRPN++ based ResNet-50 [22]. M is the number of aggregated convolution layers of SiamRPN++, N is the number of feature channels, \(O_{z}^{w}\) and \(O_{z}^{h}\) are the width and height of each feature channel for template region, and \(O_{x}^{w}\) and \(O_{x}^{h}\) are the width and height of each feature channel for search region. The feature-wise loss function \(L_{FD}\) aims to reshape the feature of the adversarial image to closely match the clean feature. The \(L_{FD}\) loss is expressed as the mean L2-norm of the difference between the feature extractor of the clean image \(\varphi (.)\) and the feature extractor of the suspicious image \(\varphi (.*)\).

Fig. 2
figure 2

Network architecture of our proposed MUNet. The malicious regions \(\{z^*,x^*\}\) are fed into MUNet which can scale number of convolution blocks, either downward and upward blocks. The MUNet is to generating predicted noise \(\{\Delta \tilde{z}, \Delta \tilde{x}\}\) as the final output

3.3 Network Architecture

The UNet [31] model is utilized in many applications based on generative adversarial purposes, for example, image translation [32, 33], adversarial attack on visual tracking [15, 17], and adversarial defense on image classification [19]. In this work, the defensive model generator named MUNet adopt UNet which is constructed as in Fig. 2. The MUNET aims to explore the combinations of downward and upward blocks for a suitable trade-off between inference time and defensive performance. The variation of downward convolution blocks is illustrated in Eq. (6).

$$\begin{aligned} \Delta {I^d_i} = {f_i^d (\Delta {I_{i-1}^d})}. \end{aligned}$$
(6)

The output of the predictive noise for a downward block i, denoted as \(\Delta {I_i^d}\), is obtained by convolving the previous block’s predictive noise \(\Delta {I_{i-1}^d}\) via a convolution function \(f_i^d\). The first input of downward block is \(\Delta {I_0^d}\) represented to a suspicious template region \(z^*\) or suspicious search region \(x^*\). Each convolutional downward block has repeated to downscale an predictive noise input for n times when n is a number of downward blocks. The final predictive noise output of downward block \(\Delta {I_{n}^d}\) is represented to be first input of upward block \(\Delta {I_{0}^u}\). Each input of upward block is be fed on upward convolution function \(f_i^u\) and concatenate then with a predictive noise output of downward block \(n-1\) as described in Eq. (7). The number of upward operations have repeated for n times that is equal to number of convolutional downward blocks for balancing the scaling output in each block.

$$\begin{aligned} \Delta {I_{i}^u} = \Delta {I_{n-i}^d} \oplus f_{i}^u(\Delta {I_{i-1}^u}). \end{aligned}$$
(7)

The final predictive noise output of upward block \(\Delta {I_{n}^u}\) represents the predictive template noise \(\Delta \tilde{z}\) or predictive search \(\Delta \tilde{x}\) noise corresponded to the suspicious template input \(z^*\) or suspicious search input \(x^*\), respectively. The detailed training process of MUNet can be found in Algorithm 1. A defensive model D is subject to defensive method in training, pixel-wise denoiser in Eq. (2) or feature-wise defender in Eq. (3).

Algorithm 1
figure a

MUNet Training for Generating a Defensive Noise

The network architecture of MUNet is illustrated in Fig. 2. Each down-sampling and up-sampling block is down-scaled and up-scaled by a factor of two, respectively. These blocks consist of a LeakyReLU operation, a 3\(\times \)3 convolutional kernel with 2\(\times \)2 strides and 1\(\times \)1 paddings, and 2D batch normalization, except for the final down-sampling block, which does not operate batch normalization.

Fig. 3
figure 3

MUNet’s inference diagram for visual tracking using SiamRPN++. The left side illustrates the defending process with MUNet, where defensive results \(\{\tilde{z},\tilde{x}\}\) are obtained by subtracting suspicious regions \(\{z^*,x^*\}\) with predicted noise \(\{\Delta \tilde{z}, \Delta \tilde{x}\}\). These defensive pairs are then fed into SiamRPN++, initiating the subsequent tracking process, as shown on the right

$$\begin{aligned} \tilde{z}= & {} z^{*} - \Delta \tilde{z},\nonumber \\ \tilde{x}= & {} x^{*} - \Delta \tilde{x}. \end{aligned}$$
(8)

Last stage, the defensive template and defensive search are obtained by subtracting suspicious template and suspicious search using the predicted template noise and predicted search noise, as depicted in Eq. (8). Subsequently, the defensive pairs are input into the tracking process as part of the standard tracking step, as illustrated in Fig. 3.

4 Experimental Results

4.1 Training and Evaluation

Training We generated the adversarial noise offline for both template-region and search-region frames based on CSA [15] and DFA [17] noise generators. The models were trained end-to-end on the GOT-10k dataset [28]. We explore five models for each pair of convolution layers, such as 3, 4, 5, 6, and 7 pairs of layers. Each model’s iteration consists of 2 clean template frames, 2 CSA template-attack frames, 4 DFA frames, and 8 CSA search-attack frames for template-region learning and 8 CSA search-attack frames, 8 DFA frames for search-region learning by setting batch size as 16 pair-images. The training dataset comprises 500,000 pairs of images, and each epoch consists of 100,000 pairs. The adaptive moment estimation (ADAM) [34] is chosen to be the optimizer with momentum decay of 10−4 and a fixed learning rate of 10−3 in every epoch. Our proposed model converges after five epochs, which takes 7 h of training on CPU Intel i9-12900KF, Nvidia RTX 3090, and RAM 128GB based on the PyTorch framework [35].

Evaluation We evaluate MUNet models for five different pairs of layers based on PD denoiser and FD defender, with and without CSA [15] and DFA [17] attacking models. The baseline tracker for all experiments is SiamRPN++ [5]. We utilize benchmark datasets including OTB100 [36], LaSOT [28], UAV123 [37], VOT2018 [38], VOT2019 [39], and GOT-10k [28].

4.2 Comparison Results of Defensive Methods Against Attacking Models

We evaluate pixel-wise denoiser (PD) and feature-wise defender (FD) against CSA and DFA attacking models based on SiamRPN++ by choosing the best performance of five different pair-layer models and using OTB100 as benchmark dataset. As illustrated in Table 1, our PD and FD models successfully improve the attacking performance on both CSA and DFA attacks for all three types; template region, search region, and both-region attacks. The quantitative analysis among PD and FD against CSA and DFA attackers is displayed in Fig. 4, and example tracking images are displayed in Fig. 11. The superior performance of the FD method over the PD approach can be attributed to the network architecture of SiamRPN++, in which features have a higher impact than the similarity of the original image. This effect leads to adversarial features being more closely aligned with the features of the original image than with those of the denoised image by using PD method.

4.3 Comparison Results on Various Tracking Datasets

We conduct experiments to evaluate the effectiveness of three different MUNet models, which have three, five, and seven pairs of layers respectively, against both DFA and CSA attackers using FD denoiser on six benchmark datasets: OTB100, LaSOT, UAV123, VOT2018, VOT2019, and Got-10k. Our results, shown in Table 2, demonstrate that the FD model successfully defends against DFA and CSA attacks on all benchmark datasets.

To determine the best defensive performance of the three variation pairs of layers, we evaluate their defensive performance against DFA and CSA attackers. Our results indicate that the defensive performance against DFA attacker slightly decrease from the original tracker by 0.6% success on LaSOT, 1.5% success on UAV123, 4.6% EAO on VOT2018, 1.4% AO on GOT-10k, and completely resile the EAO on VOT2019 when the best pair-layer is chosen. Similarly, the defensive performance results against CSA attacker slightly decrease from the original tracker by 2.2% success on LaSOT, 1.2% success on VOT2018, 3.2% EAO on GOT-10k, and completely resile the EAO on UAV123 and VOT2019.

Table 1 Comparison of defensive performance of our PD and FD models with and without CSA and DFA attackers using different attacking methods, including template-region attack, search-region attack, and both-region attack on the OTB100 dataset

4.4 Transferability Experiments: Extending Defensive Methods to Other Top Trackers

The experimental results are presented in Fig. 5. Our defensive methods, PD and FD, are initially trained based on SiamRPN++. We extend our experimentation to enhance defense against DFA attacks by leveraging the transferability of our defensive methods onto other blackbox trackers. Specifically, we select DaSiam and DiMP due to their similarity in terms of attacking transferability in the context of DFA. For the attacker, we choose DFA and utilize the OTB100 dataset.

Fig. 4
figure 4

Quantitative analysis of our PD and FD defensive models with and without both-region attacking of CSA and DFA attackers on OTB100

4.5 Comparison Results on Other Denoised Model

In this experiment, we compare the tracking performance based on SiamRPN++ using various configurations: SiamRPN++, DFA, DFA+DRUNET [29], DFA+PD, and DFA+FD, in terms of MAE noise, success, and precision on OTB100, as illustrated in Table 3. While DRUNET-color excels in denoising when the noise level is 5, its defensive efficiency is notably weak. This experiment reveals that DRUNET-color can only maintain the attacking level without increasing it when the attacking level is heightened, resulting in a trade-off where the noise level becomes higher. Conversely, when the noise levels are decreased, the attacking levels increase significantly. This experiment illustrates the failure of denoising networks under adversarial attacks.

On the other hand, our proposed models are constructed on the adversarial training strategy which surpass the limitations encountered by denoising networks under adversarial attacks because the adversarial patterns of the selected adversarial attack (DFA) have already been learned. The malicious noise is cleaned by the pixel-wise defense strategy (PD), and further defense is provided by the feature-wise defense strategy (FD). Because the objective function of FD aims to preserve the tracking features of SiamRPN++ rather than solely denoising the noisy image. Consequently, the experimental results of the feature-wise defense (DFA+FD) outperform those of the pixel-wise defense (DFA+PD).

Table 2 Comparison of our FD defensive model in different pairs of layers against DFA and CSA attackers

4.6 Ablation Study

Performance analysis in different pairs of layers

This experiment aims to explore the best defensive performance, evaluating whether three pairs of layers are sufficient to defend against an adversarial attack on a single visual track and considering a model with seven pairs of layers as potentially superior. We compare the defending performance against DFA and CSA attackers [17] among five different pairs of layers competes on OTB100. The smallest network consists of three pairs of layers, and the most extensive network has seven pairs of layers. Although the performance of the 7-pair-layer model is outstanding, all five pair-layer combinations have slightly different results against both DFA and CSA attackers. We compare the defending performance against DFA [17] and CSA [15] attackers across five different pairs of layers, competing on OTB100. The smallest network consists of three pairs of layers, while the most extensive network has seven pairs of layers. Although the 7-pair-layer model exhibits outstanding performance, all five pair-layer combinations yield slightly different results against both DFA and CSA attackers. Concerning the DFA attacker, the 3-pair-layer model shows the lowest performance, with a 4.9% drop in success and a 3.6% drop in precision compared to the original SiamRPN++. The experimental results are displayed in Fig. 6, and the quantitative analysis of a one-pass evaluation is presented in Fig. 7. On the other hand, the 4-pair-layer model exhibits the lowest performance against the CSA attacker, with a 2.5% drop in success and a 1.4% drop in precision compared to the original SiamRPN++. In contrast, the 7-pair-layer model shows a 2.5% drop in success and a 1.8% drop in precision against the DFA attacker. Meanwhile, the 5-pair-layer model demonstrates a 0.4% drop in success and full recovery in precision against the CSA attacker, as illustrated in Fig. 8, with the quantitative analysis of a one-pass evaluation presented in Fig. 9. Experimental results on both DFA and CSA attackers demonstrate to that miniature model, 3-pair-layer is sufficient to defend. Moreover, it can defend 1626 FPS on the template region and 1205 FPS on the search region. However, the biggest pairs of layers, 7 pair-layer, is not different that much with 5 pair-layer while it consume much computing resource than 5 pair-layer. More details are provided in Table 4.

Fig. 5
figure 5

Transferability evaluation: our employed models defending against DFA attacks on OTB100, (left) extended defensive to DaSiam, (right) extended defensive to DiMP

Table 3 Comparison of PD and FD defensive model with DRUNET-color model based on SiamRPN++ with DFA attacker
Fig. 6
figure 6

Comparison of FD-defensive results of different n-pair-layer (pl) against DFA attackers’ both-region attacks on OTB100

Fig. 7
figure 7

Quantitative analysis of our FD defensive models against both-region attacking of DFA attacker on OTB100

Fig. 8
figure 8

Comparison FD defensive results of different n-pair-layer (pl) against CSA attackers’ both-region attacks on OTB100

Fig. 9
figure 9

Quantitative analysis of our FD defensive models against both-region attacking of CSA attacker on OTB100

Table 4 Comparison number of parameters, convolution layers, and inference time of our MUNet
Fig. 10
figure 10

Both the noise and denoised images are amplified by a factor of 10 for enhanced visibility. The first, third, and fifth columns show adversarial images, while the second, fourth, and sixth columns show their corresponding adversarial noise results and denoised results, respectively

Noise pattern Adversarial noise and denoised results are imperceptible by human vision. In best visibility, adversarial noise and denoised results are magnified 10 times, as demonstrated in Fig. 10. PD demonstrates superior pixel recovery when applied to clean input. This observation suggests that the FD approach causes some degradation of the clean image, while the PD method does not. On the other hand, when defending against adversarial noise attacks, the FD method outperforms PD due to the higher impact of features on the network architecture of SiamRPN++ than the similarity of the original image and tends to be extensive because the difference between clean and denoised images is limitless in feature-wise defender.

Table 5 Comparison of inference times of attackers with and without defenders

Inference times of attacking and defending models We proposed the defensive model that successfully defends the SiamRPN++ [5] model with offline attacking models, such as CSA [15] and DFA [17]. Despite the efficacious dropping performance of online learning, its inference time is prolonged. We compare the inference times of attacking models for offline (CSA [15], DFA [17]) and online (IoU attacker [18]) learning in Table 5. The inference times of our defenders range from 48.1 to 54 FPS, which represents an increase of 4-9 FPS compared to CSA and DFA. In contrast, the inference time of the IoU attack is 1.6 FPS which is unlikely to attack or be defended in a practical application.

The perturbation effect in different adversarial noise We compare noise MAE and performance metrics in different adversarial noise algorithm, impulse noise, Gaussian noise, and DFA noise as depicted in Table 6. Noise MAE results from random impulse and Gaussian methods are surpassed or equal to adversarial noise from DFA while random methods marginally mitigate the impact on SiamRPN++. This phenomenon can be attributed to the robustness of top trackers against subtle noise. Overcoming the challenges posed by the target tracker constitute a significant task.

Table 6 Comparison the perturbation effect and noise MAE between different adversarial noise models
Fig. 11
figure 11

The initial row of images depicts the original SiamRPN++. The second, third, and fourth rows illustrate the CSA attacker with and without PD, and FD denoisers, respectively. The fifth, sixth, and seventh rows show the DFA attacker with and without PD, and FD denoisers, respectively. In all images, the ground-truth bounding box is drawn in black

5 Conclusion

In summary, our proposed MUNet significantly improves tracking robustness against adversarial attacks in single-object tracking. Utilizing a pixel-wise denoiser and feature-wise defender, MUNet generates robust defending images, effectively countering attackers. Experimental results showcase the model’s strong performance across diverse benchmark datasets, demonstrating its ability to defend against both attacked and clean images. Notably, our models achieve these results with minimal impact on execution time. The compact size of MUNet is sufficient for defending at a remarkable 1625 fps in inference time. Furthermore, our defensive models exhibit successful transferability to blackbox trackers, rendering them practical for real-world applications.

This work significantly advances in adversarial defense on Siamese network-style visual tracking, showcasing remarkable performance across benchmark datasets with only marginal drops in accuracy. Our proposed approach has the potential to substantially improve the robustness and security of visual tracking systems in various real-world scenarios.

Furthermore, this model can potentially be applied to enhance model training in visual tracking. By augmenting the training dataset using attacking models, such as CSA, DFA, or clean images, and subsequently applying our defensive model, we aim to further improve the performance of a target tracker.