Multi-Model UNet: An Adversarial Defense Mechanism for Robust Visual Tracking

Suttapak, Wattanapong; Zhang, Jianfu; Zhao, Haohuo; Zhang, Liqing

doi:10.1007/s11063-024-11592-2

Multi-Model UNet: An Adversarial Defense Mechanism for Robust Visual Tracking

Open access
Published: 01 April 2024

Volume 56, article number 132, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Multi-Model UNet: An Adversarial Defense Mechanism for Robust Visual Tracking

Download PDF

Wattanapong Suttapak¹,
Jianfu Zhang¹^na1,
Haohuo Zhao¹^na1 &
…
Liqing Zhang¹^na1

330 Accesses
1 Altmetric
Explore all metrics

Abstract

Currently, state-of-the-art object-tracking algorithms are facing a severe threat from adversarial attacks, which can significantly undermine their performance. In this research, we introduce MUNet, a novel defensive model designed for visual tracking. This model is capable of generating defensive images that can effectively counter attacks while maintaining a low computational overhead. To achieve this, we experiment with various configurations of MUNet models, finding that even a minimal three-layer setup significantly improves tracking robustness when the target tracker is under attack. Each model undergoes end-to-end training on randomly paired images, which include both clean and adversarial noise images. This training separately utilizes pixel-wise denoiser and feature-wise defender. Our proposed models significantly enhance tracking performance even when the target tracker is attacked or the target frame is clean. Additionally, MUNet can simultaneously share its parameters on both template and search regions. In experimental results, the proposed models successfully defend against top attackers on six benchmark datasets, including OTB100, LaSOT, UAV123, VOT2018, VOT2019, and GOT-10k. Performance results on all datasets show a significant improvement over all attackers, with a decline of less than 4.6% for every benchmark metric compared to the original tracker. Notably, our model demonstrates the ability to enhance tracking robustness in other blackbox trackers.

An efficient method to fool and enhance object tracking with adversarial perturbations

Article 14 March 2023

DIMBA: discretely masked black-box attack in single object tracking

Article Open access 31 October 2022

Efficient Adversarial Attacks for Visual Object Tracking

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Single visual tracking is a beneficial application in computer vision that aims to track the movement of a single-target object within a scene in real-time [1]. It has numerous applications, for example, video surveillance, robotics, and autonomous driving. Visual tracking typically involves detecting the location of an object in the first frame of a video and then tracking it through subsequent frames, such as SiamFC [2], SiamRPN [3, 4], SiamRPN++ [5], ATOM [6], DiMP [7], and lighttrack [8], and CHASE [9]. However, this task is challenging due to factors such as changes in lighting conditions, occlusion, and background clutter.

In recent years, the performance of visual tracking models has been threatened by adversarial attacks, for example, FGSM [10], iFGSM [11], and miFGSM [12]. These attacks are designed to manipulate the input data and cause the model to malfunction, resulting in incorrect object tracking. Adversarial attacks are a growing concern in the field of machine learning, as they can undermine the accuracy and reliability of models in critical applications.

Visual tracking attacks involve generating adversarial noise that is added to the frames, causing the tracking model to fail [13,14,15,16,17,18]. The attackers use sophisticated tactics such as model generators to infiltrate the noise into the template frames or search frames. The perturbation caused by visual object-tracking (VOT) attackers is particularly alarming, and models such as CSA [15], FBA [16], and DFA [17] can generate real-time adversarial noise, leading to complete disruption of top trackers.

To address this challenge, researchers have proposed various countermeasures to improve the robustness of deep learning models against adversarial attacks [19, 20]. However, these approaches are often not effective in visual tracking due to fundamental differences in the network structures of classification and visual tracking models. The classification model emphasizes feature extraction while visual tracking comprises 3 main blocks, feature extraction, template matching, and region proposal network. Feature extraction block is the most important for all deep learning applications. However, the top trackers are robust to subtle noise as shown by the perturbation effect in our ablation study. The random noise of impulse and Gaussian method can slightly decrease the tracking performance, while the CSA [15] and DFA [17] generated adversarial noise to be lower MAE than random noise but their efforts have more disrupted performance than random methods.

To overcome this limitation, we have proposed a novel defense model for visual tracking attackers, specifically designed to denoise the adversarial noise before feeding it to the tracking model. Our approach leverages the power of denoising techniques to remove adversarial noise and improve the quality of the input data. This helps to ensure that the tracking model receives clean and accurate data, making it more robust against adversarial attacks. We further propose a feature-wise defender on a foundation of deep learning, leveraging the power of convolutional neural networks (CNNs) for feature extraction and denoising. The proposed defense model is trained on a large dataset of clean and adversarial images. The model learns to distinguish between clean and adversarial images and to denoise the adversarial images to improve the performance of the visual tracking model. We evaluate the effectiveness of our defense model on several benchmark datasets and demonstrate that it significantly improves the robustness of visual tracking models against adversarial attacks.

Our proposed method’s effectiveness is demonstrated in Fig. 1 through a comparison of search region frames and their corresponding heatmaps on a clean tracker, an adversarial noise attack, and our proposed defensive model. The contributions are described as follows:

(1)
We employ a pixel-wise denoiser and feature-wise defender to effectively handle suspicious frames when integrated into the target tracker. The Objective loss functions address both clean pairs or adversarial pairs, encompassing both scenarios of clean and adversarial conditions.
(2)
Our model has been successful in defending against adversarial attacks in single-object tracking across various benchmark datasets, such as VOT2018, OTB100, and LaSOT. We introduce innovative network architectures, called MUNet, which minimizes computational resources while preserving tracking robustness.
(3)
In an additional experiment, MUNet successfully transfers its defensive capability to other top trackers, namely DiMP and DaSiam. The performance results closely align with those of top trackers, as demonstrated on the OTB100 dataset.

2 Related Works

2.1 Visual-Tracking-Based Siamese Network

In the past decade, the field of object tracking has seen significant advancements in accuracy and speed. One of the earliest successful approaches was the Siamese fully convolutional network (SiamFC) [2] that learned the similarity between pairs of images, allowing it to track an object in a video sequence. Expanding on the success of SiamFC, the Siamese region proposal network (SiamRPN) [3] improved the accuracy of the tracking algorithm by introducing a region proposal network [21] that generated candidate object locations.

SiamRPN++ [5] further improved upon SiamRPN by introducing a deeper and wider backbone network based on ResNet-50 [22], along with a multi-scale feature fusion mechanism to handle scale variations in the tracked object. Another extension to the Siamese architecture is the Siamese Box Adaptive Network (SiamBAN) that uses an attention mechanism to weight the feature maps based on their relevance to the object being tracked, thereby improving the accuracy of bounding box predictions. SiamMask [4] expanded the capabilities of the Siamese architecture to include instance segmentation in addition to object tracking. This was accomplished by incorporating a mask branch into the network that generated pixel-level segmentation masks for the tracked object.

To achieve real-time tracking on CPU devices, the Siamese Cascaded Architecture for Real-time object tracking (SiamCAR) [23] uses a lightweight feature extractor and a proposal generator to achieve high accuracy and speed. SiamFC++ [24] further extends the SiamFC architecture by upgrading head network structure and adding the regression structure to increase model accuracy. Moreover, they replace the AlexNet [25] backbone with GoogLeNet [26] backbone, which achieves superior performance with fewer computing resources.

In addition to these architecture-based approaches, neural architecture search (NAS) is a method for automatically designing deep neural network architectures for visual tracking tasks. DARTS [27] optimizes the network architecture for a given tracking task using a reinforcement learning-based search strategy, resulting in a high-performance network with reduced manual effort. CHASE [9] proposed a robust visual tracking to improve the tracking model and early stop avoids overfitting via a cell-level differentiable neural architecture search mechanism to adapt backbone features relative to the objective of Siamese tracking networks. Although NAS in VOT has achieved high tracking performance, NAS requires high resource power consumption in model training.

2.2 Adversarial Attacks for Single Visual Tracking

Recently, adversarial attacks on VOT have received increasing attention in research as a means to anticipate against top trackers. Adversarial attacks are not limited to classification tasks, and in visual tracking, researchers have sought to generate imperceptible noise to disrupt object tracking. Recently, online learning attacks have been proposed that can successfully disturb top trackers, for example, Hijacking tracker [13], one-shot adversarial attacks [14], and IoU attack [18]. However, online training is impractical in visual tracking due to its time-consuming nature; for example, the IoU attack [18] takes 0.625 s per frame in inference time, as illustrated in Table 5. In offline training, Yan et al. [15] proposed offline training based on SiamRPN++ [5] with the GOT-10k dataset [28], named cooling-shrinking attack (CSA). This attacking model addressed a tracking weakness by decreasing the target heatmap density, misleading the predicted bounding box to be shrunken. The feature-based attack [16] and the diminishing-feature attack (DFA) [17] were introduced to attack the sole template frame by minimizing the variance of its feature channel. This perturbed effect leads to tracker malfunction on both template and search frames. Apart from that, DFA perfectly transfers the attacking effect to unseen top trackers without learning from their parameters. It is explicit that adversarial attacking models can harm visual tracking models. Therefore, the adversarial defense against adversarial attack is essential to invest and urgently invent an adversely defensive model.

In the domain of adversarial attacks on image classification, notable works such as DUNET (Defense against adversarial attacks using high-level representation guided denoiser) [19] demonstrated success in defense against both whitebox and blackbox attacks. However, the efficacy of these defensive mechanisms does not fit to the context of single-object tracking. In single-object tracking, the model requires feeding template and search frames in different scales into tracking model simultaneously while DUNET can generate only single and fixed size denoised image. Moreover, denoised model fails to defend against adversarial attacks in tracking scenarios. In the ablation study demonstrated that deep denoiser prior (DPIR) [29] fails to defend against DFA[17] in every denoised level.

It is evident that adversarial attacking models pose a significant threat to visual tracking systems. Therefore, investing in and promptly developing adversarial defense mechanisms is imperative to create models that can robustly counter adversarial attacks in tracking scenarios.

We propose the adversarial defense on Siamese network-style visual tracking, named multi-model UNet (MUNet), to defend against suspicious images, whether they are infiltrated by adversarial noise or not, before feeding them to the tracker. MUNet models comprise five pairs of layers, such as 3, 4, 5, 6, and 7. Our MUNet models can vary in parameter size and inference speed based on the number of pairs of layers. The defending performance of our largest model is outstanding, while our fewer-layer models are also efficient and very close to the most significant model.

3 Defensive Model Methodology

3.1 Constrainted Optimization of Adversarial Defense

An adversarial attack definition aims to maximize loss function $\ell $ of a machine learning function f(.) of adversarial data $x' = x + \epsilon $ and output target y by adding perturbed noise $\epsilon $ within the norm ball B(x). The definition equation can be described as

$$\begin{aligned} x' = \text {argmax}_{x' \in \textit{B}_{\epsilon }(x)} \ell (f(x'),y). \end{aligned}$$

(1)

In defending purposes, an adversarial defense can be defined in two ways. The former is to minimize loss function $\ell $ of a predicted defensive data $\tilde{x}$ and clean data x by removing predicted noise $\delta $ out of an adversarial data $x'$ using denoised model generator D(.) as

$$\begin{aligned} \tilde{x} = \text {argmin}_{ \tilde{x} \in \textit{D}_{\delta }(x') } \ell (\tilde{x}, x) . \end{aligned}$$

(2)

The latter definition is to minimize loss function $\ell $ of a machine learning function f(.) of a predicted defensive data $\tilde{x}$ and output target y. A defensive model generator D(.) is to generate a predicted defensive noise $\delta $ and subtract this predicted noise $\delta $ from an adversarial data $x'$ as described in 3.

$$\begin{aligned} \tilde{x} = \text {argmin}_{ \tilde{x} \in \textit{D}_{\delta }(x') } \ell (f(\tilde{x}), y). \end{aligned}$$

(3)

3.2 Pixel-Wise Denoiser and Feature-Wise Defender

Liao et al. [19] proposed a defensive model named pixel-guide denoiser to defend either whitebox or blackbox attackers in classification models. The DUNET inspires us to invent a loss function named pixel-wise denoiser (PD) to defend against an adversarial attack in single visual tracking which requires predicting either denoised template $\tilde{z}$ and denoised search $\tilde{x}$ regions before feeding to the Siamese head. To be a more challenging task, the PD requires the ability to denoise an adversarial image or unchanged a clean image. Let $D_{\delta }(.)$ be a defensive model which can generate both adversarial template $z'$ and adversarial search $x'$ regions, simultaneously. In particular, a defensive model would maintain a clean image and retain the tracking performance. The PD loss function can be rewritten as in Eq. (4) when suspicious template and suspicious search regions are defined as $z^*$ and $x^*$.

$$\begin{aligned} L_{PD} = \frac{\Vert z - D_{\delta }(z^*) \Vert _{2}}{3\times w_z\times h_z} + \frac{\Vert x - D_{\delta }(x^*) \Vert _{2}}{3\times w_x\times h_x} \end{aligned}$$

(4)

To improve a robust model for single visual tracking, a denoised image by using pixel-wise denoiser might be limited due to possibly some adversarial noise being amplified and leading to the tracker malfunction. In Eq. (3), a defensive model can boost tracking robustness by minimizing the individual objective function for each specific tracking model in which the model training may effectively be limited to only a specific training model.

The diminishing feature attack (DFA) [17] addressed the transferability of perturbed tracking performance from target tracking model SiamRPN++ [5] to blackbox models, DaSiam [30] and DiMP [7]. The DFA learns to attack only a dominant feature without learning from a classification branch or regression branch. We then adopt a high-level guide denoiser from DUNET [19] as a feature-wise defender. The defensive model generates a predicted defensive template $\tilde{z}$ and a predicted defensive search $\tilde{x}$ or maintains a suspicious frame $\{z^{*}, x^{*}\}$ in case suspicious frame is without adversarial noise $\{z, x\}$.

$$\begin{aligned} L_{FD} = \frac{\Vert \varphi (z) - \varphi (D_{\delta }(z^*)) \Vert _{2}}{M \times N \times O_{z}^{w} \times O_{z}^{h}} + \frac{\Vert \varphi (x) - \varphi (D_{\delta }(x^*)) \Vert _{2}}{M\times N \times O_{x}^{w}\times O_{x}^{h}} \end{aligned}$$

(5)

Let $\varphi (.) \in \mathbb {R}^{M \times N \times O_{.}^{w} \times O_{.}^{h}}$ denotes the feature extraction of SiamRPN++ based ResNet-50 [22]. M is the number of aggregated convolution layers of SiamRPN++, N is the number of feature channels, $O_{z}^{w}$ and $O_{z}^{h}$ are the width and height of each feature channel for template region, and $O_{x}^{w}$ and $O_{x}^{h}$ are the width and height of each feature channel for search region. The feature-wise loss function $L_{FD}$ aims to reshape the feature of the adversarial image to closely match the clean feature. The $L_{FD}$ loss is expressed as the mean L2-norm of the difference between the feature extractor of the clean image $\varphi (.)$ and the feature extractor of the suspicious image $\varphi (.*)$.

3.3 Network Architecture

The UNet [31] model is utilized in many applications based on generative adversarial purposes, for example, image translation [32, 33], adversarial attack on visual tracking [15, 17], and adversarial defense on image classification [19]. In this work, the defensive model generator named MUNet adopt UNet which is constructed as in Fig. 2. The MUNET aims to explore the combinations of downward and upward blocks for a suitable trade-off between inference time and defensive performance. The variation of downward convolution blocks is illustrated in Eq. (6).

$$\begin{aligned} \Delta {I^d_i} = {f_i^d (\Delta {I_{i-1}^d})}. \end{aligned}$$

(6)

The output of the predictive noise for a downward block i, denoted as $\Delta {I_i^d}$, is obtained by convolving the previous block’s predictive noise $\Delta {I_{i-1}^d}$ via a convolution function $f_i^d$. The first input of downward block is $\Delta {I_0^d}$ represented to a suspicious template region $z^*$ or suspicious search region $x^*$. Each convolutional downward block has repeated to downscale an predictive noise input for n times when n is a number of downward blocks. The final predictive noise output of downward block $\Delta {I_{n}^d}$ is represented to be first input of upward block $\Delta {I_{0}^u}$. Each input of upward block is be fed on upward convolution function $f_i^u$ and concatenate then with a predictive noise output of downward block $n-1$ as described in Eq. (7). The number of upward operations have repeated for n times that is equal to number of convolutional downward blocks for balancing the scaling output in each block.

$$\begin{aligned} \Delta {I_{i}^u} = \Delta {I_{n-i}^d} \oplus f_{i}^u(\Delta {I_{i-1}^u}). \end{aligned}$$

(7)

The final predictive noise output of upward block $\Delta {I_{n}^u}$ represents the predictive template noise $\Delta \tilde{z}$ or predictive search $\Delta \tilde{x}$ noise corresponded to the suspicious template input $z^*$ or suspicious search input $x^*$, respectively. The detailed training process of MUNet can be found in Algorithm 1. A defensive model D is subject to defensive method in training, pixel-wise denoiser in Eq. (2) or feature-wise defender in Eq. (3).

The network architecture of MUNet is illustrated in Fig. 2. Each down-sampling and up-sampling block is down-scaled and up-scaled by a factor of two, respectively. These blocks consist of a LeakyReLU operation, a 3$\times $3 convolutional kernel with 2$\times $2 strides and 1$\times $1 paddings, and 2D batch normalization, except for the final down-sampling block, which does not operate batch normalization.

$$\begin{aligned} \tilde{z}= & {} z^{*} - \Delta \tilde{z},\nonumber \\ \tilde{x}= & {} x^{*} - \Delta \tilde{x}. \end{aligned}$$

(8)

Last stage, the defensive template and defensive search are obtained by subtracting suspicious template and suspicious search using the predicted template noise and predicted search noise, as depicted in Eq. (8). Subsequently, the defensive pairs are input into the tracking process as part of the standard tracking step, as illustrated in Fig. 3.

4 Experimental Results

4.1 Training and Evaluation

Training We generated the adversarial noise offline for both template-region and search-region frames based on CSA [15] and DFA [17] noise generators. The models were trained end-to-end on the GOT-10k dataset [28]. We explore five models for each pair of convolution layers, such as 3, 4, 5, 6, and 7 pairs of layers. Each model’s iteration consists of 2 clean template frames, 2 CSA template-attack frames, 4 DFA frames, and 8 CSA search-attack frames for template-region learning and 8 CSA search-attack frames, 8 DFA frames for search-region learning by setting batch size as 16 pair-images. The training dataset comprises 500,000 pairs of images, and each epoch consists of 100,000 pairs. The adaptive moment estimation (ADAM) [34] is chosen to be the optimizer with momentum decay of 10⁻⁴ and a fixed learning rate of 10⁻³ in every epoch. Our proposed model converges after five epochs, which takes 7 h of training on CPU Intel i9-12900KF, Nvidia RTX 3090, and RAM 128GB based on the PyTorch framework [35].

Evaluation We evaluate MUNet models for five different pairs of layers based on PD denoiser and FD defender, with and without CSA [15] and DFA [17] attacking models. The baseline tracker for all experiments is SiamRPN++ [5]. We utilize benchmark datasets including OTB100 [36], LaSOT [28], UAV123 [37], VOT2018 [38], VOT2019 [39], and GOT-10k [28].

4.2 Comparison Results of Defensive Methods Against Attacking Models

We evaluate pixel-wise denoiser (PD) and feature-wise defender (FD) against CSA and DFA attacking models based on SiamRPN++ by choosing the best performance of five different pair-layer models and using OTB100 as benchmark dataset. As illustrated in Table 1, our PD and FD models successfully improve the attacking performance on both CSA and DFA attacks for all three types; template region, search region, and both-region attacks. The quantitative analysis among PD and FD against CSA and DFA attackers is displayed in Fig. 4, and example tracking images are displayed in Fig. 11. The superior performance of the FD method over the PD approach can be attributed to the network architecture of SiamRPN++, in which features have a higher impact than the similarity of the original image. This effect leads to adversarial features being more closely aligned with the features of the original image than with those of the denoised image by using PD method.

4.3 Comparison Results on Various Tracking Datasets

We conduct experiments to evaluate the effectiveness of three different MUNet models, which have three, five, and seven pairs of layers respectively, against both DFA and CSA attackers using FD denoiser on six benchmark datasets: OTB100, LaSOT, UAV123, VOT2018, VOT2019, and Got-10k. Our results, shown in Table 2, demonstrate that the FD model successfully defends against DFA and CSA attacks on all benchmark datasets.

To determine the best defensive performance of the three variation pairs of layers, we evaluate their defensive performance against DFA and CSA attackers. Our results indicate that the defensive performance against DFA attacker slightly decrease from the original tracker by 0.6% success on LaSOT, 1.5% success on UAV123, 4.6% EAO on VOT2018, 1.4% AO on GOT-10k, and completely resile the EAO on VOT2019 when the best pair-layer is chosen. Similarly, the defensive performance results against CSA attacker slightly decrease from the original tracker by 2.2% success on LaSOT, 1.2% success on VOT2018, 3.2% EAO on GOT-10k, and completely resile the EAO on UAV123 and VOT2019.

Table 1 Comparison of defensive performance of our PD and FD models with and without CSA and DFA attackers using different attacking methods, including template-region attack, search-region attack, and both-region attack on the OTB100 dataset

Full size table

4.4 Transferability Experiments: Extending Defensive Methods to Other Top Trackers

The experimental results are presented in Fig. 5. Our defensive methods, PD and FD, are initially trained based on SiamRPN++. We extend our experimentation to enhance defense against DFA attacks by leveraging the transferability of our defensive methods onto other blackbox trackers. Specifically, we select DaSiam and DiMP due to their similarity in terms of attacking transferability in the context of DFA. For the attacker, we choose DFA and utilize the OTB100 dataset.

4.5 Comparison Results on Other Denoised Model

In this experiment, we compare the tracking performance based on SiamRPN++ using various configurations: SiamRPN++, DFA, DFA+DRUNET [29], DFA+PD, and DFA+FD, in terms of MAE noise, success, and precision on OTB100, as illustrated in Table 3. While DRUNET-color excels in denoising when the noise level is 5, its defensive efficiency is notably weak. This experiment reveals that DRUNET-color can only maintain the attacking level without increasing it when the attacking level is heightened, resulting in a trade-off where the noise level becomes higher. Conversely, when the noise levels are decreased, the attacking levels increase significantly. This experiment illustrates the failure of denoising networks under adversarial attacks.

On the other hand, our proposed models are constructed on the adversarial training strategy which surpass the limitations encountered by denoising networks under adversarial attacks because the adversarial patterns of the selected adversarial attack (DFA) have already been learned. The malicious noise is cleaned by the pixel-wise defense strategy (PD), and further defense is provided by the feature-wise defense strategy (FD). Because the objective function of FD aims to preserve the tracking features of SiamRPN++ rather than solely denoising the noisy image. Consequently, the experimental results of the feature-wise defense (DFA+FD) outperform those of the pixel-wise defense (DFA+PD).

Table 2 Comparison of our FD defensive model in different pairs of layers against DFA and CSA attackers

Full size table

4.6 Ablation Study

Performance analysis in different pairs of layers

This experiment aims to explore the best defensive performance, evaluating whether three pairs of layers are sufficient to defend against an adversarial attack on a single visual track and considering a model with seven pairs of layers as potentially superior. We compare the defending performance against DFA and CSA attackers [17] among five different pairs of layers competes on OTB100. The smallest network consists of three pairs of layers, and the most extensive network has seven pairs of layers. Although the performance of the 7-pair-layer model is outstanding, all five pair-layer combinations have slightly different results against both DFA and CSA attackers. We compare the defending performance against DFA [17] and CSA [15] attackers across five different pairs of layers, competing on OTB100. The smallest network consists of three pairs of layers, while the most extensive network has seven pairs of layers. Although the 7-pair-layer model exhibits outstanding performance, all five pair-layer combinations yield slightly different results against both DFA and CSA attackers. Concerning the DFA attacker, the 3-pair-layer model shows the lowest performance, with a 4.9% drop in success and a 3.6% drop in precision compared to the original SiamRPN++. The experimental results are displayed in Fig. 6, and the quantitative analysis of a one-pass evaluation is presented in Fig. 7. On the other hand, the 4-pair-layer model exhibits the lowest performance against the CSA attacker, with a 2.5% drop in success and a 1.4% drop in precision compared to the original SiamRPN++. In contrast, the 7-pair-layer model shows a 2.5% drop in success and a 1.8% drop in precision against the DFA attacker. Meanwhile, the 5-pair-layer model demonstrates a 0.4% drop in success and full recovery in precision against the CSA attacker, as illustrated in Fig. 8, with the quantitative analysis of a one-pass evaluation presented in Fig. 9. Experimental results on both DFA and CSA attackers demonstrate to that miniature model, 3-pair-layer is sufficient to defend. Moreover, it can defend 1626 FPS on the template region and 1205 FPS on the search region. However, the biggest pairs of layers, 7 pair-layer, is not different that much with 5 pair-layer while it consume much computing resource than 5 pair-layer. More details are provided in Table 4.

Table 3 Comparison of PD and FD defensive model with DRUNET-color model based on SiamRPN++ with DFA attacker

Full size table

Table 4 Comparison number of parameters, convolution layers, and inference time of our MUNet

Full size table

Noise pattern Adversarial noise and denoised results are imperceptible by human vision. In best visibility, adversarial noise and denoised results are magnified 10 times, as demonstrated in Fig. 10. PD demonstrates superior pixel recovery when applied to clean input. This observation suggests that the FD approach causes some degradation of the clean image, while the PD method does not. On the other hand, when defending against adversarial noise attacks, the FD method outperforms PD due to the higher impact of features on the network architecture of SiamRPN++ than the similarity of the original image and tends to be extensive because the difference between clean and denoised images is limitless in feature-wise defender.

Table 5 Comparison of inference times of attackers with and without defenders

Full size table

Inference times of attacking and defending models We proposed the defensive model that successfully defends the SiamRPN++ [5] model with offline attacking models, such as CSA [15] and DFA [17]. Despite the efficacious dropping performance of online learning, its inference time is prolonged. We compare the inference times of attacking models for offline (CSA [15], DFA [17]) and online (IoU attacker [18]) learning in Table 5. The inference times of our defenders range from 48.1 to 54 FPS, which represents an increase of 4-9 FPS compared to CSA and DFA. In contrast, the inference time of the IoU attack is 1.6 FPS which is unlikely to attack or be defended in a practical application.

The perturbation effect in different adversarial noise We compare noise MAE and performance metrics in different adversarial noise algorithm, impulse noise, Gaussian noise, and DFA noise as depicted in Table 6. Noise MAE results from random impulse and Gaussian methods are surpassed or equal to adversarial noise from DFA while random methods marginally mitigate the impact on SiamRPN++. This phenomenon can be attributed to the robustness of top trackers against subtle noise. Overcoming the challenges posed by the target tracker constitute a significant task.

Table 6 Comparison the perturbation effect and noise MAE between different adversarial noise models

Full size table

5 Conclusion

In summary, our proposed MUNet significantly improves tracking robustness against adversarial attacks in single-object tracking. Utilizing a pixel-wise denoiser and feature-wise defender, MUNet generates robust defending images, effectively countering attackers. Experimental results showcase the model’s strong performance across diverse benchmark datasets, demonstrating its ability to defend against both attacked and clean images. Notably, our models achieve these results with minimal impact on execution time. The compact size of MUNet is sufficient for defending at a remarkable 1625 fps in inference time. Furthermore, our defensive models exhibit successful transferability to blackbox trackers, rendering them practical for real-world applications.

This work significantly advances in adversarial defense on Siamese network-style visual tracking, showcasing remarkable performance across benchmark datasets with only marginal drops in accuracy. Our proposed approach has the potential to substantially improve the robustness and security of visual tracking systems in various real-world scenarios.

Furthermore, this model can potentially be applied to enhance model training in visual tracking. By augmenting the training dataset using attacking models, such as CSA, DFA, or clean images, and subsequently applying our defensive model, we aim to further improve the performance of a target tracker.

References

Marvasti-Zadeh SM, Cheng L, Ghanei-Yakhdan H, Kasaei S (2021) Deep learning for visual tracking: a comprehensive survey. IEEE Trans Intell Transp Syst 23(5):3943–3968
Article Google Scholar
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision, Springer, pp 850–865
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1328–1338
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4282–4291
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: accurate tracking by overlap maximization. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4660–4669. arXiv:1811.07628
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE international conference on computer vision, pp 6182–6191
Yan B, Peng H, Wu K, Wang D, Fu J, Lu H (2021) Lighttrack: finding lightweight neural networks for object tracking via one-shot architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15180–15189
Marvasti-Zadeh SM, Khaghani J, Cheng L, Ghanei-Yakhdan H, Kasaei S (2021) Chase: robust visual tracking via cell-level differentiable neural architecture search. arXiv:2107.03463
Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. In: 3rd International conference on learning representations, ICLR 2015—conference track proceedings. arXiv:1412.6572
Kurakin A, Goodfellow I, Bengio S (2016) Adversarial machine learning at scale. arXiv:1611.01236
Dong Y, Liao F, Pang T, Su H, Zhu J, Hu X, Li J (2018) Boosting adversarial attacks with momentum. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9185–9193
Yan X, Chen X, Jiang Y, Xia S-T, Zhao Y, Zheng F (2020) Hijacking tracker: a powerful adversarial attack on visual tracking. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 2897–2901
Chen X, Yan X, Zheng F, Jiang Y, Xia S-T, Zhao Y, Ji R (2020) One-shot adversarial attacks on visual tracking with dual attention. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 10176–10185
Yan B, Wang D, Lu H, Yang X (2020) Cooling-shrinking attack: Blinding the tracker with imperceptible noises. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 990–999
Suttapak W (2021) Feature-based attack on visual tracking. In: 2021 25th international computer science and engineering conference (ICSEC), pp 45–50
Suttapak W, Zhang J, Zhang L (2022) Diminishing-feature attack: the adversarial infiltration on visual tracking. Neurocomputing
Jia S, Song Y, Ma C, Yang X (2021) Iou attack: towards temporally coherent black-box adversarial attack for visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6709–6718
Liao F, Liang M, Dong Y, Pang T, Hu X, Zhu J (2018) Defense against adversarial attacks using high-level representation guided denoiser. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1778–1787
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310
Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp 12549–12556
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Liu H, Simonyan K, Yang Y (2018) Darts: Differentiable architecture search. arXiv:1806.09055
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383
Zhang K, Li Y, Zuo W, Zhang L, Van Gool L, Timofte R (2021) Plug-and-play image restoration with deep denoiser prior. IEEE Trans Pattern Anal Mach Intell 44(10):6360–6376
Article Google Scholar
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 234–241
Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE international conference on computer vision (ICCV)
Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32
Wu Y, Lim J, Yang M-H (2013) Online object tracking: a benchmark. In: IEEE Conference on computer vision and pattern recognition (CVPR)
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: European conference on computer vision, Springer, pp 445–461
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Čehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A, et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV), pp 0–0
Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen J-K, Čehovin Zajc L, Drbohlav O, Lukezic A, Berg A, et al (2019) The seventh visual object tracking vot2019 challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0

Download references

Acknowledgements

This work was supported by the Shanghai Municipal Science and Technology Major Project, China (Grant No. 2021SHZDZX0102), the National Natural Science Foundation of China (Grant No. 62076162), and the Startup Fund for Young Faculty at SJTU (Grant No. 22X010503821).

Author information

J. Zhang, H. Zhao and L. Zhang have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd, Minhang, Shanghai, 200240, China
Wattanapong Suttapak, Jianfu Zhang, Haohuo Zhao & Liqing Zhang

Authors

Wattanapong Suttapak
View author publications
You can also search for this author in PubMed Google Scholar
Jianfu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haohuo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Liqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Suttapak Wattanapong: conceptualization, methodology, data analysis, software, visualization, validation, investigation, writing Zhang Jianfu: edit, data analysis, validation, methodology, supervision Zhao Haohuo: edit, validation Zhang Liqinq: edit, supervision

Corresponding author

Correspondence to Wattanapong Suttapak.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Suttapak, W., Zhang, J., Zhao, H. et al. Multi-Model UNet: An Adversarial Defense Mechanism for Robust Visual Tracking. Neural Process Lett 56, 132 (2024). https://doi.org/10.1007/s11063-024-11592-2

Download citation

Accepted: 11 March 2024
Published: 01 April 2024
DOI: https://doi.org/10.1007/s11063-024-11592-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-Model UNet: An Adversarial Defense Mechanism for Robust Visual Tracking

Abstract

Similar content being viewed by others

An efficient method to fool and enhance object tracking with adversarial perturbations

DIMBA: discretely masked black-box attack in single object tracking

Efficient Adversarial Attacks for Visual Object Tracking

1 Introduction

2 Related Works

2.1 Visual-Tracking-Based Siamese Network

2.2 Adversarial Attacks for Single Visual Tracking

3 Defensive Model Methodology

3.1 Constrainted Optimization of Adversarial Defense