DIMBA: Discretely Masked Black-Box Attack in Single Object Tracking

The adversarial attack can force a CNN-based model to produce an incorrect output by craftily manipulating human-imperceptible input. Exploring such perturbations can help us gain a deeper understanding of the vulnerability of neural networks, and provide robustness to deep learning against miscellaneous adversaries. Despite extensive studies focusing on the robustness of image, audio, and NLP, works on adversarial examples of visual object tracking -- especially in a black-box manner -- are quite lacking. In this paper, we propose a novel adversarial attack method to generate noises for single object tracking under black-box settings, where perturbations are merely added on initial frames of tracking sequences, which is difficult to be noticed from the perspective of a whole video clip. Specifically, we divide our algorithm into three components and exploit reinforcement learning for localizing important frame patches precisely while reducing unnecessary computational queries overhead. Compared to existing techniques, our method requires fewer queries on initialized frames of a video to manipulate competitive or even better attack performance. We test our algorithm in both long-term and short-term datasets, including OTB100, VOT2018, UAV123, and LaSOT. Extensive experiments demonstrate the effectiveness of our method on three mainstream types of trackers: discrimination, Siamese-based, and reinforcement learning-based trackers.


Introduction
While deep learning has achieved a breakthrough in solving the problems that have been experienced by the artificial intelligence and machine learning community over the past decade, several studies have revealed that Deep Neural Networks (DNNs) are vulnerable to adversarial perturbations (Goodfellow et al, 2015) on image processing tasks (Szegedy et al, 2014;Moosavi-Dezfooli et al, 2016;Xie et al, 2017).For images, such perturbations are often too small to be perceptible, yet they can completely fool a DNN classifier, detector, or segmentation analyzer, causing them to predict incorrect categories or contours.This leads to great concerns under the circumstances where deep learning models are deployed rapidly in safety and security-critical applications in particular, e.g., self-driving cars, surveillance, drones, and robotics (Mnih et al, 2015).Besides the computer vision applications, recent works also investigate adversarial attacks on other tasks, e.g.natural language processing (Zhang et al, 2019a), audio recognition (Yakura and Sakuma, 2019), and malware detection (Grosse et al, 2017).
Single object tracking(SOT), as one of the fundamental problems in computer vision, has recently experienced tremendous improvement through DNNs and plays a significant role in practical security applications such as self-driving systems, robotics, etc., (Mnih et al, 2015).In terms of the tracking procedure, it can be mainly divided into three categories, Siamese-based trackers (Li et al, 2018b;Zhu et al, 2018;Bertinetto et al, 2016;Zhang et al, 2019b), discrimination trackers (Danelljan et al, 2020(Danelljan et al, , 2019)), and reinforcement learning-based trackers (Yun et al, 2017a).Siamese-based trackers define the tracking problem as a one-stage detection problem and locate the object that has the most similar feature representation with the initial template on subsequent frames.On the other hand, discrimination trackers predict object locations based on two sub-modules.The first one is target classification, which introduces dedicated optimization techniques to discriminate between the background and the target object, then the target estimation module is exploited to regress an intersection-over-union (IoU) score between the ground-truths and predicted bounding boxes.The third category, reinforcement learning-based trackers, formulates the whole tracking procedure as a Markov Decision Process, and selects different actions according to the agent state at the current step.However, after the concept of adversarial attack was proposed by (Szegedy et al, 2014), although intensive follow-up methods were inspired to demonstrate various adversaries to deceive deep learning models (Goodfellow et al, 2015;Kurakin et al, 2017;Madry et al, 2019), adversarial robustness concerning object trackers has yet been fully explored.As far as we know, only a handful of research works appeared very recently.For example, some researchers (Yan et al, 2020a) have proposed a Cooling-Shrinking Loss to train the perturbation generator to achieve an effective and efficient adversarial attacking algorithm.Moreover, spatial-temporal sparse noise was applied in (Guo et al, 2020) along targeted or untargeted trajectories.By categorizing the tracking problem into classification and regression branches, researchers in (Chen et al, 2020) focused on free-model object tracking with dual attention.Whereas current attacking techniques applied on SOT exhibit several limitations that may severely restrict their generality in practice.Specifically, we highlight the following disadvantages: (1)Most tracking adversaries cannot be extended to constrained black-box SOT applications.Given comprehensive knowledge of model architecture and parameters, miscellaneous approaches are capable of generating effective perturbations over the whole video clip based on the computation of network gradient.However, the target network is often inaccessible within safety-critical scenarios where we can only obtain hard-label predictions during the whole tracking procedure.Therefore, practical black-box attack algorithms are worthy of exploration.(2)Current methods compose perturbations often on multiple frames.As illustrated above, existing white-box attacks can realize powerful overall results, but most of them are derived from noises attached to a large portion of frames.Although the initial frame of a video plays a vital role in SOT, few works pay attention to this, either in white-box or black-box scenarios.For instance, the Hijacking algorithm (Yan et al, 2020c) generates an adversary on a special clip of the video, and the IoU attack (Jia et al, 2021) proposes a continuous black-box attack framework imposed from the 2 nd frame to N th frame.(3)Recent blackbox attack algorithms applied on SOT do not consider computational efficiency.
As far as we know, none of the existing black-box attacks on SOT considers query efficiency.(Liang et al, 2020) presents a transferable attack mode, but it is specialized in white-box cases.(Jia et al, 2021) focuses on temporal correlations between adjacent frames, but its effectiveness heavily relies on the length of a video and the query times per frame.Different from the attack on image classification or segmentation tasks where perturbation can be merely added on a single picture, the evaluation metrics in SOT are determined by the whole video clip.As the number of perturbed frames increases, adversaries will be detected more easily.Meanwhile, the knowledge of video gradient is completely lacking within a black-box scenario.Therefore, a sacrifice of query times is almost unavoidable to improve adversarial results.Subsequently, we propose a question: Can we combine efficiency and effectiveness in black-box attack on SOT ?Or in other words, can we select the most fragile part of a video to perturb and reach destroyed tracking results more quickly?In this paper, we combine query-based method with reinforcement learning framework and propose the Discrete Masked Black-Box attack(DIMBA) algorithm on SOT, where we simply modify initialized frames across the whole video clip to realize perturbed results.In contrast to previous works, we reversely craft heavy or effective perturbations at first, then decrease the adversarial magnitude using a modified sign attack method.In summary, the key contributions of our paper are as follows: 1) We formulate the black-box attack problem on SOT in a more practical and query-efficient manner.Compared to recursively generating perturbed results in each frame, we focus on initialized frames, which boosts the attack efficiency.2) To reduce unnecessary perturbations with large adversarial magnitude on specific areas in initialized frames, and increase the probability of generating perturbations causing similar attack performance within a smaller perturbing radius, we introduce an A2C (Actor-Critic) grid searching strategy 3) The comprehensively devised experiments over OTB100, UAV123, LaSOT, and VOT2018 datasets show that DIMBA attack can generate imperceptible perturbations more efficiently, and achieve competitive or even better performance compared to SOTA black-box attacks on SOT.
2 Related Work

Adversarial Attacks on Visual Object Tracking
Wide applications of visual object tracking have led to numerous specialized real-world techniques, which have also resulted in well-crafted attacks from the adversarial perspective.Taking the realm of physical world attacks into account, (Eykholt et al, 2018) analyzed adversarial stickers on stop signs in the context of autonomous driving to fool YOLO (Redmon et al, 2016).(Jia et al, 2019) proposed a 'tracking hijacking' technique to fool multiple object trackers with imperceptible perturbations computed for object detectors in the perceptual pipeline of autonomous driving.Meanwhile, (Yan et al, 2020a) developed an attacking technique to deceive single object trackers based on SiamRPN++ (Li et al, 2018b).Their method trains a generator model to construct adversarial frames under a 'cooling-shrinking' loss, which is manipulated to cool down the hot target regions and force the bounding boxes to shrink during online tracking.(Huang et al, 2020) delved into physical attacks on object detectors in the wild by developing a universal camouflage for object categories.A oneshot adversarial attack is demonstrated in (Chen et al, 2020) for single object tracking were inserting a patch in the first frame of the video results in losing the target in the subsequent frames.A spatial-aware attack (SPARK) is proposed in (Guo et al, 2020) fool online trackers.This approach imposes an L p constraint over perturbations while computing them incrementally based on previous frames.Extensive experiments show that their adversaries are capable of fooling multiple state-of-the-art trackers.
Differing from previous attacking models in white-box settings, (Jia et al, 2021) explores black-box perturbations by making use of temporally correlated information and incrementally adding noise from the initial frame to subsequent frames.However, it focuses extensively on locally anchored noise between adjacent templates and is devoid of long-term diversity.

Deep Reinforcement Learning
Due to its ability to scale to previously intractable decision-making problems, Deep Reinforcement Learning (DRL) has been a growing area recently.Kickstarting this revolution (Mnih et al, 2015), for example, firstly learns to play a range of Atari 2600 video games at a superhuman level directly from pixellevel knowledge, whilst demonstrating that RL agents could be trained on raw, high-dimensional observations based on reward signals.As another standout success, AlphaGo (Silver et al, 2016) parallelled the historic achievement of IBM's Deep Blue and defeated a human world champion in Go.

Methodology
In this section, we first introduce the preliminaries of our proposed attack method.The details of DIMBA are presented in subsequent sections.The general pipeline of our algorithm is shown in 2. Initialized frames are taken as an input(For simplicity, we only consider One Pass Evaluation(OPE)in the following parts).With a momentum-based perturbation, generator simulating the optimal gradient descent direction and exploiting historical noise trajectory as shown in MI-FGSM (Dong et al, 2018), and a texture-based approach selecting candidates by crafting spectral residual detection, we accumulate bunches of candidate first frames.Then an Actor-Critic agent computes the importance of patches segmented equally in the initial frame and selects the least important region under the current state.Last but not least, an iterative boundary-walking strategy is utilized to compress perturbation magnitude while maintaining attack results within a specific region.

Preliminaries
We denote a video sample by v ∈ V ⊂ R N ×H×W ×C with N, H, W, C referring to the number of frames, height, width, and the number of channels respectively.A specific frame can be denoted as v i (i ∈ 1, ...N ), where N is the length of video v. Generally, SOT learns a tracking model T (v; θ) : V → (B, S) by minimizing regression loss between ground truth and predicted bounding boxes in each frame and maximizing similarity of predicted bounding boxes between adjacent frames.B ∈ R N ×4 indicates localizing matrix, where each row [x i , y i , w i , h i ] denotes the x-axis and y-axis coordinates, width, and height of the predicted bounding box for v i .Meanwhile, S collects the highest confidence scores for each frame.According to the evaluation method, SOT can be summarized into two categories.The first one initializes only once in a single video, which is also called One Pass Evaluation (OPE).In contrast, the second approach can restart the tracker several frames after the failed one, such as testing trackers on Visual Object Tracking Challenge 2018 (Kristan, 2018).The goal of an adversarial attack in SOT is to find an adversarial example v * that can fool the network to make a shifted or even target-lost bounding box in the sequence, while keeping v * within the -ball centered at v using L p normalization v * − v p , where p can be 1, 2 or ∞.Here in this paper, we mainly focus on the L ∞ norm and SSIM similarity (Wang et al, 2004) for comparison to clean frames.
Although there are multiple evaluation metrics for SOT across various challenges, we decide to explore two standards that are in most common use for visual tracking, represented as A and R, short for accuracy and robustness.A denotes the average of IoU scores of all frames that contain overlapping perturbed bounding boxes and predicted bounding boxes until the end of video or reinitialization.R then weights the tracking performance according to the number of failed frames in a discounted reward manner.These two values can be calculated as: 0, else. (1) where IoU i represents Intersection over Union between predicted Bi and ground truth Bi .γ a and γ r state the discounted factors for accuracy and robustness, highlighting the impact of future tracking performance.Generally in our work, both of them are set to 0.9.Similar to SPARK (Guo et al, 2020), we split the video into L-length intervals based on a common frame rate (also known as Frame Per Second (F P S)), considering weight factors within the same interval are supposed to be set equivalently, but decreased exponentially in a long term view.Generally, attacks on SOT can be categorized into untargeted and targeted attacks.An untargeted attack generates an adversarial example either from a long-term or short-term tracking perspective according to object motion, aiming to decrease the average value of IoU i in a whole video clip, which in the best case can cause the tracker to lose the target.In contrast, a targeted attack focuses on the object trajectory or shape of the bounding box.In this paper, we will mainly focus on untargeted attacks.

Heavy Perturbation Generator
In the first stage of our proposed pipeline, we generate a heavily perturbed initial frame.We synergistically exploit texture-based and momentum-based generators to produce adversarial candidates to diversify adversarial directions and increase the probability of successful perturbations.Take texture-based perturbations, for instance, we randomly select a certain number of videos from the current dataset and pick up frames from candidates with the same timestamp as the victim frame.Particularly in OPE scenarios, the victim frame would be #0.Then considering both human visual systems and video processing models that concentrate on target locations contributing more to final results, we apply a Spectral Residual Saliency approach (Hou and Zhang, 2007) on each candidate using pixel-wise mask M p ∈ {0, 1} S×W ×H×C , where S, W, H, C indicate the number of reinitialization(S = 1 for OPE), width, height, and the number of channels for each video.Then all candidates will be appended to adversarial sets V. Momentum-based approach, on the other hand, is a technique for accelerating gradient descent algorithm by accumulating a velocity vector in the gradient direction.IoU-Attack (Jia et al, 2021) leverages this concept and extends it to temporal correspondence among continuous frames.Inspired by these works, we present a novel spatial momentum-based approach, which is applied to the initial as well as the most essential frame of a video.As illustrated in Algorithm 1, by randomly sampling perturbing directions in each attack level denoted as k , where indicates Algorithm 1 Momentum-based perturbation generation in OPE Input: SOT tracker T , clean video v, adversarial video v * = v, maximum perturbation , candidate number C, momentum factor µ, trade-off factor ι, iterations k, initial gradient g 0 , tracking performance T P = 1, adversarial candidate set V. Output: end for 10: the magnitude of L ∞ normalization, we craft adversaries along the historically optimal direction progressively, until we find the successful perturbation on the initial frame or the magnitude of perturbation exceeds the -ball bound around v i .Balanced by trade-off factor ι, if the tracking performance decreases, we then update and get the optimal gradient g opt with momentum.With two different perturbation generators, we can finally obtain an adversarial set V full of heavily destroyed initial frames.Then we feed them into the next part of our pipeline.For simplicity, only the OPE-based case is summarized in Algorithm 1. Cases with reinitialization (VOT2018) can be easily extended by repeating the previous process on all reinitialised frames step by step.

Actor-Critic Key Patch Selection
As illustrated above, some areas in the initial frame are more beneficial for feature representations of the target object, but others are not.Take video Bird1 in Figure 2 for instance, perturbations added to corners affect much less than those on more significant regions, like bird-surrounding ones.Therefore, removing redundant perturbations attached to those regions will not affect the whole attack results (or at most only marginally) but decrease adversarial magnitude for perturbations.As shown in Figure 2, we impose a mask that is split into P × P patches and element-wisely composed of all 1s.Considering computational efficiency as well as the averaged size of video frames across different datasets, we adjust P as a hyper-parameter and conduct a grid search.Then we apply a reinforcement learning (RL)-based key patch selection framework, which is implemented by Actor-Critic network Z, to select the least As shown in the second part of Figure 2, our network contains 5 convolutional layers, each of them is followed by a max-pooling layer, where parameters are shared between Actor and Critic branches, and extract features of newly added perturbations.However, the shape of videos can be varied even in the same tracking dataset.Resizing them into a fixed size may result in unwanted geometric distortion, which is extremely harmful to localizing objects in SOT.Therefore we introduce a Spatial Pyramid Pooling (SPP) (He et al, 2016) strategy on top of the last convolutional layer to remove the fixed size constraint of the network.Subsequently, we append 3 fully connected layers to estimate what is the best action that the agent should take and the corresponding critic value of that.
Generally, we consider the key patch selection as a multi-step Markov Decision Process (MDP), which contains states, actions, transition function, and a reward function.In our task, the state s t at time step t is defined as the pixelwise difference between v 0 and v * 0 masked by the current mask M t ∈ R S×P×P .It can be denoted as: (3) where represents Hadamard product.At time step 0, M 0 is {1} S×P×P .An action a t = Z(s t ) refers to a S × P 2 softmax matrix, indicating the least important patch in each initialized frame to successfully track the target at time step t.Then once the agent chooses an action a t , we can set the corresponding element in M t to 0. Denoting this process as a function F, we can update the state to s t+1 will be the terminal state if a t ∈ {a 0 , a 1 , ..., a t−1 } or A(T (v0+st+1)) A(T (v0+s0)) > τ 1 or R(T (v0+st+1)) R(T (v0+s0)) < τ 2 .Since SOT is inherently a regression problem within the continuous output space instead of a pure classification problem, slight manipulation of the adversarial perturbation may be reflected in the final tracking results.Therefore we introduce ratio thresholds τ 1 and τ 2 to maintain the attack results within an acceptable scale.Generally, our goal is to delete Fine-tune A2C network parameters θ p and θ c using top-n adversarial videos from V that is Ranked based on T P in Algorithm 1 and V i − v 0 ∞ with ascending order.2: for i = 0 to n do 3: Apply policy θ p to get sparse mask M i 5: Vi−v0 ∞ 8: Using binary search algorithm to compute g(φ d ) with κ = γ(τ1τ2−1)+1 τ2 9: end if 10: for n A = 0 to N A do 11: Randomly sample K vectors u 1 , ..., u k using Gaussian distribution N (0, I) Recompute g(θ d ) as shown above.V i = v 0 + g(φ d )φ d 17: end for 18: Ranking V 19: Return V less important patches and maximize the long-term expected reward, therefore we design the reward in step t as , else In the offline training stage, we select a certain number of candidate videos generated from the previous step, then feed them into policy network π θc (a t s t ) and critic network π θc (c t s t ) to maximize the expected long-term reward with PPO algorithm, which is written as  where A θp (s ), Q θp is the Q-value calculated by discounting future rewards, V θc is the critic value generated by critic network.ρ denotes the clip parameter to regularize policy iterations.

Sign Attack Module
As indicated in Algorithm 2, after removing less important patch-level perturbations attached to initial frames of videos, we can fetch manipulated adversarial examples as well as their tracking accuracy and robustness.Then we need a boundary walking method to help us compress the noise magnitude while maintaining attack results within a specific scope.As shown in part (c) of Figure 2, we iteratively update victim frame v 0 until its magnitude is compressed from 1 to 3 , while maintaining competitive attack results or even strengthening it.Cheng et al. (Cheng et al, 2018) states that a black-box attack problem can be formulated into an optimization problem, where the objective function can be evaluated as a binary search with additional model queries.Then a zeroth-order optimization algorithm can be applied to solve this optimization problem.In this paper, we exploit the Sign-OPT algorithm in the Sign Attack Module.
In our approach, φ d and g(φ d ) indicate our designated search direction and corresponding distance from the initial frame v 0 to its nearest adversarial example that has the same or similar tracking results within a predefined threshold along φ d .The objective function can be written as which can be evaluated by a local binary search procedure.As the evaluation results of SOT, AR is denoted as γ )) A(T (v0+s0)) )) .
We need to estimate its directional derivative by consuming a huge amount of queries when computing g(φ d + u) − g(φ d ).However, it will take a large number of computational resources if we intend to obtain the gradient derivative accurately.Due to the various and large dimensions of our input, we decide to improve query complexity by an imperfect but informative estimation of directional derivative.Therefore, we exploit the sign value and compute the gradient by sampling K gaussian vectors: When starting an attack on videos, we need to initialize perturbing directions , where v * 0 can be retrieved by sampling from v 0 's candidate adversarial sets V, including texture-based and momentum-based perturbations.Detailed in Algorithm 2, by trading off the magnitude of adversaries and their tracking performance, we rank the candidate list with T P and L 1 normalization and pick the top-n target video clips for the attacked video.

Experiments
In this section, we describe our experimental settings and analyze the effectiveness of the proposed DIMBA algorithm against different trackers on four challenging short-term or long-term datasets, including OTB100 (Wu et al, 2015), VOT2018 (Kristan, 2018), UAV123 (Mueller et al, 2016), and LaSOT (Fan et al, 2019).Part of the qualitative tracking results performed by PrDiMP50 is shown in Figure 5 4.1 Experimental Settings Victim Models.As mentioned in section 1, current tracking models can be divided into Siamese-based, discrimination, and reinforcement learning-based trackers.Considering overall tracking performance, we select one or more most representative trackers for each of them, which consists of SiamRPN++ that uses AlexNet (Krizhevsky et al, 2012), mobilenetv2 (Sandler et al, 2018), and ResNet50 (He et al, 2016) as backbones, DaSiamRPN (Zhu et al, 2018), PrDiMP (Danelljan et al, 2020), and Action-Decision Network (Yun et al, 2017a).Metrics.To fairly compare our attack results with original tracking performance and previous black-box attacks on SOT, standard evaluation methods are exploited.While testing DIMBA on OTB100 (Wu et al, 2015), UAV123 (Mueller et al, 2016) and LaSOT (Fan et al, 2019), we utilize precision plot and success plot metrics in a one-pass evaluation (OPE) scenario.As for the VOT2018 challenge (Kristan, 2018), we introduce a reinitialization mechanism five frames after the tracker lost the target.Computing Infrastructures.We conduct experiments on a computer with three Nvidia GeForce RTX 2080Ti and one Nvidia GeForce RTX 3090 GPUs, an Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz, running Ubuntu 18.04.5 LTS.

Implementation Details
Our experiment is implemented in PyTorch.In momentum-based perturbation generation, maximum noise magnitude is 64, candidate number C is 15, iteration number k is 128, momentum factor µ is 0.5, trade-off factor ι is 0.4.Same to momentum generator, the texture-based generator produces adversarial sets with capacity C as well.
To pretrain the Actor-Critic Network for key patch selection, we set PPO epoch, clipping parameter ρ, buffer capacity, and maximum gradient normalization to 10, 0.2, 500, and 0.5, respectively.As for patch number P, we exploit the grid search strategy and set P as 2, 4, 8, 16, 32.For balancing selection efficiency and final impact on tracking performance, P is parameterized to 16.
In the same way, the combination of ratio threshold τ 1 and τ 2 is set to 1.5 and 0.4.trade-off factor γ is set to 0.4, video candidate number n is naturally set to 20 out of 30, gradient candidate number K is assigned to be 100, and the number of attack queries N A can be 60.

Overall Attack Results
Results on VOT2018.Table 1 compares the overall results of these trackers on the VOT2018 dataset.We exploit randomly generated noises as well as perturbations computed by IoU Attack (Jia et al, 2021) and compare them with our proposed method.Specifically, our algorithm outperforms IoU Attack concerning accuracy in DaSiamRPN and ADNet by 8.45% and 5.82%, respectively.Furthermore, in terms of robustness, our approach exceeds IoU Attack in SiamPRN++, DasiamRPN, and ADNet by 9.32%, 3.21%, and 2.97%.As for EAO (Expected Average Overlap) in SiamRPN++ and ADNet, we have achieved 6.2% and 7.9% improvement.Results on OTB100.As shown in Figure 3, we draw success and precision plots of various trackers selected according to their categories and tested on OTB100.Compared to the original tracking performance, our black-box attack method can reduce the AUC score and visually change the curves' shape.Meanwhile, we correspondingly visualize the results of a white-box One-Shot Attack (Chen et al, 2020) and check the difference.Meanwhile, Table 2 illustrates the success and precision rates of original videos, random perturbations, One-Shot Attack, IoU Attack, and our method.Results on UAV123 and LaSOT Depicted in Figure 4, tracking results of different trackers are illustrated based on UAV123 and LaSOT.With our attack method, the AUC score of success plots tested on UAV123 are decreased by 4.3%, 10.8%, and 17.4% for PrDiMP, SiamRPN++, and ADNet individually.In the meantime, the same score of success plots calculated on LaSOT are reduced by 6.6%, 9.0%, 22.5%, and 11.8% for PrDiMP, SiamRPN++, DaSiamRPN, and ADNet respectively.

Ablation Study of Key Patch Selection
We conduct a series of experiments to evaluate the impact of the key patch selection module.Discrimination model PrDiMP is selected as our baseline and tracking results on VOT2018 are shown in Figure 6.As we can conclude from Figure 6, we query fewer times in black-box settings to reach a similar perturbation magnitude using Key Patch Selection.Meanwhile, the average IoU scores on unlost frames remain much smaller than DIMBA Attack without the Key Patch Selection module.

Comparison with Previous Works
According to our understanding, the overall computational complexity of IoU Attack (Jia et al, 2021) is O(KN L), where K is the number of epochs for choosing perturbations on each frame, N is the candidate number of random noises, L is the length of the video clip.Whereas in our algorithm, our query complexity can be reduced to O(KN + C), where C is a constant number independent of L. The comparison in computational efficiency between IoU Attack and our approach is illustrated in Table 4. Furthermore, we also illustrate the comparison with the One-Shot Attack (Chen et al, 2020) in Table 5.

Conclusions
In this work, we propose an effective and efficient query-based black-box attack for SOT.An Actor-Critic key patch selection module is exploited to reduce redundant noises and increase query efficiency.Meanwhile, the combination of texture-based and momentum-based perturbation generators diverse potential adversarial directions and introduce heavily damaged tracking performance.
Compared with existing works, our method requires fewer queries on SOT and less perturbation from the perspective of a whole video clip but maintains competitive, even better manipulating results.The experiments in both long-term and short-term datasets across three major categories of trackers demonstrate the effectiveness of our framework.We hope this work can elucidate the source of vulnerabilities in these trackers, optimistically paving the way for more powerful ones.

Fig. 1
Fig. 1 Visualization of tracking results generated by trackers from three different tracking categories under DIMBA Attack, including SiamRPN++ (Li et al, 2019)(left), ADNet (Yun et al, 2017a)(middle), and PrDiMP (Danelljan et al, 2020)(right).Clipped frames above the chart qualitatively demonstrate the behaviors of trackers with or without attack.Green bounding boxes refer to ground truths, blue ones measure original tracking results, and red ones illustrate failed tracking performance.The charts below indicate IoU scores between predicted bounding boxes and ground truths, and the tracking performance with or without attack is separately represented in red and blue lines.

Fig. 2
Fig. 2 Overview of DIMBA framework, which contains heavy perturbation generator, key patch selection, and sign attack module, (a) Heavy Perturbation Generator initially constructs candidate adversarial videos, originating from either momentum-based approach or texture-based approach.Partial adversaries are overly perturbed, which are therefore sent to subsequent components.(b)Then, Key Patch Selection assigns the mask value of particularly perturbed patches to 0 based on an Actor-Critic network, of which structure is proposed above.(c) Sign Attack Module estimates gradients around designated directions optimized from previous steps and computes final results.

Fig. 3
Fig. 3 Success and Precision Plots of trackers with or without adversarial attacks on OTB100 dataset

Fig. 4
Fig. 4 Success Plots of trackers with or without adversarial attacks on UAV123 and LaSOT

Fig. 5
Fig. 5 Illustration of clean and adversarial tracking results collected from DIMBA attack on PrDiMP50 tracker.Blue Bounding boxes indicate originally predicted bounding locations while red ones demonstrate attacked ones.

Fig. 6
Fig. 6 Illustration of the ablation study on key patch selection module of our proposed DIMBA Attack.Results are averaged over the OTB100 dataset tracked by PrDiMP50.The left figure indicates the fluctuation of perturbation magnitude with respect to query times.While the right one denotes the relation between the average overlap score on each frame and perturbation magnitude.
Algorithm 2 Key Patch Selection and Sign Attack Module in OPE Input: SOT tracker T , clean video clip v, A2C pretrained policy θ p , value network parameter θ c , adversarial candidate set V, video candidate number n, gradient candidate number K, smoothing parameter ρ d , and direction learning step size α.number of attack queries N A , initial grid mask M Output: adversarial example set V 1: , evaluated using Accuracy, Robustness, and EAO(Expected Average Overlap).

Table 3
Evolving success rate and precision based on perturbations within different scopes.

Table 4
Comparison of average query times between IoU and DIMBA Attack using SiamPRN++(R).

Table 5
Evaluation on OTB100 between One-Shot Attack and DIMBA.