1 Introduction

While deep learning has achieved a breakthrough in solving the problems that have been experienced by the artificial intelligence and machine learning community over the past decade, several studies have revealed that Deep Neural Networks (DNNs) are vulnerable to adversarial perturbations (Goodfellow et al., 2015) on image processing tasks (Moosavi-Dezfooli et al., 2016; Szegedy et al., 2014; Xie et al., 2017). For images, such perturbations are often too small to be perceptible, yet they can completely fool a DNN classifier, detector, or segmentation analyzer, causing them to predict incorrect categories or contours. This leads to great concerns under the circumstances where deep learning models are deployed rapidly in safety and security-critical applications in particular, e.g., self-driving cars, surveillance, drones, and robotics (Mnih et al., 2015). Besides the computer vision applications, recent works also investigate adversarial attacks on other tasks, e.g. natural language processing (Zhang et al., 2019a), audio recognition (Yakura & Sakuma, 2019), and malware detection (Grosse et al., 2017).

Single object tracking(SOT), as one of the fundamental problems in computer vision, has recently experienced tremendous improvement through DNNs and plays a significant role in practical security applications such as self-driving systems, robotics, etc., Mnih et al. (2015). In terms of the tracking procedure, it can be mainly divided into three categories, Siamese-based trackers (Bertinetto et al., 2016; Li et al., 2018; Zhang et al., 2019b; Zhu et al., 2018), discrimination trackers (Danelljan et al., 2019, 2020), and reinforcement learning-based trackers (Yun et al., 2017). Siamese-based trackers define the tracking problem as a one-stage detection problem and locate the object on subsequent frames that have the most similar feature representations with the initial template, their reliance on initialized frames especially targeted regions is fully exploited in our proposed algorithm. In contrast, discrimination trackers predict object locations based on two sub-modules, which are target classification and target estimation. The third category, reinforcement learning-based trackers, formulate the whole tracking procedure as a Markov Decision Process and select different actions according to the agent state at the current step. In recent years, after the concept of adversarial attack was proposed by Szegedy et al. (2014), intensive follow-up methods were inspired to demonstrate various adversaries to deceive deep learning models (Goodfellow et al., 2015; Kurakin et al., 2017; Madry et al., 2019), adversarial attacks concerning visual object tracking have also been explored by plenty of works. For example, Yan et al. (2020a) has proposed a Cooling-Shrinking Loss to train the perturbation generator to achieve an effective and efficient adversarial attacking algorithm. Moreover, spatial-temporal sparse noise was applied in Guo et al. (2020) along targeted or untargeted trajectories. By categorizing the tracking problem into classification and regression branches, Chen et al. (2020) focused on free-model object tracking with dual attention.

Whereas, current attack algorithms applied on SOT exhibit several limitations that may severely restrict their generality in practice. Specifically, we highlight the following disadvantages: (1) Most tracking adversaries cannot be extended to black-box SOT applications. Given comprehensive knowledge of model architecture and parameters, miscellaneous approaches are capable of generating effective perturbations over the whole video clip based on the computation of network gradient. However, the target network is often inaccessible within safety-critical scenarios where we can only obtain hard-label predictions during the whole tracking procedure. Therefore, practical black-box attack algorithms are worthy of exploration. (2) Current methods compose perturbations often on multiple frames. As illustrated above, existing white-box attacks can realize powerful overall results, but most of them are derived from noises attached to a large portion of frames. Although the initial frame of a video plays a vital role in SOT, few works pay attention to this, either in white-box or black-box scenarios. For instance, the Hijacking algorithm (Yan et al., 2020b) generates an adversary on a special clip of the video, and the IoU attack (Jia et al., 2021) proposes a continuous black-box attack framework imposed from the \(2_{nd}\) frame to \(N_{th}\) frame. (3) Recent query-based black-box attacks applied on SOT do not consider computational efficiency. As far as we know, none of the existing query-based black-box attacks on SOT considers query efficiency. Jia et al. (2021) focuses on temporal correlations between adjacent frames. The gradually increasing perturbation magnitude can surely influence the tracking performance, but its effectiveness heavily relies on query times for each frame and the randomness of the Gaussian distribution.

Overall, different from black-box attacks on image classification or segmentation where perturbations are merely added to a single picture, the tracking performance in SOT is determined by the whole video clip, and as the number of perturbed frames increases, adversaries will be detected more easily. Meanwhile, the gradient information is completely lacking within black-box scenarios. It seems that a sacrifice of query times is unavoidable to improve adversarial results in a query-based black-box attack. Therefore, we propose a question:

  • Can we combine efficiency and effectiveness in black-box attack on SOT?

Or in other words, can we select the most fragile part of a video, and realize heavily shifted tracking results more quickly? In this paper, we propose the Discrete Masked Black-Box attack (DIMBA) algorithm, which is mainly inspired by mechanisms of SiamRPN-based trackers that achieve the balance between speed and performance based on initialized frames and generalized to other types of trackers. In contrast to previous works, we firstly introduce the decision-based attack strategy by crafting heavy perturbations on significant regions in the initial frame, then remove unnecessary noises and decrease the adversarial magnitude using a zeroth-order optimization algorithm. In summary, the key contributions of our paper are as follows:

  1. (1)

    We formulate the query-based black-box attack problem on SOT in a query-efficient manner. Compared to recursively generated perturbations in each frame, we only focus on significant regions in the initial frame, and firstly introduce a decision-based attack strategy in adversarial SOT problems.

  2. (2)

    To reduce unnecessary patch-based heavy perturbations on specific areas in initialized frames, and increase the probability of generating perturbations causing similar attack performance within a smaller perturbing radius, we introduce a novel grid searching strategy.

  3. (3)

    The comprehensively devised experiments over OTB100, UAV123, LaSOT, and VOT2018 datasets show that DIMBA attack can generate perturbations more efficiently, and achieve competitive or even better performance compared to SOTA black-box attacks on SOT.

2 Related works

2.1 Adversarial attacks on visual object tracking

Wide applications of visual object tracking have led to numerous specialized real-world techniques, which have also resulted in well-crafted attacks from the adversarial perspective. Taking the realm of physical world attacks into account, Eykholt et al. (2018) analyzed adversarial stickers on stop signs in the context of autonomous driving to fool YOLO (Redmon et al., 2016). Jia et al. (2019) proposed a ‘tracking hijacking’ technique to fool multiple object trackers with imperceptible perturbations computed for object detectors in the perceptual pipeline of autonomous driving. Meanwhile, Yan et al. (2020a) developed an attacking technique to deceive single object trackers based on SiamRPN++ (Li et al., 2018). Their method trains a generator model to construct adversarial frames under a ‘cooling-shrinking’ loss, which is manipulated to cool down the hot target regions and force the bounding boxes to shrink during online tracking. Huang et al. (2020) delved into physical attacks on object detectors in the wild by developing a universal camouflage for object categories. One-shot attack (Chen et al., 2020) demonstrated the possibility to craft adversaries in the first frame of a video clip, forcing trackers, especially SiamRPN-based ones to lose the target in subsequent frames. A spatial-aware attack (SPARK) is proposed in Guo et al. (2020) to fool online trackers. This approach imposes an \(L_{p}\) constraint over perturbations while computing them incrementally based on previous frames. Extensive experiments show that their adversaries are capable of fooling multiple state-of-the-art trackers.

Different from the above methods proposed in white-box settings, Jia et al. (2021) explores the black-box attack by utilizing temporal correspondence between adjacent frames and incrementally adding noises from the second frame to subsequent frames. From the perspective of attack strategies, however, it focuses on locally anchored noises between adjacent templates and relies excessively on the successful randomness of perturbations in earlier frames due to the temporal momentum, which in essence sacrifices the efficiency for generality. Therefore in this paper, we make full use of the prior knowledge of search regions in the initial frame, especially existing in SiamRPN-based trackers, to improve the query efficiency: Formulating tracking as a one-shot detection problem, SiamRPN-based trackers aim at locating objects that have similar appearance with the initial template on the search region in each frame. Though search regions are not considered in reinforcement learning-based trackers, the initial frame plays an important role as the starting point of iterative actions in RNN-based frameworks. Equivalently in discriminative tracking processes, the target classification and location regression module can be impacted by attached perturbations surrounding the object on the first frame.

In Table 1, we compare our proposed method with previous attack algorithms from different perspectives, including the knowledge of perturbed models, number of frames under adversarial attacks, transferability of adversaries between different trackers, and whether or not the proposed algorithm is a decision-based one.

Table 1 A high-level comparison with previous attack methods on visual object tracking

2.2 Deep reinforcement learning

Due to its ability to scale to previously intractable decision-making problems, Deep Reinforcement Learning (DRL) has been a growing area recently. Kickstarting this revolution (Mnih et al., 2015), for example, firstly learns to play a range of Atari 2600 video games at a superhuman level directly from pixel-level knowledge, whilst demonstrating that RL agents could be trained on raw, high-dimensional observations based on reward signals. As another standout success, AlphaGo (Silver et al., 2016) parallelled the historic achievement of IBM’s Deep Blue and defeated a human world champion in Go.

Over time, several types of RL algorithms have been introduced and they can be divided into three groups: Actor-Only, Critic-Only, and Actor-Critic methods. Policy gradient methods such as REINFORCE algorithms (Williams, 1992) are chiefly Actor-Only and optimized over a large set of parameterized policies. In contrast, Critic-Only methods including Q-learning (Watkins & Dayan, 1992) and SARSA (Sutton & Barto, 2018) approximate solutions to the Bellman equation and learn the optimal value functions. To combine the advantages of Actor-Only and Critic-Only methods, Actor-Critic methods generate continuous actions step by step, while the large variance in the policy gradients of an Actor is reduced by a Critic.

3 Methodology

In this section, we first introduce the preliminaries of our proposed attack method. As shown in Fig. 2, The general pipeline of our algorithm consists of three parts. Firstly, We introduce a momentum-based as well as a patch-based perturbation generation process to accumulate heavily perturbed frames as candidate examples. Then a key-patch selection module divides the object-surrounding noise into different regions and computes the importance for each of them so that we can remove less important patches step by step and remain the approximately same attack results within a bounded range. At last, an iterative boundary-walking strategy is utilized to compress perturbation magnitude while maintaining attack results within a specific region. Perturbed by our method, Fig. 1 quantitatively illustrates IoU scores with the increase of frame indexes For simplicity, only One Pass Evaluation (OPE) is considered in the following sections.

Fig. 1
figure 1

Visualization of tracking results generated by trackers from three different tracking categories under DIMBA Attack, including SiamRPN++ (Li et al., 2019) (left), ADNet (Yun et al., 2017) (middle), and PrDiMP50 (Danelljan et al., 2020) (right). Clipped frames above the chart qualitatively demonstrate the behaviors of trackers with or without attack. Green bounding boxes refer to ground truths, blue ones measure original tracking results, and red ones illustrate failed tracking performance. The charts below indicate IoU scores between predicted bounding boxes and ground truths, and the tracking performance with or without attack is separately represented in red and blue lines (Color figure online)

3.1 Preliminaries

We denote a video sample by \(v \in {\mathcal {V}}\subset {{\mathbb {R}}^{N\times H\times W\times C}}\) with NHWC referring to the number of frames, height, width, and the number of channels respectively. A specific frame can be denoted as \(v_{i}(i\in {1,\ldots N})\), where N is the length of video v. Generally, SOT learns a tracking model \({\mathcal {T}}(v;\theta ) : {\mathcal {V}} \rightarrow \mathcal {(B, S)}\) by minimizing regression loss between ground truth and predicted bounding boxes in each frame and maximizing similarity of predicted bounding boxes between adjacent frames. \({\mathcal {B}} \in {\mathcal {R}}^{(N-1)\times 4}\) indicates localizing matrix, where each row \([x_{i}, y_{i}, w_{i}, h_{i}]\) denotes the x-axis and y-axis coordinates, width, and height of the predicted bounding box for \(v_i\)(The initialized frame and its ground truth bounding box are prior knowledge). Meanwhile, \({\mathcal {S}}\) collects the highest confidence scores for each frame. According to the evaluation method, SOT can be summarized into two categories. The first one initializes only once in a single video, which is also called One Pass Evaluation (OPE). In contrast, the second approach can restart the tracker several frames after the failed one, such as testing trackers on Visual Object Tracking Challenge 2018 (Kristan, 2018). The goal of an adversarial attack in SOT is to find an adversarial example \(v^{*}\) that can fool the network to make a shifted or even target-lost bounding box in the sequence, while keeping \(v^{*}\) within the \(\epsilon\)-ball centered at v using \(L_p\) normalization \(\Vert v^{*} - v\Vert _{p}\), where p can be 1, 2 or \(\infty\). Here in this paper, we mainly focus on the \(L_\infty\) norm and SSIM similarity (Wang et al., 2004) for comparison to clean frames.

Although there are multiple evaluation metrics for SOT across various challenges, we decide to explore two standards that are in most common use for visual tracking, represented as \({\mathcal {A}}\) and \({\mathcal {R}}\), short for accuracy and robustness. \({\mathcal {A}}\) denotes the average of IoU scores of all frames that contain overlapping perturbed bounding boxes and predicted bounding boxes until the end of video or reinitialization. \({\mathcal {R}}\) weights the tracking performance according to the number of failed frames in a discounted reward manner. These two values can be calculated as:

$$\begin{aligned} IoU_i = \frac{{\hat{B}}_i \cap B_i}{{\hat{B}}_i \cup B_i},\quad ro_{i} =\left\{ \begin{array}{ll} 1, &{}\quad {\textit{IoU}}_i \in (0,1], \\ 0, &{}\quad {\textit{else}}. \end{array} \right. \end{aligned}$$
(1)
$$\begin{aligned} {\mathcal {A}} = \frac{1}{N}\times \sum _{i}^{N} (\gamma _{a})^{i//L} {\textit{IoU}}_i*ro_i ,\quad {\mathcal {R}} = \sum _{i}^{N}(\gamma _{r})^{i//L}ro_{i} \end{aligned}$$
(2)

where \({\textit{IoU}}_i\) represents Intersection over Union between predicted \({\hat{B}}_i\) and ground truth \({\hat{B}}_i\). \(\gamma _{a}\) and \(\gamma _{r}\) state discounted factors for accuracy and robustness, highlighting the impact of subsequent tracking performance across the video clip. Similar to SPARK (Guo et al., 2020), we split the video into several intervals with length L based on Frame Per Second.

figure a

Generally, attacks on SOT can be categorized into untargeted and targeted attacks. In this paper, we mainly focus on untargeted attacks, generating adversarial videos based on object motions to degrade the overall tracking performance or deviate the tracker across the whole video clip.

3.2 Heavy perturbation generator

In the first stage of our algorithms, we generate a group of heavily perturbed videos as candidate adversarial examples. To diversify perturbations and increase the probability of successful attacks, we synergistically exploit patch-based and momentum-based perturbation generators.

In the patch-based perturbation generating process, we randomly select a certain number of candidate videos from the dataset consisting of the attacked video. For each candidate video, we randomly pick up a frame and crop an image patch using a window that has the same size as the ground truth bounding box in the initial frame of the attacked video. Then we can construct a set of cropped patches that can be added to the initial frame of the target video. Different from classification tasks such as video recognition or human action recognition, where we regard each video frame as a whole to feed the underlying model and extract feature representation, the final objective of all SOT problems is to accurately locate objects in subsequent frames, therefore it is intuitively to craft noises on the region surrounding the object instead of marginal regions. To do so, we randomly select a group of areas from the search region around the initial frame, and craft patch perturbations over these areas, then we are capable of adding these perturbations to the adversarial candidate set \({\mathcal {V}}\).

On the other hand, we propose a momentum-based perturbation generator, which estimates gradient directions by accumulating historical velocity vectors. IoU Attack (Jia et al., 2021) leverages this concept and extends it to the temporal correspondence among continuous frames. Inspired by this, we delve into the spatial correspondence following MI-FGSM. As illustrated in Algorithm 1, after collecting patch-based adversarial candidates in the set \({\mathcal {V}}\), for each perturbing level \(\frac{\epsilon }{k}\), where \(\epsilon\) is the overall adversarial magnitude, and k is the number of iterations, we randomly sample C perturbing directions denoted as \(g^{'}\), then adversaries are crafted along the historically optimal direction progressively until the magnitude of perturbation exceeds the \(\epsilon\)-ball bound around the initial frame \(v_0\). Balanced by trade-off factor \(\iota\), if the tracking performance decreases, we then update and get the optimal gradient \(g_{opt}\) with momentum. With the momentum-based generator, we can get optimal adversarial frames in each perturbing level. Particularly, if any of these adversaries provides better attack results than previous patch-based perturbations, we can directly output this adversary as shown in Fig. 2. Cases with reinitialization (VOT2018) can be easily extended by repeating the previous process on all reinitialized frames step by step.

Fig. 2
figure 2

Overview of DIMBA framework, which contains heavy perturbation generator, key patch selection, and sign attack module, a Heavy Perturbation Generator initially constructs candidate adversarial videos, originating from either momentum-based approach or patch-based approach. b Key Patch Selection assigns the mask value of heavily perturbed patches to be 0 based on an Actor-Critic network, of which structure is proposed above. c Sign Attack Module estimates gradients around designated directions calculated from previous steps and compresses adversarial magnitude while maintaining attack results within a specific region

3.3 Key patch selection

As illustrated above, some areas in initialized frames are more beneficial for feature representations of the target object, but others are not. Take video Bird1 in Fig. 2 for instance, perturbations added on edges of bounding boxes affect the tracking performance much less than those on object-surrounding ones. Therefore, removing perturbations attached to those regions will not affect the overall attack results but increase the similarities between original frames and perturbed ones. As shown in Fig. 2, we impose a mask that is split into \({\mathcal {P}}\times {\mathcal {P}}\) patches and element-wisely composed of all 1s. Considering computational efficiency as well as the averaged size of video frames across different datasets, we adjust \({\mathcal {P}}\) as a hyper-parameter and conduct a grid search. Then we apply a reinforcement learning (RL)-based key patch selection framework, which is implemented by

Actor-Critic network \({\mathcal {Z}}\), to select the least important patch step by step until the RL agent enters into a terminal state.

As shown in the second part of Fig. 2, our network contains 5 convolutional layers, each of them is followed by a max-pooling layer, where parameters are shared between Actor and Critic branches, and extract features of newly added perturbations. However, the shape of videos can be varied even in the same tracking dataset. Resizing them into a fixed size may result in unwanted geometric distortion, which is extremely harmful to localizing objects in SOT. Therefore we introduce a Spatial Pyramid Pooling (SPP) (He et al., 2016) strategy on top of the last convolutional layer to remove the fixed size constraint of the network. Subsequently, we append 3 fully connected layers to estimate what is the best action that the agent should take and the corresponding critic value of that.

Generally, we consider the key patch selection as a multi-step Markov Decision Process (MDP), which contains states, actions, transition function, and a reward function. In our task, the state \(s_{t}\) at time step t is defined as the pixel-wise difference between \(v_{0}\) and \(v_{0}^*\) masked by the current mask \(M_t\in {\mathbb {R}}^{S\times {\mathcal {P}}\times {\mathcal {P}}}\). It can be denoted as:

$$\begin{aligned} s_{t} = (v_{0}^* - v_{0})\odot M_{t} \end{aligned}$$
(3)

where \(\odot\) represents Hadamard product. At time step 0, \(M_0\) is \(\{1\}^{S\times {\mathcal {P}}\times {\mathcal {P}}}\). An action \(a_t = {\mathcal {Z}}(s_t)\) refers to a \(S\times {\mathcal {P}}^2\) softmax matrix, indicating the least important patch in each initialized frame to successfully track the target at time step t. Then once the agent chooses an action \(a_{t}\), we can set the corresponding element in \(M_t\) to 0.

figure b

Denoting this process as a function \({\mathcal {F}}\), we can update the state to

$$\begin{aligned} s_{t+1} = (v_{0}^* - v_{0})\odot {\mathcal {F}}(M_{t}, a_{t}) \end{aligned}$$
(4)

\(s_{t+1}\) will be the terminal state if \(a_{t} \in \{a_{0}, a_{1}, \ldots , a_{t-1}\}\) or \(\frac{{\mathcal {A}}({\mathcal {T}}(v_{0}+s_{t+1}))}{{\mathcal {A}}({\mathcal {T}}(v_{0}+s_{0}))} > \tau _1\) or \(\frac{{\mathcal {R}}({\mathcal {T}}(v_{0}+s_{t+1}))}{{\mathcal {R}}({\mathcal {T}}(v_{0}+s_{0}))} < \tau _2\). Since SOT is inherently a regression problem within the continuous output space instead of a pure classification problem, slight manipulation of the adversarial perturbation may be reflected in the final tracking results. Therefore we introduce ratio thresholds \(\tau _1\) and \(\tau _2\) to maintain the attack results within an acceptable scale. Generally, our goal is to delete less important patches and maximize the long-term expected reward, therefore we design the reward in step t as

$$\begin{aligned} r_t = \left\{ \begin{array}{ll} 0,&{}\quad a_{t} \in \{a_{0}, a_{1},.., a_{t-1}\};\\ -1,&{}\quad \frac{{\mathcal {A}}({\mathcal {T}}(v+s_{t+1}))}{{\mathcal {A}}({\mathcal {T}}(v+s_{I}))}> \tau _1\;{\textit{or}}\;\frac{{\mathcal {R}}({\mathcal {T}}(v+s_{t+1}))}{{\mathcal {R}}({\mathcal {T}}(v+s_{I}))} < \tau _2;\\ \gamma \frac{{\mathcal {A}}({\mathcal {T}}(v+s_{I}))}{{\mathcal {A}}({\mathcal {T}}(v+s_{t+1}))} +(1-\gamma )\frac{{\mathcal {R}}({\mathcal {T}}(v+s_{t+1}))}{{\mathcal {R}}({\mathcal {T}}(v+s_{I}))},&{}\quad {\textit{else}} \end{array}\right. \end{aligned}$$

In the offline training stage, we select a certain number of candidate videos generated from the previous step, then feed them into policy network \(\pi _{\theta _c}(a_t\Vert s_t)\) and critic network \(\pi _{\theta _c}(c_t\Vert s_t)\) to maximize the expected long-term reward with PPO algorithm, which is written as

$$\begin{aligned} L(\theta _{p})=\sum _{(s_t, a_t)}\min \left( \frac{\pi _{\theta _p}(a_t\Vert s_t)}{\pi _{\theta _p^{{\textit{old}}}}(a_t\Vert s_t)}, {\textit{clip}}\left( \frac{\pi _{\theta _p}(a_t\Vert s_t)}{\pi _{\theta _p^{{\textit{old}}}}(a_t\Vert s_t)}, 1-\rho , 1+\rho \right) \right) A_{\theta ^{{\textit{old}}}_p}(s_t\Vert a_t) \end{aligned}$$
(5)

where \(A_{\theta _p}(s_t\Vert a_t) = Q_{\theta _p}(s_t, a_t)-V_{\theta _c}(s_t)=\gamma ^{T-t}V(s_T)+\gamma ^{T-t-1}r_{T-1}+\dots +r_t-V_{\theta _c}(s_t)\), \(Q_{\theta _p}\) is the Q-value calculated by discounting future rewards, \(V_{\theta _c}\) is the critic value generated by critic network. \(\rho\) denotes the clip parameter to regularize policy iterations.

3.4 Sign attack module

As indicated in Algorithm 2, after removing less important patch-level perturbations attached to initial frames of videos, we can fetch manipulated adversarial examples as well as their tracking accuracy and robustness. Then we need a boundary walking method to help us compress the noise magnitude while maintaining attack results within a specific scope. As shown in part (c) of Fig. 2, we iteratively update victim frame \(v_0\) until its magnitude is compressed from \(\epsilon _1\) to \(\epsilon _3\), while maintaining competitive attack results or even strengthening it. Cheng et al. (2018) states that a black-box attack problem can be formulated into an optimization problem, where the objective function can be evaluated as a binary search with additional model queries. Then a zeroth-order optimization algorithm can be applied to solve this optimization problem. In this paper, we exploit the Sign-OPT algorithm in the Sign Attack Module.

In our approach, \(\phi _d\) and \(g(\phi _d)\) indicate our designated search direction and corresponding distance from the initial frame \(v_{0}\) to its nearest adversarial example that has the same or similar tracking results within a predefined threshold along \(\phi _d\). The objective function can be written as

$$\begin{aligned} \underset{\phi _d}{\min }\;g(\phi _d),\quad {\textit{where}}\;g(\phi _d)=\underset{\lambda }{\arg min}\left( \mathcal{AR}\mathcal{}\left( {\mathcal {T}}\left( v_0+\lambda \frac{\phi _d}{\Vert \phi _d\Vert };\theta \right) \right) \le \frac{\gamma (\tau _1\tau _2-1)+1}{\tau _2}\right) \end{aligned}$$
(6)

where \(\tau _1\) and \(\tau _2\) are hyper-parameters exploited in Key Patch Selection. As the evaluation results of SOT, \(\mathcal{AR}\mathcal{}\) is denoted as \(\gamma \frac{{\mathcal {A}}({\mathcal {T}}(v_0+\lambda \frac{\phi _d}{\Vert \phi _d\Vert }))}{{\mathcal {A}}({\mathcal {T}}(v_0+s_0))}+(1-\gamma )\frac{{\mathcal {R}}({\mathcal {T}}(v_0+s_0))}{{\mathcal {R}}({\mathcal {T}}(v_0+\lambda \frac{\phi _d}{\Vert \phi _d\Vert }))}\). We need to estimate its directional derivative by consuming a huge amount of queries when computing \(g(\phi _d +u)-g(\phi _d)\). However, it will take a large number of computational resources if we intend to obtain the gradient derivative accurately. Due to the various and large dimensions of our input, we decide to improve query complexity by an imperfect but informative estimation of directional derivative. Therefore, we exploit the sign value and compute the gradient by sampling K gaussian vectors:

$$\begin{aligned} {\hat{\nabla }}g(\phi _d) = \frac{1}{K}\sum _{k=1}^{K}Sign(g(\phi _d+\rho _d u_k)-g(\phi _d))u_k \end{aligned}$$
(7)

When starting an attack on videos, we need to initialize perturbing directions \(\phi _d=\frac{v_0^*-v_0}{\Vert v_0^*-v_0\Vert }\), where \(v_0^*\) can be retrieved by sampling from \(v_0\)’s candidate adversarial sets \({\mathcal {V}}\), including patch-based and momentum-based perturbations. Detailed in Algorithm 2, by trading off the magnitude of adversaries and their tracking performance, we rank the candidate list with \(\mathcal{TP}\mathcal{}\) and \(L_1\) normalization and pick the top-n target video clips for the attacked video.

4 Experiments

In this section, we describe our experimental settings and analyze the effectiveness of the proposed DIMBA algorithm against different trackers on four challenging short-term or long-term datasets, including OTB100 (Wu et al., 2015), VOT2018 (Kristan, 2018), UAV123 (Mueller et al., 2016), and LaSOT (Fan et al., 2019). Part of the qualitative tracking results performed by SiamRPN++ is shown in Fig. 3.

Fig. 3
figure 3

Illustration of clean and adversarial tracking results tracked by SiamRPN++ tested on OTB100. Green bounding boxes indicate ground truth locations, blue ones state originally predicted locations, while red ones demonstrate adversarially attacked locations (Color figure online)

4.1 Experimental settings

Victim models As mentioned in Sect. 1, current tracking models can be divided into Siamese-based, discrimination, and reinforcement learning-based trackers. Considering overall tracking performance, we select one or more most representative trackers for each of them, which consists of SiamRPN++ that uses MobileNetv2 (Sandler et al., 2018), and ResNet50 (He et al., 2016) as backbones, DaSiamRPN (Zhu et al., 2018), PrDiMP50 (Danelljan et al., 2020), TrTr (Zhao et al., 2021), and Action-Decision Network (Yun et al., 2017). Specifically, SiamRPN++(R) exploits ResNet50 as the backbone model, while SiamRPN++(M) utilizes MobileNetv2.

Metrics To fairly compare our attack results with original tracking performance and previous black-box attacks on SOT, standard evaluation methods are exploited. While testing DIMBA on OTB100 (Wu et al., 2015), UAV123 (Mueller et al., 2016) and LaSOT (Fan et al., 2019), we utilize precision and success rate in a one-pass evaluation (OPE) scenario. As for the VOT2018 challenge (Kristan, 2018), we introduce a reinitialization mechanism five frames after the tracker lost the target.

Computing infrastructures We conduct experiments on a computer with three Nvidia GeForce RTX 2080Ti and one Nvidia GeForce RTX 3090 GPUs, an Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz, running Ubuntu 18.04.5 LTS.

4.2 Implementation details

Our experiment is implemented in PyTorch.Footnote 1 In momentum-based perturbation generation, maximum noise magnitude \(\epsilon\) is 8 (following One-Shot imperceptible settings) , candidate number C is 25, iteration number k is 128, momentum factor \(\mu\) is 0.5, trade-off factor \(\iota\) is 0.4. Same to momentum generator, the patch-based generator produces adversarial sets with capacity C as well. \(\gamma _a\) and \(\gamma _r\) are both set to be 0.9.

To pre-train the Actor-Critic Network for key patch selection, we set PPO epoch, clipping parameter \(\rho\), buffer capacity, and maximum gradient normalization to 10, 0.2, 500, and 0.5, respectively. As for patch number \({\mathcal {P}}\), we exploit the grid search strategy and set \({\mathcal {P}}\) as 2, 4, 8, 16, 32. For balancing selection efficiency and final impact on tracking performance, \({\mathcal {P}}\) is parameterized to 16.

In the same way, the combination of ratio threshold \(\tau _1\) and \(\tau _2\) is set to 1.5 and 0.4. trade-off factor \(\gamma\) is set to 0.5, video candidate number n is naturally set to 20 out of 30, gradient candidate number K is assigned to be 100, and the number of attack queries \({\mathcal{N}}_{{\mathcal{A}}}\) can be 60.

4.3 Overall attack results

Results on VOT2018 Table 2 compares the overall results of these trackers on the VOT2018 dataset. We exploit randomly generated noises as well as perturbations computed by IoU Attack (Jia et al., 2021) and compare them with our proposed method. Specifically, our algorithm outperforms IoU Attack concerning accuracy in DaSiamRPN and ADNet by 8.45 and 5.82%, respectively. Furthermore, in terms of robustness, our approach exceeds IoU Attack in SiamRPN++(ResNet50), DaSiamRPN, and ADNet by 9.32, 3.21, and 2.97%. As for EAO (Expected Average Overlap) in SiamRPN++ and ADNet, we have achieved 6.2 and 7.9% improvement.

Table 2 Attack results of SiamRPN++ (Li et al., 2019), DaSiamRPN (Zhu et al., 2018), PrDiMP50 (Danelljan et al., 2020), ADNet (Yun et al., 2017), and TrTr (Zhao et al., 2021) on VOT2018 (Kristan, 2018), evaluated using Accuracy, Robustness, and EAO (expected average overlap)

Results on OTB100 As shown in Fig. 4, we draw success and precision plots of various trackers selected according to their categories and tested on OTB100. Compared to the original tracking performance, our black-box attack method can reduce the AUC score and visually change the curves’ shape. Meanwhile, we correspondingly visualize the results of a white-box One-Shot Attack (Chen et al., 2020) and check the difference. Meanwhile, Table 3 illustrates the success and precision rates of original videos, random perturbations, One-Shot Attack, IoU Attack, and our method.

Fig. 4
figure 4

Illustration of success plots and precision plots tested on OTB100, UAV123, and LaSOT. Success plots represent the AUC values regarding different overlapping scores, while precision plots are at the error threshold of 20 pixels with respect to centered location errors. The numbers in brackets in front of tracker names denote AUC scores

Table 3 Attack Results of SiamRPN++(ResNet50), SiamRPN++(Mobilev2), DaSiamRPN, ADNet, TrTr, and PrDiMP50 on OTB100 (Wu et al., 2015), evaluated using success rate and precision

Results on UAV123 and LaSOT Depicted in Fig. 4, tracking results of different trackers are illustrated based on UAV123 and LaSOT. With our attack method, the AUC score of success plots tested on UAV123 are decreased by 4.3, 10.8, and 17.4% for PrDiMP50, SiamRPN++(ResNet50), and ADNet individually. In the meantime, the same score of success plots calculated on LaSOT are reduced by 6.6, 9.0, 22.5, and 11.8% for PrDiMP50, SiamRPN++, DaSiamRPN, and ADNet respectively.

4.4 Ablation study of key patch selection

We conduct a series of experiments to evaluate the impact of the key patch selection module. SiamRPN++(R), DaSiamRPN, and ADNet are selected as our baselines, and tracking results on OTB100 and UAV123 are shown in Fig. 5. We query fewer times in black-box settings to reach a similar perturbation magnitude \(\epsilon\) using Key Patch Selection. Meanwhile, the average IoU scores in subintervals as shown in Fig. 5 under our proposed Key Patch Selection algorithm majorly remain smaller than the ones with random patch selection or without the Key Patch Selection module.

Fig. 5
figure 5

Illustration of the ablation study on key patch selection module of our proposed DIMBA Attack. Results are conducted over OTB100 UAV123, tracked by SiamRPN++(ResNet), DaSiamRPN, ADNet. Yellow bars indicate the percentage of query times in 8 subintervals from 0–200. Red lines represent ratio changes in \(l_\infty\)-norm adversarial magnitude, while blue lines state changes of average overlap scores in each interval. and state changes under our proposed key patch selection algorithm. In contrast, and refer to ones by randomly selected (Color figure online)

4.5 Comparison with previous works

According to our understanding, the overall computational complexity of IoU Attack (Jia et al., 2021) is \({\mathcal {O}}(KNL)\), where K is the number of epochs for choosing perturbations on each frame, N is the candidate number of random noises, L is the length of the video clip. Whereas in our algorithm, our query complexity can be reduced to \({\mathcal {O}}(KN+C)\), where C is a constant number independent of L. The comparison in query efficiency between query-based black box attack algorithms, IoU Attack, and our proposed method, is illustrated in Table 4. In the meantime, we compare the SSIM similarity between clean and adversarial videos to qualitatively verify the side effects of our proposed algorithm, the result is shown in Table 5. Except for some specific cases, our algorithm achieves better SSIM similarity than the query-based IoU Attack.

Table 4 Comparison of average query times between IoU Attack and our proposed method, tracked by SiamRPN++(R), DaSiamRPN, PrDiMP50, and ADNet, tested on OTB100, UAV123, and LaSOT
Table 5 Average SSIM similarity between clean videos and perturbed videos from OTB100, UAV123, and LaSOT, tracked by SiamRPN++(R), PrDiMP50, and ADNet

5 Conclusions

In this work, we propose an effective and efficient query-based black-box attack for SOT. An Actor-Critic key patch selection module is exploited to reduce redundant noises and increase query efficiency. Meanwhile, the combination of patch-based and momentum-based perturbation generators diverse potential adversarial directions and introduce heavily damaged tracking performance. Compared with existing works, our method requires fewer queries on SOT and less perturbation from the perspective of a whole video clip but maintains competitive, even better manipulating results. The experiments in both long-term and short-term datasets across three major categories of trackers demonstrate the effectiveness of our framework. We hope this work can elucidate the source of vulnerabilities in these trackers, optimistically paving the way for more powerful ones.