Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Deep neural networks (DNNs or networks, for simplicity) have been developed for a variety of tasks, including malware detection [11], abnormal network activity detection [31], and self-driving cars [5, 6, 32]. A classification network N can be used as a decision-making algorithm: given an input \(\alpha \), it suggests a decision \(N(\alpha )\) among a set of possible decisions. While the accuracy of neural networks has greatly improved, matching the cognitive ability of humans [17], they are susceptible to adversarial examples [4, 33]. An adversarial example is an input which, though initially classified correctly, is misclassified after a minor, perhaps imperceptible, perturbation. Adversarial examples pose challenges for self-driving cars, where neural network solutions have been proposed for tasks such as end-to-end steering [6], road segmentation [5], and traffic sign classification [32]. In the context of steering and road segmentation, an adversarial example may cause a car to steer off the road or drive into barriers, and misclassifying traffic signs may cause a vehicle to drive into oncoming traffic. Figure 1 shows an image of a traffic light correctly classified by a state-of-the-art network which is then misclassified after only a few pixels have been changed. Though somewhat artificial, since in practice the controller would rely on additional sensor input when making a decision, such cases strongly suggest that, before deployment in safety-critical tasks, DNNs resilience (or robustness) to adversarial examples must be strengthened.

Fig. 1.
figure 1

An adversarial example for the YOLO object recognition network.

A number of approaches have been proposed to search for adversarial examples (see Related Work). They are based on computing the gradients [12], along which a heuristic search moves; computing a Jacobian-based saliency map [27], based on which pixels are selected to be changed; transforming the existence of adversarial examples into an optimisation problem [8], on which an optimisation algorithm can be applied; transforming the existence of adversarial examples into a constraint solving problem [15], on which a constraint solver can be applied; or discretising the neighbourhood of a point and searching it exhaustively in a layer-by-layer manner [14]. All these approaches assume some knowledge about the network, e.g., the architecture or the parameters, which can vary as the network continuously learns and adapts to new data, and, with a few exceptions [26] that access the penultimate layer, do not explore the feature maps of the networks.

In this paper, we propose a feature-guided approach to test the resilience of image classifier networks against adversarial examples. While convolutional neural networks (CNN) have been successful in classification tasks, their feature extraction capability is not well understood [37]. The discovery of adversarial examples has called into question CNN’s ability to robustly handle input with diverse structural and compositional elements. On the other hand, state-of-the-art feature extraction methods are able to deterministically and efficiently extract structural elements of an image regardless of scale, rotation or transformation. A key observation of this paper is that feature extraction methods enable us to identify elements of an image which are most vulnerable to a visual system such as a CNN.

Leveraging knowledge of the human perception system, existing object detection techniques detect instances of semantic objects of a certain class (such as animals, buildings, or cars) in digital images and videos by identifying their features. We use the scale-invariant feature transform approach, or SIFT [20], to detect features, which is achieved with no knowledge of the network in a black-box manner. Using the SIFT features, whose number is much smaller than the number of pixels, we represent the image as a two-dimensional Gaussian mixture model. This reduction in dimensionality allows us to efficiently target the exploration at salient features, similarly to human perception. We formulate the process of crafting adversarial examples as a two-player turn-based stochastic game, where player \(\mathtt{I}\) selects features and player \(\mathtt{II}\) then selects pixels within the selected features and a manipulation instruction. After both players have made their choices, the image is modified according to the manipulation instruction, and the game continues. While player \(\mathtt{I}\) aims to minimise the distance to an adversarial example, player \(\mathtt{II}\) can be cooperative, adversarial, or nature who samples the pixels according to the Gaussian mixture model. We show that, theoretically, the two-player game can converge to the optimal strategy, and that the optimal strategy represents a globally minimal adversarial image. We also consider safety guarantees for Lipschitz networks and identify conditions to ensure that no adversarial examples exist.

We implement a software packageFootnote 1, in which a Monte Carlo tree search (MCTS) algorithm is employed to find asymptotically optimal strategies for both players, with player \(\mathtt{II}\) being a cooperator. The algorithm is anytime, meaning that it can be terminated with time-out bounds provided by the user and, when terminated, it returns the best strategies it has for both players. The experiments on networks trained on benchmark datasets such as MNIST [18] and CIFAR10 [1] show that, even without the knowledge of the network and using relatively little time (1 min for every image), the algorithm can already achieve competitive performance against existing adversarial example crafting algorithms. We also experiment on several state-of-the-art networks, including the winner of the Nexar traffic light challenge [25], a real-time object detection system YOLO, and VGG16 [3] for ImageNet competition, where, surprisingly, we show that the algorithm can return adversarial examples even with very limited resources (e.g., running time of less than a second), including that in Fig. 1 from YOLO. Further, since the SIFT method is scale and rotation invariant, we can counter claims in the recent paper [21] that adversarial examples are not invariant to changes in scale or angle in the physical domain.

Our software package is well suited to safety testing and decision support for DNNs in safety-critical applications. First, the MCTS algorithm can be used offline to evaluate the network’s robustness against adversarial examples on a given set of images. The asymptotic optimal strategy achievable by MCTS algorithm enables a theoretical guarantee of safety, i.e., the network is safe when the algorithm cannot find adversarial examples. The algorithm is guaranteed to terminate, but this may be impractical, so we provide an alternative termination criterion. Second, the MCTS algorithm, in view of its time efficiency, has the potential to be deployed on-board for real-time decision support.

An extended version of the paper, which includes more additional explanations and experimental results, is available from [36].

2 Preliminaries

Let N be a network with a set C of classes. Given an input \(\alpha \) and a class \(c \in C\), we use \(N(\alpha ,c)\) to denote the confidence (expressed as a probability value obtained from normalising the score) of N believing that \(\alpha \) is in class c. Moreover, we write \(N(\alpha ) = \arg \max _{c\in C} N(\alpha ,c)\) for the class into which N classifies \(\alpha \). For our discussion of image classification networks, the input domain \(\mathrm{D}\) is a vector space, which in most cases can be represented as \(\mathrm{I\!R_{[0,255]}^{w\times h\times ch}}\), where whch are the width, height, and number of channels of an image, respectively, and we let \(P_0 = w\times h\times ch\) be the set of input dimensions. In the following, we may refer to an element in \(w\times h\) as a pixel and an element in \(P_0\) as a dimension. We remark that dimensions are normalised as real values in [0, 1]. Image classifiers employ a distance function to compare images. Ideally, such a distance should reflect perceptual similarity between images, comparable to human perception. However, in practice \(L_k\) distances are used instead, typically \(L_0\), \(L_1\) (Manhattan distance), \(L_2\) (Euclidean distance), and \(L_\infty \) (Chebyshev distance). We also work with \(L_k\) distances but emphasise that our method can be adapted to other distances. In the following, we write \(||\alpha _1-\alpha _2||_{k}\) with \(k\ge 0\) for the distance between two images \(\alpha _1\) and \(\alpha _2\) with respect to the \(L_k\) measurement.

Given an image \(\alpha \), a distance measure \(L_k\), and a distance d, we define \(\eta (\alpha ,k,d)=\{\alpha ' ~|~ ||\alpha '-\alpha ||_{k} \le d\}\) as the set of points whose distance to \(\alpha \) is no greater than d with respect to \(L_k\). Next we define adversarial examples, as well as what we mean by targeted and non-targeted safety.

Definition 1

Given an input \(\alpha \in \mathrm{D}\), a distance measure \(L_k\) for some \(k\ge 0\), and a distance d, an adversarial example \(\alpha '\) of class \(c\ne N(\alpha )\) is such that \(\alpha '\in \eta (\alpha ,k,d)\), \(N(\alpha )\ne N(\alpha ')\), and \(N(\alpha ')=c\). Moreover, we write \(adv_{N,k,d}(\alpha ,c)\) for the set of adversarial examples of class c and let \(adv_{N,k,d}(\alpha )=\bigcup _{c\in C, c\ne N(\alpha )}adv_{N,k,d}(\alpha ,c)\). A targeted safety of class c is defined as \(adv_{N,k,d}(\alpha ,c)=\emptyset \), and a non-targeted safety is defined as \(adv_{N,k,d}(\alpha )=\emptyset \).

Feature Extraction. The Scale Invariant Feature Transform (SIFT) algorithm [20], a reliable technique for exhuming features from an image, makes object localization and tracking possible without the use of neural networks. Generally, the SIFT algorithm proceeds through the following steps: scale-space extrema detection (detecting relatively darker or lighter areas in the image), keypoint localization (determining the exact position of these areas), and keypoint descriptor assignment (understanding the context of the image w.r.t its local area). Human perception of an image or an object can be reasonably represented as a set of features (referred to as keypoints in SIFT) of different sizes and response strengths, see [35] and Appendix of [36] for more detail. Let \(\varLambda (\alpha )\) be a set of features of the image \(\alpha \) such that each feature \(\lambda \in \varLambda (\alpha )\) is a tuple \((\lambda _x,\lambda _y,\lambda _s,\lambda _r)\), where \((\lambda _x,\lambda _y)\) is the coordinate of the feature in the image, \(\lambda _s\) is the size of the feature, and \(\lambda _r\) is the response strength of the feature. The SIFT procedures implemented in standard libraries such as OpenCV may return more information which we do not use.

Fig. 2.
figure 2

Illustration of the transformation of an image into a saliency distribution. (a) The original image \(\alpha \), provided by ImageNet. (b) The image marked with relevant keypoints \(\varLambda (\alpha )\). (c) The heatmap of the Gaussian mixture model \(\mathcal{G}(\varLambda (\alpha ))\).

On their own, keypoints are not guaranteed to involve every pixel in the image, and in order to ensure a comprehensive and flexible safety analysis, we utilize these keypoints as a basis for a Gaussian mixture model. Figure 2 shows the original image (a) and this image annotated with keypoints (b).

Gaussian Mixture Model. Given an image \(\alpha \) and its set \(\varLambda (\alpha )\) of keypoints, we define for \(\lambda _i\in \varLambda (\alpha )\) a two-dimensional Gaussian distribution \(\mathcal{G}_i\) such that, for pixel \((p_x,p_y)\), we have

$$\begin{aligned} \mathcal{G}_{i,x} = \dfrac{1}{\sqrt{2\pi \lambda _{i,s}^{2}}} exp\big (\dfrac{-(p_x-\lambda _{i,x})^2}{2\lambda _{i,s}^{2}}\big )~~~~~ \mathcal{G}_{i,y} = \dfrac{1}{\sqrt{2\pi \lambda _{i,s}^{2}}} exp\big (\dfrac{-(p_y-\lambda _{i,y})^2}{2\lambda _{i,s}^{2}}\big ) \end{aligned}$$
(1)

where the variance is the size \(\lambda _{i,s}\) of the keypoint and the mean is its location \((\lambda _{i,x}, \lambda _{i,y})\). To complete the model, we define a set of weights \(\varPhi = \{\phi _i\}_{ i \in \{1,2,...,k\} }\) such that \(k=|\varLambda (\alpha )|\) and \(\phi _i = \lambda _{i,r}/\sum _{j=0}^{k}\lambda _{j,r} \). Then, we can construct a Gaussian mixture model \(\mathcal{G}\) by combining the distribution components with the weights as coefficients, i.e., \(\mathcal{G}_{x} = \prod _{i=1}^k \phi _i\times \mathcal{G}_{i,x}\) and \(\mathcal{G}_{y} = \prod _{i=1}^k \phi _i\times \mathcal{G}_{i,y}\). The two-dimensional distributions are discrete and separable and therefore their realization is tractable and independent, which improves efficiency of computation. Let \(\mathcal{G}(\varLambda (\alpha ))\) be the obtained Gaussian mixture model from \(\varLambda (\alpha )\), and G be the set of Gaussian mixture models. In Fig. 2 we illustrate the transformation of an image into a saliency distribution.

Pixel Manipulation. We now define the operations that we consider for manipulating images. We write \(\alpha (x,y,z)\) for the value of the z-channel (typically RGB or grey-scale values) of the pixel positioned at (xy) on the image \(\alpha \). Let \(I=\{+,-\}\) be a set of manipulation instructions and \(\tau \) be a positive real number representing the manipulation magnitude, then we can define pixel manipulations \(\delta _{X,i}: \mathrm{D}\rightarrow \mathrm{D}\) for X a subset of input pixels and \(i\in I\):

$$ \delta _{X,i}(\alpha )(x,y,z) = \left\{ \begin{array}{ll} \alpha (x,y,z) + \tau , &{} \text {if } (x,y)\in X \text { and } i= + \\ \alpha (x,y,z) - \tau , &{} \text {if } (x,y)\in X \text { and } i= - \\ \alpha (x,y,z) &{} \text {otherwise}\\ \end{array} \right. $$

for all pixels (xy) and channels \(z\in \{1,2,3\}\). Note that if the values are bounded, e.g., [0, 1], \(\delta _{X,i}(\alpha )(x,y,z)\) needs to be restricted to be within the bounds. For simplicity, in our experiments and comparisons we allow a manipulation to choose either the upper bound or the lower bound with respect to the instruction \(i\). For example, in Fig. 1, the actual manipulation considered is to make the manipulated dimensions choose value 1.

3 Safety Against Manipulations

Recall that every image represents a point in the input vector space \(\mathrm{D}\). Most existing investigations of the safety (or robustness) of DNNs focus on optimising the movement of a point along the gradient direction of some function obtained from the network (see Related Work for more detail). Therefore, these approaches rely on the knowledge about the DNN. Arguably, this reliance holds also for the black-box approach proposed in [26], which uses a new surrogate network trained on the data sampled from the original network. Furthermore, the current understanding about the transferability of adversarial examples (i.e., an adversarial example found for a network can also serve as an adversarial example for another network, trained on different data) are all based on empirical experiments [26]. The conflict between the understanding of transferability and existing approaches to crafting adversarial examples can be gleaned from an observation made in [19] that gradient directions of different models are orthogonal to each other. A reasonable interpretation is that transferable adversarial examples, if they exist, do not rely on the gradient direction suggested by a network but instead may be specific to the input.

In this paper, we propose a feature-guided approach which, instead of using the gradient direction as the guide for optimisation, relies on searching fro adversarial examples by targeting and manipulating image features as recognised by human perception capability. We extract features using SIFT, which is a reasonable proxy for human perception and enables dimensionality reduction through the Gaussian mixture representation (see [29]). Our method needs neither the knowledge about the network nor the necessity to massively sample the network for data to train a new network, and is therefore a black-box approach.

Game-Based Approach. We formulate the search for adversarial examples as a two-player turn-based stochastic game, where player \(\mathtt{I}\) selects features and player \(\mathtt{II}\) then selects pixels within the selected features and a manipulation instruction. While player \(\mathtt{I}\) aims to minimise the distance to an adversarial example, player \(\mathtt{II}\) can be cooperative, adversarial, or nature who samples the pixels according to the Gaussian mixture model. To give more intuition for feature-guided search, in Appendix of [36] we demonstrate how the distribution of the Gaussian mixture model representation evolves for different adversarial examples.

We define the objective function in terms of the \(L_k\) distance and view the distance to an adversarial example as a measure of its severity. Note that the sets \(adv_{N,k,d}(\alpha ,c)\) and \(adv_{N,k,d}(\alpha )\) of adversarial examples can be infinite.

Definition 2

Among all adversarial examples in the set \(adv_{N,k,d}(\alpha ,c)\) (or \(adv_{N,k,d}(\alpha )\)), find \(\alpha '\) with the minimum distance to the original image \(\alpha \):

$$\begin{aligned} \arg \min _{\alpha '} \{ sev_\alpha (\alpha ')~|~\alpha ' \in adv_{N,k,d}(\alpha ,c) (\text {or } adv_{N,k,d}(\alpha ))\} \end{aligned}$$
(2)

where \(sev_\alpha (\alpha ') = ||\alpha - \alpha '||_{k}\) is the severity of the adversarial example \(\alpha '\) against the original image \(\alpha \).

We remark that the choice of \(L_k\) will affect perceptual similarity, see Appendix of [36].

Crafting Adversarial Examples as a Two-Player Turn-Based Game. Assume two players \(\mathtt{I}\) and \(\mathtt{II}\). Let \(M(\alpha ,k,d)=(S\cup (S\times \varLambda (\alpha )),s_0,\{T_a\}_{a \in \{\mathtt{I},\mathtt{II}\}},L)\) be a game model, where S is a set of game states belonging to player \(\mathtt{I}\) such that each state represents an image in \( \eta (\alpha ,k,d)\), and \(S\times \varLambda (\alpha )\) is a set of game states belonging to player \(\mathtt{II}\) where \(\varLambda (\alpha )\) is a set of features (keypoints) of image \(\alpha \). We write \(\alpha (s)\) for the image associated to the state \(s\in S\). \(s_0\in S\) is the initial game state such that \(\alpha (s_0)\) is the original image \(\alpha \). The transition relation \(T_\mathtt{I}: S \times \varLambda (\alpha )\rightarrow S \times \varLambda (\alpha )\) is defined as \(T_\mathtt{I}(s,\lambda )=(s,\lambda )\), and transition relation \(T_\mathtt{II}: (S \times \varLambda (\alpha ))\times \mathcal {P}(P_0)\times I\rightarrow S\) is defined as \(T_\mathtt{II}((s,\lambda ),X,i)=\delta _{X,i}(\alpha (s))\), where \(\delta _{X,i}\) is a pixel manipulation defined in Sect. 2. Intuitively, on every game state \(s\in S\), player \(\mathtt{I}\) will choose a keypoint \(\lambda \), and, in response to this, player \(\mathtt{II}\) will choose a pair \((X,i)\), where X is a set of input dimensions and \(i\) is a manipulation instruction. The labelling function \(L:S\cup (S\times \varLambda (\alpha ))\rightarrow C\times G\) assigns to each state s or \((s,\lambda )\) a class \(N(\alpha (s))\) and a two-dimensional Gaussian mixture model \(\mathcal{G}(\varLambda (\alpha (s)))\).

A path (or game play) of the game model is a sequence \(s_1u_1s_2u_2...\) of game states such that, for all \(k\ge 1\), we have \(u_k = T_{\mathtt{I}}(s_k,\lambda _k)\) for some feature \(\lambda _k\) and \(s_{k+1}=T_\mathtt{II}((s_k,\lambda _k),X_k,i_k)\) for some \((X_k,i_k)\). Let \(last(\rho )\) be the last state of a finite path \(\rho \) and \(Path_a^F\) be the set of finite paths such that \(last(\rho )\) belongs to player \(a\in \{\mathtt{I},\mathtt{II}\}\). A stochastic strategy \(\sigma _\mathtt{I}: Path_\mathtt{I}^F\rightarrow \mathcal {D}(\varLambda (\alpha ))\) of player \(\mathtt{I}\) maps each finite paths to a distribution over the next actions, and similarly for \(\sigma _\mathtt{II}:Path_\mathtt{II}^F\rightarrow \mathcal {D}(\mathcal {P}(P_0)\times I)\) for player \(\mathtt{II}\). We call \(\sigma = (\sigma _\mathtt{I},\sigma _\mathtt{II})\) a strategy profile. In this section, we only discuss targeted safety for a given target class c (see Definition 1). All the notations and results can be easily adapted to work with non-targeted safety.

In the following, we define a reward \(R(\sigma ,\rho )\) for a given strategy profile \(\sigma = (\sigma _\mathtt{I},\sigma _\mathtt{II})\) and a finite path \(\rho \in \bigcup _{a\in \{\mathtt{I},\mathtt{II}\}}Path_a^F\). The idea of the reward is to accumulate a measure of severity of the adversarial example found over a path. Note that, given \(\sigma \), the game becomes a fully probabilistic system. Let \(\alpha _\rho ' = \alpha (last(\rho ))\) be the image associated with the last state of the path \(\rho \). We write \(t(\rho )\) for the expression \(N(\alpha _\rho ')=c \vee || \alpha _\rho '-\alpha ||_{k} > d\), representing that the path has reached a state whose associated image either is in the target class c or lies outside the region \( \eta (\alpha ,k,d)\). The path \(\rho \) can be terminated whenever \(t(\rho )\) is satisfiable. It is not hard to see that, due to the constraints in Definition 1, every infinite path has a finite prefix which can be terminated. Then we define the reward function \(R(\sigma ,\rho ) = \)

$$ \left\{ \begin{array}{lll} 1/ sev_\alpha (\alpha _\rho ') &{} \text {if } t(\rho ) \text { and }\rho \in Path_\mathtt{I}^F \\ \mathop \sum \nolimits _{\lambda \in \varLambda (\alpha )} \sigma _\mathtt{I}(\rho )(\lambda ) \cdot R(\sigma ,\rho T_\mathtt{I}(last(\rho ),\lambda )) &{} \text {if } \lnot t(\rho ) \text { and }\rho \in Path_\mathtt{I}^F \\ \mathop \sum \nolimits _{(X,i)\in \mathcal {P}(P_0)\times I} \sigma _\mathtt{II}(\rho )(X,i) \cdot R(\sigma ,\rho T_\mathtt{II}(last(\rho ),X,i)) &{} \text {if } \rho \in Path_\mathtt{II}^F \end{array} \right. $$

where \(\sigma _\mathtt{I}(\rho )(\lambda )\) is the probability of selecting \(\lambda \) on \(\rho \) by player \(\mathtt{I}\), and \(\sigma _\mathtt{II}(\rho )(X,i)\) is the probability of selecting \((X,i)\) based on \(\rho \) by player \(\mathtt{II}\). We note that a path only terminates on player \(\mathtt{I}\) states.

Intuitively, if an adversarial example is found then the reward assigned is the inverse of severity (minimal distance), and otherwise it is the weighted summation of the rewards if its children. Thus, a strategy \(\sigma _\mathtt{I}\) to maximise the reward will need to minimise the severity \(sev_\alpha (\alpha _\rho ')\), the objective of the problem defined in Definition 2.

Definition 3

The goal of the game is for player \(\mathtt{I}\) to choose a strategy \(\sigma _{\mathtt{I}}\) to maximise the reward \(R((\sigma _\mathtt{I},\sigma _\mathtt{II}),s_0) \) of the initial state \(s_0\), based on the strategy \(\sigma _{\mathtt{II}}\) of the player \(\mathtt{II}\), i.e.,

$$\begin{aligned} \arg \max _{\sigma _{\mathtt{I}}} \mathtt{opt}_{\sigma _{\mathtt{II}}} R((\sigma _\mathtt{I},\sigma _\mathtt{II}),s_0). \end{aligned}$$
(3)

where option \(\mathtt{opt}_{\sigma _{\mathtt{II}}}\) can be \(\max _{\sigma _{\mathtt{II}}} \), \(\min _{\sigma _{\mathtt{II}}} \), or \(\mathtt{nat}_{\sigma _{\mathtt{II}}}\), according to which player \(\mathtt{II}\) acts as a cooperator, an adversary, or nature who samples the distribution \(\mathcal{G}(\varLambda (\alpha ))\) for pixels and randomly chooses the manipulation instruction.

A strategy \(\sigma \) is called deterministic if \(\sigma (\rho )\) is a Dirac distribution, and is called memoryless if \(\sigma (\rho )=\sigma (last(\rho ))\) for all finite paths \(\rho \). We have the following result.

Theorem 1

Deterministic and memoryless strategies suffice for player \(\mathtt{I}\), when \(\mathtt{opt}_{\sigma _{\mathtt{II}}} \in \{\max _{\sigma _{\mathtt{II}}}, \min _{\sigma _{\mathtt{II}}}, \mathtt{nat}_{\sigma _{\mathtt{II}}}\}\).

Complexity of the Problem. As a by-product of Theorem 1, the theoretical complexity of the problem (i.e., determining whether \(adv_{N,k,d}(\alpha ,c)=\emptyset \)) is in PTIME, with respect to the size of the game model \(M(\alpha ,k,d)\). However, even if we only consider finite paths (and therefore a finite system), the number of states (and therefore the size of the system) is \(O(|P_0 |^{h})\) for h the length of the longest finite path of the system without a terminating state. While the precise size of \(O(|P_0|^{h})\) is dependent on the problem (including the image \(\alpha \) and the difficulty of crafting an adversarial example), it is roughly \(O(50000^{100})\) for the images used in the ImageNet competition and \(O(1000^{20})\) for smaller images such as CIFAR10 and MNIST. This is beyond the capability of existing approaches for exact or \(\epsilon \)-approximate computation of probability (e.g., reduction to linear programming, value iteration, and policy iteration, etc) that are used in probabilistic verification.

4 Monte Carlo Tree Search for Asymptotically Optimal Strategy

In this section, we present an approach based on Monte Carlo tree search (MCTS) [9] to find an optimal strategy asymptotically. We also we show that the optimal strategy, if achieved, represents the best adversarial example with respect to the objective in Definition 2, under some conditions.

We first consider the case of \(\mathtt{opt}_{\sigma _{\mathtt{II}}}=\max _{\sigma _{\mathtt{II}}}\). An MCTS algorithm, whose pseudo-code is presented in Algorithm 1, gradually expands a partial game tree by sampling the strategy space of the model \(M(\alpha ,k,d)\). With the upper confidence bound (UCB) [16] as the exploration-exploitation tradeoff, MCTS has a theoretical guarantee that it converges to optimal solution when the game tree is fully explored. The algorithm mainly follows the standard MCTS procedure, with a few adaptations. We use two termination conditions \(tc_1\) and \(tc_2\) to control the pace of the algorithm. More specifically, \(tc_1\) controls whether the entire procedure should be terminated, and \(tc_2\) controls when a move should be made. The terminating conditions can be, e.g., bounds on the number of iterations, etc. On the partial tree, every node maintains a pair (rn), which represents the accumulated reward r and the number of visits n, respectively. The \(selection\) procedure travels from the root to a leaf according to an exploration-exploitation balance, i.e., UCB [16]. After expanding the leaf node to have its children added to the partial tree, we call \(Simulation\) to run simulation on every child node. A simulation on a new node is a play of the game from node until it terminates. Players act randomly during the simulation. Every simulation terminates when reaching a terminated node \(\alpha '\), on which a reward \(1/ sev_\alpha (\alpha ')\) can be computed. This reward is then backpropagated from the new child node through its ancestors until reaching the root. Every time a new reward v is backpropogated through a node, we update its associated pair to \((r+v,n+1)\). The bestChild(root) returns the child of root which has the highest value of r / n. The other two cases are similar except for the choice of the next move (i.e., Line 12). Instead of choosing the best child, a child is chosen by sampling \(\mathcal{G}(\varLambda (\alpha ))\) for the case of \(\mathtt{opt}_{\sigma _{\mathtt{II}}}=\mathtt{nat}_{\sigma _{\mathtt{II}}}\), and the worst child is chosen for the case of \(\mathtt{opt}_{\sigma _{\mathtt{II}}}=\min _{\sigma _{\mathtt{II}}}\). We remark the game is not zero-sum when \(\mathtt{opt}_{\sigma _{\mathtt{II}}}\in \{\mathtt{nat}_{\sigma _{\mathtt{II}}},\max _{\sigma _{\mathtt{II}}}\}\).

figure a

Severity Interval from the Game. Assume that we have fixed termination conditions \(tc_1\) and \(tc_2\) and target class c. Given an option \(\mathtt{opt}_{\sigma _{\mathtt{II}}}\) for player \(\mathtt{II}\), we have an MCTS algorithm to compute an adversarial example \(\alpha '\). Let \(sev(M(\alpha ,k,d),\mathtt{opt}_{\sigma _{\mathtt{II}}})\) be \(sev_{\alpha }(\alpha ')\), where \(\alpha '\) is the returned adversarial example by running Algorithm 1 over the inputs \(M(\alpha ,k,d)\), \(tc_1\), \(tc_2\), c for a certain \(\mathtt{opt}_{\sigma _{\mathtt{II}}}\). Then there exists a severity interval \(SI(\alpha ,k,d)\) with respect to the role of player \(\mathtt{II}\):

$$\begin{aligned}{}[sev(M(\alpha ,k,d), \max _{\sigma _{\mathtt{II}}}), ~~sev(M(\alpha ,k,d), \min _{\sigma _{\mathtt{II}}}) ]. \end{aligned}$$
(4)

Moreover, we have that \(sev(M(\alpha ,k,d), \mathtt{nat}_{\sigma _{\mathtt{II}}})\in SI(\alpha ,k,d)\).

Safety Guarantee via Optimal Strategy. Recall that \(\tau \), a positive real number, is the manipulation magnitude used in pixel manipulations. An image \(\alpha ' \in \eta (\alpha ,k,d)\) is a \(\tau \)-grid image if for all dimensions \(p\in P_0\) we have \(|\alpha '(p)-\alpha (p)| = n~*~\tau \) for some \(n\ge 0\). Let \(G(\alpha ,k,d)\) be the set of \(\tau \)-grid images in \(\eta (\alpha ,k,d)\). First of all, we have the following conclusion for the case when player \(\mathtt{II}\) is cooperative.

Theorem 2

Let \(\alpha ' \in \eta (\alpha ,k,d)\) be any \(\tau \)-grid image such that \(\alpha ' \in adv_{N,k,d}(\alpha ,c)\), where c is the targeted class. Then we have that \( sev_\alpha (\alpha ') \ge sev(M(\alpha ,k,d), \max _{\sigma _{\mathtt{II}}})\).

Intuitively, the theorem says that the algorithm can find the optimal adversarial example from the set of \(\tau \)-grid images. The idea of the proof is to show that every \(\tau \)-grid image can be reached by some game play. In the following, we show that, if the network is Lipschitz continuous, we need only consider \(\tau \)-grid images when \(\tau \) is small enough. Then, together with the above theorem, we can conclude that our algorithm is both sound and complete.

Further, we say that an image \(\alpha _1\in \eta (\alpha ,k,d)\) is a misclassification aggregator with respect to a number \(\beta >0\) if, for any \(\alpha _2\in \eta (\alpha _1,1,\beta )\), we have that \(N(\alpha _2) \ne N(\alpha )\) implies \(N(\alpha _1) \ne N(\alpha )\). Intuitively, if a misclassification aggregator \(\alpha _1\) with respect to \(\beta \) is classified correctly then all input images in \(\eta (\alpha _1,1,\beta )\) are classified correctly. We remark that the region \(\eta (\alpha _1,1,\beta )\) is defined with respect to the \(L_1\) metric, but can also be defined using \(L_{k'}\), some \(k'\), without affecting the results if \(\eta (\alpha ,k,d) \subseteq \bigcup _{\alpha _1\in G(\alpha ,k,d)}\eta (\alpha _1,k',\tau /2)\). Then we have the following theorem.

Theorem 3

If all \(\tau \)-grid images are misclassification aggregators with respect to \(\tau /2\), and \(sev(M(\alpha ,k,d), \max _{\sigma _{\mathtt{II}}}) > d\), then \(adv_{N,k,d}(\alpha ,c)=\emptyset \).

Note that \(sev(M(\alpha ,k,d), \max _{\sigma _{\mathtt{II}}}) > d\) means that none of the \(\tau \)-images in \(\eta (\alpha ,k,d)\) is an adversarial example. The theorem suggests that, to achieve a complete safety verification, one may gradually decrease \(\tau \) until either \(sev(M(\alpha ,k,d), \max _{\sigma _{\mathtt{II}}}) \le d\), in which case we claim the network is unsafe, or the condition that all \(\tau \)-grid images are misclassification aggregators with respect to \(\tau /2\) is satisfiable, in which case we claim the network is safe. In the following, we discuss how to decide the largest \(\tau \) for a Lipschitz network, in order to satisfy that condition and therefore achieve a complete verification using our approach.

Definition 4

Network N is a Lipschitz network with respect to the distance \(L_k\) and a constant \(\hbar > 0 \) if, for all \(\alpha ,\alpha '\in \mathrm{D}\), we have \(|N(\alpha ', N(\alpha ))-N(\alpha , N(\alpha ))| < \hbar \cdot ||\alpha ' - \alpha ||_{k} \).

Note that all networks whose inputs are bounded, including all image classification networks we studied, are Lipschitz networks. Specifically, it is shown in [30] that most known types of layers, including fully-connected, convolutional, ReLU, maxpooling, sigmoid, softmax, etc., are Lipschitz continuous. Moreover, we let \(\ell \) be the minimum confidence gap for a class change, i.e.,

$$ \ell = \min \{|N(\alpha ', N(\alpha ))-N(\alpha , N(\alpha ))| ~~|~~ \alpha ,\alpha '\in \mathrm{D}, N(\alpha ')\ne N(\alpha ) \}. $$

The value of \(\ell \) is in [0, 1], dependent on the network, and can be estimated by examining all input examples \(\alpha '\) in the training and test data sets, or computed with provable guarantees by reachability analysis [30]. The following theorem can be seen as an instantiation of Theorem 3 by using Lipschitz continuity with \(\tau \le \frac{2 \ell }{ \hbar }\) to implement the misclassification aggregator.

Theorem 4

Let N be a Lipschitz network with respect to \(L_1\) and a constant \(\hbar \). Then, when \(\tau \le \frac{2 \ell }{ \hbar }\) and \(sev(M(\alpha ,k,d), \max _{\sigma _{\mathtt{II}}}) > d\), we have that \(adv_{N,k,d}(\alpha ,c)=\emptyset \).

\(1/\epsilon \)-convergence Because we are working with a finite game, MCTS is guaranteed to converge when the game tree is fully expanded. In the worst case, it may take a very long time to converge. In practice, we can work with \(1/\epsilon \)-convergence by letting the program terminate when the current best adversarial example has not been improved by finding a less severe one for \(\lceil 1/\epsilon \rceil \) iterations, where \(\epsilon >0\) is a small real number.

5 Experimental Results

For our experiments, we let player \(\mathtt{II}\) be a cooperator, and its move \((X,i)\) is such that for all \((x_1,y_1,z_1),(x_2,y_2,z_2) \in X\) we have \(x_1=x_2\) and \(y_1=y_2\), i.e., one pixel (including 3 dimensions for color images or 1 dimension for grey-scale images) is changed for every move. When running simulations (Line 10 of Algorithm 1), we let \(\sigma _\mathtt{I}(\lambda )=\lambda _r/\sum _{\lambda \in \varLambda (\alpha )}\lambda _r\) for all keypoints \(\lambda \in \varLambda (\alpha )\) and \(\mathtt{opt}_{\sigma _{\mathtt{II}}}=\mathtt{nat}_{\sigma _{\mathtt{II}}}\). That is, player \(\mathtt{I}\) follows a stochastic strategy to choose a keypoint according to its response strength and player \(\mathtt{II}\) is nature. In this section, we compare our method with existing approaches, show convergence of the MCTS algorithm on limited runs, evaluate safety-critical networks trained on traffic light images, and counter-claim a recent statement regarding adversarial examples in physical domains.

Comparison with Existing Approaches. We compare our approach to two state-of-the-art methods on two image classification networks, trained on the well known benchmark datasets MNIST and CIFAR10. The MNIST image dataset contains images of size \(28\times 28\) and one channel and the network is trained with the source code given in [2]. The trained network is of medium size with 600,810 real-valued parameters, and achieves state-of-the-art accuracy, exceeding 99%. It has 12 layers, within which there are 2 convolutional layers, as well as layers such as ReLU, dropout, fully-connected layers and a softmax layer. The CIFAR10 dataset contains small images, \(32\times 32\), with three channels, and the network is trained with the source code from [1] for more than 12 hours. The trained network has 1,250,858 real-valued parameters and includes convolutional layers, ReLU layers, max-pooling layers, dropout layers, fully-connected layers, and a softmax layer. For both networks, the images are preprocessed to make the value of each dimension lie within the bound [0, 1]. We randomly select 1000 images \(\{\alpha _i\}_{i\in \{1..1000\}}\) from both datasets for non-targeted safety testing. The numbers in Table 1 are average distances defined as \(\frac{1}{1000}\cdot \sum _{i=1}^{1000}||\alpha _i - \alpha _i'||_{0}\), where \(\alpha _i'\) is the adversarial image of \(\alpha _i\) returned by the algorithm. Table 1 gives a comparison with the other two approaches (CW [8] and JSMA [27]). The numbers for CW and JSMA are taken from [8]Footnote 2, where additional optimisations have been conducted over the original JSMA. According to [27], the original JSMA has an average distance of 40 for MNIST.

Table 1. CW vs. Game (this paper) vs. JSMA

Our experiments are conducted by setting the termination conditions \(tc_1 = 20\) s and \(tc_2 = 60\) s for every image. Note that JSMA needs several minutes to handle an image, and CW is 10 times slower than JSMA [8]. From the table, we can see that, already in a limited computation time, our game-based approach can achieve a significant margin over optimised JSMA, which is based on saliency distributions, although it is not able to beat the optimisation-based approach CW. We also mention that, in [14], the un-optimised JSMA produces adversarial examples with smaller average \(L_2\) distance than FGSM [12] and DLV on its single-path algorithm [14]. Appendix of [36] provide illustrative examples exhibiting the manipulations that the three algorithms performed on the images.

Convergence in Limited Runs. To demonstrate convergence of our algorithm, we plot the evolution of three variables related to the adversarial severity \(sev_\alpha (\alpha ')\) against the number of iterations. The variable best (in blue color) is the smallest severity found so far. The variable current (in orange) is the severity returned in the current iteration. The variable window (in green) is the average severity returned in the past 10 iterations. The blue and orange plots may overlap because we let the algorithm return the best example when it fails to find an adversarial example in some iteration. The experiments are terminated with \(1/\epsilon \)-convergence of different \(\epsilon \) value such as 0.1 or 0.05. The green plot getting closer to the other two provides empirical evidence of convergence. In Fig. 3 we show that two MNIST images converge over fewer than 50 iterations on manipulations of 2 pixels, and we have confirmed that they represent optimal strategies of the players. We also work with other state-of-the-art networks such as the VGG16 network [3] from the ImageNet competition. Examples of convergence are provided in Appendix of [36].

Fig. 3.
figure 3

(a) Image of a two classified as a seven with 70% confidence and (b) the demonstration of convergence. (c) Image of a six classified as a five with 50% confidence and (d) the demonstration of convergence. (Color figure online)

Evaluating Safety-Critical Networks. We explore the possibility of applying our game-based approach to support real-time decision making and testing, for which the algorithm needs to be highly efficient, requiring only seconds to execute a task.

We apply our method to a network used for classifying traffic light images collected from dashboard cameras. The Nexar traffic light challenge [25] made over eighteen thousand dashboard camera images publicly available. Each image is labeled either green, if the traffic light appearing in the image is green, or red, if the traffic light appearing in the image is red, or null if there is no traffic light appearing in the image. We test the winner of the challenge which scored an accuracy above 90% [7]. Despite each input being 37632-dimensional (112\(\,\times \,\)112\(\,\times \,\)3), our algorithm reports that the manipulation of an average of 4.85 dimensions changes the network classification. Each image was processed by the algorithm in 0.303 s (which includes time to read and write images), i.e., 304 s are taken to test all 1000 images. We illustrate the results of our analysis of the network in Fig. 4. Though the images are easy for humans to classify, only one pixel change causes the network to make potentially disastrous decisions, particularly for the case of red light misclassified as green. To explore this particular situation in greater depth, we use a targeted safety MCTS procedure on the same 1000 images, aiming to manipulate images into green. We do not consider images which are already classified as green. Of the remaining 500 images, our algorithm is able to change all image classifications to green with worryingly low severities, namely an average \(L_0\) of 3.23. On average, this targeted procedure returns an adversarial example in 0.21 s per image. Appendix of [36] provides some other examples.

Fig. 4.
figure 4

Adversarial examples generated on Nexar data demonstrate a lack of robustness. (a) Green light classified as red with confidence 56% after one pixel change. (b) Green light classified as red with confidence 76% after one pixel change. (c) Red light classified as green with 90% confidence after one pixel change. (Color figure online)

Fig. 5.
figure 5

(Left) Adversarial examples in physical domain remain adversarial at multiple angles. Top images classified correctly as traffic lights, bottom images classified incorrectly as either ovens, TV screens, or microwaves. (Right) Adversarial examples in the physical domain remain adversarial at multiple scales. Top images correctly classified as traffic lights, bottom images classified incorrectly as ovens or microwaves (with the center light being misclassified as a pizza in the bottom right instance).

Counter-Claim to Statements in [21]. A recent paper [21] argued that, under specific circumstances, there is no need to worry about adversarial examples because they are not invariant to changes in scale or angle in the physical domain. Our SIFT-based approach, which is inherently scale and rotationally invariant, can easily counter-claim such statements. To demonstrate this, we conducted similar tests to [21]. We set up the YOLO network, took pictures of a few traffic lights in Oxford, United Kingdom, and generated adversarial examples on these images. For the adversarial example shown in Fig. 1, we print and photograph it at several different angles and scales to test whether it remains misclassified. The results are shown in Fig. 5. In [21] it is suggested that realistic camera movements – those which change the angle and distance of the viewer – reduce the phenomenon of adversarial examples to a curiosity rather than a safety concern. Here, we show that our adversarial examples, which are predicated on scale and rotationally invariant methods, defeat these claims.

6 Related Works

We review works concerning the safety (and robustness) of deep neural networks. Instead of trying to be complete, we aim to only cover those directly related.

White-Box Heuristic Approaches. In [34], Szegedy et. al. find a targeted adversarial example by running the L-BFGS algorithm, which minimises the \(L_2\) distance between the images while maintaining the misclassification. Fast Gradient Sign Method (FGSM) [12], a refinement of L-BFGS, takes as inputs the parameters \(\theta \) of the model, the input \(\alpha \) to the model, and the target label y, and computes a linearized version of the cost function with respect to \(\theta \) to obtain a manipulation direction. After the manipulation direction is fixed, a small constant value \(\tau \) is taken as the magnitude of the manipulation. Carlini and Wagner [8] adapt the optimisation problem proposed in [34] to obtain a set of optimisation problems for \(L_0\), \(L_2\), and \(L_\infty \) attacks. They claim better performance than FGSM and Jacobian-based Saliency Map Attack (JSMA) with their \(L_2\) attack, in which for every pixel \(x_i\) a new real-valued variable \(w_i\) is introduced and then the optimisation is conducted by letting \(x_i\) move along the gradient direction of \(\tanh (w_i)\). Different from the optimisation approaches, the JSMA [27] uses a loss function to create a “saliency map” of the image which indicates the importance of each pixel on the network’s decision. A greedy algorithm is used to gradually modify the most important pixels. In [23], an iterative application of an optimisation approach (such as [34]) is conducted on a set of images one by one to get an accumulated manipulation, which is expected to make a number of inputs misclassified. [22] replaces the softmax layer in a deep network with a multiclass SVM and then finds adversarial examples by performing a gradient computation.

White-Box Verification Approaches. Compared with heuristic search approaches, the verification approaches aim to provide guarantees on the safety of DNNs. An early verification approach [28] encodes the entire network as a set of constraints. The constraints can then be solved with a SAT solver. [15] improves on [28] by handling the ReLU activation functions. The Simplex method for linear programming is extended to work with the piecewise linear ReLU functions that cannot be expressed using linear programming. The approach can scale up to networks with 300 ReLU nodes. In recent work [13] the input vector space is partitioned using clustering and then the method of [15] is used to check the individual partitions. DLV [14] uses multi-path search and layer-by-layer refinement to exhaustively explore a finite region of the vector spaces associated with the input layer or the hidden layers, and scales to work with state-of-the-art networks such as VGG16.

Black-Box Algorithms. The methods in [26] evaluate a network by generating a synthetic data set, training a surrogate model, and then applying white box detection techniques on the model. [24] randomly searches the vector space around the input image for changes which will cause a misclassification. It shows that in some instances this method is efficient and able to indicate where salient areas of the image exist.

7 Conclusion

In this paper we present a novel feature-guided black-box algorithm for evaluating the resilience of deep neural networks against adversarial examples. Our algorithm employs the SIFT method for feature extraction, provides a theoretical safety guarantee under certain restrictions, and is very efficient, opening up the possibility of deployment in real-time decision support. We develop a software package and demonstrate its applicability on a variety of state-of-the-art networks and benchmarks. While we have detected many instabilities in state-of-the-art networks, we have not yet found a network that is safe. Future works include comparison with the Bayesian inference method for identifying adversarial examples [10].