Adversarial machine learning phases of matter

We study the robustness of machine learning approaches to adversarial perturbations, with a focus on supervised learning scenarios. We find that typical phase classifiers based on deep neural networks are extremely vulnerable to adversarial perturbations: adding a tiny amount of carefully crafted noises into the original legitimate examples will cause the classifiers to make incorrect predictions at a notably high confidence level. Through the lens of activation maps, we find that some important underlying physical principles and symmetries remain to be adequately captured for classifiers with even near-perfect performance. This explains why adversarial perturbations exist for fooling these classifiers. In addition, we find that, after adversarial training the classifiers will become more consistent with physical laws and consequently more robust to certain kinds of adversarial perturbations. Our results provide valuable guidance for both theoretical and experimental future studies on applying machine learning techniques to condensed matter physics.

An important question of both theoretical and experimental relevance concerns the reliability of such machine-learning approaches to condensed matter physics: are these approaches robust to adversarial perturbations, which are deliberately crafted in a way intended to fool the classifiers?In the realm of adversarial machine learning [35][36][37][38][39][40][41], it has been shown that machine learning models can be surprisingly vulnerable to adversarial perturbations if the dimension of the data is high enough-one can often synthesize small, imperceptible perturbations of the input data to cause the model make highlyconfident but erroneous predictions.A prominent adversarial example that clearly manifests such vulnerability of classifiers based on deep neural networks was first observed by Szegedy et al. [42], where adding a small adversarial perturbation, although unnoticeable to human eyes, will cause the classifier to miscategorize a panda as a gibbon with confidence larger than 99%.In this paper, we investigate the vulnerability of machine learning approaches in the context of classifying different phases of matter, with a focus on supervised learning based on deep neural networks (see Fig. 1 for an illustration).
We find that typical phase classifiers based on deep neural networks are likewise extremely vulnerable to adversarial perturbations.This is demonstrated through two concrete  [32], a trained neural network (i.e., the classifier) can successfully predict its corresponding Chern number with nearly unit accuracy.However, if we add a tiny adversarial perturbation (which is imperceptible to human eyes) to the original image, the same classifier will misclassify the resulted image into an incorrect category with nearly unit confidence.
examples, which cover different phases of matter (including both symmetry-breaking and symmetry-protected topological phases) and different strategies to obtain the adversarial perturbations.To better understand why these adversarial examples can fool the classifier in the physics context, we open up the neural network and use an idea borrowed from the machine learning community, called activation map [43,44], to study how the classifier infers different phases of matter.Further, we show that an adversarial training-based defense strategy improves classifiers' ability to resist specific perturbations and how well the underlying physical principles are captured.Our results shed light on the fledgling field of machinelearning applications in condensed matter physics, which may provide an important paradigm for future theoretical and experimental studies as the field matures.
To begin with, we introduce the main ideas of adversar-ial machine learning, which involves the generation and defense of adversarial examples [35][36][37][38].Adversarial examples are instances with small intentionally crafted perturbations to cause the classifier make incorrect predictions.Under the supervised learning scenario, we have a training data set with labels D n = {(x (1) , y (1) ), • • • , (x (n) , y (n) )}, a classifier h(•; θ) and a loss function L to evaluate the classifier's performance.Adversarial examples generation task can be reduced to an optimization problem: searching for a bounded perturbation that maximizes the loss function [45]: max δ∈∆ L(h(x (i) + δ; θ), y (i) ). (1) In the machine learning literature, a number of methods have been proposed to solve the above optimization problem, along with corresponding defense strategies [39,[46][47][48][49].We employ some of these methods and one general defense strategy, adversarial training, on two concrete examples: one concerns the conventional paramagnetic/ferromagnetic phases with a two-dimensional classical Ising model [14,16,17]; the other involves topological phases with experimental raw data generated by a solid-state quantum simulator [31].
The ferromagnetic Ising model.-Thefirst example we consider involves the ferromagnetic Ising model defined on a 2D square lattice: , where the Ising variables σ z i = ±1 and the coupling strength J ≡ 1 is set to be the energy unit.This model features a well-understood phase transition at the critical temperature T c = 2/ ln(1 + √ 2) ≈ 2.366 [50], between a high-temperature paramagnetic phase and a low-temperature ferromagnetic phase.In the context of machine learning phases of matter, different pioneering approaches, including these based on supervised learning [16], unsupervised learning [14], or a confusion scheme combining both [17], have been introduced to classify the ferromagnetic/paramagnetic phases hosted by the above 2D Ising model.In particular, Carrasquilla and Melko first explored a supervised learning scheme based on a fully connected feed-forward neural network [16].They used equilibrium spin configurations sampled from Monte Carlo simulations to train the network and demonstrated that after training it can correctly classify new samples with notably high accuracy.Moreover, through scanning the temperature the network can also locate the the transition temperature T c and extrapolate the critical exponents that are crucial in the study of phase transitions.
To study the robustness of these introduced machine learning approaches to adversarial perturbations, we first train a powerful classifier which has comparable performance with the ones shown in [16].After training, the network can successfully classify data from the test set with a high accuracy larger than 97% [45].Then we try to obtain adversarial perturbations to attack this seemingly ideal classifier.It is natural to consider discrete attack in this scenario since the spin configuration in Ising model can be only discretely changed as spin flips.We apply the differential evolution algorithm (DEA) [51] to the Monte Carlo sampled spin configurations and obtain the corresponding adversarial perturbations.A concrete example found by DEA is illustrated in Fig. 2(a-b).Initially, To understand why this adversarial example crafted with tiny changes leads to misclassification, we dissect the classifier by estimating each position's importance to the final prediction, which we call the activation map of the classifier [45].In Fig. 2(c) we depict the activation map for ferromagnetic phase.It is evident that the classifier makes prediction mainly based on positions with large activation values (dark colors).The position where the adversarial spin flip happens in Fig. 2(b) has an activation value 3.28, which is the forth largest among all 900 positions.Then we enumerate all positions with activation values larger than 2.6 and find that single spin flips, which changes the contribution to ferromagnetic phase from positive to negative, can all lead to misclassification [45].The values at different positions for the activation map is found to be nonuniform, which contradicts to the physical knowledge that each spin contributes equally to the order parameter M .This explains why the classifier is vulnerable to these particular spin flips.We remark that in the traditional machine learning realm of classifying daily-life images (such as images of cats and dogs), such an explanation is unattainable due to the absence of a sharply defined "order parameter".
The obtaining of adversarial examples is more challenging, since the topological invariants capture the global properties of the systems and are insensitive to local perturbations.We consider a three-band model for 3D chiral topological insulators (TIs) [67,68]: ) with c † k,µ the fermion creation operator at momentum k = (k x , k y , k z ) in the orbital (spin) state µ = −1, 0, 1 and the summation is over the Brillouin zone (BZ); H k = λ 1 sin k x + λ 2 sin k y + λ 6 sin k z − λ 7 (cos k x + cos k y + cos k z + h) denotes the single-particle Hamiltonian, with λ 1,2,6,7 being four traceless Gell-Mann matrices [67].The topological properties for each band can be characterized by a topological invariant χ (η) [69] and it is straightforward to obtain that χ (m) = 0, 1, and − 2 for |h| > 3, 1 < |h| < 3, and |h| < 1, respectively.Recently, an experiment has been carried out to simulate H TI with the electron spins in a NV center and a demonstration of the supervised learning approach to topological phases has been reported [31].Using the measured density matrices in the momentum space (which can be obtained through quantum state tomography) as input data, a trained 3D convolutional neural network (CNN) can correctly identify distinct topological phases with exceptionally high success probability, even when a large portion of the experimentally generated raw data was dropped out or inaccessible.Here, we show that this approach is highly vulnerable to adversarial perturbations.
We first train a 3D CNN with numerically simulated data.The training curve is shown in Fig. 3(a) and the accuracy on validation data saturates at a high value (≈ 99%) [45].After the training, we fix the model parameters of the CNN and utilize the Fast Gradient Sign Method (FGSM) [48], projected gradient descent (PGD) [48] and Momentum Iterative Method (MIM) [49] to generate adversarial perturbations [70]. of a sample with χ (m) = 0 as functions of the MIM iterations.From this figure, P (χ (m) = 0) decreases rapidly as the iteration number increases and converges to a small value (≈ 2%) after about eight iterations.Meanwhile, P (χ (m) = 1) increases rapidly and converges to a large value (≈ 98%), indicating a misclassification of the classifier-after about eight iterations, the sample originally from the category χ (m) = 0 is misclassified to belong to the category χ (m) = 1 with a confidence level ≈ 98% [45].We note that a direct calculation of the topological invariant through integration confirms that χ (m) = 0 for this adversarial example, indicating that the tiny perturbation would not affect the traditional methods.It is more challenging to figure out why the topological phases classifier is so vulnerable to the adversarial perturbation.Since these convolutional kernels are repeatedly applied to different spatial windows, we have limited method to calculate the importance of each location, as we do with an Ising classifier.We study the activation maps of all convolutional kernels in the first convolutional layer.We find that the sixth kernel has totally different activation patterns for topologically trivial and nontrivial phases, which acts as a strong indicator for the classifier to distinguish these phases [45].Specifically, the activation patterns for χ (m) = 0, 1 phases are illustrated in Fig. 3(c-d).We then compare the sixth kernel's activation maps of the legitimate sample and the adversarial example.As shown in Fig. 3(e-f), we can clearly see that the tiny adversarial perturbation makes the activation map much more correlated with the χ (m) = 1 ones, which gives the classifier high confidence to assert that the adversarial example belongs to the χ (m) = 1 phase [45].This explains why adversarial examples can deceive the classifier.
The above two examples clearly demonstrate the vulnerability of machine learning approaches to classify different phases of matter.We mention that, although we have only focused on these two examples, the existence of adversarial perturbations is ubiquitous in learning various phases (independent of the learning model and input data type) and the methods used in the above examples can also be used to generate the desired adversarial perturbations for different phases.From a more theoretical computer science perspective, the vulnerability of the phase classifiers can be understood as a consequence of the strong "No Free Lunch" theorem-there exists an intrinsic tension between adversarial robustness and generalization accuracy [71][72][73].The data distributions in the scenarios of learning phases of matter typically satisfy the W 2 Talagrand transportation-cost inequality, thus a phase classifier could be adversarially deceived with high probability [74].
Adversarial training.-Inadversarial machine learning, a number of countermeasures against adversarial examples have been developed [75,76].Adversarial training, whose essential idea is to first generate a substantial amount of adversarial examples and then retrain the classifier with both original data and crafted data, is one of these countermeasures to make the classifiers more robust.Here, in order to study how it works for machine learning phases of matter, we apply adversarial training to the 3D CNN classifier used in classifying topological phases.Partial results are plotted in Fig. 4(a-b).While the classifier's performance on legitimate examples maintains intact, the test accuracy on adversarial examples increases significantly (at the end of the adversarial training, it increases from about 60% to 98%).This result indicates that the retrained classifier is immune to the adversarial examples generated by the corresponding attacks.
We also study how adversarial training can make the classifiers grasp physical quantities more thoroughly.As shown in Fig. 2(d), the activation map of the Ising model classifier becomes much flatter after adversarial training (the standard deviation of different positions is reduced from 0.88 to 0.20), which indicates that after adversarial learning the output of the classifier is more consistent with the physical order parameter of magnetization where each spin contributes equally, and hence more robust to adversarial perturbations.This is also reflected by the fact that after adversarial training the classifier can identify the phase transition point more accurately, as shown in Fig. 4(c-d).
Discussion and conclusion.-Wemention that the adversarial training method is useful only on adversarial examples which are crafted on the original classifier.The defense may not work for black-box attacks [77,78], where an adversary generates malicious examples on a locally trained substitute model.To deal with the transferred black-box attack, one may explore the recently proposed ensemble adversarial training method that retrain the classifier with adversarial examples generated from multiple sources [79].In the future, it would be interesting and desirable to find other defense strategies to strengthen the robustness of phase classifiers to adversarial perturbations.In addition, an experimental demonstration of adversarial learning phases of matter together with defense strategies would also be an important step towards reliable practical applications of machine learning in physics.
In summary, we have studied the robustness of machine learning approaches in classifying different phases of matter.Our discussion is mainly focused on supervised learning based on deep neural networks, but its generalization to other types of learning models (such as unsupervised learning or reinforcement learning) and other type of phases are possible.Through two concrete examples, we have demonstrated explicitly that typical phase classifiers based on deep neural networks are extremely vulnerable to tiny adversarial perturbations.We have studied the explainability of adversarial examples and demonstrated that adversarial training significantly improves the robustness of phase classifiers by assisting the model to learn underlying physical principle and symmetries.Our results reveal a novel vulnerability aspect for the growing field of machine learning phases of matter, which would benefit future studies across condensed matter physics, machine learning, and artificial intelligence.
We thank Christopher Monroe, John Preskill, Nana Liu, Peter Wittek, Ignacio Cirac, Roger Colbeck, Yi Zhang, Peter , where µντ is the Levi-Civita symbol with µ, ν, τ ∈ {x, y, z}, and the Berry connection is Supplementary Material for: Vulnerability of Machine Learning Phases of Matter

I. METHODS FOR GENERATING ADVERSARIAL PERTURBATIONS
In the main text, we have shown that the machine learning approaches to phases of matter based on deep neural networks are extremely vulnerable to adversarial examples: adding a tiny amount of carefully-crafted perturbation, which are imperceptible to human eyes, into the original legitimate data will cause the phase classifiers to make incorrect predictions with a high confidence level.Here in this section, we give more technical details on how to obtain the adversarial perturbations.
As discussed in the main text, in supervised learning the training data is labeled D = {(x (1) , y (1) ), • • • , (x (n) , y (n) )} and the task of obtaining adversarial examples reduces to solving the following optimization problem: In the adversarial machine learning literature, a number of methods have been introduced to deal with the above optimization problem.We consider two scenarios in this paper, one is called discrete attack scenario, where the adversarial perturbations are discrete and the original legitimate samples are modified by discrete values; the other is called continuous attack scenario, where the perturbations are continuous and the original legitimate samples are modified continuously.For the discrete attack scenario, we mainly apply the differential evolution algorithm [46,47], which is a population based optimization algorithm for solving complex multi-modal problems and has recently been used for generating one-pixel adversarial perturbations to fool deep neural networks in image recognition [51].For the continuous attack scenario, we use a number of attacking methods, including fast gradient sign method (FGSM) [39,48], projected gradient descent (PGD) [48] and momentum iterative method (MIM) [49].
For the case of the ferromagnetic Ising model, we apply both the discrete and continuous attacks, whereas for the case of topological phases of matter we apply only the continuous attacks.We use cleverhans [70] to implement FGSM, PGD, and MIM for both the Ising and chiral topological insulator cases.In each case, we produce the adversarial samples based on the origin legitimate training set.We define the success ratio as the proportion of adversarial samples that successfully fool the classifier.In the following, we briefly sketch the essential ideas for each attacking methods used in this paper.For each method, we also provide a pseudocode to clearly illustrate how it works.
A. Differential evolution algorithm Differential evolution is a population based optimization algorithm and is arguably one of the most powerful stochastic real-parameter optimization algorithms in solving complex multi-modal optimization problems [46,47].It belongs to the general class of evolutionary algorithms and the computational steps it takes are quite similar to these taken by a standard evolutionary algorithm.Yet, unlike traditional evolutionary algorithms, the differential evolution algorithm perturbs the current generation population members with the scaled differences of randomly chosen distinct population members.More specifically, during each iteration we randomly generate a new set of candidate solutions (called children) according to the current population (parents), and then compare the children with their corresponding parents, replacing the parents if the children have higher fitness value.
We apply the differential evolution algorithm in the "blackbox" setting to generate adversarial examples for the ferromagnetic Ising model [51], where we assume no prior information about the classifier's internal structures and only discrete changes of the samples could be made: we first generate some counterfeit samples by reversing a number of magnetic moments of the legitimate sample randomly.We denote these samples as X 1 , X 2 , . . ., X n , where n is the population size.We then feed these samples into the classifier to obtain the confidence probability for each configuration.Then we produce new counterfeit samples, which are called children, based on prior samples.Particularly, for each component X i (s) of the children X i , we have the following generation rule: where M is the mutual factor (larger M leads to larger search radius but take longer time to converge).j, k, l are chosen randomly from [n]/{i}, P is called the crossover probability.If these children have better performance (i.e., higher confidence probability for the wrong classification category), then we replace their corresponding parents with these children.
We repeat this procedure with several iterations until it converges and the desired adversarial samples are obtained.For the particular Ising case considered in this paper, we denote every children generation as a sequence of (a, b, s) N , where (a, b) is the flipped spins' positions, s is the spin after the reversing (which is restricted to be either 0 or 1 ), and N is the number of changed spins.A pseudocode representation of the differential evolution algorithm is shown in Algorithm 1.
It is worthwhile to mention that the differential evolution algorithm cannot guarantee that the optimal solution will be obtained.It is possible that the algorithm may only yields certain local minima.In our scenario, this means that the adversarial examples we obtained may not be the most effective ones to fool the classifier.

Algorithm 1 The Differential Evolution Algorithm
Input A legitimate sample ( x, y), the trained model h(•; θ), the loss function L. Input The iteration number T , the population size n, the mutual factor M , the crossover probability P , the number N of spins to flip.Output An adversarial example x * .
1: Set the position bound B to be the shape of Generate children: for s = 1, 2, . . ., N do 9: Randomly pick p from U (0, 1) Xi = X i 17: x end if end for 20: end for 21: Find the Xp among {X1, X2, . . ., Xn} that has the highest confidence probability for the wrong classification category, apply Xp to x to get x * 22: return x *

B. Fast gradient sign method
The fast gradient sign method is a simple one-step scheme for solving Eq. (S1) and has been widely used in the adversarial machine learning community [39,48].Before introducing this method, let us first introduce the fast gradient method (FGM).
We work in a white-box attack setting, where full information about the classifier is assumed.Our goal is to maximize the loss function for a particular input data x to generate the adversarial sample x * .Since we know all parameters of the model, we can compute the fastest increasing direction on the position x, which is just the gradient of the loss function: ∇ x L(h( x; θ), y).The FGM is a one-step attack which perturbs x along the direction of the gradient with one particular stepsize: The perturbation is constrained within l p -norm bound: If we take l ∞ -norm bound, we get a simple rule for obtaining the adversarial perturbation via FGSM: For different problems, there are other particular perturbation bounds as well.One of the most useful bounds is the rectangular-box-like bound, where each component of the adversarial sample is bounded by some constant numbers x min ≤ x ≤ x max .For example, for the case of chiral topological insulators studied in this paper, we require that every component of x be bounded by [− x * (i) = x(i) + δi x end if x * (i) = xmin 10: end if 11: end for 12: return x *

C. Projected gradient descent method
As shown in Eq. (S5), one may interpret the FGSM as a simple one-step scheme for maximizing the inner part of the saddle point formulation.With a small stepsize, FGSM may perform well.But with a large stepsize, FGSM can perform poorly since the gradient of the loss function may change significantly during this step.To deal with this problem, a more powerful method is its multi-step variant, which is called the projected gradient descent method (PGD).The basic idea of PGD is to use FGSM methods with multiple times (T ) and perform projections iteratively to enforce that the perturbation is within an appropriate region [48].At each step, we check if the proposed update has moved out of the region, and apply a projection back if it does.So the rule for updating is where α = T is the stepsize and π C is the projection operation which projects those points out of the chosen appropriate region [denoted as ∆ in Eq. (S1)] back.In our scenarios, the permitted region we choose is the region that restricts every component of x to be in [x min , x max ], therefore, π C is simply the projection for each component into end for 15: end for 16: return x * = xT

D. Momentum iterative method
The FGSM assumes the sign of the gradient of loss function will not change around the data point and generates an adversarial example by applying the sign of the gradient to a legitimate example only once.However, in many practical applications the assumption may not hold when the distortion is large, rendering the adversarial example generated by FGSM "under-fits" the model.On the other hand, iterative FGSM like PGD moves the counterfeit examples gradually in the direction of the sign of the gradient in each iteration and hence can easily drop into poor local extremums and "overfit" the model.To deal with such a dilemma, one can integrate momentum into the iterative FGSM so as to stabilize update directions and escape from local extremums [49].This is the essential idea of the momentum iterative method.
For a T iterations attack with l ∞ -norm constraint , in every iteration we calculate the gradient descent direction and add the gradient descent direction in the last iteration with a decay factor µ as the accelerated velocity: and the rule for updating is: where α = T is the stepsize.A pseudocode representation for the momentum iterative method is shown in Algorithm 4.

Algorithm 4 Momentum Iterative Method
Input The trained model h(•, θ), loss function L, the legitimate sample ( x, y).Input The perturbation strength , iteration number T , decay factor µ, upper and lower bound xmin, xmax.Output An adversarial example x * .

II. MORE DETAILS ON THE TWO CONCRETE EXAMPLES
In this section, we provide more technical details on the neural network structures of the classifiers, the training process, and the analysis of adversarial examples In addition, we provide more numerical simulation results for both the examples of the Ising model and the chiral topological insulator.

A. The ferromagnetic Ising model
In the main text, we have shown that adding a tiny amount of adversarial perturbation as small as a single pixel can lead the classifier to misclassify a spin-configuration image from the ferromagnetic phase into the paramagnetic category.In this example, our phase classifier is a fully connected feedforward neural network, which is composed of an input layer with 900 neurons, a hidden layer with 100 sigmoid neurons, and an analogous output layer with two sigmoid neurons, as shown in Fig. S1(a).The input data is the equilibrium spin configurations sampled from Monte Carlo simulations, same as in Ref. [16].We use 0 and 1 to represent whether the spin is up or down.The lattice size is fixed to be 30 × 30, and therefore the input data x are {0, 1} arrays with length 900.The training and validation sets are both numerically generated with Monte Carlo simulations [16], and their sizes are 90000 and 10000, respectively.We use the RMSprop as the optimizer with batch size of 256 and the learning rate is set to be 10 −3 .In Fig. S1(b), we plot the results for the training process.From this figure, it is clear that the accuracy increases (the loss decreases) as the number of epochs increases, and after 15 epochs the network can successfully classify samples from the validation/test set with a high accuracy larger than 97%.
After the training process, we fix the parameters of the classifier and utilize different methods to generate adversarial examples.The first method we use is the differential evolution algorithm, which is a discrete attack method.Fig. 2(a) of the main text gives an adversarial example that only differs with the original legitimate one by a single pixel.Intuitively, if we modify the original sample by flipping more spins, the confidence probability for the classifier to misclassify the modified sample will increase.This is also verified in our numerical simulations and partial of our results are shown in Fig. S2.In Fig. S2(a), we randomly choose a legitimate sample from the ferromagnetic phase, which is shown in Fig. S2(b).Without changing any pixel (flipping a spin), the classifier will cor- rectly identify the sample as from the ferromagnetic phase with a confidence level ≈ 80%.However, this confidence probability will decrease rapidly as the number of pixels that are allowed to change increases.This is clearly demonstrated in S2(a).Fig. S2(c) and Fig. S2(d) plot two adversarial examples with one and five pixels of the original sample (Fig. S2(b)) changed, respectively.For this particular legitimate sample, changing one pixel (five pixels) will lead the classifier to misclassify it with confidence 52% (90%).We note that in order to obtain Fig. 2 (b) and Fig. 2 (c) of the main text, the hyper parameters we used are the same as these in Fig. S2.
As discussed in the main text, we may also regard H Ising as a quantum Hamiltonian and the input data to be the local magnetization, and thus we allow the input data to be continuously modified.In this case, we can use different methods, such as FGSM, PGD, and MIM discussed in Sec.I, to generate adversarial examples.Partial of our results are shown in It is clear that after around three iterations, the classifier will begin to misclassify the samples, and after ten iterations the slightly modified samples will be identified as belongs to the ferromagnetic category with confidence ≥ 90%.(b) A randomly chosen legitimate sample from the paramagnetic phase.(c) An adversarial example obtained by MIM, which is slightly different from the original sample.Here, the perturbation is restricted to be within δ ∞ ≤ 0.1.S1.The result of single spin flips.The original spin configuration is classified as ferromagnetic phase with confidence Pferro = 72%.The "position" is the spin index on the 30 × 30 lattice.The "original spin" represents the original spin direction in the legitimate example.The "value in AM" represents the position's corresponding value in the activation map."P (ferro)" means the classifier's confidence to identify the sample into ferromagnetic phase after the a single spin flip on the original legitimate example.From the table, we can find that spin changes to have the different sign to the corresponding value in AM will dramatically decrease the condifence of being ferromagnetic phase.

Position
Fig. S3.In Fig. S3(a), we randomly choose a sample from the paramagnetic phase, which is plotted in Fig. S3(b).At the beginning, the classifier can correctly identify this sample as in the paramagnetic category with confidence ≥ 99%.We then use MIM to modify the original sample and after around three iterations, the classifier will begin to make incorrect predictions, and after ten iterations it will misclassify the sample to be in the ferromagnetic phase with confidence ≥ 90%.Fig. S3(c) shows the corresponding adversarial example obtained by MIM after ten iterations.In Fig. 2(a) of the main text, we briefly introduced the idea of the classifier's activation map on the Ising model, now we explain in detail about how we get this activation map.Since the classifier structure we used for the Ising model is rather simple, the inference of the classifier can be strictly written as the following: where x is the input Ising configuration with shape 900 × 1, W 1 is the first layer weight with shape 100×900, and b 1 is the first layer bias with shape 100 × 1.Similarly, W 2 is the second layer weight with shape 2×100 and b 2 is the second layer bias with shape 2 × 1.We can see that P 1 , which is the first element of final prediction P and represents the confidence of being ferromagnetic phase, is proportional to Therefore, we denote (W 2 • W 1 ) 1,: , which is the fist row of W 2 •W 1 , as the activation map W ferro .W ferro has shape 1×900 and approximately represents ∂P ferro /∂x, denoting the weight of each spin on final prediction to the ferromagnetic phase.
The spin with the same sign as the corresponding position's value in activation map will contributes to being classified as the ferromagnetic phase and vice versa.This explains why a single spin flip can change the classifier's prediction dramatically, either increase or decrease.Using the legitimate example shown in main text Fig. 2(a), we try single spin flips on all positions in the activation map with value larger than 2.6  and list the result in Table S1.We can find that single spin flips which make their signs change to the opposite of values in the AM can make the classifier incorrectly identifies the example into paramagnetic phase (i.e., P (ferro)< 0.5).This result, combined with the highly non-uniform activation map, show that the classifier's prediction relies heavily on only several particular spins.This does not agree with the physically defined order parameter M = where each spin contributes equally, indicating the model does not fully capture the underlying physical principles.

B. Topological phases of matter
For the example of topological phases of matter, the classifier we consider is a 3D convolutional neural network (CNN), as shown in Fig. S5(a).It consists of two 3D convolution layers, a 3D max pooling layer, a dropout layer with rate 0.4 to avoid overfitting, and a flattening layer connected with two fully-connected layers with 0.55 dropout.The output layer is a softmax layer outputting the probability for the three possible topological phases.We use the RMSprop as the optimizer with batch size of 128.The loss function is chosen to be the cross-entropy.The learning rate is set to be 10 −3 .
In our scenario, the input data are the density matrices on a 10 × 10 × 10 momentum grid and we express each density matrix ρ as [31]: where λ is a vector consists of the eight Gell-Mann matrices, I From this figure, the training accuracy for the training set increases rapidly at the beginning and then saturates at a high value (≈ 99%), whereas the loss for the training set decrease rapidly at the beginning and then saturate at a small value (≈ 0.05).This indicates that the classifier performs remarkably well on the legitimate samples.
After the training was done, we use three different methods, namely FGSM, PGD, and MIM, to generate adversarial examples.We find that all these methods work notably well and can generate adversarial examples with success ratio larger than 76% (i.e., for more than 76% of the legitimate samples, these methods can successfully output the corresponding adversarial examples) with the perturbation bounded by ||δ|| ∞ ≤ 0.2.In order to obtain Fig. 3(b) of the main text, we randomly choose a sample from the category with topological invariant χ (m) = 1.At the beginning, the classifier can successfully identify this sample with almost unit confidence probability (≈ 99.5%).Here, we use MIM to generate adversarial perturbations with restriction ||δ|| ∞ ≤ 0.05.The classifier will begin to misclassify the slightly modified sample to be in the category χ (m) = 0 after about three iterations.The confidence probability for the misclassification approaches 98% after four iterations and begins to saturate at this value.The original legitimate sample is shown in Fig. S4(a) and Fig. S4(b) plots its corresponding adversarial example obtained by MIM.As discussed in Eq. (S10), each density matrix is represented by a vector b of length eight.For ease of visualization, in Fig. S4(a-b) we plot only the first component of b, namely b 1 , for each momentum point.We mention that one can also use the experimental data obtained recently in Ref. [31] with a solid-state simulator to generate adversarial examples.This is also observed in our our numerical simulations.
To demonstrate why this adversarial example can mislead the powerful topological phase classifier, we study the activation map of each kernel in the first convolutional layer.We find that the sixth kernel has totally different activation patterns on topologically trivial and nontrivial phases, implying the classifier uses this kernel as a strong indicator for different phases.To verify this statement in depth, we calculate the correlations between the sixth kernel's activation maps on χ (m) = 0, 1, −2 phases.The metric we use is Normalized Cross Correlation (NCC) ρ, which is expressed as the following: where I 1 , I  S2.We can find that the activation map of topologically nontrivial phases (χ (m) = 1, −2) are highly correlated (ρ = 0.971), but have little correlation with topologically trivial phase (ρ = 0.110, 0.244 respectively).
Then the functionality of adversarial example can be explained by the increase of correlation with the activation maps of incorrect phases.In Fig. 2(c) of the main text we show one layer of the activation map of the sixth kernel, we now show the layer with k z = −π, 0, 4π/5 in Fig. S4(c-f), corresponding to the layers in Fig. S4(a-b).We can clearly see that the activation map of adversarial example (I adv ) becomes more correlated with average activation map of χ = 1 (I χ=1 ).This can also be verified from Table S2, the NCC with I χ=1 changes from 0.235 to 0.689.This means that the sixth kernel's behaviour on the adversarial example is more similar as the χ = 1 phase's, which makes the classifier do the incorrect prediction.

III. ADVERSARIAL TRAINING
In order to increase the robustness of the deep neural networks to adversarial perturbations, a number of methods have been developed in the adversarial machine learning literature.The simplest and most straightforward one is adversarial training [48].Its essential idea is to first generate a substantial amount of adversarial examples with certain attacking methods and then retrain the classifier with both the original legitimate data and the crafted data.After retraining, the classifier will be more immune to the corresponding attacks and its robustness to the adversarial perturbations will be enhanced.
In the main text, Fig. 4(a-b) plots the results of the adver- From the figure, it is clear that the accuracy for both the legitimate samples and adversarial samples increase as the number of epochs increase, and saturate at notable values larger than 0.96.This demonstrates that the retrained classifier is indeed much more robust to the adversarial perturbations generated by the corresponding attacking methods.In Fig. 2(d) of the main text we present that the activation map of the Ising model classifier becomes much flatter after the adversarial training, which accords with the symmetry of calculating M .This results in the effectiveness of the adversarial training on identifying the phase transition point: In Fig. 4(c-d) in the main text we demonstrate the Ising model classifier's prediction curves on different M and different spin configurations before and after adversarial training.Since we only consider the 30 × 30 lattice, we divide M from 0 to 1 into 450 intervals and in each interval we randomly generate 100 samples.We generate samples according to M instead of T , which is in order to elaborate the classifier's be-haviours around the small range of the transition point.From the Fig. 4(c-d) in the main text we can see that the classifier presents similar predictions on samples with the same M , and the confidence of being the ferromagnetic phase goes down as M decreases.This indicates that the classifier has learnt the correct classification rule according to the order parameter M .In Fig. S6 we show the relation between the confidence of being the ferromagnetic phase and M by averaging 100 different spin configurations on each M interval.We can clearly see that after adversarial training, the prediction curve becomes sharper and smoother, and the transition point identified at the confidence threshold P ferro = 0.5 moves closer to the theoretical critical temperature T c .This implies that the classifier can identify the phase transition point more precisely after adversarial training.
Besides the Ising model, We show that adversarial training can also help the topological phase classifier to make better inference.From table S2, we find that the activation map of the sixth kernel on topologically trivial phase (I χ=0 ) becomes less correlated with topologically nontrivial ones (I χ=1,−2 ).In addition, the activation maps between I χ=1 and I χ=−2 also become less correlated.This indicates that the kernel begins to learn how to distinguish two topologically nontrivial phases.In Fig. S4(g-h), we plot the sixth kernel's activation map on the legitimate and adversarial examples after the adversarial training.We can find that I adv is much less correlated with I χ=1 , which can also be verified by table S2.All these results imply that the classifier learned a more robust convolutional kernel, which can better distinguish three different topological phases, to do inferring.
The effectiveness of the adversarial training can be also derived from another view: although one cannot expect a universal defense strategy that is able to make the phase classifiers robust to all types of adversarial perturbations, the defense against certain adversarial attack can help the model better learn the physical principle behind the problem.To support our claim, we try the discrete attack on the Ising model classifier after the adversarial training against continuous attack (FGSM in this example).The result is shown in Fig. S7.We generate adversarial examples based on the test set with the Monte Carlo method on 41 different temperatures.Each temperature has 250 samples.Under each temperature, we enumerate one pixel and two pixels' flip on all 250 samples and find the ratio of samples that can make the model do misclassification.We find that after adversarial training, the refined classifier is more robust to discrete spins flip, and the peaks move closer to the phase transition point.

FIG. 1 .
FIG.1.A schematic illustration for the vulnerability of machine learning phases of matter.For a clean image, such as the time-offlight image obtained in a recent cold-atom experiment[32], a trained neural network (i.e., the classifier) can successfully predict its corresponding Chern number with nearly unit accuracy.However, if we add a tiny adversarial perturbation (which is imperceptible to human eyes) to the original image, the same classifier will misclassify the resulted image into an incorrect category with nearly unit confidence.

1 FIG. 2 .
FIG. 2. (a) A legitimate sample of the spin configuration in the ferromagnetic phase with M = | N i σi|/N = 0.791.(b) An adversarial example obtained by the differential evolution algorithm (DEA), which only differs from the original legitimate one by flipping one spin (in red circle).(c) The activation map (AM) of the original classifier.The spins at positions with darker colors contribute more to the confidence of being ferromagnetic phase.(d) The activation map of the classifier after adversarial training.The map becomes much flatter and the variance of each position's activation value becomes much smaller.

FIG. 3 .
FIG. 3. (a) The average accuracy and loss of the 3D convolutional neural network to classify the topological phases.(b) We use the momentum iterative method to obtain the adversarial examples.This plot shows the classification probabilities as a function of the iteration number.After around two iterations, the network begin to misclassify the samples.(c-f) The activation maps (AM) of the sixth kernel in the first convolutional layer under different settings.(c) the average AM on all samples in the test set with χ (m) = 0. (d) the average AM on χ (m) = 1.(e) the AM obtained by taking the legitimate sample as the input to the topological phases classifier and (f) is taking the adversarial example as input.
FIG. 4. The effectiveness of the adversarial training.(a) We first numerically generate adequate adversarial examples with the FGSM, and then retrain the CNN with both the legitimate and adversarial data.(b) Similar adversarial training for the defense of the PGD attack.(c) The original classifier's output representing the confidence of being ferromagnetic.The "Configurations" means different spin configurations with the same magnetization M .The corresponding temperature (listed in parentheses) of each M is calculated by Onsager's formula [50].The original classifier identify the transition point at T = 2.255.(d) The refined classifier's output after the adversarial training against BIM attack.The identified transition point changes to T = 2.268, which becomes closer to Tc = 2.269.

4 : 5 :
Input xi−1 into h to get ∇xL(h( xt−1; θ), y) at = µ • at−1 + ∇xL(h( x t−1 ;θ),y) ||∇xL(h( x t−1 ;θ),y)|| 6: for each component xt−1(j) of xt−1 do 7: δj = α • sign(at)(j) 8: xt(j) = xt−1(j) + δj 9: if xt(j) > xmax then 10: xt(j) = πC ( xt(j)) = 2xmax − xt(j) if xt(j) < xminthen13: xt(j) = πC ( xt(j)) = 2xmin − xt(j) end for 17: return x * = xT FIG. S1.Machine learning ferromagnetic/paramagnetic phases of the Ising model.(a) The classifier is a fully connected feed forward neural network.It consists of one input layer with 900 neurons which have one-to-one correspondence to the spins of the Ising model, one hidden layer with 100 sigmoid neurons, and one output layer with two softmax neurons outputting the probabilities of the paramagnetic and ferromagnetic phases.(b) The training process.The classifier is trained with numerically simulated data at 40 different temperatures from T = 0 to T = 3.54.The training set contains 90000 samples, each sample is a array with length 900.The validation set is of size 10000 and the test set is of size 10250.We use RMSprop optimizer with batch size of 256 and learning rate of 10 −3 .The accuracy is the correct classification percentage and the loss is the value of crossentropy.
FIG.S2.Performance of the differential evolution algorithm for the case of ferromagnetic Ising model.In this figure, the population n is set to be 100 and the mutual factor M set to be 30.The algorithm stops returns the adversarial samples when the confidence probability converges with increasing number of iterations.(a) The confidence probabilities for the ferromagnetic and paramagnetic categories versus the number of pixels that are allowed to be changed (i.e., spins allowed to be flipped).We randomly choose a legitimate sample (here the 5067 th sample in the validation set) which is correctly classified by the neural network to belong to the ferromagnetic phase with confidence 80%.(b) The legitimate sample from the ferromagnetic phase.Here, each small square corresponds to a spin and black (white) color means the corresponding spin points down (up).(c) An adversarial sample with one pixel changed.The classifier misclassifies this modified sample into the paramagnetic category with confidence 52%.(d) An adversarial sample with five pixels changed.The classifier misclassifies this modified sample into the paramagnetic phase with confidence 90%.
FIG.S3.Performance of the momentum iterative method (MIM) for the case of Ising model.(a) The classification probabilities of the ferromagnetic and paramagnetic phases as a function of the iteration number.It is clear that after around three iterations, the classifier will begin to misclassify the samples, and after ten iterations the slightly modified samples will be identified as belongs to the ferromagnetic category with confidence ≥ 90%.(b) A randomly chosen legitimate sample from the paramagnetic phase.(c) An adversarial example obtained by MIM, which is slightly different from the original sample.Here, the perturbation is restricted to be within δ ∞ ≤ 0.1.
FIG. S4.(a) A legitimate sample of the first component of the input data (which is related to the density matrix in the momentum space).Here, only slices corresponding to −π, 0, 4π/5 are displayed.(b) An adversarial example obtained by FGM, which only differs with the sample in (a) by a tiny perturbation and has average fidelity 0.997.(c)(d)(e)(f) The activation maps (AM) of the sixth kernel in the first convolutional layer under different settings.(c) The average AM on all samples in the test set with χ (m) = 0. (d) The average AM on χ (m) = 1.(e) The AM obtained by taking the legitimate sample (a) as the input to the original classifier and (f) is taking the adversarial example (b) as input.(g)(h) are similar to (e)(f) but change the original classifier into the classifier after adversarial training.
FIG. S5.Learning topological phases with a 3D convolutional neural network (CNN).(a) The structure of the CNN phase classifier.(b) The training process.Here, the loss function is chosen to be the cross-entropy.It is clear that after ten epochs, the classifier can successfully identify the samples from both the training and validation sets with accuracy ≈ 99%.

2 √
is the three-by-three identity matrix, and b = (b 1 , b 2 , • • • , b 8 ) with b i = 1 3tr(ρλ i ).Therefore, in this representation of the density matrices each sample of the input data has the form 10 × 10 × 10 × 8, which can be regarded as 10 × 10 × 10 pixels image with 8 color channels.To train the network, we numerically generate 5001 samples as the training set and 2001 samples as the validation set with parameter h varied uniformly from −5 to 5. The training process is shown in Fig. S5(b).The performance of the training process is shown in Fig. S5(b).
FIG. S7.One pixel and two pixels attacks' results on the original classifier and refined classifier after adversarial training against FGSM attacks.(a)(b) show the results of the one pixel attack.It is clear that the model is more vulnerable near critical temperatures Tc.After adversarial training, adversarial samples ratio decreases a lot and the peak moves to Tc exactly, which means the model becomes more robust.(c)(d) show the results of the two pixels attack and a similar conclusion can be drawn.
1, 1].If the adversarial sample has components exceeding this bound, FGSM simply change the value of this component to be the value of either x min or x max .A pseudocode representation of the fast gradient sign method is shown in Algorithm 2.

TABLE S2 .
The normalized cross correlations between different kinds of activation maps of the sixth kernel before and after adversarial training.Iχ=i represents the average activation map for χ = i phase, where χ = 0 is topologically trivial and χ = 1, −2 are topologically nontrivial.Ileg is the activation map of the legitimate example with χ = 0 and Iadv is the activation map of the adversarial example.
2 are two activation maps, D(•) is the variance and σ(•, •) is the covariance.It is easy to verify that |ρ| ≤ 1 and larger ρ indicates stronger correlation.The result is shown in Table training to defense the FGSM and PGD attacks for the 3D CNN.In order to obtain this figure, we use 5001 legitimate samples and their corresponding 5001 adversarial samples as the training set for retraining the network.After every epoch we calculate the accuracy of legitimate samples and adversarial samples, respectively.The loss is calculated on both legitimate and adversarial samples.We mention that the adversarial samples used for training are different in every epoch.In each epoch, we use the current model and legitimate samples to generate adversarial samples, using both legitimate and adversarial samples to train the model, and then in the next epoch, we generate new adversarial samples by the model with updated parameters. sarial