Tandem Deep Learning Side-Channel Attack on FPGA Implementation of AES

Side-channel attacks have become a realistic threat to implementations of cryptographic algorithms, especially with the help of deep-learning techniques. The majority of recently demonstrated deep-learning side-channel attacks use a single neural network classifier to extract the secret from implementations of cryptographic algorithms. The potential benefits of combining multiple classifiers using the ensemble learning method have not been fully explored in the side-channel attack’s context. In this paper, we propose a tandem approach for the attack in which multiple models are trained on different attack points but are used in parallel to recover the key. Such an approach allows us to considerably reduce (33.5% on average) the number of traces required to recover the key from an FPGA implementation of AES by power analysis. We also show that not all combinations of classifiers improve the attack efficiency.


Introduction
Deep-learning Side-Channel Attacks (DL-SCAs) utilize deep-learning models to bypass the theoretical strength of cryptographic algorithms. Many attacks on software implementations of Advanced Encryption Standard (AES) have been demonstrated recently. In [1][2][3], the effect of changing hyper parameters of deep-learning models for side-channel attacks are investigated. Afterwards, Wu et al. [4] and Rijsdijk et al. [5] provide two different approaches for tuning neural networks' hyperparameters automatically. In [6], a monobit-model technique to improve the attack efficiency is presented. To explore how board diversity affects the attack accuracy of the trained deep-learning models, Wang et al. [7] show to extend which a model trained for one device can lead to successful attacks on another device. To mitigate the effect caused by the board diversity, References [8][9][10] propose a cross-device approach, which trains models on traces captured from multiple devices. Besides, in [11], the newly proposed federated learning framework [12] is applied to improve the attack efficiency to break an 8-bit ATx-mega128D4 microcontroller implementation of AES-128.
However, software implementations of AES are relatively easy to break using side-channel analysis because instructions are computed sequentially [13]. In hardware implementations, computations are performed in parallel. Therefore, DL-SCAs for hardware implementations is inherently more difficult, especially in advanced technologies. Power traces of two well-known public datasets, DPA contest V2 [14] and AES_HD [15], are captured from Xilinx Virtex-5 FPGA series. Many attacks are demonstrated based on these two datasets. In [15], Random Forest (RF) technique requires more than 5000 traces to recover a subkey. Masure et al. [16] investigate the theoretical soundness of Convolutional Neural Networks (CNNs) in the context of side-channel, References [17][18][19] demonstrated successful attacks on Virtex-5 FPGAs using CNNs. On a lightweight implementation of AES on Artix-7 FPGA [20], a non-profiled attack is able to recover the key with 3700 traces. Apart from FPGA, [21] shows the effectiveness of CNN-based side-channel attacks on ASICs. Table 1 shows a summary of previous attacks on hardware implementations of AES. To the best of authors' knowledge, previous works did not consider the potential of combining multiple deep-learning classifiers in DL-SCAs on hardware implementations. When traces are particularly noisy, an ensemble of multiple models is capable of outperforming the single classifier. Also, it is necessary to test models on devices manufactured using advanced technologies.
To address these limitations and to further improve the attack efficiency, we propose a tandem deep-learning sidechannel attack. It is inspired by a machine learning metaalgorithm called Adaptive Boosting (AdaBoost) [22], which is a subset of ensemble learning [23]. In AdaBoost, different classifiers (weak classifiers) are trained on the same training set. These weak classifiers are combined to form a boosted classifier (strong classifier). In our approach, several different, separately trained deep-learning models are used in an ensemble and we multiply models' outputs to reduce the generalization error. Since different models usually do not make the same errors on the test set, an ensemble of multiple models is expected to perform better than its members [24]. To reduce the generalization error, we train different classifiers (weak classifiers) on different training sets, which are labeled by different attack points. These weak classifiers are combined to form a boosted classifier (tandem model), which is able to achieve a more efficient attack.
In this paper, we show that while our best single CNN classifier requires 251 traces on average for a successful attack, the number for the tandem model is 167, which is a 33.5% reduction. In summary, our main contributions are as follows: 1 Paper organization The rest of the paper is organized as follows. The next section provides background information on software and hardware implementations of AES, and reviews how deep-learning-based side-channel attacks work. The third section reviews existing deep-learning side-channel attacks on two well-known publicly available datasets for hardware implementations of AES. The fourth section shows the equipment used in the experiments. The fifth section explains how different deep-learning models are used in a tandem. The sixth section presents the experimental results. The last section concludes this paper.

Background
In this section, we start by reviewing AES-128 and comparing hardware and software implementations of AES. Afterwards, we review deep-learning techniques and CNN.

AES-128
AES-128 [27] is a symmetric encryption algorithm, which takes a 128-bit block of plaintext and a 128-bit key as inputs. Figure 1 shows the flow of the AES-128 algorithm and three attack points used in our experiments. AES-128 contains 10 rounds in total. Except for the last round, each round has 4 steps: SubBytes, ShiftRows, MixColumns and AddRound-Key. The last round does not mix columns. The SubBytes procedure is a byte-to-byte substitution using a lookup table called Substitution Box (SBox). As any block cipher, AES can be used in several modes of operation. In this paper we use Electronic Codebook (ECB) mode, in which the message is divided into blocks and each block is encrypted separately.

Hardware vs. Software Implementations of AES
In software implementations of AES, leakage is timedependent and samples are less noisy since instructions are carried out one by one [18]. This makes deep-learning models easier to learn features from traces. On the other hand, hardware implementations of AES execute instructions in parallel. Therefore, traces captured from hardware implementations overlap features of all subkeys, which makes side-channel analysis inherently more difficult, especially in advanced technology. For example, Fig. 2a, b shows power traces captured from an 8-bit ATxmega128D4 microcontroller and a Xilinx Artix-7 FPGA implementations of AES-128, respectively, during the execution of Sbox operations of the first round of AES. In the trace from the microcontroller, 16 SBox computations are executed sequentially. To recover each key byte, an attacker can build a specialized model on the specific part of the trace. However, FPGAs execute all 16 SBox computations in parallel and a model built for only one subkey needs to handle the overlap caused by other 15 SBox computations. This makes the single-model attack less efficient.

Deep-Learning Side-Channel Attacks
Deep learning [24] is a subset of machine learning which uses neural networks to explore different levels of representative features of data for classification or prediction. Deep-learning models start with simple features and by the layer-by-layer combination continuously explore more complex features. Given the training data and a certain set of parameters, deep-learning models are able to demonstrate some particular tasks such as classification [28].
The aim of DL-SCAs is to use deep-learning models to classify a set of power traces T = T 1 , T 2 , … T m based on their labels to derive the secret key, where m is the number of traces and T i denotes a single trace. The corresponding label of trace T i is denoted as l(T i ) ∈ L , where L = {0, 1, … , 255} is the set of intermediate data processed at the attack point. To recover the full 128-bit key K of AES-128, typically a divide-and-conquer strategy is applied in which the key K is divided into 8-bit parts K k ∈ K , called subkeys, and the subkeys are recovered independently, for k ∈ 1, 2, ...., 16 , where K = {0, 1, … , 255}.
In most cases, deep-learning side-channel attacks are usually composed of two stages: the profiling stage ( Fig. 3a) and the attack stage ( Fig. 3b).
At the profiling stage, the attacker first uses the profiling device to encrypt a large number of plaintexts using known keys and captures traces. The model is trained on the labeled traces to learn the correlation between traces and keys. A neural network can be viewed as a mapping N ∶ ℝ n → |L| , which maps a trace T i ∈ ℝ n into a score vector S i = N(T) ∈ |L| whose elements s i,j represent the probability that label l(T i ) has the value j ∈ {0, 1, ..., 255} , where n is the number of data points in T i .
At the attack stage, the attacker uses the victim device to encrypt a small number of plaintexts and records corresponding traces. Using the trained model to classify traces captured from the victim device, the attacker is able to obtain the corresponding intermediate data and hence derive the subkey. The process mapping from the label to the subkey can be described as a retrieve function F ∶ |L| → |K| .  From the one-to-one mapping process F , a guess vector P i can be obtained from the score vector S i . Each ..,p 255 } denote the cumulative guess vector, which is an element-wise multiplication for all guess vectors generated by classifying m traces. The attacker can find the subkey K k = j which has the largest probability in : We use K * k to denote the real subkey. Once K k = K * k , the subkey is recovered successfully.
Since instructions of hardware implementations of AES are executed in parallel, traces captured from FPGAs are particularly noisy. In this scenario, CNN-based side-channel attacks seem to be powerful to handle noisy traces. CNN was originally introduced for image, speech, time series processing and document recognition [29]. The strength of a CNN network is that different network layers can learn features of the input data at different levels. A typical CNN network contains three types of layers: convolutional layers for filtering, pooling layers for down sampling and Fully-Connected (FC) layers for projection. CNNs have been successfully applied to bypass the trace misalignment and to overcome jitter-based countermeasures [30]. CNNs were also used to break protected AES [31][32][33].

Previous Work
This section reviews some existing deep-learning side-channel attacks on two well-known publicly available datasets for hardware implementations of AES, called DPA v2 and AES_HD.
The DPA Contest v2 [14] was organized by the VLSI research group from the COMELEC department of the Télécom ParisTech french University. The acquisitions have been performed on a SASEBO GII board [38] implementation of AES-128. The board features the Xilinx Virtex-5 [39] LX30/LX50 as the target FPGA for implementation evaluation. For the AES_HD dataset [15], it consists of EM measurements of an unprotected AES-128 implementation on Xilinx Virtex-5 FPGA of a SASEBO GII evaluation board. The implementation is written in VHDL in a round based architecture that takes 11 clock cycles for each encryption. The dataset has 500K traces in total with 500K randomly generated plaintexts and each trace contains 1250 samples.
Deep-learning techniques were first used to assist power analysis in 2013 [40] when a three-layer MLP network was trained to break a Smart Card implementation of AES-128 which contains an 8-bit microcontroller PIC16F84 [41].
Apart from MLPs, Maghrebi et al. [18] investigate how other types of deep-learning models could make the sidechannel attacks more efficient based on three different datasets. To break an Virtex-5 FPGA implementation of AES (DPA v2 dataset), the CNN and Autoencoder (AE)-based approaches in [18] require roughly 200 traces on average to recover the secret key. When it comes to the case of template attack, the result becomes about 400 traces.
To pursue the line of works on CNN-based power analysis, Cagli et al. [30] apply the CNN with a data augmentation technique [42] to bypass the trace misalignment and to overcome the jitter-based countermeasures. Cagli et al. [30] first point out that the conventional template attack strategy suffers from the difficulty to deal with the trace misalignment, which forces the attacker to have a critical realignment of the captured traces. Afterwards, they experimentally show that the CNN-based strategy greatly facilitates the attack roadmap since it waives the requirement for trace realignment and precise selection of points of interest.
To further improve the attack efficiency to break hardware implementation of AES, Jin et al. [36] introduce an attention mechanism, which is called Convolutional Block Attention Module (CBAM). Afterwards, they incorporate the proposed CBAM module into their CNN architecture, which helps the model to find the informative points of traces. In their experiments, the enhanced CNN model requires 2100 traces to recover a subkey from an Virtex-5 FPGA implementation of AES (AES_HD dataset).
To explain the role of each hyperparameters of neural networks during the feature selection phase in the sidechannel attacks' context, Zaid et al. [37] use three visualization techniques to show the inner-workings of models, which are Weight Visualization [43], Gradient Visualization [44] and Heatmaps [45]. With the help of these visualization approaches, Zaid et al. [37] need 1050 traces to recover a subkey from an Virtex-5 FPGA implementation of AES (AES_HD dataset).
In [6], to train neural networks, each bit of the intermediate data processed at the attack point is used as one label. Thus, when considering a subkey which is a byte, there are 8 labels in total. This technique is presented to overcome the curse of class imbalance since each bit is nearly uniformly distributed compared to the Hamming weight (HW) leakage model. To break an Virtex-5 FPGA implementation of AES (AES_HD dataset), the multi-label model in [6] uses 831 traces to recover a subkey. Tables 2 and 3 summarize some existing deep-learning side-channel attacks on Virtex-5 FPGA implementations of AES. In this work, we go one step further to focus on an Xilinx Artix-7 FPGA implementation of AES-128. Unlike Virtex-5 FPGAs which are manufactured using 65 nm process technology, Artix-7 FPGAs are manufactured using 28 nm process technology. Advanced manufacturing process technique makes the attack particularly difficult. Besides, all existing works do not take the impact caused by the board diversity into consideration. They train deep-learning models on traces captured from the victim device, which requires an unlimited access to the target. Clearly, this condition is unlikely in a real attack scenario. In our experiments, we train and test deep-learning models on traces captured from different devices to mitigate the effect caused by the board diversity. Figure 4 shows two Xilinx Artix-7 FPGAs manufactured using 28 nm High-K Metal Gate (HKMG) process technology. In the sequel, we call these two boards FPGA1 and FPGA2 respectively. They are programmed to the same version of AES-128 in Electronic Codebook (ECB) mode of operation. We use ChipWisperer Lite [46] with a 40 MHz sampling rate for trace capture. In our experiments, FPGA1 is used as the profiling board and FPGA2 is the victim board.

Attack Point
An attack point is a selected intermediate state which can be used to describe the power consumption. To form the proposed 3-classifier tandem model, three attack points are selected from the last round of AES-128 since the last round does not have the Mixcolumn procedure (see Fig. 1). It only has 3 operations: SubBytes, ShiftRows and AddRoundKey. The SubBytes procedure is a byte-to-byte substitution using a lookup table called Substitution Box (SBox). We denote these three attack points as x 1 , x 2 and x 3 :   where C k represents the k th 8-bit ciphertext, sft_row −1 () and SBox −1 () denote the inverse of SubBytes and ShiftRows, respectively. Attack point x 1 is the input of the last round, x 2 is the output of the shift row operation, and x 3 is the XORed value between the input and output of the last round. Note that x 3 represents switching activity, which is known to be the dominant fraction of the total power consumed by a CMOS device.

Model Structure
CNNs have been successfully applied to bypass trace misalignment and to overcome jitter-based countermeasures [30]. Layer structures of our local CNN classifiers are shown in Table 4. We use the identity model as the power model, which assumes that the power consumption is proportional to the data processed at the attack point. Three CNN classifiers, referred as classifier 1, 2 and 3, are trained on traces labeled by attack point x 1 , x 2 and x 3 , respectively. We use categorical crossentropy loss to quantify the classification error and use the RMSprop optimizer to tune internal parameters.

Tandem Deep-Learning Model
As shown in Fig. 5, three CNN classifiers are trained on same traces, but labeled by different attack points. To retrieve the subkey K k from x 1 , x 2 and x 3 , we define three different retrieve functions R 1 , R 2 and R 3 : During the attack stage, 3 local classifiers are used to classify traces captured from the victim board individually and obtain their own cumulative guess vectors 1 , 2 , 3 , which represent the classification results of 3 local classifiers. Afterwards, we multiply these classification results and obtain the final guess vector ̃ = 1 × 2 × 3 to form the tandem model.

Estimation Metrics
Rank The rank of a key K, Rank(K) , is the number of keys with a higher probability than K:

Guessing Entropy
The Guessing Entropy is the expected rank among all possible keys: GE = K∈K (Rank(K)) . If subkeys are recovered individually, then the entropy is guessed for each subkey K k separately and Partial Guessing Entropy, PGE, is used as the estimation metric [47].

Experimental Results
In this section, we first evaluate how non-profiling attacks such as Correlation Power Analysis (CPA) [48] perform on traces captured from Artix-7 FPGA implementations of AES. Afterwards, we investigate the average number of traces required to recover the key using a single CNN classifier without the tandem approach. Next, we test to which extend the 2-classifier and 3-classifier tandem models can improve the attack efficiency. Afterwards, for completeness, we investigate how the result changes if tandem models are built by combining classifiers trained on the same attack point.

Correlation Power Analysis
To show the reader how non-profiling attacks such as CPA perform on an Artix-7 FPGA implementation of AES, in this section we present CPA results for 5K traces. We use three different attack points with the identify power model. Figure 6a-c shows the correlation results for all 16 subkeys for attack point x 1 , x 2 and x 3 , respectively.
As we can see from the PGE plots in Fig. 6a, b, the CPA cannot recover any subkey for all selected attack points within 5K traces without the key enumeration. The attack point x 3 achieves the best CPA result, in which the minimum rank is 1 and the maximum is 249. Notice that once the rank achieves 0, the key is recovered.

Single-Classifier Model
In this section, our experiments are designed to show how many traces are required to recover the key using singleclassifier models trained on traces labeled by different attack points. For each attack point, we train classifiers with the learning rate 0.0001, no learning rate decay, no dropout, and the batch size 256. The CNNs are trained using RMSprop optimizer. To select a best number of epochs, for each of the three classifiers we trained 10 models using e epochs, for e ∈ {10, 20, … , 100} . At each iteration, the model is stored instead of being overwritten. The resulting best numbers of epochs are shown in Table 5.
Classifier 1, 2 and 3 are trained on 1,000K traces captured from FPGA1 labeled by attack point x 1 , x 2 and x 3 , respectively, with 200K traces randomly set aside for validation. We have two different test sets of the same size, the first one contains 50K traces captured from FPGA1 and another one is from FPGA2. For a single test, 1K where min and max are the minimum and the maximum data points in T. Figure 7a-c shows the PGE of classifier 1, 2 and 3 tested on traces captured from FPGA1 and FPGA2 respectively. Classifier 1 is able to recover the key using 524 traces captured from FPGA1, and 815 traces from FPGA2 on average. For classifier 2, the result becomes to 533 and 672 traces. Classifier 3 is the best model which can recover the key using 251 traces captured from FPGA1, and 342 traces from FPGA2. These results are concluded in Table 6. Classifier 3 uses fewer traces to recover the key than other classifiers, which indicates that x 3 is more efficient than other attack points to break both FPGA1 and FPGA2. This is an expected result since classifier 3 uses the attack point defined by the Hamming distance between two states, while classifiers 1 and 2 use the attack points defined by the values of the states themselves (identity power model). It is known that, for hardware implementations, the Hamming distance is a better power model than the identify since the total power consumption is dominated by the dynamic power consumption which, in turn, is determined by the switching activity of logic gates [50]. A larger Hamming distance implies a higher switching activity.
Next, we combine classifiers into a tandem.

2-Classifier Tandem Model
The 2-classifier tandem model is built by combining 2 of 3 CNN classifiers. Figure 8 shows the PGE results and Table 6 shows the average number of traces used by 2-classifier tandem models to break both FPGA1 and FPGA2. We notice that, except the tandem model built by combining classifiers 1 and 2, all other 2-classifier tandem models use fewer traces than the classifiers they include. Compared to our best single classifier (classifier 3), the tandem model with a combination of classifier 1 and 3 uses 30.7% fewer power traces to recover the key of FPGA1 and 21.1% for FPGA2. Also, the tandem model with a combination of classifier 2 and 3 uses 33.5% of fewer power traces to recover the key of FPGA1 and 11.4% for FPGA2. However, the tandem model which combines classifier 1 and 2 needs to use 10.9% more traces to break FPGA2 than classifier 1, which is the best local classifier of this tandem model. Our explanation is that attack point 1 and 2 do not provide enough diversity. Fig. 7 Average PGE of CNN classifiers tested on traces captured from FPGA1 and FPGA2. To compute the average, 500 tests were performed for each classifier. For each test, 5000 traces were randomly selected from 50K traces

3-Classifier Tandem Model
Our 3-classifier tandem model is built by combining classifier 1, 2 and 3, which utilizes all available attack points. Figure 8(d) shows the PGE result and Table 6 shows the average number of traces. Compared to the results of 2-classifier models, adding one more classifier indeed improves the attack efficiency when the target device is different from the profiling device. The 3-classifier model uses 172 traces to recover the key of FPGA1 and 219 traces for FPGA2, on average. Compared to our best single classifier, it uses 31.5% fewer traces to break FPGA1 and 36.0% fewer traces for FPGA2. Table 6 shows that a tandem with a larger number of classifiers seems to be more robust to manufacturing process variations since it needs the fewest number of traces for the attack on the FPGA2. We can also see that, for the FPGA1, all tandems with the classifier based on the attack point 3 achieve comparable results.
Besides, since it is easier to attack the device which was used for training, the experimental results for FPGA1 can be treated as the lower bound on the number of traces required for a successful attack. Attacking another device is more difficult due to manufacturing process variations.

Tandem Model with the Same Attack Point
To verify that it is important to use different attack points to train classifiers for building a tandem model, we further train 4 CNN classifiers on attack point 3 with different number of epochs. Then, we combine them into tandem. Table 7 shows the average number of traces required to recover the  Table 7, we can conclude that building tandems from multiple classifiers trained on the same attack point is not a good strategy.

Conclusion
By combining multiple deep-learning classifiers trained on different attack points, the proposed tandem model is able to achieve a more efficient attack on an Artix-7 FPGA implementation of AES. Compared to the conventional single-classifier attack, the tandem model with multiple attack points can significantly improve the attack efficiency. We show that, one of our 2-classifier tandem models is able to use 33.5% fewer traces to break the profiling device (FPGA1). We also show that our 3-classifier tandem model is able to use 31.5% fewer traces to break the victim device (FPGA2). Finally, we show that it is important to use different attack points to build the tandem model.