1 Introduction

Deep learning using a deep neural network (DNN), a machine-learning technique, has seen rapid development and been applied to image-recognition systems for surveillance cameras or autonomous vehicles. It has also been put to practical use in security-related applications such as the detection of malicious emails.

Conventional side-channel attacks (SCAs), such as differential power analysis (DPA) [1], correlation power analysis (CPA) [2], and template attack [3], require signal analysis with statistical processing of the power waveform on the basis of the implementation knowledge of the target cryptographic algorithm. Furthermore, high skills are required to analyze cryptographic circuits with SCA countermeasures on which random number masking and/or clock jitter are introduced.

Deep-learning SCAs (DL-SCAs), applying deep learning to SCAs, are proposed [4]. It makes it possible to attack cryptographic circuits protected with conventional SCA countermeasures without knowledge of the countermeasures.

SCAs are classified as profiled or non-profiled. In profiled attacks, a leakage model is constructed using side-channel information from a freely controllable device identical to the target device (profiling device). The attack is then carried out using the constructed leakage model and side-channel information obtained from the target device. One example of profiled attacks is a template attack. Non-profiled attacks use only the side-channel information obtained from the target device. One example of non-profiled attacks is a CPA. Profiled attacks have been commonly used for DL-SCAs, but non-profiled DL-SCAs were introduced in 2018 [5].

1.1 Related works

In 2016, Maghrebi carried out DL-SCAs using convolutional layers for the first time against AES software implementation protected with masking countermeasures and showed that the convolution layer is effective against SCAs [4]. Cagli et al. carried out DL-SCAs on AES software and hardware implementation protected with random jitter countermeasure using data augmentation and showed that the attack is possible without the alignment process [6].

Hou carried out a CPA on DPA Contest V4, which is a dataset of AES software implementation protected with Rotating Sboxes Masking (RSM) countermeasure and successfully attacked it after preprocessing to determine the offset and mask values [7]. Gilmore also successfully attacked DPA Contest V4 by using a multi-layer perceptron (MLP) [8].

In 2018, Prouff released the ANSSI SCA Database (ASCAD), which is a public dataset of AES software implementation protected with the table re-computation masking countermeasure and carried out profiled DL-SCAs to examine which DNN parameters, such as the number of convolutional layers and filters, are suitable for DL-SCA [9].

In 2018, Timon proposed an attack that applies deep learning to non-profiled attacks (non-profiled DL-SCAs) and showed the effectiveness of the attack against AES software implementation protected with the random jitter and the masking countermeasure used on ASCAD. The successful attack was demonstrated on only one partial key against the masking countermeasure and on the limited points of interest (PoI) after analyzing the leakage point of the mask and the masked intermediate values [5] [10]. Other related studies on non-profiled DL-SCAs involved a method using side-channel information from 1D trace transformed into 2D image format [11] and an attack against hiding countermeasure based on correlated noise generation [12].

Table 1 Differences of contributions between previous study [10] and this study

1.2 Contents of this paper

In this study, we attacked ASCAD and our experimental AES-128 software implemented on an Atmel Xmega128 connected to a power-measurement field-programmable gate array using the New AE’s ChipWhisperer environment [13]. We carried out non-profiled DL-SCAs against two types of masking countermeasures; table re-computation masking (used in ASCAD) and RSM, which has not been successfully attacked by non-profiled DL-SCAs so far. The contributions of this paper are as follows:

  1. (1)

    We provide guidelines on how to select the DL models and attack points on the waveforms (i.e., PoI) to be used in non-profiled DL-SCAs. We examined several types of structures on MLP and select which is appropriate for non-profiled DL-SCAs. Although in the previous study [10], such an attack was carried out on the limited PoI after analyzing the side-channel leaks, we also show that DL-SCAs are successful on a wide range of sampling points without identifying the limited PoI selected from the prior analysis.

  2. (2)

    The previous study [10] used the least significant bit (LSB) or most significant bit (MSB) of the S-Box output value as the training label. We explore which bit is better to use through two datasets.

  3. (3)

    We discuss the reason why the attack performance depends on the bit selected as the label when attacking AES protected with RSM countermeasure. Then, we present the hint for effective fixed mask values to be used for RSM countermeasure against non-profiled DL-SCAs.

  4. (4)

    The Over-fitting problem has to be avoided for successful DL-SCAs. We introduce regularization as a countermeasure against over-fitting and successfully revealed all partial keys against the two masking countermeasures i.e., ASCAD and RSM.

  5. (5)

    We carry out successful non-profiled DL-SCAs against software implemented AES protected with the RSM countermeasure. It should be noted that mask values used in RSM are not used in DL-SCAs. While all 16 partial keys are revealed under 50,000 traces by the DL-SCAs, no partial keys are revealed by the conventional 1st-order CPA.

  6. (6)

    In DL-SCAs, gradient visualization has been used for analyzing the SCA leakage point learned by a model [14]. SHAP, an explainable artificial intelligence, was proposed to explain the reason of output from machine learning [15]. We apply this idea to identify the leak point of mask values by observing the change in attack performance when a part of the input waveform is removed from the selected PoI.

Table 1 summarizes the differences between the previous study [10] and this study. The rest of the paper is organized as follows. We introduce deep learning in Sect. 2 and introduce the experimental environment and non-profiled DL-SCAs in Sect. 3. In Sect. 4, we present the guidelines for non-profiled DL-SCAs and present the results from carrying out non-profiled DL-SCAs on AES protected with RSM countermeasure in Sect. 5. In Sect. 6, we discuss the side-channel leakages learned by the network model used in non-profiled DL-SCAs. In Sect. 7, we discuss the reason why attack performance changes depending on the bit selected as the label when attacking AES protected with RSM countermeasure.

2 Deep learning techniques

2.1 Structure of neural network

A neural network consists of several layers of neurons. A neuron is the mathematical model of a neuron in the human brain. A neural network consists of a pair of input and output layers, and several intermediate layers placed between those layers. This structure is called an MLP. Neural networks for classification tasks are trained by a training dataset \(D =\{(x_{n}, y_{n})\}^{N}_{n = 1}\) which is a set of N pairs of a finite number of inputs x and correct labels y. The network learns to minimize the error between the correct label y and predicted value \({\hat{y}}\) and obtains the ability to discriminate.

2.2 Over-fitting and validation accuracy

Over-fitting refers to a situation in which the network adapts too much to the training dataset and fails to classify an unknown data that is different from the training data. During training, we observe the classification accuracy and loss for the validation dataset (unknown dataset for the network) to monitor over-fitting. Regarding over-fitting, classification accuracy for the validation dataset (hereafter, validation accuracy) becomes lower than that for the training dataset (hereafter, training accuracy). Similarly, when over-fitting is in progress, the classification loss for the training dataset decreases, but the loss for the validation dataset increases. In the following experiments, we used validation accuracy to evaluate the trained model.

2.3 Regularization of weights in DNNs models

Regularization is a method for preventing over-fitting. It should improve the generalization performance of network models. From the experiments introduced in Sects. 4 and 5, we found that regularization has a significant advantage on the performance of non-profiled DL-SCAs compared to results without it.

L1 regularization, L2 regularization, and L1/L2 regularization, which applies both L1 and L2 regularization, are mainly used in deep learning. When regularization is not used, the network learns to minimize the loss function L(x). If L1 regularization is used, weight parameters \(w_{i}\) used on DNNs are determined to minimize the sum of the loss function \(L^{'}(x)\), which includes the L1 regularization term as a penalty in addition to loss function L(x), as shown in Eq. 1.

$$\begin{aligned} L^{'}(x) = L(x) + \lambda _{1}\sum _{i=0}^{n} \mid w_{i}\mid , \end{aligned}$$
(1)

where \(\lambda _{1}\) is the L1 regularization parameter that adjusts the strength of the regularization.

Similarly, if L2 regularization is used, \(w_{i}\) is determined to minimize \(L^{'}(x)\), which includes the L2 regularization term as a penalty in addition to loss function L(x), as shown in Eq. 2.

$$\begin{aligned} L^{'}(x) = L(x) + \lambda _{2}\sum _{i=0}^{n} \mid w_{i} \mid ^{2}, \end{aligned}$$
(2)

where \(\lambda _{2}\) is the L2 regularization parameter that adjusts the strength of the regularization.

If L1/L2 regularization is used, \(w_{i}\) is determined to minimize the \(L^{'}(x)\), which includes the L1/L2 regularization term as a penalty in addition to loss function L(x), as shown in Eq. 3.

$$\begin{aligned} L^{'}(x) = L(x) + \lambda _{1}\sum _{i=0}^{n} \mid w_{i}\mid + \lambda _{2}\sum _{i=0}^{n} \mid w_{i}\mid ^{2}, \end{aligned}$$
(3)

where \(\lambda _{1}\) and \(\lambda _{2}\) are the L1 and L2 regularization parameter respectively.

Therefore, the sum of \(w_{i}\) is evaluated to prevent the weights from becoming too large and avoid over-fitting.

3 Experiments on non-profiled DL-SCAs

3.1 Non-profiled DL-SCA procedure

A non-profiled DL-SCA is an attack that applies deep learning to non-profiled attacks that use side-channel information obtained only from the target device [10].

First, the attacker obtains N power-consumption waveforms \(T (t_{n} \in T, 1 \le n \le N)\) in cryptographic operations with a fixed key \(k_{secret}\). In the case of byte-by-byte attacks against AES, the attacker also prepares network models \(M_{k} (0 \le k \le 2^{8}-1)\) for each key candidate k. Each model is trained to output the specific intermediate value (label) \(h_{k,n}\) computed from the key candidate k and the plaintext \(p_{n} (1 \le n \le N)\). A \(p_{n}\) is an n-th plaintext of entire plaintext dataset. The key \(k^{*}\) corresponding to the model \(M_{k^{*}}\) with the highest accuracy after the training is specified as the correct key. The procedure of the non-profiled DL-SCA is shown in Algorithm 1. In all experiments, the optimizer was Adam, learning rate was 0.001, batch size was 1,000, number of epochs was 100, and loss function was the mean square error. These hyperparameters are the same ones used in the previous study [10].

figure a

3.2 Waveform datasets used in the experiments

We carried out non-profiled DL-SCAs against AES-128 software implementation protected with the table re-computation masking and RSM countermeasures. The two datasets used in the experiments are as follows.

  1. (1)

    ASCAD (countermeasure: table re-computation masking) [9]

ASCAD is a dataset that provides electromagnetic (EM) emission waveforms during the operation of AES software implementation protected with the table re-computation masking countermeasure on the 8-bit AVR ATMega8515. The mask values for 0th and 1st bytes are set to 0, so 0th and 1st bytes are regarded as unprotected. The EM waveforms are measured using a digital oscilloscope with a sampling rate of 2 GS/s. ASCAD discloses waveforms of 2nd byte SubBytes processing in 1st round as the PoI for the 2nd byte attack.

  1. (2)

    In-house dataset measured by ChipWhisperer

    (countermeasure: RSM)

The DPA Contest V4.2 dataset is available as a public dataset for AES software implementation protected with the RSM countermeasure [16]. However, this dataset is not suitable for non-profiled DL-SCA evaluation because the key is changed every 5,000 waveforms. Therefore, we prepared an in-house dataset by using Atmel 8-bit microcontroller Xmega128. The source code is modified from that available on the DPA Contest website. The power consumption during the AES processing was measured using an A/D converter (sampling rate: 29 MS/s) implemented in ChipWhisperer-Lite [13]. The experimental condition for waveform acquisition is shown in Fig. 1.

Fig. 1
figure 1

Experimental condition for acquiring power consumption waveform on AES processing

4 Guidelines for non-profiled DL-SCA

In this section, we present the results of attacks on the 2nd byte of ASCAD to provide guidelines on how to select DNN models and attack points on waveforms for non-profiled DL-SCAs. We discuss the performance of the attack on various numbers of layers and nodes on a network model (MLP), and whether it is necessary to select a small PoI after analyzing the leakage points of intermediate values.

We also discuss how to improve the attack by regularization and that the attack performance varies depending on the labeling methods.

4.1 Table re-computation masking countermeasure

The table re-computation masking countermeasure, applied in ASCAD, prevents side-channel leakage by using an S-Box table computed with two random numbers. The algorithm of AES protected with the table re-computation masking countermeasure is also shown in Algorithm 2 in Appendix. This masked S-Box is used in the SubBytes layer because simple additive masking cannot be applied on a nonlinear transformation [17, 18]. Figure 21 in Appendix shows an overview of the first round of AES protected with the table re-computation masking countermeasure. The masked S-Box is pre-computed using 1-byte random mask values \(m_{in}\) and \(m_{out}\), as shown in the figure. The input plaintext is also masked by a 16-byte random value \(m_{1}\). Since the values of all nodes in Fig. 21 are masked by random masks which is unknown to the attacker, leakages about \(p\oplus k_{1}\) and S-Box(\(p\oplus k_{1}\)) do not appear in the side-channel information and cannot be attacked by a conventional 1st-order CPA. The \(m_{1}\), \(m_{in}\), and \(m_{out}\) will be changed when the next plaintext is encrypted. Non-profiled DL-SCAs are performed using the method shown in Sect. 3.1, and the assumption is that the attacker does not know that a masking countermeasure is implemented at the target. Therefore, we train DNNs \(M_{k}(0\le k\le 255)\) by the LSB on the output of S-Box(\(p\oplus k\)) in the first round, which is computed only from the key candidate k and the plaintext \(p_{n} (1 \le n \le N)\). By monitoring validation accuracy on the training, we estimate the key candidate \(k^*\) corresponding to \(M_{k^{*}}\) with the highest accuracy as the correct key.

4.2 Layers and nodes in network model

We used the MLP with two intermediate layers (20 nodes in the first layer and 10 nodes in the second layer) used in the previous work [10] as the reference network model. We also observed the change in attack performance when the number of intermediate layers and that of nodes are varied to identify the appropriate network in our study.

We first verified the change of attack performance depending on the number of intermediate layers of the MLP. The number of intermediate layers with 20 nodes was changed from 2 to 8, and attacks were carried out against the 2nd byte of ASCAD. We used 30,000 waveforms (training waveforms: 24,000, validation waveforms: 6,000). We observed the difference in validation accuracy between the correct key and the incorrect key with the highest accuracy as an evaluation metric. Figure. 2a shows the difference in this accuracy when the number of intermediate layers was changed. Although the number of layers increased, the accuracy of the correct label did not improve compared with the result with two layers.

We then verified the change in attack performance with the number of nodes in the intermediate layer of the MLP. The number of layers was fixed to 2. Using the same evaluation method as in Fig. 2a, b shows the difference in accuracy of the correct label when the number of nodes was changed. As in the case of changing the number of layers, no significant difference was observed when the number of nodes increased.

We also verified the network model with convolutional layers used in the previous study [10], but the attack performances were not better than those of MLPs. The results of the network model with convolutional layers used in the previous study [10] are shown in Fig. 2 with black lines.

In the previous study [10], it was stated that the use of complex networks may improve the performance of non-profiled DL-SCAs, but the complexity of the network did not make a significant difference according to our experimental results. Therefore, considering additionally that the more complex the network is, the longer it takes to learn, we also used the same network with two intermediate layers (20 nodes in the first layer and 10 nodes in the second layer) as the previous study [10] in the subsequent experiments.

Fig. 2
figure 2

Dependence on the number of nodes and intermediate layers with the validation accuracy in attack against ASCAD 2nd Byte

4.3 How to choose PoI

In this section, we discuss how the attack performance changes depending on the choice of PoI. In ASCAD, the mask values \(m_{1}\), \(m_{in}\), and \(m_{out}\) are used to identify the 2nd-byte SubBytes processing point in the first round, which is extracted from the side-channel information. Assuming a realistic attack scenario in which the attacker does not have any prerequisite knowledge of the target, it is reasonable to select the point where all SubBytes on the first round of AES are processed from the raw waveform of ASCAD, as shown in Fig. 3. We examined whether the attack performance changes when the attack point is the 2nd byte processing point specified in ASCAD (Regular ASCAD PoI in Fig. 3, sample points: 700) or when the attack point is the entire SubBytes processing point (Full SubBytes PoI in Fig. 3, sample points: 34,000). When Full SubBytes processing area is selected as an attack point, the number of input samples is reduced to 1,700 by averaging every 20 points to reduce the number of input samples. We used 30,000 waveforms (training waveforms: 24,000, validation waveforms: 6,000) to carry out the attack. Figure 4a shows the validation accuracy of each estimated key when the attack point was the Regular ASCAD PoI. Figure 4b shows the validation accuracy when the attack point was the Full SubBytes PoI. The attack succeeded even when the attack point was wider.

The larger number of the attack points, the larger the number of epochs required for estimating the correct key, but the final accuracy is slightly higher with a larger number of attack points. In the attack on ASCAD, described in detail in Sect. 6, the network learned the leakage of \(m_{out}\). This mask value was used in the SubBytes process of all bytes, so the Full SubBytes PoI provides more hints (leaks of \(m_{out}\)) to the model. Therefore, the final accuracy is likely to be higher if the Full SubBytes PoI is used as the attack point.

Fig. 3
figure 3

The raw waveform of ASCAD and the attack points used in the attack (Regular ASCAD and Full SubBytes in the first Round)

Fig. 4
figure 4

Results of validation accuracy attack on 2nd byte when using Regular ASCAD PoI and Full SubBytes PoI (validation accuracy per estimated key)

4.4 Effect of regularization

In Sect. 4.3, we showed that an attack against the 2nd byte of ASCAD is possible even if the attack point is widely selected by including SubByte processing of all bytes (Full SubByte PoI) in the first round. Since this PoI involves SubBytes processing of all bytes, we used the same 30,000 waveforms to attack all bytes. We observed the validation accuracy and found that we could recover 15 partial keys, but we could not successfully obtain the 3rd partial key.

The fact that the validation accuracy of the model trained by the dataset of the correct key did not increase suggests that the trained network model does not have generalization performance. To increase validation accuracy (i.e., to improve generalization performance), we applied the regularization of weight parameters on DNNs, as described in Sect. 2.3. We tried several regularization parameters and adopted the L1/L2 regularization shown in Eq. 3, which achieves the best result. Regularization parameters (\(\lambda _{1}\):0.00001, \(\lambda _{2}\):0.00001) were set for 1st dense layer and (\(\lambda _{1}\):0.0008, \(\lambda _{2}\):0.0008) for 2nd dense layer, and the attack was conducted again.

Figure 5 shows the validation accuracy for the 3rd byte attack without (a) and with (b) regularization. Compared (a) with (b), the accuracy of the model trained with regularization was clearly higher. Figure 6 shows the results of non-profiled DL-SCAs with regularization we carried out against all ASCAD bytes. These results indicate the effectiveness of regularization.

Fig. 5
figure 5

Validation accuracy of each estimated key for attack on 3rd byte of ASCAD

Fig. 6
figure 6

The result of non-profiled DL-SCAs against all bytes of ASCAD. Vertical axis means the validation accuracy, and horizontal axis means the number of epochs

4.5 Comparison in labeling

The previous work [10] on non-profiled DL-SCAs only evaluated attacks with LSB or MSB of the S-Box output values as the label. We evaluated attacks with all the bits of the S-Box output values as the label. It is represented as \(b_0\) to \(b_7\) in the binary representation of the S-box output value, as shown in Eq. 4.

$$\begin{aligned} \text {S-Box}\_\text {out} = (b_{7}b_{6}b_{5}b_{4}\quad b_{3}b_{2}b_{1}b_{0})_{2}, \end{aligned}$$
(4)

where \(b_{7}\) represents MSB and \(b_{0}\) represents LSB.

We evaluated attack performance with each bit as the training label. As in Sect. 4.4, we carried out attacks against all bytes using 30,000 waveforms, Full SubBytes PoI, and regularization.

Figure 7 shows the number of recovered partial keys per epoch when a certain bit was used as the training label. The metric, which was used for the attack was validation accuracy. When LSB (\(b_{0}\)) was selected, all bytes were successfully recovered, as mentioned in Sect. 4.4. In contrast, two unprotected bytes and a few masked bytes were revealed when \(b_{7}\) (MSB) to \(b_{2}\) was selected. It is notable that 15 of 16 partial keys were successfully recovered when \(b_1\) was selected.

These results indicate that LSB (\(b_{0}\)) achieves the best attack performance. Similar experiments will be carried out for the RSM countermeasure in Sect. 5.4, but the appropriate bit for attacks may vary depending on the implementation of countermeasures.

Fig. 7
figure 7

Comparison of attack results for each bit selected as a label. (successfully recovered bytes per epoch)

4.6 Comparison with 1st-order CPA

A CPA is a typical and powerful non-profiled attack. The results of the 1st-order CPA with 30,000 waveforms, which is the same number of waveforms as DL-SCAs we carried out, are shown in Fig. 8. The CPA was carried out by applying Hamming Weight (HW) model on the S-Box output value S-Box(\(p\oplus k_1\)) in the first round of AES. Figure 6 in Sect. 4.4 shows the results of non-profiled DL-SCAs with regularization we carried out against all ASCAD bytes.

Both CPA and DL-SCAs succeeded on the 0th and 1st bytes on which the masking countermeasure was disabled. For 5th byte, the correlation value of the correct key was the highest, and all other bytes had no correlation peak. Therefore, the CPA against the masking countermeasure almost failed. Non-profiled DL-SCAs with regularization succeeded in the 2nd and subsequent bytes, where the masking countermeasure was effective even without the knowledge that the masking countermeasure was applied.

Figure 9 shows the number of partial keys successfully recovered when the number of waveforms increased to 30,000. The CPA was not successful, except for the 2 bytes, which were not masked, and 5th byte. On the other hand, non-profiled DL-SCAs without regularization successfully attacked 15 bytes in 30,000 waveforms, and all 16 bytes in 25,000 waveforms when regularization was applied. These results mean that non-profiled DL-SCAs had higher attack performance than the typical (1st order) CPA. Against the masking SCA countermeasure described in this section, attack is still possible if the internal mask value can be estimated using a high-order CPA by the conventional method. The fact that an attacker who does not know the internal processing of the countermeasure can still obtain all partial keys using non-profiled DL-SCAs suggests that such an attack increases the threat of side-channel attacks.

Fig. 8
figure 8

The result of CPA against all bytes of ASCAD. Vertical axis means the correlation, and horizontal axis means the sampling point

Fig. 9
figure 9

Comparison of attack results between non-profiled DL-SCAs with and without regularization (successfully recovered bytes per epoch)

5 Non-profiled DL-SCAs against AES-128 RSM

5.1 Implementation of RSM countermeasure

The RSM (Rotating Sboxes Masking) countermeasure uses 16 types of \(\{S\text {-}Box^{'}_{i}\}_{0\le i\le 15}\) pre-computed with 16-byte fixed mask values m[i]\(({0\le i\le 15})\) as the S-Box used in SubBytes processing. The algorithm of AES protected with the RSM countermeasure is also shown in Algorithm 3 in Appendix. Before and after the S-Box transformation, masking is executed on the intermediate values with m[i] to prevent side-channel leakage [19]. The 16 types of S-Boxes used in encryption are calculated by using Eq. 5.

$$\begin{aligned} S\text {-}Box^{'}_{i}(x) = S\text {-}Box(x\oplus m[i])\oplus m[(i+1)mod16]. \end{aligned}$$
(5)

Figure 22 in appendix shows an overview of the first round of AES protected with the RSM countermeasure. The offset that appears in the algorithm and figure is a parameter that specifies which of the 16 S-Boxes and mask values should be used in the first round of processing for each byte. Thus, the possible values of offset are from 0 to 15. The S-Box used in the SubBytes layer is rotated in each round. The fixed mask value m[0:15] = [0x03, 0x0c, 0x35, 0x3a, 0x50, 0x5f, 0x66, 0x69, 0x96, 0x99, 0xa0, 0xaf, 0xc5, 0xca, 0xf3, 0xfc] used in our experiment is the same as that used in DPA Contest V4.2 [16].

5.2 PoI determination on power traces

We experimented with the program of AES software implementation protected with the RSM countermeasure published by DPA Contest V4.2, excluding the shuffling countermeasure of the byte processing.

We collected 50,000 (training waveforms: 40,000, validation waveforms: 10,000) power-consumption waveforms by using the experimental environment described in Sect. 3.2. The power consumption waveform is shown in Fig. 10.

When carrying out non-profiled DL-SCAs, we assumed that the attacker does not have knowledge of the AES implementation techniques in the target device. Therefore, the fixed 16-byte mask value on RSM is not used in the attack because the attacker does not know that the RSM countermeasure is in place. We also assumed that the attacker can roughly select the power-consumption waveform, so we used all regions including the pre-processing (mask processing) and SubBytes, ShiftRows, and MixColumns processing in the first Round as the PoI (sample points of clipped waveform: 3,850). It is possible that the attacker does not include the mask processing part in the PoI, but we discuss such a case in Sect. 6.2.

Fig. 10
figure 10

Power-consumption waveforms we collected and PoI

5.3 Results of non-profiled DL-SCAs against AES with RSM countermeasure

As with the experiments discussed in Sect. 4, we carried out non-profiled DL-SCAs using the method discussed in Sect. 3.1. We learned the MLPs using the LSB on the S-Box(\(p\oplus k_1\)) in the first round, which is computed from only key candidate k and plaintext \(p_{n} (1 \le n \le N)\). The key candidate with the highest accuracy was estimated as the correct key. To show the effect of regularization on the RSM countermeasure, we conducted attacks both with and without regularization (regularization parameters were the same as those discussed in Sect. 4.4). Figure 11 shows the validation accuracy for each estimated key with and without regularization when we attack the 1st byte. The accuracy of the model which learned the correct key dataset with regularization was much higher than that without regularization. Figure 12 shows the relationship between the number of revealed partial keys and that of epochs on learning. When regularization was not applied, only 10 out of 16 bytes were successfully attacked. The results of non-profiled DL-SCAs with the regularization applied against all bytes of the RSM countermeasure are shown in Fig. 13. By applying regularization and observing validation accuracy, we found that all the partial keys were successfully revealed. These results suggest that attacks against the RSM countermeasure (as well as ASCAD) are possible even if the attacker does not know that the countermeasure was implemented in the target and the internal processing of mask values, etc.

Fig. 11
figure 11

Validation accuracy of each estimated key for RSM 1st byte attack

Fig. 12
figure 12

Comparison of attack results between non-profiled DL-SCAs with and without regularization (successfully recovered bytes per epoch)

Fig. 13
figure 13

The result of non-profiled DL-SCAs against all bytes of RSM. Vertical axis means the validation accuracy, and horizontal axis means the number of epochs

5.4 Comparison in labeling

As in Sect. 4.5, we evaluated whether the attack performance against the RSM countermeasure depends on the bit selected as the label.

Figure 14 shows the number of recovered partial keys per epoch when a certain bit was used as the training label. When MSB (\(b_{7}\)) or LSB (\(b_{0}\)) was selected, all bytes are successfully recovered. In contrast, all bytes were not successfully recovered when \(b_{6}\) to \(b_{2}\) was selected, all partial keys were not successfully attacked. It is notable that all partial keys were successfully attacked with fewer epochs than LSB when \(b_{3}\) was selected.

Figure 15 shows the number of partial keys successfully recovered for MSB labeling (training with \(b_{7}\) as the label) and LSB labeling(training with \(b_{0}\) as the label) when the number of waveforms increased to 50,000. Compared with Fig. 14, the lower the number of epochs required for recovering all partial keys, the lower the number of waveforms required to recover all partial keys. The LSB labeling achieved superior attack performance in the ASCAD dataset but the MSB labeling achieved the best attack performance in the attack against the RSM countermeasure. This result and the result of attacks against ASCAD with each bit mentioned in Sect. 4.5 indicate that the appropriate bit for the training label may vary depending on the implementation of countermeasure algorithms. In Sect. 7, we will discuss why the attack performance depends on the bit position as the label when attacking against AES protected with RSM countermeasure.

Note that we explored the best bit for the label, but it is not a realistic setup in a non-profiled scenario. Therefore, it is worth adopting a multi-label classification [20] that refers to all of the bits as labels to achieve stable performance without bit exploring.

Fig. 14
figure 14

Comparison of attack results for each bit selected as a label. (successfully recovered bytes per epoch)

Fig. 15
figure 15

Comparison of attack results for non- profiled DL-SCAs with LSB labeling and MSB labeling and CPA

5.5 Comparison with 1st-order CPA

Using the same HW model as in Sect. 4.6, the results of the CPA with 50,000 waveforms, the same number of waveforms as with non-profiled DL-SCAs we carried out, are shown in Fig. 16. Figure 13 in Sect. 5.3 shows the results of non-profiled DL-SCAs with regularization we carried out against the RSM countermeasure.

The CPA results contain a peak, but the correlation with the incorrect key has the highest value, and the attack failed for all bytes. The results for non-profiled DL-SCAs are shown in Fig. 13. The attacks against all partial keys were successful. It suggests that an attack against the RSM countermeasure (as well as ASCAD) is possible even if the attacker does not know that the countermeasure was implemented in the target and the internal processing of mask values, etc.

Fig. 16
figure 16

The result of CPA against all bytes of RSM. Vertical axis means the correlation, and horizontal axis means the sampling point

6 Discussion of the model’s learning points

6.1 Evaluation with the gradient visualization

To determine whether the network learns the leaks, we utilize the gradient visualization (GV) technique [14]. We calculated the GV score \(\nabla t_n\) as;

$$\begin{aligned} \nabla t_{n} = \frac{\partial }{\partial t_{n}}\textrm{MSB}(M_{k_{secret}}(t_{n}), \textrm{LSB}(\textrm{SBox}(p_{n} \oplus k_{secret}))), \nonumber \\ \end{aligned}$$
(6)

where the function \(\textrm{MSE}(\cdot )\) calculates a mean squared error, \(\textrm{LSB}(\cdot )\) calculates a LSB, \(\textrm{SBox}(\cdot )\) calculates an S-Box output value, \(M_{k_{secret}}\) is the DNN model trained with the correct key, \(t_n\) is an n-th waveform from the dataset T and \(p_n\) is a plaintext corresponding to the \(t_n\).

In the previous study [10], Eq. 7 is mainly used to calculate which part of waveforms the network learns.

$$\begin{aligned} S_{j} = \sum \limits _{n=1}^N (\nabla t_{n,j} \times t_{n,j}), \end{aligned}$$
(7)

where \(\nabla t_{n,j}\) denotes a j-th sample of \(\nabla t_{n}\), \(t_{n,j}\) denotes j-th sample of n-th waveform and N denotes the number of waveforms used.

We apply Eq. 6 to the model trained by the attack against ASCAD using regularization mentioned in Sect. 4.4. To analyze what the network has learned by using gradient visualization, it is necessary to understand what operations are being executed in which parts of the waveform. For this reason, the correlation between the mask value and power-consumption waveform, and the correlation between the masked intermediate value and the power-consumption waveform were obtained. The upper waveform of Fig. 17 shows the correlations between the power-consumption and the LSB of the following values:

  • Blue) 2nd-byte mask value: \(m_{1}\)

  • Orange) 2nd-byte masked S-box output masked by \(m_{1}\): S-Box (\(p \oplus k_1\)) \( \oplus m_{1}\)

  • Green) Mask value: \(m_{out}\)

  • Black) 2nd-byte masked S-box output masked by \(m_{out}\): S-Box (\(p\oplus k_1\)) \(\oplus m_{out}\)

It can be concluded that the operations using these values are carried out where the correlation peaks appear. The lower waveform of Fig. 17 shows the gradient visualization \(\nabla t_{n}\) at each sample point when the validation data waveform is input to the network trained with the training dataset of the correct key for the 2nd-byte attack. Since the peak locations of \(\nabla t_{n}\) overlap the peak locations of the four correlations of \(m_{1}\), \(m_{out}\), S-Box(\(p\oplus k_1\)) \( \oplus m_{1}\) and S-Box(\(p \oplus k_1\)) \(\oplus m_{out}\), it seems that the network learns the leakage of the two sets of mask and masked S-Box output values. Figure 23 in appendix shows an overview of the processing of the attack points during an attack on the table re-computation masking countermeasure in Sect. 4.4. From Figs. 17 and 23, we can consider that by XORing \(m_{out}\) with S-Box(\(p\oplus k_1\)) \(\oplus m_{out}\), which is the output of the pre-computed S-Box’, the LSB of the correct label S-Box(\(p\oplus k_1\)) is derived. Similarly, learning S-Box(\(p\oplus k_1\)) \(\oplus m_{1}\) and \(m_{1}\) can also deliver the correct label LSB(S-Box(\(p\oplus k_1\))).

We conducted the same analysis for the attack against AES protected with the RSM countermeasure. The upper part of Fig. 18 shows the correlations between the power-consumption and the LSB of the following values:

  • Blue) 1st-byte mask value m[offset[1]+1]

  • Orange) 1st-byte masked S-box output:

  • S-Box(\(p\oplus k_1\)) \(\oplus \) m[offset[1]+1]

The lower part of Fig. 18 shows \(\nabla T_{t}\) when the validation data are input to the network trained with the training data set of the correct key for the 1st-byte attack carried out in Sect. 5.3. From Fig. 18, it appears that models are learning the leakages of S-Box(\(p\oplus k_1\))\(\oplus \)m[offset[1]+1], and m[offset[1]+1]. Figure 24 in appendix shows an overview of the processing of the attack points during an attack on the RSM countermeasure in Sect. 5.3. From Fig. 24, there is only one masked S-Box output value S-Box(\(p\oplus k_1\))\(\oplus \)m[offset[1]+1], which appears during the cryptographic process of the first round.

From the above results, in non-profiled DL-SCAs against both masking countermeasures, the network model learns the mask value m and the masked output S-Box value S-Box(\(p\oplus k_1\))\(\oplus m\) from the power-consumption value. The network model unmasked the learned masked S-Box output value using the mask value and estimated the LSB of unmasked S-Box output value LSB(S-Box(\(p\oplus k_1\))), which was set as the label.

Fig. 17
figure 17

Correlation of mask and masked S-Box output (top) and \(\nabla t_{n}\) (bottom) (using 2nd byte of ASCAD)

Fig. 18
figure 18

Correlation of masked and masked S-Box output (top) and \(\nabla t_{n}\) (bottom) (using 1st byte of RSM countermeasure dataset)

6.2 Evaluation with partial removal of power traces

Research is progressing on approaches, such as SHAP, that explain the output of machine learning [15]. As a simple approach, we observe the change of attack performance depending on whether the mask leak is included or excluded from the PoI. Therefore, we verify whether the model in fact learns the leakage points of the mask in non-profiled DL-SCAs. We removed the sample before 1,100 points corresponding to the “mask processing” (as shown in Fig. 24 in appendix) from the attack point discussed in Sect. 5.2. Therefore, the attack points were only SubBytes, ShiftRows, and MixColumns processing points. We applied regularization, as in Sect. 5.3, and conducted the attack. Figure 19 shows the number of successfully recovered partial keys per epoch by observing validation accuracy.

In contrast to the results of the attack including masking processing point mentioned in Sect. 5.3 at epoch 100, only 3 out of 16 partial keys were successfully recovered when validation accuracy was observed. Figure 20 also shows the \(\nabla t_{n}\) derived in the same manner as mentioned in Sect. 6.1. In contrast to Fig. 18, when the mask processing area was excluded from the PoI, no significant peak appeared in\(\nabla t_{n}\) indicating that the model is not able to effectively learn side-channel leakages.

These results confirm that the model learns the leakage of the mask, and the leakage points of the mask need to be included in the PoI in DL-SCAs. The RSM countermeasure does not use mask values in the SubBytes processing. Therefore, we need to choose a wide range of attack points including masking processing since non-profiled DL-SCAs are possible even with a wide range of attack points, as discussed in Sect. 4.3.

Fig. 19
figure 19

Results of non-profiled DL-SCAs against AES software implementation protected with RSM countermeasure when the waveform on mask processing is partially removed

Fig. 20
figure 20

Correlation of masked and masked S-Box output (top) and \(\nabla t_{n}\) (bottom) on RSM (0-1,100 points are removed)

Table 2 The fixed mask value used in RSM countermeasure

7 Discussion about change of attack performance by the selected bit

In this section, we discuss why the attack performance changes depending on the bit selection as the label in attacks against AES protected with RSM countermeasure shown in Fig. 14.

The leakage of masked S-Box output S-Box(\(p\oplus k_1\)) \(\oplus \) \(m[offset+1]\) will appear in the waveforms of AES protected with RSM countermeasure. But the label value used for non-profiled DL-SCA is one bit of S-Box(\(p\oplus k_1\)), so the network model needs to obtain the information of \(m[offset+1]\) from the waveform for successful attacks. From the experimental results in Sect. 6.2, attacks against RSM countermeasure don’t succeed if the waveform region of calculating mask \(p[i]\oplus k[i]\oplus m[offset]\) (corresponding to ’mask processing’ in Fig. 24 in appendix) is removed from the PoI. From this experiment, we assumed that the network model estimates \(m[offset+1]\) from m[offset] leaks in the ’mask processing’ region by exploiting the correlation between m[offset] and \(m[offset+1]\).

Table 2 shows the fixed mask value used in the RSM countermeasure. The bit transition ratio in the last line of Table 2 indicates the rate at how often bits change in successive mask values. We hypothesized that bits that don’t change value between m[offset] and \(m[offset+1]\), or bits that always flip, are highly correlated and thus the network model will be able to predict \(m[offset+1]\) from m[offset]. Comparing the bit transition ratios in Table 2 with the bit-position dependence of attacks in Fig. 14, the bit transition ratios of \(b_0\) (LSB), \(b_3\), and \(b_7\) (MSB), which could attack all partial keys, are 14/16, 16/16, and 2/16, respectively. On the other hand, the bit transition rate of \(b_2\) and \(b_5\), which were difficult to attack even when the number of epochs was increased, are 10/16 and 8/16. We found out that it is difficult to attack all partial keys if we choose bits with low correlation between m[offset] and \(m[offset+1]\) as the label. Therefore, the hypothesis that network model predicts \(m[offset+1]\) from m[offset] is consistent with the experimental results.

In summarizing this discussion, successful bit selection as shown in Fig. 14 can be successfully explained by the mask values (as shown in Table 2) used in RSM countermeasure. In the future works, it is required to develop new method to automatically search for optimal bit position without knowledge of mask values used in the RSM countermeasure.

8 Conclusion

We reported on the results of non-profiled DL-SCAs against AES software implementation protected with two types of masking countermeasures. One is the table re-computation masking countermeasure, and the other is the RSM countermeasure. From the results of the attacks against the above two types of masking countermeasures, practical guidelines on non-profiled DL-SCAs and the analysis of leakage points are clarified as follows.

  1. (1)

    The MLP network model with two intermediate layers with 20 and 10 nodes is sufficient for non-profiled DL-SCAs. There is no need to increase the number of layers and nodes according to the experimental results on ASCAD.

  2. (2)

    The selected range of waveforms to be learned, i.e., the PoI, can be attacked by specifying a large range that includes the S-Box processing of all bytes and does not need to be divided into byte-by-byte computational regions.

  3. (3)

    Applying L1/L2 regularization to the weight parameters on DNNs during non-profiled DL-SCAs improves the performance of DL-SCAs and allows us to recover all partial keys for both masking countermeasures.

  4. (4)

    The best labeling method for non-profiled DL-SCAs depends on the implementation of countermeasure algorithms. In attacks against the RSM countermeasure, the reason why attack performance varies depending on the bit selected as the label is due to the fixed mask value used.

  5. (5)

    By analyzing the information learned by the network model using gradient visualization, we find that the network model learns the mask value m and the masked S-Box output value S-Box(\(p\oplus k_1\))\(\oplus m\) from the side-channel information. DNNs infer the correct label S-Box(\(p\oplus k_1\)) from these values, so the accuracy of the correct key will increase.

  6. (6)

    As a new method to evaluate what the model learned, the performance of non-profiled DL-SCAs gets worsens when the arithmetic part that uses the mask value m in RSM implementation is removed from the PoI. This confirms that the mask value m is being estimated from the waveform.

Non-profiled DL-SCAs against AES software implementation protected with a masking countermeasure do not require any special attacking method (e.g., mask-value estimation before attacking, as used in the conventional 2nd-order CPA) to recover all partial keys. Attackers do not need to refine the PoI of the acquired waveform. Therefore, we can conclude that the threat of a realistic side-channel attack has increased since such an attack is successful without knowledge of the side-channel countermeasure implementation used by the target. In the future, non-profiled DL-SCAs will become more common for the evaluation of implemented side-channel attack countermeasures. It will also be necessary to study implementation schemes that are more resistant to DL-SCAs.