A side-channel attack on a masked and shuffled software implementation of Saber

In this paper, we show that a software implementation of IND-CCA-secure Saber key encapsulation mechanism protected by first-order masking and shuffling can be broken by deep learning-based power analysis. Using an ensemble of deep neural networks trained at the profiling stage, we can recover the session key and the secret key from 257×N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$257 \times N$$\end{document} and 24×257×N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$24 \times 257 \times N$$\end{document} traces, respectively, where N is the number of repetitions of the same measurement. The value of N depends on the implementation of the algorithm, the type of device under attack, environmental factors, acquisition noise, etc.; in our experiments N=10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 10$$\end{document} is sufficient for a successful attack. The neural networks are trained on a combination of 80% of traces from the profiling device with a known shuffling order and 20% of traces from the device under attack captured for all-0 and all-1 messages. “Spicing” the training set with traces from the device under attack helps us minimize the negative effect of inter-device variability.


Introduction
Public-key cryptographic schemes used today depend on the intractability of certain mathematical problems such as integer factorization or the discrete logarithm.However, if large-scale quantum computers become a reality, it will be possible to solve these problems in polynomial time using Shor's algorithm [51].Even though it will take many years to construct a large-scale quantum computer, the need for long-term security makes it urgent to investigate new solutions.
To address this need, the National Institute of Standards and Technology (NIST) started in 2016 a process for standardization of post-quantum cryptographic primitives, NIST PQC.Candidate primitives rely on problems that are not tographic algorithms [31].The general protection method against timing attacks is to make implementations such that all instructions are executed in constant time, a standard assumption for software implementations.The timing channel can be extended to consider cache-timing attacks [11], where time variation due to memory management in the executing device is considered.A typical example of an exploit is the use of look-up tables.
Even with constant time implementations and avoiding implementation weaknesses such as the use of look-up tables, a software implementation is still vulnerable to attacks if the power consumption or electromagnetic (EM) emissions from the CPU can be measured [1,30].In such cases, more advanced countermeasures are required.The main tools are techniques such as masking [14], shuffling [56], insertion of random delays through dummy operations [16], constantweight encoding [33] and code polymorphism [7].
Differential side-channel analysis pioneered by Kocher et al. [30] was the first breakthrough in the area.The second major advance was the introduction of deep learning-based side-channel analysis.Apart from improving the differential attacks' effectiveness (e.g., four instead of 400 power measurements are needed to extract a key from a USIM [10]), the latter enabled non-differential message/key recovery attacks on NIST PQC candidates [40,52,54], as well as attacks of true random number generators [39] and Physical Unclonable Functions [63].Deep learning-based sidechannel attacks can overcome traditional countermeasures, including Boolean masking [40], jitter [12] and code polymorphism [34].Our contributions In this paper, an extension of ASHES'21 [41], we present the first side-channel attack on a masked and shuffled implementation of CCA-secure Saber KEM.Additionally, in this extension, we delve deeper into the workings of neural network models and pinpoint the specific assembly instructions that leak information, thereby identifying potential areas for future protection.
Until now, these countermeasures combined together were believed to provide an adequate protection against power and EM analysis.
We show how to recover the session key and the longterm secret key by deep learning-based power analysis from 257 × N and 24 × 257 × N traces, respectively, captured using the execution of the decapsulation algorithm, where N is the number of repetitions of the same measurement.The value of N depends on the implementation, environmental factors, acquisition noise, etc.; in our experiments, N = 10 is enough for a successful attack without any enumeration.
Similarly to the attack on a first-order masked Saber [40], our deep neural networks learn a higher-order model directly, without explicitly extracting random masks at each execution.However, since we attack an implementation in which the message bits are shuffled, it is not possible to directly recover the message from a single trace, as in [40].Only the message Hamming weight (HW) can be derived.To find the order of message bits, traces for 256 additional decapsulations have to be captured and analyzed for each chosen chiphertext (hence ×257).
We quantify the success rate of the message HW recovery as a function of the success rate of a single message bit recovery and show that the latter should be of the order of 0.999 to recover the message HW with a high probability.To increase the success rate of a single message bit recovery, we introduce a novel approach for training neural networks which uses a combination of traces from the profiling device with a known shuffling order and traces from the device under attack captured for all-0 and all-1 messages.We also use an ensemble of models to increase the success rate of message recovery from the derived HWs.
The remainder of this paper is organized as follows.Section 3 gives the necessary background on Saber algorithm and profiled side-channel attacks.Section 4 describes the implementation of masked and shuffled Saber KEM which is used in our experiments.Section 5 presents equipment for trace acquisition.Section 6 shows how points of interest are located in side-channel measurements.Sections 7 and 8 describe the profiling and the attack stages, respectively.Section 9 summarizes the experimental results.Section 10 concludes the paper and describes future work.

Previous work
In this section, we describe previous work on implementations and attacks of the NIST PQC lattice-based candidates.

Implementations
The first side-channel protected implementation of a latticebased cryptosystem was presented in [49] followed by [48], based on masking.Masking involves doing linear operations twice, whereas nonlinear operations need more complex solutions decreasing the speed substantially.The implementation approach in [49] increases the number of CPU cycles on an ARM Cortex-M4 by a factor more than 5 compared to a standard implementation, see [6, p. 2].
These protected implementations focus on Chosen-Plaintext Attack (CPA)-secure lattice schemes, but more relevant are secure primitives designed to withstand Chosen-Ciphertext Attacks (CCA).CCA-secure primitives are usually obtained from a CPA secure primitive using a transform, such as the Fujisaki-Okamoto (FO) transform or some variation of it [28].The CCA-transform is itself susceptible to side-channel attacks and should be protected [47].Examples of recent masked implementations are: [42] of a KEM similar to NewHope; and [5,22,35] being lattice-based signature schemes.
At the time the work presented in this paper was carried out, only one of the round 3 finalists of the NIST PQC, Saber, had a protected software implementation available [6].The implementation utilizes a first-order masking of the Saber CCA-secure decapsulation algorithm with an overhead factor of only 2.5 compared to an unmasked implementation.This side-channel secure version can be built with relatively simple building blocks compared to other candidates, resulting in a small overhead.The masked implementation of Saber is based on masked logical shifting on arithmetic shares and a masked binomial sampler.The work includes experimental validation of the implementation to confirm it on the Cortex-M4 general-purpose processor.

Attacks
Early side-channel attacks on NIST PQC project candidates targeted unprotected implementations.In [52] message recovery attacks on the unprotected encapsulation part of round 3 candidates CRYSTALS-Kyber and Saber and round 3 alternate candidate FrodoKEM using a single power trace were described.In [47], near-field EM side-channel assisted chosen ciphertext attacks applicable to six round 2 candidates were presented.In [62], unprotected Kyber was attacked as a case study using near-field EM side-channels.A way of turning a message recovery attack to a secret key recovery attack was proposed using, e.g., 184 traces for 98% success rate.In [55] another power/EM-based secret key recovery attack on some round 3 candidates KEMs based on FO transform and its variant was presented.In [25], similar ideas were used for timing attacks.The resistance of an unprotected Saber to amplitude-modulated EM emanations was investigated in [60] and [59].
In [45], the authors improve the key recovery attacks on unprotected implementations of three NIST PQC finalists, including Saber.They also discuss how to attack masked implementations by attacking shares individually.However, no actual attack on masked Saber was carried out.The first attack on a first-order masked implementation of the IND-CCA-secure Saber KEM was demonstrated in [40].The attack recovers both the session key and the secret key using a deep neural network trained at the profiling stage.The chosen ciphertext-based secret key recovery attack requires 24 traces.The ciphertexts are constructed using a novel error-correction code-based method which allows for correcting single-bit errors an detecting double-bit errors in the recovered messages.This waves the requirement for a perfect message recovery, making the attack more realistic.An attack applying the method of [40] to a first-order masked implementation of Kyber was presented in [58], targeting the message encoding vulnerability found in [52].
More recently, in [8], side-channel attacks on two implementations of masked polynomial comparison were demonstrated on the example of Kyber.Polynomial multiplication is also a target of the attacks on unprotected implementations of all lattice-based NIST PQC finalists presented [37], where Correlation Power Analysis is used.

Background
This section describes Saber and profiled side-channel attacks.A more detailed description of Saber can be found in [18].

Saber design description
Saber is a package of cryptographic algorithms whose security relies on the hardness of the Module Learning With Rounding problem (Mod-LWR) [18].It contains a CPA-secure public key encryption scheme, Saber.PKE and a CCA-secure key encapsulation mechanism, Saber.KEM, based on a post-quantum version of the Fujisaki-Okamoto transform [21].
Pseudo-codes of Saber.PKE and Saber.KEM are shown in Figs. 1 and 2, respectively.We follow the notation of [40].
Fig. 1 Description of Saber.PKE from [18] 123 Fig. 2 Description of Saber.KEM from [18]  Let Z q be the ring of integers modulo q and R q be the quotient ring Z q [X ]/(X n + 1).The rank of the module is denoted by l.The rounding modulus is denoted by p.
The notation x ← χ(S) stands to denote sampling x according to a distribution χ over a set S. The uniform distribution is denoted by U. The centered binomial distribution with parameter μ is denoted by β μ , where μ is an even positive integer.The term β μ (R l×k q ; r ) generates a matrix in R l×k q where the coefficients of polynomials in R q are sampled in a deterministic manner from β μ using seed r .The functions F, G and H are SHA3-256, SHA3-512 and SHA3-256 hash functions, respectively.The gen is an extendable output function which is used to generate a pseudorandom matrix A ∈ R l×l q from seed A .It is instantiated with SHAKE-128.
The bitwise right shift operation is denoted by " ".It is extendable to polynomials and matrices by performing the shift coefficient-wise.To allow for an efficient implementation, Saber design uses power-of-two moduli q, p, and T , namely q = 2 q , p = 2 p and T = 2 T .In order to implement rounding operations by a simple bit shift, three constants are used: polynomials h 1 ∈ R q and h 2 ∈ R q with all coefficients being 2 q − p −1 and 2 p −2 − 2 p − T −1 + 2 q − p −1 , respectively, and a constant vector h ∈ R l×1 q in which each polynomial is equal to h 1 .
In the round 3 Saber document [18], three sets of parameters are proposed for the security levels of NIST-I, NIST-III and NIST-V: LightSaber, Saber and FireSaber, respectively (See Table 1).All results presented in this paper are for Saber, but it is trivial to extend them to the other versions.Saber uses n = 256, l = 3, q = 2 13 , p = 2 10 , T = 2 4 and μ = 8.Its decryption failure probability is bounded by 2 −136 .

Profiled side-channel attacks
Side-channel attacks can be carried out in two settings: profiled and non-profiled.Profiled attacks first learn a leakage profile of the targeted cryptographic algorithm's implementation using a device similar to the device under attack, called profiling device.The profiling can be done by creating a template [3,13,27], or training a neural network model [10,12,29,32].Then, the resulting template/model is used to recover the secret variable, e.g., the key, from the device under attack [32].Non-profiled attacks attack directly [53].
Profiled side-channel attacks typically assume that: (1) The attacker has at least one profiling device similar to the device under attack which runs the same implementation.
(2) The attacker has full control over the profiling device.
(3) The attacker has direct physical access to the device under attack to measure side-channel signals for chosen inputs.

Implementation of masked and shuffled Saber KEM
All experiments presented in this paper are performed on a first-order masked and shuffled implementation of Saber which we created ourselves.To the best of our knowledge, no implementations of Saber protected by both masking and shuffling countermeasures are available at present.We used the first-order masked implementation of Saber presented in [6] as a base and added shuffling on the top as described Sect.4.2.

Masking
Masking is a well-known countermeasure against power/EM analysis [14].First-order masking protects against attacks leveraging information in the first-order statistical moment.A first-order masking partitions any sensitive variable x into two shares, x 1 and x 2 , such that x = x 1 • x 2 , and executes all Fig. 3 The masked implementation of Saber.PKE.Dec() from [6] (left) and the presented masked and shuffled implementation of Saber.PKE.Dec() (right) operations separately on the shares.The operator "•" depends on the type of masking, e.g., it is "+" in arithmetic masking and "⊕" in Boolean masking.
Carrying out operations on the shares x 1 and x 2 prevents leakage of side-channel information related to x as computations do not explicitly involve x.Instead, x 1 and x 2 are linked to the leakage.Since the shares are randomized at each execution of the algorithm, they are not expected to contain exploitable information about x.The randomization is usually done by assigning a random mask r to one share and computing the other share as x −r for arithmetic masking or x ⊕ r for Boolean masking.
A challenge in masking lattice-based cryptosystems is the integration of bitwise operations with arithmetic masking which requires methods for secure conversion between masked representations.Saber can be efficiently masked due to specific features of its design: power-of-two modulo q, p and T , and limited noise sampling of LWR.Due to the former, modular reductions are basically free.The latter implies that only the secret key s has to be sampled securely.In contrast, LWE-based schemes also need to securely sample two additional error vectors.
Masking duplicates most linear operations, but requires more complex routines for nonlinear operations.The firstorder masked implementation of Saber presented [6] uses a custom primitive for masked logical shifting on arithmetic shares, called poly_A2A(), and an adapted masked binomial sampler from [50].Particular attention is devoted in [6] to the protection of the decapsulation algorithm since it involves operations with the long-term secret key s.At its first step (see Saber.KEM.Decaps() in Fig. 2), the decapsulation algorithm calls Saber.PKE.Dec() to decrypt the input ciphertext c. Figure 3 shows the implementa-123 tion of Saber.PKE.Dec() from [6] called indcpa_kem_ dec_masked().
To perform masked logical shifting, the authors of [6] recognize that, for power-of-two moduli, the conventional method of first performing an A2B conversion and then shifting subsequently the Boolean shares is wasteful.This is because the lower bits are first computed only to be immediately discarded by shifting them out.Their novel primitive, poly_A2A() (see Fig. 3), avoids computing the Boolean sharing of the lower bits completely, leading to reduced computational and memory overheads.

Shuffling
Shuffling is another well-known countermeasure against power/EM analysis [56].We use the modernized version of the Fisher-Yates (FY) algorithm [20] which generates a random permutation of a finite sequence.The generated sequence is used as the loop iterator to index the inner loop function's data processing.This effectively scrambles the order in which the elements of an array are processed as opposed the linear sequence of a non-shuffled loop.Shuffling makes power analysis and neural network training significantly more difficult as this removes the linear correlation of index sequence with time.
Figure 3 shows our masked and shuffled implementation of the decryption algorithm Saber.PKE.Dec(), called indcpa_kem_dec_ masked_and_shuffled().We implement bitwise shuffling of a 256-bit message in the primitive poly_A2A() by calling the FY_Gen() function to randomly permute a list of the same length (see poly_ A2A_shuffled()).The shuffled values (in the range from 0 to 255) are then subsequently referenced at the start of every loop iteration, resulting in randomized execution order.
Figure 4a and b compares the inner loops of the assembly code of poly_A2A procedure before and after adding shuffling on the top of masking.One can see that the inclusion of FY_Gen() function has a minimal effect.It changes the lines that reference the store/load offsets only.Therefore, the side-channel leakage which is not related to FY index generation is expected to be similar in both implementations.
We also implement bytewise shuffling of a message in the procedure POL2MSG() (see POL2MSG_shuffled()) by calling the FY_Gen() to randomly permute a list of the length equal to the number of bytes, 32.

Known vulnerabilities
In previous work, a number of vulnerabilities were discovered in the non-masked LWE/LWR-based PKE/KEMs [2, 37, 44-47, 52, 52, 55].One is Incremental-Storage vulnerability resulting from an incremental update of the decrypted message in memory during message decoding [45].The decoding It was further observed in [45] that a non-masked implementation of the decoding function contains two points with exploitable Incremental-Storage vulnerability.The first one is where the message bits are computed and stored in a 16-bit memory location in an unpacked fashion.Since the memory location can take only two possible values, 0 or 1, an attacker can recover the message bit by distinguishing between 0 and 1.The second point is in POL2MSG() procedure where the decoded message bits are packed into a byte array in memory.There has been many attacks put forth against the Fujisaki-Okamoto (FO) transform commonly found in lattice-based KEMs, such as [26,54] as well as [9] which targets a masked version of the FO's comparison operation through a collision attack.
In [40], it was demonstrated that, despite partitioning the message into two shares in a first-order masked implementation of Saber, the leakage point in POL2MSG() procedure can still be exploited.In addition, a new leakage point in poly_A2A() procedure was discovered (highlighted in red Fig. 5 Equipment for trace acquisition in Fig. 3).The attacks presented in this paper are based on the corresponding point in poly_A2A_shuffled()) (highlighted in red in Fig. 3).

Equipment for trace acquisition
The equipment we use for trace acquisition consists of the ChipWhis-perer-Lite board, the CW308 UFO board and two CW308T-STM32F4 target boards (see Fig. 5).
The ChipWhisperer is a hardware security evaluation toolkit based on a low-cost open hardware platform and an open-source software [38].It can be used to measure power consumption and to make communication between the target device and the computer easier.Power is measured over a shunt resistor connected between the power supply and the target device.ChipWhisperer-Lite employs a synchronous capture method, which greatly improves trace synchronization while also lowering the required sample rate and data storage.
The CW308 UFO board is a general-purpose platform for evaluating multiple targets [17].The target board is plugged into a dedicated U connector.
The target board CW308T-STM32F4 contains a 32-bit ARM Cortex-M4 CPU with STM32F415-RGT6 device.The board operates at 24 MHz and it is sampled at 24 MHz, i.e., 1 point per clock cycle.
In our experiments, the Cortex-M4 CPU is programmed with the masked and shuffled Saber implementation described in the previous section.The implementation is compiled with arm-none-eabi-gcc at the highest level of compiler optimization -O3 (recommended default) which is typically the most difficult to break by side-channel analysis [52].

Locating points of interest
The attacks on unprotected implementations of LWE/LWRbased KEMs [47,52] typically locate leakage points in sidechannel measurements using techniques such as Test Vector Leakage Assessment (TVLA) [24], or Correlation Power Analysis (CPA).However, such a method is not applicable to a protected implementation since masked implementations change random masks for each execution and shuffled implementations change shuffling order for each execution.
In this section, we describe our method for locating points of interest in a masked and shuffled implementation of Saber. Figure 6a shows a power trace obtained by averaging 50K measurement made during the execution of Saber.KEM.Decaps() for random ciphertexts.We can clearly see different blocks with different structure.Our aim is poly_A2A() procedure which processes 256 message bits one-by-one.The segment of Fig. 6a marked by two red lines is a possible candidate.By zooming in, see Fig. 6b and c, one can verify that the number of repeating peaks is indeed 256.
By measuring the distance between the peaks, we can find that the processing of one bit by poly_A2A() takes 51 points.This parameter is referred to as bit_offset in the sequel.Since for poly_A2A() the shares A[i] and R[i] are processed immediately following each other (see line 4 of poly_A2A() in Fig. 3), bit_offset contains both shares.
By locating the first peak, we can find the starting point of poly_A2A() procedure.This parameter is referred to as offset.Note that we do not need to know neither the value of a random mask, nor the shuffling order to compute the offset and bit_offset.

Profiling stage
The aim of profiling is to construct a neural network model capable of distinguishing between the message bit values "0" and "1."At the attack stage, we use this model to count the number of "1"s in the message in order to determine its HW.
We use neural networks with a multilayer perceptron (MLP) architecture shown in Table 2.It is the same as the one in [40] except for the input size.This architecture was selected using the grid search algorithm [23] which trains a model for every joint specification of hyperparameter values in the Cartesian product of the set of values for each individual hyperparameter.
During training, we use Nadam optimizer [19], which is an extension of RMSprop with Nesterov momentum, with a learning rate of 0.001 and a numerical stability constant epsilon=1e-08.Binary cross-entropy is used as a loss function.The training is run for a maximum of 100 epochs, with a batch size of 128 and an early stopping.70% of the training set is used for training, and 30% is used for validation.
Unlike [40] where eight models were trained, one for each bit position of a byte, we train a single model capable of recovering all message bits.This is accomplished by composing the training set as a union of trace intervals corresponding to individual bit processing.As a result, we get a universal model which has "learned" features for all 256 bits.Using a cut-and-join technique like this, we can increase the size of the training set by a factor of 256 without having to capture 256 times as many traces.For example, the 2 M training set used in our experiments is composed from 7.8K captured traces.On an ARM Cortex-M4 running at 24MHz, it takes less than 17 min to capture the latter and 3 days to capture the former.
The cut-and-join technique is applicable to poly_A2A() leakage point because poly_A2A() procedure processes all message bits in the same way during their storage in memory.Thus, traces representing the execution of poly_A2A() appear identical for all message bits except the first and last, as we can see from Fig 7 .Because of the Cortex-M4's three-stage pipeline, the next instruction begins before the previous instruction has finished.As a result, the power consumed during the processing of the first and the last bits differs from the power consumed during the processing of other bits.
Similarly to [40], we defeat masking by training models on traces containing the bits of both shares labeled by the value of the corresponding message bit.Thus, the models are capable of recovering the message bits directly, without explicitly extracting the mask.However, since the message bits are also shuffled in our case, we cannot train on traces captured from the device under attack for random messages, as in [40] because the order of bits (and thus training labels) is unknown.Instead, we train on a combination of traces from the profiling device running an implementation with deactivated shuffling, and traces from the device under attack captured for all-0 and all-1 messages.Obviously, the labels of all bits are the same for all-0 and all-1 messages.In the experimental results section, we show that such a combined strategy helps us minimize the negative effect of device variability on model's classification accuracy.We also show that training on 100% of traces from the device under attack captured for all-0 and all-1 messages is not the best choice because traces of all-0 and all-1 messages do not allow the neural network to Fig. 7 Average power traces representing the processing of message bits 0, 1 and 255 by poly_A2A() (for 10K measurements).Traces for the remaining bits look similarly to the trace of bit 1 learn all possible features due to the above-mentioned impact of the previous and next instructions on power consumption.
The effect of device variability on the shape of traces is illustrated in Fig. 8. Figure 8 shows two plots obtained by averaging 10K traces captured from the profiling device D P (blue) and the device under attack D A (orange) during the processing of the message by poly_A2A().The blue plot is difficult to see because it is covered by the orange plot to a large extent.Figure 9 shows Welch's t-test [61] results for the same 10K trace sets from D P and D A computed as: , where μ P /μ A , σ A /σ P and n P /n P are the mean, standard deviation and the size of the trace sets from D P /D A .We can see that there are noticeable differences between traces.The bottom peak corresponds to the point 51.Later, we show that side-channel data in the interval around this point is crucial for accurate class prediction.
The pseudo-code of the profiling algorithm is shown in Fig. 10.TrainModel() takes as input the number of traces to be captured, τ , the neural network's input size, in_size, and a parameter k ∈ I, I = {x ∈ R | 0 ≤ x ≤ 1}, which defines which fraction of traces is captured from the profiling device, D P .For example, k = 0.8 means that 80% of traces are from D P .The rest of traces is captured from the device under attack, D A , for all-0 and all-1 messages in equal parts, r = (1 − k)/2.
At step 1, ComposeTrainingSet() procedure is called to create a set of training traces, T , and the corresponding set of labels, L. In ComposeTrainingSet(), k × τ messages are selected at random and encrypted by a fixed public key. 1 The profiling device D P , which is running an implementation with deactivated shuffling, is used to decapsulate the resulting set of ciphertexts.During its execution, the power traces are captured.
Similarly, τ × r all-0 and τ × r all-1 messages are generated and encrypted.The device under attack D A is used to decapsulate the resulting ciphertexts, and the power traces are captured (step 4-8).
Next, the initial offset, offset, and the distance between the message bits in T , bit_offset, are determined as described in Sect.6.Finally, the cut-and-join technique is used to divide T into intervals representing individual message bit processing and to generate the set of labels L containing the corresponding message bit values.

Attack stage
To defeat the combined masked and shuffled countermeasures, we make use of the existing key and message recovery techniques presented in [40] and [45] for masked-only and shuffled-only LWE/LWR-based KEMs, respectively, and introduce two new algorithms.
In this section, we outline the main steps of the proposed secret and session key recovery approaches, then describe the key and message recovery techniques from from [40] and [45], and finally present the new algorithms.

Secret key recovery
The secret key is recovered as follows: (1) Construct 24 chosen ciphertexts c 1 , . . ., c 24  Session key recovery Assume that the adversary has a properly generated ciphertext c which is decapsulated by the device under attack.The adversary follows the steps ( 2)-( 5) of the secret key recovery algorithm described above to extract the message m contained in c from 257 × N power traces.Given m , he/she computes ( K , r ) = G( pkh, m ) and gets the session key as K = H( K , c).

Chosen ciphertext construction
In [40], an approach based on error-correcting codes (ECC) was introduced to recover the secret key from masked Saber.We use the same chosen ciphertexts as in [40] for recovering the secret key from masked and shuffled Saber.
The ciphertexts are constructed as where the pairs (k 0 , k 1 ) are listed in Table 3 The approach in [40] works because decryption of (c m , b ) yields the message

Bit-flip technique
In [45], a technique called bit-flip was introduced to recover the message m contained in ciphertext c which is decapsulated by the device under attack implementing a shuffled LWE/LWR-based KEM algorithm.We use a "fuzzy" version of this technique, presented in Sect.8.4, for recovering messages contained in 24 chosen ciphertexts which are decapsulated by the device under attack implementing masked and shuffled Saber.Given a ciphertext c = (c m , b ), the bit-flip technique [45] constructs 256 ciphertexts c j , j ∈ {0, . . ., 255}, in which the value of the center of the integer ring Z q is subtracted from the jth coefficient of c m .Since the message polynomial is only additively hidden within the ciphertext, this results in a ciphertext decrypting m j which is equal m = Saber.PKE.Dec(s, c) with jth bit flipped.
For c and each c j , a side-channel HW classifier is applied to find the HW of m and each m j , for j ∈ {0, . . ., 255}.In [45], the HW classifier is constructed by the template approach.From the obtained HWs, the message m is recovered bit-by-bit as follows:

Message HW recovery algorithm
In this section, we present the algorithm RecoverHW() which we use to recover HW of messages contained in 24 chosen ciphertexts and their bit-flipped versions.Its pseudo-code is shown in Fig. 11.RecoverHW() takes as input the neural network trained at the profiling stage, NN , the neural network's input size, in_size, the initial offset, offset, the distance between the bits, Fig. 11 Message HW recovery algorithm bit_offset, the ciphertext c for which the message HW has to be recovered and the degree of repetition of the same measurement, N .
First, N trials are performed to recover the HW of the message m contained in c.The device under attack is used to decapsulate c, and a power trace Ti is captured during its execution (step 2).The interval corresponding to the processing of b in Ti is located based on offset and bit_offset for each of the 256-bit positions b ∈ {0, 1, . . ., 255} (representing the message bits in an unknown shuffled order) (steps 5-6).This interval is fed into the neural network NN trained during the profiling stage to determine whether the message bit in position b has a value of "0" or "1."If the resulting score s b is greater than 0.5 (i.e., "1" has a higher probability), the HW is incremented.Otherwise, the HW is not changed.
The HW is then determined by first removing the outliers and then computing the median of the remaining HWs (steps [13][14]).An outlier is defined as a HW that differs from the median HW by more than 10%.We explored a variety of combining methods.The one we present consistently outperforms others in our experiments.

Message recovery algorithm
In this section, we present a "fuzzy" version of the bit-flip technique, RecoverMessage().We construct 256 ciphertexts containing bit-flipped messages in the same way as in the original method [45].However, we take a different approach to deciding the final message bit values.We also quantify the success rate of message HW recovery as a function of the success rate of single-bit recovery.
The pseudo-code is shown in Fig. 12. RecoverMessage() takes as input the same parameters as RecoverHW() algorithm.First, the HW of m contained in c is recovered by calling RecoverHW().Then, the following loop is repeated 256 times: For each i ∈ {0, . . ., 255}, the ciphertext c i is constructed using BitFlip(), i.e., the value of the center of Z q is subtracted from the ith coefficient of c.The HW of m i contained in c i is recovered by calling RecoverHW().If HW(m i ) > HW(m ), the ith bit of m is assigned "0."If HW(m i ) < HW(m ), the ith bit of m is assigned "1." Otherwise the ith bit of m is assigned '2' to indicate that the bit is not recovered correctly.In the experiments, we call this case a detectable error.
Next we quantify the probability to recover the message HW as a function of the probability to recover the single message bit.The property below assumes that the message is balanced, i.e., has equal number of "1"s and "0"s.
Property 1 Let m be a balanced n-bit binary message.If p is the success rate of single-bit recovery and bit errors are mutually independent events, then the success rate of message HW recovery is given by: Proof The proof is based on the fact that, if, for any 0 ≤ k ≤ n/2, k message bits change as 0 → 1 and other k message bits change as 1 → 0, then the message HW does not change.
A n-bit balanced binary message has n/2 "0"s and n/2 "1"s.There are n/2 k choices to select k elements from a set of size n/2.Thus, for a fixed k, the number of possible 2k-bit errors in which k bits flip in one direction and the rest of bits flip in another direction is n/2 k 2 .Since the probability of a 2k-bit error in an n-bit message is p n−2k (1 − p) 2k , we get (3).
Using Property 1, we can estimate the success rate of single-bit recovery required to recover the message HW.Table 4 lists some examples.According to the table, the success rate of single-bit recovery should be of the order of 0.999 to recover the message HW with a high probability.

Experimental results
In the experiments, we use two identical CW303 ARM devices, D P and D A .D P is the profiling device.We have complete control over D P , which means we can reload it with a different implementation, change its secret key, etc. D A is the device that is being attacked.We use D A to capture traces for key recovery and a part of traces for training.

Message recovery
In this section, we evaluate the impact of training set composition on the success rate of RecoverMessage() algorithm.We also justify why the use of an ensemble of a set of models can further improve the success rate.
We trained MLP models on trace sets of size 2 M with varying proportions of D P and D A traces, denoted by D P : D A .We tried five cases: D P : D A = {0:100, 20:80, 50:50, 80:20, 100:0}.The notation x : y means that x% of traces are from D P and y% are from D A .Recall from Sect. 7 that traces from D P are captured for random messages, while traces from D A are captured for all-0 and all-1 messages in equal proportion.D P runs an implementation with deactivated shuffling.For each fraction D P : D A = x : y, we trained ten models with the architecture in Table 2 using TrainModel() with input parameters τ = 2 M and k = x/100 and selected the best.
We tested the models on ten different ciphertexts created by encrypting a random message with a randomly selected public key.To recover the message, 257 × N = 5140 traces from D A were captured for each ciphertext, for N = 10, 15 and 20.
Table 5 lists the number of detected and undetected errors for each of the ten test sets.Recall that detected errors are those for which RecoverMessage() returns "2" as the message bit value.The ability to detect errors is very useful since e detected bit errors can be handled by enumerating 2 e possible choices, computing ( K , r ) = G( pkh, m ) and then checking if c = Saber.PKE.Enc( pk, m ; r ).
We can see from Table 5 that the model trained on a combination of 80% of traces from D P and 20% of traces from D A produces the best results.Including traces from the device under attack into the training set helps mitigating the negative effect of device variability on classification accuracy.
We can also see that training on 100% of traces from the device under attack captured for all-0 and all-1 messages is not the best choice.As we mentioned in Sect.7, due to the Cortex-M4's three-stage pipeline, the power consumed during the processing of a given message bit depends not only on the value of that bit, but also on values of previous bits.Therefore, traces of all-0 and all-1 messages do not allow the neural network to learn all possible features.
Training on 100% of traces from the profiling device is the worst option.One could argue that such an option has the advantage of allowing profiling to be completed prior to the attack.However, thanks to the cut-and-join technique, we only need to capture 1.5K traces from D A to contribute 20% of traces to the 2 M training set, which takes less than 4 min.As a result, composing the training set as 80:20 has no significant effect on the time required to physically access D A .
Table 5 also shows that, for N = 15 and lower, all models have some undetected errors.It is possible to improve the success rate by increasing the value of N , however, a larger N increases capture time for attack traces, which is undesirable.Thus, in the experiments that follow, rather than increasing N , we use an ensemble of a set of k models to improve the success rate of message recovery.The ensemble approach increases training time, but this is not as critical as increasing access time to D A .
It is known [23] that, on average, an ensemble of a set of models performs at least as well as any of its members.Furthermore, if the members make independent errors, then the ensemble performs considerably better than its members.This can be justified as follows.Suppose that each model makes an error i on each test example, and the errors are drawn from a zero-mean multivariate normal distribution with variances E[ 2 i ] = v and covariances E[ i j ] = c.Then, the error made by the average prediction of all the ensemble models is 1 k i i .So, the expected squared error of the ensemble predictor is given by [23]: From the above we can conclude that, if the errors are dependent and c = v, then the expected squared error of the ensemble is v, i.e., the ensemble brings no improvement.
In contract, if the errors are independent and c = 0, then the expected squared error of the ensemble reduces to v k , i.e., it is inversely proportional to the ensemble size k.
The ensemble approach helps us in practice because different models typically do not make all the same errors on the test set, as we can see from the results in Table 5.This might due to differences in models parameters after training.In Sect.9.3, we present an example illustrating these differences.

Secret key recovery
To evaluate the success rate of the secret key recovery attack, we captured ten test sets of 24 × 257 × N traces representing the decapsulation of ciphertexts constructed following steps 1-3 of the procedure in Sect.8.1, for N = 10, 15 and 20.Each test set was captured for a different secret key.
To recover the secret messages contained in the ciphertexts, we use an ensemble of best models obtained during training.The ensemble method is known to be useful in side-channel analysis [43,57].Table 6 shows the results for ensembles of size up to 5 for different N .The k models in the ensemble are trained on the same training set with D P : D A = 80 : 20.
The output of an ensemble of k models is obtained as follows.For each j ∈ {0, 1, . . ., 255}, models that result in m [ j] = 2 (i.e., detected error) are excluded from voting, and then the mean of the m [ j]s produced by the remaining models is computed.If the mean is ≤ 0.5, the jth message bit is set to "0"; otherwise it is set to "1." Finally, the secret key is derived from the 24 recovered messages as described in Sect.8.2.
Since we use the ECC-based method [40] which is able to correct single errors and detects one additional error in the recovered message, we can mark the positions of the detected incorrect key coefficients for later enumeration.With d detected incorrect key coefficients, 9 d enumerations are required to find the true key.For example, for N = 10 and k = 5, 9 1.2 ≈ 14 enumerations are required.
Undetected errors are positions that are not handled by the ECC.We determine them by comparing the recovered key to the true key.One can see from Table 6 that, for N ≥ 10 and k ≥ 3, there are no undetected errors.Certainly, the values of N and k may vary depending on the implementation, environmental conditions, acquisition method, etc.
The last three columns of Table 6 show the time required for capturing traces and message recovery, as well as the average key enumeration time on a PC with a 16 core processor running at 4.3 GHz and 64 GB of RAM (simple single threaded implementation).The sign "-" means that key enumeration is not feasible.Note that capture requires physical access to the device under attack, whereas post-processing steps do not.

Analysis of neural network models
It is challenging to explain how neural network models take their decisions.However, making an attempt is important because it could help in locating and fixing vulnerabilities in the implementation under attack.It might also aid in model optimization.

Feature analysis
To assess the significance of various input features for the models, we use two techniques: (1) weight analysis, and (2) stuck-at-0 fault injection.
Both methods have been shown effective in previous attacks of lattice-based PKE/KEMs [60].
Figure 13c shows the gamma, γ , parameters of the input Batch Normalization layer of five MLP models in Table 5 after training.The model trained on the dataset composed as D P : D A = x : y is referred to as DP_x_DA_y.h5 in the legend.Recall that Batch Normalization first standardizes the input values X of the layer using their respective mean, μ, and standard deviation, σ , X norm = (X − μ)/σ , and then applies the scaling, γ (gamma) and offset, β (beta), parameters to the result, X = (γ * X norm ) + β.The parameters γ More specifically, the backpropagation algorithm is adjusted to operate on the transformed inputs, and error is used to update the new scaling and offset parameters learned by the model.Thus, a higher value of γ indicates the higher importance of the corresponding input feature in the decision taken by the model.We can see that there are substantial differences in the weights of different trace points.
The contribution of each feature becomes even more clear after the stuck-at-0 fault injection analysis.Figure 13a and b shows how the prediction accuracy of the models DP_80_DA_20.h5and DP_100_DA_0.h5,respectively, is affected by setting each single point p of a test trace to 0 before making inference (implying that the model takes its decision without the data sampled at that point).If the prediction accuracy drops to the random guess accuracy of 0.5, the point p is important.
In Fig. 13a and b, we can see that there are points in the interval [45:53] whose removal drops the accuracy to the random guess.It shows that the most important input features for the model's decision are located there.This, in turn, implies that the computations performed by the implementation of Saber during the corresponding clock cycles leak exploitable side-channel information.By doing a clock cycle accurate analysis of the assembly code of poly_A2A_shuffled() in Fig. 4b, one can link these computations to the store register halfword instructions strh.wr3, [r4, r1, lsl 1] and strh.wr2, [r5, r1, lsl 1] which store a halfword from a register to memory.These instruction implement the line 6 the C code of poly_A2A_shuffled() marked in red in Fig. 3.

Model comparison
Figure 13 also illustrates that different models can be nonequally "sensitive" to the same point of data.This may result in these models making different inference errors on the same test set.
By comparing Fig. 13a and b, we can see that the deletion of some points may affect the models DP_80_DA_20.h5and DP_100_DA_0.h5,differently.For example, the deletion of the point 42 decreases the prediction accuracy of the model DP_80_DA_20.h5to 87% only, while for the model DP_100_DA_0.h5the reduction is to 65%.Contrary, the deletion of the point 54 drops the prediction accuracy of the model DP_80_DA_20.h5to 56% while the accuracy of the model DP_100_DA_0.h5reduces only to 89%.While all models follow a similar "pattern" of weights in Fig. 13c, for some data points their values differ.This also applies to the parameters of the follow up layers, causing the avalanche effect.As a result, the same data point might contribute non-equally to the decisions of different models.According to Table 5, the model DP_80_DA_20.h5 is more successful than DP_100_DA_0.h5 in making correct predictions.Thus, apparently the weights learned by DP_80_DA_20.h5are closer to the optimal than the ones learned by DP_100_DA_0.h5.

Conclusion
We demonstrated that it is possible to break a masked and shuffled implementation of Saber KEM using deep learning-based power analysis.Earlier it was believed that the masked and shuffled countermeasures, when combined, provide adequate protection against side-channel attacks.The presented message and key recovery attacks are not specific to Saber and might be applicable to other LWE/LWR-based PKE/KEMs, including CRYSTALS-Kyber [4] which has been recently selected for standardization by NIST [36].
The traces, models and scripts, along with video demonstration of the attack are publicly available at https://drive.google.com/drive/folders/1NBf1oLO81UTSf_Z4HRRScb-RMIRzRNzcFuture work includes designing stronger countermeasures for LWE/LWR-based PKE/KEMs.

Fig. 4
Fig. 4 Assembly code of masked poly_A2A inner loop before and after adding shuffling on the top

Fig. 8 Fig. 9 T
Fig. 8 Comparison of average power traces of D P and D A

Fig. 13 a
Fig. 13 a and b The average prediction accuracy of the models DP_80_DA_20.h5and DP_100_DA_0.h5,respectively, for the case when a given data point is stuck to 0; c Gamma parameters of the input Batch Normalization layer of five models

Table 1
Proposed parameters of round 3 Saber

Table 2
The MLP architecture 255}.The procedure is described in Sect.8.3.(3)For each of 24 × 257 resulting ciphertexts, acquire a power trace during the decapsulation of the ciphertext by the device under attack.Repeat N times for each ciphertext.(4)Use the acquired 24 × 257 × N power traces to recover the messages m i contained in the ciphertexts c i , for all i ∈ {1, ..., 24}, using RecoverMessage() algorithm presented in Sect.8.4.(5)Derive the secret key from the 24 recovered messages m 1 , . . ., m 24 as described in Sect.8.2.
i , i ∈ {1, . . ., 24}, construct 256 ciphertexts c i 0 , . . ., c i 255 such that c i j decrypts to m i j = Saber.PKE.Dec(s, c i j ) which is equal to the message m i = Saber.PKE.Dec(s, c i ) with the jth bit is flipped, for j ∈ {0, . . ., 2 extended Hamming code composed from the eight message bits.The first 256 secret key coefficients are derived from messages recovered from c 1 , . . ., c 8 , the second 256 coefficients-from c 9 , . . ., c 16 and the last 256 coefficients-from c 17 , . . ., c 24 .

Table 5
The impact of training set composition on message recovery success rate

Table 6
Success rate of key recovery (average for 10 tests)