Systematic Side-Channel Analysis of Curve25519 with Machine Learning

Profiling attacks, especially those based on machine learning, proved to be very successful techniques in recent years when considering the side-channel analysis of symmetric-key crypto implementations. At the same time, the results for implementations of asymmetric-key cryptosystems are very sparse. This paper considers several machine learning techniques to mount side-channel attacks on two implementations of scalar multiplication on the elliptic curve Curve25519. The first implementation follows the baseline implementation with complete formulae as used for EdDSA in WolfSSl, where we exploit power consumption as a side-channel. The second implementation features several countermeasures, and in this case, we analyze electromagnetic emanations to find side-channel leakage. Most techniques considered in this work result in potent attacks, and especially the method of choice appears to be convolutional neural networks (CNNs), which can break the first implementation with only a single measurement in the attack phase. The same convolutional neural network demonstrated excellent performance for attacking AES cipher implementations. Our results show that some common grounds can be established when using deep learning for profiling attacks on very different cryptographic algorithms and their corresponding implementations.


Introduction
Various cyber-physical devices have become integral parts of our lives. They provide basic services, and as such, also need to fulfill appropriate security requirements. Designing such secure devices is not easy due to limited resources available for implementations, and the need to provide resilience against various attacks. In the last decades, implementation attacks emerged as real threats and the most potent attacks. In implementation attacks, the attacker does not aim at the weaknesses of an algorithm, but the weaknesses in implementations [23]. One powerful category of implementations attacks is the profiled sidechannel analysis (SCA) where the attacker has access to a profiling device she uses to learn about the leakage from the device under attack. Profiled SCA uses a broad set of methods to conduct the attack.
In the last few years, attacks based on the machine learning classification task have proved to be very successful when attacking symmetric-key cryptography [20-22, 35, 39]. On the other hand, profiled SCAs on publickey cryptography implementations are much more scarce [8,25,38].
While the current state-of-the-art results on profiled SCA and public-key cryptography suggest breaking targets with relatively small effort, many questions remain unanswered. For instance, it is not yet clearwhat are the benefits of countermeasures against machine learning-based attacks. What is more, public-key cryptography has different use cases and parameters that also result in classification problems with a significantly different number of classes one commonly encounters when attacking, e.g., block ciphers. Finally, in profiled SCA on symmetric ciphers, we are slowly moving away from scenarios where the only interesting aspect is the attack performance. Indeed, the SCA community is now becoming interested in not only questions like interpretability [24,32,45] and explainability [46] of deep learning attacks, but also building methodologies [50] and frameworks [33,34] for objective analysis.
This paper considers profiled side-channel attacks on two implementations of scalar multiplication on one of the most popular elliptic curves for applications, i.e., Curve25519. The first implementation is the baseline implementation with the complete formulae as used for EdDSA in WolfSSl. The second implementation also includes several countermeasures. To evaluate the security of those implementations, we consider seven different profiled methods. Additionally, we investigate the influence of the dimensionality reduction technique. By doing this, we aim at filling the knowledge gap and give insights into the performance of different profiled methods. Finally, we compare the differences in the attack performance when considering protected and non-protected implementations.
This paper is based on the work "One Trace Is All It Takes: Machine Learning-Based Side-Channel Attack on EdDSA" [48]. The main differences are: 1. We provide results for an additional target, protected with countermeasures. 2. We provide results for several more profiled methods and different dimensionality reduction steps. 3. We investigate the applicability of one visualization technique for deep learning when attacking public-key implementations.
The rest of this paper is organized as follows. In Section 2, we give details about EdDSA and scalar multiplication procedure. Afterwards, we discuss the profiled methods we use in our experiments. Section 3 provides details about the attacker model, the datasets we use, hyperparameter tuning, and dimensionality reduction. In Section 4, we provide experimental results for both targets. In Section 5, we discuss related works. Finally, in Section 6, we conclude the paper and offer some potential future research directions.

Background
In this section, we start by introducing the elliptic curve scalar multiplication operation and the EdDSA algorithm.
After that, we discuss profiling attacks that we use in our experiments.

Elliptic Curve Digital Signature Algorithm
In the context of public-key cryptography, one important feature is the (entity) authentication between two parties. This feature ensures to party B that party A has sent a message M and that this message is original and unaltered. Authentication can be performed by the Digital Signature Algorithm (DSA). Nowadays, public-key cryptography for constrained devices typically implies Elliptic Curves cryptography (ECC) as the successor of RSA because it achieves a higher security level with smaller key lengths saving the resources such as memory, power, and energy. The security of ECC algorithms is based on the difficulty of Elliptic Curve Discrete Logarithm Problem (ECDLP), which states that while it is easy and efficient to compute Q = k · P , it is "difficult" to find k with knowledge of Q and P .
EdDSA [4] is a variant of the Schnorr digital signature scheme [42] using Twisted Edward Curves, a subgroup of elliptic curves that uses unified formulas, enabling speedups for specific curve parameters. This algorithm proposes a deterministic generation of the ephemeral key, different for every message, to prevent flaws from a biased random number generator. The ephemeral key r is made of the hash value of the message M and the auxiliary key b, generating a unique ephemeral public key R for every message.
EdDSA, with the parameters of Curve25519, is referred to as Ed25519 [3]. EdDSA scheme for signature generation and verification is described in Algorithm 1, where the notation (x, . . . , y) denotes the concatenation of the elements. The hash function H is SHA-512 [29]. The key length is of size u = 256. We denote the private key with k, the private scalar a is the first part of the private key's hashed value, and the auxiliary key b is the second part. We denote the ephemeral key with r and M is the message.
After the signature generation, party A sends (M, R, S), i.e., the message along with the signature pair (R, S) to B. The verification of the signature is done by B with steps 10 to 11. If the last equation is verified, it represents a point on the elliptic curve, and the signature is correct, ensuring that the message can be trusted as an authentic message from A.

Elliptic Curve Scalar Multiplication
We focus on two types of implementations of EC scalar multiplication. The first implementation is of EdDSA using Ed25519 as in WolfSSL. This implementation is based on the work of Bernstein et al. [4] and is a window-based method with radix-16, making use of a precomputed table containing results of the scalar multiplication of 16 i |r i | · G, where r i ∈ [−8, 7] ∩ Z and G is the base point of Curve25519. This method is popular because of its tradeoff between memory usage and computation speed, but also because the implementation is time-constant and does not feature any branch condition nor array indices and hence is presumably secure against timing attacks.
Leaking information from the corresponding value loaded from memory with a function ge select is here used to recover e and hence can be used to connect to the ephemeral key r easily. More details are given in the remainder of this paper. We can attack this implementation and extract the ephemeral key r from Step 5 in Algorithm 1.
The second implementation we focus on is the Montgomery Ladder scalar multiplication as used in μNaCl [14]. The implementation employs arithmetic-based conditional swap and is additionally protected with projective coordinate re-randomization and scalar randomization. The traces used to analyze this implementation are obtained from a publicly available dataset [11]. All details on this implementation, including the additional countermeasures, are described in [27].

Random Forest (RF)
Random forest is an ensemble learning method that consists of a number of decision trees [6]. Decision trees consist of combinations of Boolean decisions on a different random subset of attributes of input data (called bootstrap sampling). For each node of each tree, the best split is taken among these randomly chosen attributes. Random forest is a stochastic algorithm since it has two sources of randomness: bootstrap sampling and attribute selection at node splitting. While the random forest has several hyperparameters to tune, we investigate the influence of the number of trees in the forest, where we do not pose any limits on the tree size.

Support Vector Machines (SVM)
Support vector machine is a kernel-based machine learning family of methods used to classify linearly separable and linearly inseparable data [47]. The idea for linearly inseparable data is to transform them into a higher dimensional space using a kernel function, wherein the data can usually be classified with higher accuracy. The scikitlearn implementation we use considers libsvm's C-SVC classifier [31] that implements SMO-type algorithm [16]. This implementation of SVM learning is widely used because it is simpler and faster compared to older methods. The multi-class support is handled according to a one-vsone scheme. We investigate two variations of SVM: with a linear kernel and with a radial kernel. Linear kernelbased SVM has the penalty hyperparameter C of the error term. Radial kernel-based SVM has two significant hyperparameters to tune: the cost of the margin C and the kernel γ .

Convolutional Neural Networks (CNNs)
Convolutional neural networks, like other types of neural networks, have several layers where each layer is made up of neurons, as depicted in Fig. 1. Every neuron in a layer computes a weighted combination of an input set by a net input function (e.g., the sum function in neurons of a fully connected layer) from which a nonlinear activation function produces an output. When the output is different from zero, Fig. 1 Anatomy of a neuron we say that the neuron activation feeds the next layer as its input. Layers with a convolution function as the net input function are referred to as convolutional layers and are the core building blocks in a CNN. Pooling layers are commonly used after a convolution layer to sample down local regions and create spatial regions of interest. The last fully connected layers of a CNN behave as a classifier for the extracted features from the inputs.
In this work, we start from the VGG-16 architecture introduced in [43] for image recognition. This architecture was also recently applied for SCA on AES [20] and EdDSA [48]. This CNN architecture also uses the following elements: 1. Batch normalization to normalize the input layer by applying standard scaling on the activations of the previous layer. 2. Flatten layer to transform input data of rank greater than two into a one-dimensional feature vector used in the fully connected layer. 3. Dropout (randomly dropping out units (both hidden and visible) in a neural network with a certain probability at each batch) as a regularization technique for reducing overfitting by preventing complex co-adaptations on the training data.
The architecture of a CNN depends on a large number of hyperparameters, so choosing hyperparameters for each different application is an engineering challenge. The choices made in this paper are discussed in Section 4.

Gradient Boosting (XGB)
Gradient boosting for classification is an algorithm that trains several weak learners (i.e., decision trees that perform poorly considering the classification problem) and combines their predictions to make one stronger learner. Gradient boosting differs from the random forest in the way the decision trees are built. While in random forest classifier, each tree is trained independently using random samples of the data, and decision trees in gradient boosting depend on the previously trained tree's prediction to correct its errors. Gradient tree boosting is composed of a concatenation of several smaller decision trees. We used the extreme gradient boosting (XGB) implementation of gradient boosting, designed by Chen and Guestrin [10], which uses a sparsity-aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate learning.

Naive Bayes (NB)
Gaussian Naive Bayes classifier is one of the classification algorithms that applies Bayes's theorem with the "naive" assumption. The naive assumption describes the conditional independence between every pair of features in a given class sample. The Gaussian assumption is assumed as the features probability distribution. The Naive Bayes method is highly scalable with the number of features and requires only a few representative features per class to achieve a satisfying performance.

Template Attack (TA)
The template attack relies on the Bayes theorem and considers the features to be dependent. Commonly, template attack relies on a normal distribution [9] and it assumes that each P ( X = x|Y = y) follows a (multivariate) Gaussian distribution parameterized by its mean and covariance matrix for each class Y . Choudary and Kuhn proposed using one pooled covariance matrix averaged over all classes Y to cope with statistical difficulties and thus lower efficiency [12]. In our experiments, we use this version of the attack.

Attacker Model
The general recommendation for EdDSA, as well as other ECDSA implementations, is to select different ephemeral private keys r for each different signature. When this is not applied and the same r is used for different messages, the two resulting signature pairs (R, S) and (R, S ) for messages M and M , respectively, can be used to recover r as r = (z − z )(S − S ) −1 , where z and z represent a majority of leftmost bits of H (M) and H (M ) interpreted as integers. 1 Finally, the private scalar a is exposed as a = R −1 (Sr − z) and can be misused by the attacker to forge new signatures. 2 The attacker's aim is the same as for every ECDSA attack: recover the secret scalar a. The difference is that the attacker cannot acquire two signatures with the same random r, but can still recover the secret scalar in two different ways. The first method consists of attacking the hash function's implementation to recover b from the computation of ephemeral private key [40]. The second one attacks the implementation of the scalar multiplication during the ephemeral public key's computation to infer it in a single trace [48]. In this paper, we consider only the profiled attacks, i.e., those based on the supervised machine

SCA Datasets
We analyze two publicly available datasets targeting elliptic curve scalar multiplication on Curve25519 for microcontrollers. The first dataset consists of power traces of a baseline implementation, and the second dataset consists of electromagnetic traces of a more protected implementation.

Baseline Implementation Dataset
We consider a dataset of scalar multiplication on Curve25519. The implementation follows the baseline implementation of the scalar multiplication algorithm as in [48]. The traces contain power measurements collected from a Piñata development board 1 based on a 32-bit STM32F4 microcontroller with an ARM-based architecture, running at the clock frequency of 168 MHz. The device is running the Ed25519 implementation of WolfSSL 3.10.2. The target is the EC scalar multiplication of the ephemeral key and the base point of curve Ed25519 (as explained in Section 3.1). Because of the chosen implementation, it is possible to profile the full scalar by nibble in a horizontal fashion. The dataset is thus composed of multiple separate nibble computations.
The dataset has 6400 labeled traces of 1000 features each, with associated nibble value. In Fig. 2, we give the signal-to-noise ratio of this dataset. The SNR is high and reaches a maximum value of 12.9. Such a high SNR is the consequence of dealing with power leakages that are less 1 Pinata Board: https://www.riscure.com/product/pinata-training-target/ Fig. 3 Signal-to-noise ratio for the protected implementation dataset noisy than usual EM leakages. The leakage is essentially located between points 50 and 700, where several features seem to leak information about the handled nibble.

Protected Implementation Dataset
The traces in the protected dataset are taken from a publicly available dataset [11]. This set contains electromagnetic traces coming from 5997 executions of Curve25519 μNaCl Montgomery Ladder scalar multiplication 3 running on the Piñata target, the same as in Section 3.2.1. The implementation employs an arithmetic-based conditional swap and is additionally protected with the projective coordinate re-randomization and scalar randomization. Each trace from the dataset represents a single iteration of the Montgomery Ladder scalar multiplication that is cut from the whole execution trace; such trace is labeled with the corresponding cswap condition bit. 4 Furthermore, all these cut traces (5997 × 255 = 1, 529, 235) are aligned to exploit the leakage efficiently. Details about the implementation and how the traces are aligned are in [27]. Figure 3 represents the SNR of the dataset for the bit model. This SNR is relatively flat except for two peaks where the leakage of the data is stronger. One is located before feature 3000 and the second after feature 5000. The noise level is high for an EM dataset but is smaller than the other dataset based on power traces.

Evaluation Metrics
To examine the feasibility and performance of our attack, we use two different metrics. We first compare the performance using the accuracy metric since it is a standard metric in machine learning. The accuracy metric represents the fraction of the measurements that are classified correctly. The second metric we use is the success rate as it is an SCA metric that gives a more concrete idea on the power of the attacker [44]. Let us consider the settings where we have A attack traces. As the result of an attack, we output a key guessing vector v = [v 1 , v 2 , . . . , v |K| ] in decreasing order of probability with |K| being the size of the keyspace. Then, the success rate is the average empirical probability that v 1 is equal to the correct key.

Dimensionality Reduction
For computational reasons, one may want to analyze only the most informative features from the dataset's traces. Consequently, we explore several different settings where we use all the features in a trace or conduct dimensionality reduction. For dimensionality reduction, we use a method called principal component analysis. Principal component analysis (PCA) is a linear dimensionality reduction method that uses Singular Value Decomposition (SVD) of the data matrix to project it to a lower dimensional space [5]. PCA creates a new set of features (called principal components) that form a new orthogonal coordinate system that is linearly uncorrelated. The number of components is the same as the number of original features. The components are arranged so that the first component covers the largest variance by a projection of the original data, and the following components cover less and less of the remaining data variance. The projection contains (weighted) contributions from all the original features. Not all principal components need to be kept in the transformed dataset. Since the components are sorted by decreasing covered variance, the number of kept components, designated by L, maximizes the original data variance and minimizes the data transformation's reconstruction error. While PCA is meant to select the principal information from data, there is no guarantee that the reduced data form will give better results for profiling attacks than its complete form.

Hyperparameter Tuning
Most machine learning methods are parametric and require some hyperparameters to be tuned before the training phase. Depending on this pre-tuning, the trained classifier will potentially have a different outcome. The different classification methods we used are trained with a wide set of hyperparameters as detailed in this section. The exact used hyperparameters are listed in Tables 1 and 4. TA We use the Template Attack with a pooled covariance matrix [12]. This method has no hyperparameters to tune. NB We do not conduct hyperparameter tuning as the method is non-parametric (i.e., there are no hyperparameters to tune).

RF
We tune the number of decision trees. We consider the following number of trees: 50, 100, 500.
SVM For the linear kernel, the hyperparameter to optimize is the penalty parameter C. We search for the best C in the range [1, 10 5 ] in logarithmic space. For the radial basis function (RBF) kernel, we have two hyperparameters to tune: the penalty C and the kernel coefficient γ . The search for best hyperparameters is done within C = [1, 10 5 ] and γ = [−5, 2] in logarithmic spaces.
XGB In the same fashion as the random forest classifier, we set the hyperparameter exploration for the number of trees to 50, 100, and 300. We impose a maximum depth for each tree from 1 to 3 nodes, to force each tree to be a weak learner.
CNN The chosen hyperparameters for VGG-16 follow several rules that have been adapted for SCA in [20] or [39] and that we describe here: 1. The model is composed of several convolution blocks and ends with a dropout layer followed by a fully connected layer and an output layer with the Softmax activation function. 2. Convolutional and fully connected layers use the ReLU activation function (max(0, x)).

A convolution block is composed of one convolution
layer followed by a pooling layer. 4. An additional batch normalization layer is applied for every odd-numbered convolution block and is preceding the pooling layer. 5. The chosen filter size for convolution layers is set to the size 3.
6. The number of filters n filters,i in a convolution block i increases according to the following rule: n filters,i = max(2 i · n filters,1 , 512) for every layer i ≥ 0 and we choose n filters,1 = 8. 7. The stride of the pooling layers equals two and halves the input data for each block. 8. Convolution blocks follow each other until the size of the input data is reduced to 1.

Results
In this section, we first present results for the baseline implementation and the protected implementation afterward. We finish the section with results on visualization and discussion. The best results in Tables 2 and 5 are given in italics.

Baseline Implementation
After the conducted training phase of all the different classifiers with their hyperparameters, we list in Table 1 the best hyperparameter combinations for each machine learning model. The resulting CNN architecture for a 1000-feature input is depicted in Fig. 4. Other architectures will have a different number of convolutional blocks and a number of weights depending on the number of features of the input.
In Table 2, we give the accuracy score for different profiling methods when considering the recovery of a single nibble of the key. We can see that all profiling techniques reach excellent performance with accuracy above 95%. When considering all available features (1000), CNN performs the best and achieves an accuracy of 100%. Both SVM (linear and RBF) and RF have the same accuracy. SVM's performance is interesting since the same value for linear and RBF kernel indicates there is no advantage of using higher dimensional space, which means that the classes are linearly separable. Finally, NB, XGB, and TA still perform well, but we conclude they reach the worst results compared with other methods. PCA results in lower accuracy scores for most of the considered techniques. When considering 500 or 100 PCA components, the TA's results slightly improve, while RF and CNN results slightly decrease. SVM with both kernels can reach minimally higher accuracy when considering 500 PCA components. When considering the scenario with only the ten most important PCA components, all the results deteriorate compared with the results with 1000 features, and SVM performs the best.
To conclude, all techniques exhibit strong performance, but CNN is the best if no dimensionality reduction is applied. There, the maximum accuracy is obtained after only a few epochs (see Figs. 6 and 7). If dimensionality reduction is applied, CNN shows a progressive performance deterioration. This behavior should not come as a surprise since CNNs are usually used with the raw features (i.e., no pre-processing). Applying such techniques could reduce the performance due to a loss of information and changes in the spatial representation of features. Interestingly, TA and SVM are very stable methods, regardless of the number of used features (components), and those methods show the best performance for a reduced number of features settings.
In Fig. 5, we present a success rate with orders up to 10 for all profiling methods on the dataset without applying PCA. Recall that a success rate of order o is the probability that the correct subkey is ranked among the first o candidates of the guessing vector. While CNN has a 100% success rate of order 1, other methods achieve the perfect score only for orders greater than 6.
The results for all methods are similar in the recovery of a single nibble from the key. To have an idea of how good these methods perform for the recovery of a full 256-bit key, we apply classification on the successive 64 nibbles. We obtain an intuition of the resulting accuracy by considering the cumulative probability P c of the probabilities of recovery of one nibble P s : P c = 64 P s (see Table 3). The cumulative accuracy obtained in such a way can be interpreted as the predictive first-order success rate of a full key for the different methods in terms of a security metric.
From these results, the best result is obtained with CNN when no dimensionality reduction is applied. Other methods are nonetheless powerful profiling attacks with up to 95% performance to recover the full key on the first guess with the best choice of hyperparameters and dimensionality reduction. When considering the results after dimensionality reduction, SVM is the best performing technique when using 500 PCA components.
As can be observed from Figs. 6 and 7, both the scenarios without dimensionality reduction and dimensionality reduction to 100 and 500 components reach the maximal performance very fast. On the other hand, the scenario with Fig. 4 CNN architecture, as implemented in Keras. This architecture takes a 1000-feature input and consists of nine convolutional layers followed by max pooling layers. For each odd convolutional layer, there is a batch normalization layer before the pooling layer. At the end of the network, there is one fully connected layer 10 PCA components does not reach the maximal performance within 100 epochs since the validation accuracy does not start to decrease. Still, even longer experiments do not show further improvement in the performance, which indicates that the network simply learned all that is possible and that there is no more information that can be used to increase the performance further. Finally, the fast increase in training and validation accuracy, and the stable behavior of profiling

Protected Implementation
We list the selected hyperparameters for the protected implementation in Table 4. The protected implementation dataset contains more features per trace than the other dataset. Therefore, the number of trainable parameters for machine learning methods greatly increases, increasing the models' training load. We experimented with RF, NB, and  . 6 Accuracy of the CNN method over 100 epochs for the baseline implementation dataset XGB and left out SVM (both with linear and RBF kernel) as this method's training becomes too expensive. We show the accuracy results for all tested methods on the protected implementation dataset in Table 5. Notice that, contrary to the previously considered dataset, not all profiling techniques have good performance, and most of them are even close to random guessing. Still, some profiling methods can reach above 99% accuracy, where the best results are obtained with CNN. When PCA is applied, random forest performs poorly with 50.2% accuracy for ten and 1000 components, which is not better than one could expect from random guessing. However, this method turns out to be quite efficient on the raw features and reaches an accuracy of 93% for one bit recovery.
Naive Bayes and XGB perform poorly regardless of the hyperparameters explored and if dimensionality reduction is applied. The accuracy stays around random guessing when PCA is applied with ten and 1000 components, and does not go above 60% in the best case. Naive Bayes and XGB are simple classifiers and, considering their accuracy score on The template attack is performing well, where the more features are taken, the better the results. The best accuracy score for template attack is obtained when all features are kept, and it reaches 99% accuracy. When PCA is applied and 1000 components are selected, the accuracy falls to 89% (which is, in fact, the best result for all considered techniques). Finally, when the number of selected components is reduced to 10, the accuracy falls to 52%.  CNN is a highly efficient method only when considering the dataset without applying the PCA method, where it reaches an accuracy above 99%. As we can see in Figs. 8 and 9, when PCA is applied, while the training loss and accuracy seems to fit the training set, the model fails to generalize and converge on the validation set given the chosen number of traces and epochs.
We can evaluate the accuracy of the different methods to predict a 256-bit scalar by computing the cumulative Fig. 8 Accuracy of the CNN method over 100 epochs on the protected implementation dataset probability of success of a single bit over 256 attempts. The cumulative probability p c for a 256-bit key considering a single bit probability recovery P s is P c = 256 P s . Here, only the methods with a single accuracy above 99% are worth considering as the other methods have a cumulative probability close to 0. For example, the cumulative accuracy for the random forest with 5500 features is 8%, and CNN with 5500 features is 98%.

Visualization of the Integrated Gradient
For CNNs, various visualization techniques have been developed to help researchers understand what input features influence the neural network predictions. These tools are interesting in side-channel analysis to evaluate if a network bases its prediction on the part of the trace where the leakage is the strongest. We note that visualization techniques proved to be a helpful tool when considering profiled SCA and block ciphers [17,24]. We use here the integrated gradient method [30]. In this method, the higher Fig. 9 Loss of the CNN method over 100 epochs on the protected implementation dataset is the gradient value, the more important the feature is for the model's prediction.
From Figs. 10 and 11, we can notice that when we apply principal component analysis, the network tends to rely more on the first features. After applying PCA, the features are reorganized and ranked from the most important to the least important feature. When considering the dataset without applying PCA, the features' order is the same as those sampled with the oscilloscope. We can notice interesting similarities between the SNR of the unprotected implementation (Fig. 2) and the integrated gradient of the CNN. The interpretation of the integrated gradient obtained for the CNN trained on the protected implementation dataset is less evident as the high peaks do not correspond to the leaking features indicated by the SNR (see Fig. 3). When comparing the visualization results for both datasets, the Fig. 10 Integrated gradient method applied to CNN trained on the baseline implementation dataset similarity between the baseline results for the full number of features and after dimensionality reduction indicates that the performance should be similar, which is confirmed by the accuracy results. On the other hand, we see striking differences between two visualizations for the protected implementation, where the one with 1000 features cannot concentrate on the most important elements, which is again evident from the accuracy results.

General Remarks
The obtained results allow us to infer some more general recommendations one could follow one attacking ECC with profiled SCAs:

Related Work
In 2003, Chari et al. [9] introduced a template attack (TA) as a powerful SCA method in the informationtheoretic point of view, which became a standard tool for profiling SCA. As TA's straightforward implementations can lead to computationally intensive computation, one option for more efficient computation is to use only a single covariance matrix, which is referred to as the socalled pooled template attack presented by Choudary and Kuhn [12]. There, the authors were able to template a LOAD instruction and recover all 8 bits treated with a guessing entropy equal to 0. Several works applied machine learning methods to SCA of block ciphers because they resemble general profiling techniques. Two methods stand out particularly in profiling SCA, namely support vector machines [21,22,36,41] and random forest [18,35,41]. Few other works also experienced SCA with naive Bayes [36] and gradient boosting methods [37,49]. With the general evolution in the field of deep learning, more and more works deal with neural networks for SCA and often show top performance. Most of the research concentrated on either multilayer perceptron or convolutional neural networks [7,13,22,37].
There is a large portion of works considering profiling techniques for symmetric-key ciphers, but there is less for public-key cryptography, 5 especially ECC. Template attacks on ECC trace back to an attack on ECDSA, as demonstrated by Medwed and Oswald in 2009 [26]. That work showed TA to be efficient for attacking SPAresistant ECDSA with the P192 NIST curve on a 32-bit microcontroller [25]. Heyszl presented another template attack on ECC in [19]. That attack exploited register location-based leakage using a high-resolution inductive EM probe. Another approach to attack ECC is the socalled online template attacks [1,2,15,30]. The first three approaches [1,2,15] use correlation to match the template traces to the whole attacked traces while the fourth attack [30] employs instead several machine learning distinguishers.
Lerman et al. considered a template attack and several machine learning techniques to attack RSA. However, the targeted implementation was not secure, making the comparison with non-machine learning techniques less favorable [21]. Nascimento et al. applied a horizontal attack on ECC implementation for AVR ATmega microcontroller targeting the side-channel leakage of cmov operation. Their approach to side-channel is similar to ours, but they do not use deep learning in the analysis [28]. Note that approach was extended to unsupervised settings using clustering [27]. Poussier et al. used horizontal attacks and linear regression to conduct an attack on ECC implementations, but their approach cannot be classified as deep learning [38]. Carbone et al. used deep learning to attack a secure implementation of RSA [8]. The results from that paper show that deep learning can reach strong performance against secure implementations of RSA.

Conclusions
In this paper, we consider several profiling methods to attack Curve25519 in both unprotected and protected settings. The results show that unprotected implementation is easy to attack with many techniques, where good results are achieved even after dimensionality reduction. We observe a significantly different behavior for the protected dataset, where only CNN can easily break the target implementation. What is more, most of the other methods perform on the level of random guessing. For this dataset, we also see a strong negative influence of dimensionality reduction. Finally, our results with the integrated gradient visualization indicate such methods useful in evaluating CNN's behavior. Indeed, when there are clear peaks for the integrated gradient, this maps to a simple classification task and, consequently, powerful attack performance.
We plan to investigate whether standard machine learning metrics like accuracy have fewer issues for publickey cryptography implementations than are reported for symmetric-key ciphers. As this gap between machine learning and side-channel metrics represents one of the most significant challenges in the SCA community today, insights about public-key particularities are needed.