## Abstract

Recent works have demonstrated that deep learning algorithms were efficient to conduct security evaluations of embedded systems and had many advantages compared to the other methods. Unfortunately, their *hyper-parametrization* has often been kept secret by the authors who only discussed on the main design principles and on the attack efficiencies in some specific contexts. This is clearly an important limitation of previous works since (1) the latter parametrization is known to be a challenging question in machine learning and (2) it does not allow for the reproducibility of the presented results and (3) it does not allow to draw general conclusions. This paper aims to address these limitations in several ways. First, completing recent works, we propose a study of deep learning algorithms when applied in the context of side-channel analysis and we discuss the links with the classical template attacks. Secondly, for the first time, we address the question of the choice of the hyper-parameters for the class convolutional neural networks. Several benchmarks and rationales are given in the context of the analysis of a challenging masked implementation of the AES algorithm. Interestingly, our work shows that the approach followed to design the algorithm VGG-16 used for image recognition seems also to be sound when it comes to fix an architecture for side-channel analysis. To enable perfect reproducibility of our tests, this work also introduces an open platform including all the sources of the target implementation together with the campaign of electromagnetic measurements exploited in our benchmarks. This open database, named ASCAD, is the first one in its category and it has been specified to serve as a common basis for further works on this subject.

### Similar content being viewed by others

## Notes

Some libraries (such as hyperopt or hyperas, [8]) could have been tested to automatize the search of accurate hyper-parameters in pre-defined sets. However, since they often perform a random search of the best parameters ([7]), they do not allow studying the impact of each hyper-parameter independently of the others on the side-channel attack success rate. Moreover, they have been defined to maximize classical machine learning evaluation metrics and not SCA ranking functions which require a batch of test traces.

We have validated that the code and the full project can be easily tested with the Chipwhisperer platform developed by C. O’ Flynn [52].

In Templates Attacks the profiling set and the attack set are assumed to be different, namely the traces \(\varvec{\ell }_{i}\) involved in (2) have not been used for the profiling.

The name

*generative*is due to the fact that it is possible to generate synthetic traces by sampling from such probability distributions.When no ambiguity is present we will call simply

*hyper-parameters*the architecture ones.We insist here on the fact that the model is trained from scratch at each iteration of the loop over

*t*.and also different values of \(k^{\star }\) if this is relevant for the attacked algorithm.

Another metric, the

*prediction error*(PE), is sometimes used in combination with the accuracy: it is defined as the expected error of the model over the training sets; \(\mathsf {PE}_{N_{\text {train}}}(\hat{\mathbf {\text {g}}}_{}) = 1 - \mathsf {ACC}_{N_{\text {train}}}(\hat{\mathbf {\text {g}}}_{})\).The SNR is sometimes named

*F*-Test to refer to its original introduction by Fischer [18]. For a noisy observation \(L_{t}\) at time sample*t*of an event \(Z\), it is defined as \(\mathsf {Var}[\mathsf {E}_{}[L_{t}\mid Z ] ]/\mathsf {E}_{}[\mathsf {Var}[L_{t}\mid Z ] ]\).Another possibility would have been to target \(\text {state0}[3] = \text {sbox}(p[3]\oplus k[3])\oplus r[3]\) which is manipulated at the end of Step 8] in Algorithm 1.

Note that some peaks appearing in Fig. 1b have not been selected.

They are called

*Fully-Connected*because each*i*-th input coordinate is*connected*to each*j*-th output via the \(\mathbf{A}[i,j]\) weight. FC layers can be seen as a special case of the linear layers where not all the connections are necessarily present. The absence of some (*i*,*j*) connections can be formalized as a constraint for the matrix \(\mathbf{A}\) consisting in forcing to 0 its (*i*,*j*)-th coordinates.Amount of units by which a filter shifts across the trace.

*patches*in the machine learning language.Ambiguity: Neural networks with many layers are sometimes called

*Deep Neural Networks*, where the*depth*corresponds to the number of layers.To prevent underflow, the log-softmax is usually preferred if several classification outputs must be combined.

where each layer of the same type appearing in the composition is not to be intended as exactly the same function (e.g. with same input/output dimensions), but as a function of the same form.

Straightforwardly customized to apply on 1-dimensional inputs of 700 units and outputs of 256 units.

Leading to 10 training sets of size 45,000 and 10 test sets of size 5000 to perform the 10-fold cross-validation.

Having 50 epochs and a batch size equal to 50 is also a good trade-off, but between two options that seem equivalent, we chose to prefer the solution with the highest number of epochs.

For the sake of completeness, we have also tested the SCANet model introduced in [54]. This did not yield to good performances on our dataset: we have obtained a mean rank of approximatively 128 for each of our desynchronizations 0, 50 and 100.

Additionally, beware that in this paper training and testing are used in the context of cross-validation and are subsets of the profiling dataset \(\mathcal {D}_{\text {profiling}}\).

We recommend to perform the cross-validation only with the profiling set.

Stride pooling consists in taking the first value on each input window defined by the stride.

## References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org

Akkar, M.L., Giraud, C.: An Implementation of DES and AES, Secure against Some Attacks. In: Ç. Koç, D., Naccache, D., Paar, C. (eds.) Cryptographic Hardware and Embedded Systems–CHES 2001. Lecture Notes in Computer Science, vol. 2162, pp. 309–318. Springer, Berlin (2001)

ANSSI: Ascad database (2018). https://github.com/ANSSI-FR/ASCAD

ANSSI: secaes-atmega8515 (2018). https://github.com/ANSSI-FR/secAES-ATmega8515

Bartkewitz, T., Lemke-Rust, K.: Efficient template attacks based on probabilistic multi-class support vector machines. In: Mangard, S. (ed.) Smart Card Research and Advanced Applications CARDIS. Lecture Notes in Computer Science, vol. 7771, pp. 263–276. Springer, Berlin (2013). https://doi.org/10.1007/978-3-642-37288-9_18

Bengio, Y., Grandvalet, Y.: Bias in estimating the variance of k-fold cross-validation. In: Duchesne, P., Rémillard, B. (eds.) Statistical modeling and analysis for complex data problems, pp. 75–95. Springer, Berlin (2005)

Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res.

**13**((Feb)), 281–305 (2012)Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in Science Conference, pp. 13–20 (2013)

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)

Breiman, L., et al.: Heuristics of instability and stabilization in model selection. Ann. Stat.

**24**(6), 2350–2383 (1996)Brier, E., Clavier, C., Olivier, F.: Correlation power analysis with a leakage model. In: Joye, M., Quisquater, J.J. (eds.) Cryptographic Hardware and Embedded Systems–CHES 2004. Lecture Notes in Computer Science, vol. 3156, pp. 16–29. Springer, Berlin (2004)

Cagli, E., Dumas, C., Prouff, E.: Kernel discriminant analysis for information extraction in the presence of masking. In: K. Lemke-Rust, M. Tunstall (eds.) Smart Card Research and Advanced Applications-15th International Conference, CARDIS 2016, Cannes, France, 7–9 November 2016, Revised Selected Papers, Lecture Notes in Computer Science, vol. 10146, pp. 1–22. Springer, Berlin (2016). https://doi.org/10.1007/978-3-319-54669-8_1

Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures - profiling attacks without pre-processing. In: W. Fischer, N. Homma (eds.) Cryptographic Hardware and Embedded Systems-CHES 2017-19th International Conference, Taipei, Taiwan, September 25–28 2017, Proceedings, Lecture Notes in Computer Science, vol. 10529, pp. 45–68. Springer, Berlin (2017). https://doi.org/10.1007/978-3-319-66787-4_3

Chari, S., Rao, J., Rohatgi, P.: Template attacks. In: Kaliski Jr., B., Koç, Ç., Paar, C. (eds.) Cryptographic Hardware and Embedded Systems-CHES 2002. Lecture Notes in Computer Science, vol. 2523, pp. 13–29. Springer, Berlin (2002)

Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras

Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn.

**20**(3), 273–297 (1995). https://doi.org/10.1007/BF00994018Doget, J., Prouff, E., Rivain, M., Standaert, F.X.: Univariate side channel attacks and leakage modeling. J. Cryptogr. Eng.

**1**(2), 123–144 (2011)Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A.

**222**, 309–368 (1922). https://doi.org/10.1098/rsta.1922.0009Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen.

**7**(7), 179–188 (1936)Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001)

Gilmore, R., Hanley, N., O’Neill, M.: Neural network based attack on a masked implementation of AES. In: IEEE International Symposium on Hardware Oriented Security and Trust, HOST 2015, Washington, DC, USA, 5–7 May 2015, pp. 106–111. IEEE Computer Society (2015). https://doi.org/10.1109/HST.2015.7140247

Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

Goodfellow, I.J., Bengio, Y., Courville, A.C.: Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016)

Group, H.: The hdf group. https://www.hdfgroup.org/

Group, H.: HDF5 For Python. http://www.h5py.org/

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

Heuser, A., Zohner, M.: Intelligent machine homicide-breaking cryptographic devices using support vector machines. In: W. Schindler, S.A. Huss (eds.) Constructive Side-Channel Analysis and Secure Design-Third International Workshop, COSADE 2012, Darmstadt, Germany, 3–4 May 2012. Proceedings, Lecture Notes in Computer Science, vol. 7275, pp. 249–264. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-29912-4_18

Hospodar, G., Gierlichs, B., Mulder, E.D., Verbauwhede, I., Vandewalle, J.: Machine learning in side-channel analysis: a first study. J. Cryptogr. Eng.

**1**(4), 293–302 (2011). https://doi.org/10.1007/s13389-011-0023-xIoffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR (2015). arXiv:1502.03167

Jarrett, K., Kavukcuoglu, K., LeCun, Y., et al.: What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153. IEEE (2009)

Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks (2017). arXiv preprint arXiv:1706.02515

Kocher, P., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) Advances in Cryptology-CRYPTO’99. Lecture Notes in Computer Science, vol. 1666, pp. 388–397. Springer, Berlin (1999)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM

**60**(6), 84–90 (2017). https://doi.org/10.1145/3065386LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks

**3361**(10), 1995 (1995)LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput.

**1**(4), 541–551 (1989)LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE

**86**(11), 2278–2324 (1998)LeCun, Y., Cortes, C., Burges, C.J.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/

LeCun, Y., Huang, F.J.: Loss functions for discriminative training of energy-based models. In: R.G. Cowell, Z. Ghahramani (eds.) Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, 6–8 January 2005. Society for Artificial Intelligence and Statistics (2005). http://www.gatsby.ucl.ac.uk/aistats/fullpapers/207.pdf

Lerman, L., Bontempi, G., Markowitch, O.: Power analysis attack: an approach based on machine learning. IJACT

**3**(2), 97–115 (2014). https://doi.org/10.1504/IJACT.2014.062722Lerman, L., Medeiros, S.F., Bontempi, G., Markowitch, O.: A machine learning approach against a masked AES. In: Friedman, J., Hastie, T., Tibshirani, R. (eds.) The Elements of Statistical Learning. Springer Series in Statistics, vol. 1, pp. 61–75. Springer, New York (2014). https://doi.org/10.1007/978-3-319-08302-5_5

Lerman, L., Poussier, R., Bontempi, G., Markowitch, O., Standaert, F.: Template attacks vs. machine learning revisited (and the curse of dimensionality in side-channel analysis). In: S. Mangard, A.Y. Poschmann (eds.) Constructive Side-Channel Analysis and Secure Design-6th International Workshop, COSADE 2015, Berlin, Germany, 13–14 April 2015. Revised Selected Papers, Lecture Notes in Computer Science, vol. 9064, pp. 20–33. Springer, Berlin (2015). https://doi.org/10.1007/978-3-319-21476-4_2

Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: C. Carlet, M.A. Hasan, V. Saraswat (eds.) Security, Privacy, and Applied Cryptography Engineering-6th International Conference, SPACE 2016, Hyderabad, India, 14–18 December 2016. Proceedings, Lecture Notes in Computer Science, vol. 10076, pp. 3–26. Springer, Berlin (2016). https://doi.org/10.1007/978-3-319-49445-6_1

Mangard, S., Pramstaller, N., Oswald, E.: Successfully attacking masked AES hardware implementations. In: Rao, J., Sunar, B. (eds.) Cryptographic Hardware and Embedded Systems-CHES 2005. Lecture Notes in Computer Science, vol. 3659, pp. 157–171. Springer, Berlin (2005)

Martinasek, Z., Dzurenda, P., Malina, L.: Profiling power analysis attack based on MLP in DPA contest V4.2. In: 39th International Conference on Telecommunications and Signal Processing, TSP 2016, Vienna, Austria, 27–29 June 2016, pp. 223–226. IEEE (2016). https://doi.org/10.1109/TSP.2016.7760865

Martinasek, Z., Hajny, J., Malina, L.: Optimization of power analysis using neural network. In: Francillon, A., Rohatgi, P. (eds.) Smart Card Research and Advanced Applications-12th International Conference, CARDIS 2013, Berlin, Germany, 27–29 November 2013. Revised Selected Papers, Lecture Notes in Computer Science, vol. 8419, pp. 94–107. Springer, Berlin. https://doi.org/10.1007/978-3-319-08302-5_7

Martinasek, Z., Malina, L., Trasy, K.: Profiling power analysis attack based on multi-layer perceptron network. Comput. Probl. Sci. Eng.

**343**, 317 (2015)McAllester, D.A., Hazan, T., Keshet, J.: Direct loss minimization for structured prediction. In: J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, A. Culotta (eds.) Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6–9 December 2010, Vancouver, British Columbia, Canada, pp. 1594–1602. Curran Associates, Inc., Red Hook (2010). http://papers.nips.cc/paper/4069-direct-loss-minimization-for-structured-prediction

Messerges, T.: Using second-order power analysis to attack DPA resistant software. In: Koç, Ç., Paar, C. (eds.) Cryptographic Hardware and Embedded Systems-CHES 2000. Lecture Notes in Computer Science, vol. 1965, pp. 238–251. Springer, Berlin (2000)

Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), 21–24 June 2010, Haifa, Israel, pp. 807–814. Omnipress, Madison (2010)

O’Flynn, C., Chen, Z.D.: Chipwhisperer: An open-source platform for hardware embedded security research. In: E. Prouff (ed.) Constructive Side-Channel Analysis and Secure Design-5th International Workshop, COSADE 2014, Paris, France, 13–15 April 2014. Revised Selected Papers, Lecture Notes in Computer Science, vol. 8622, pp. 243–260. Springer, Berlin (2014). https://doi.org/10.1007/978-3-319-10175-0_17

Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag.

**2**(11), 559–572 (1901)Picek, S., Samiotis, I.P., Heuser, A., Kim, J., Bhasin, S., Legay, A.: On the Performance of Deep Learning for Side-channel Analysis. IACR Cryptology. ePrint Archive

**2018**, 004 (2018). http://eprint.iacr.org/2018/004Prouff, E., Rivain, M.: A generic method for secure SBox implementation. In: Kim, S., Yung, M., Lee, H.W. (eds.) WISA. Lecture Notes in Computer Science, vol. 4867, pp. 227–244. Springer, Berlin (2008)

Rokach, L., Maimon, O.: Data Mining with Decision Trees: Theroy and Applications. World Scientific Publishing Co., Inc, River Edge (2008)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.

**115**(3), 211–252 (2015)Schindler, W.: Advanced stochastic methods in side channel analysis on block ciphers in the presence of masking. J. Math. Cryptol.

**2**, 291–310 (2008)Schindler, W., Lemke, K., Paar, C.: A Stochastic model for differential side channel cryptanalysis. In: Rao, J., Sunar, B. (eds.) Cryptographic Hardware and Embedded Systems–CHES 2005. Lecture Notes in Computer Science, vol. 3659. Springer, Berlin (2005)

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556

Song, Y., Schwing, A.G., Zemel, R.S., Urtasun, R.: Direct loss minimization for training deep neural nets. CoRR (2015). arXiv:1511.06411

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

Weston, J., Watkins, C.: Multi-class support vector machines. Technical Report CSD-TR-98-04, Royal Holloway, University of London (1998)

Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer (2014)

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### A signal-to-noise characterization of the target operations

See Fig. 15.

### B The new ASCAD database

### 1.1 B.1 Trace format

For the storage of the observations and the metadata (plaintext/ciphertext/key/mask values), we chose to use the current version 5 of the *Hierarchical Data Format* (HDF5). The latter one is a multi-purpose hierarchical container format capable of storing large numerical datasets with their metadata. The specification is open and the tools are open source. The development of HDF5 is done by the HDF Group, a non-profit corporation [26]. A HDF5 file contains a POSIX-like hierarchy of numerical arrays (aka datasets) organized within groups and subgroups. Effectively, HDF5 may be seen as a file system within a file, where files are datasets and folders are groups. Moreover, HDF5 also supports lossless compression of datasets. To manipulate our HDF5 files we used the h5py python package [27].

Our HDF5 file \(\texttt {ATMega8515\_raw\_traces.h5}\) is composed of two datasets within two groups: metadata and traces. The type of the latter one is HDF5 Scalar Dataset (i.e. may be viewed as a 2-dimensional array of 8-bit integers, the first dimension being the observation index, the second dimension being a time index and 8-bit integer being the type of the measure). The type of metadata is HDF5 Compound Dataset which is similar to a struct in C language. The members of the compound dataset metadata are *plaintext*, *ciphertext*, *key* and *mask* which all are arrays of 16 unsigned 8-bit integers. The 14 first elements of the *mask* array correspond to the masks \(r[3],\cdots ,r[16]\) in Algorithm 1 and the two last elements respectively correspond to \(r_{\text {in}}\) and \(r_{\text {out}}\) (as explained before the masks \(r[1]\) and \(r[2]\) have been forced to 0 for test/validation purpose). We give an overview of this structure on Fig. 16.

### 1.2 B.2 MNIST database and adaptations to SCA

Our raw traces format, as described in the previous subsection, is a classical representation of data for SCA analysis. This however suffers from some issues when considering it in the light of ML analysis:

When considering a classification problem, one wants to get explicit and distinct classes where each trace is sorted (i.e. labelled) to help with the profiling phase.

From the traces in \(\texttt {ATMega8515\_raw\_traces.h5}\), it is not clear which dataset is to be used for training, and which is to be used for the tests and the accuracy computation.

Finally, the raw \(\texttt {ATMega8515\_raw\_traces.h5}\) file does not contain explicit

*labels*for the SCA classification problem, though these can be computed given the*plaintext*and*key*metadata.

The **MNIST database** [39] is a reference in the ML image classification community, allowing any new machine learning algorithm to be fairly compared to the state-of-the-art results. The efficiency of a new algorithm is tested against the classification of \(28\times 28\) pixels greyscale, normalized and centered images of handwritten digits. The database is split in groups, each one containing data and labels:

- 1.
The

*training dataset*group (50, 000 samples) contains the samples used during the training phase. This group is composed of the raw images in a file, and their labels with the same index in another file. - 2.
Similarly, the

*test dataset*group (10,000 samples) is composed of the raw images in a file, and their labels with the same index in another file.

Following the path of the MNIST database, we propose a novel approach that fits the needs of testing ML algorithms against the SCA classification problems described in previous sections. We provide a database ASCAD with labelled datasets that will allow the *SCA community to objectively compare the efficiency of ML and DL methods*. To fit the SCA context, we have adapted the so-called MNIST training and test concepts to the more appropriate *profiling* and *attack* semantics as introduced and described in Sect. 2.2.1.^{Footnote 22}

The database information is extracted from the raw data file \(\texttt {ATMega8515\_raw\_traces.h5}\), and its structure is presented on Fig. 17. For the sake of efficiency and simplicity, the HDF5 file format has been kept for our ASCAD database. The new file \(\texttt {ASCAD.h5}\) is composed of:

two main groups: Profiling_traces for profiling which contains \(N_p\) information, and Attack_traces for attacking which contains \(N_a\) information.

^{Footnote 23}In our case, over the 60, 000 labelled traces, we have chosen \(N_p= 50,000\) and \(N_a= 10,000\).In each main group, we find three HDF5 datasets;

the traces dataset contains the raw traces zoomed in on the 700 samples window of interest: the [45400..46100] interval containing the relevant information as previously described (only keeping the relevant samples in the traces allows to have a reasonably sized database),

the labels dataset contains the labels (following the ML classification meaning) for each trace. In our case, the value of the byte \(\text {sbox}(p[3]\oplus k[3])\) is the label of interest, leading to 256 possible classes (the sequel of the article discusses this choice, and compares it to other possible classes such as using the Hamming weight of \(\text {sbox}(p[3]\oplus k[3])\)). In Remark 2, we explain how this labelling over the outputs of the sbox processing can be simply converted into a labelling over the different key candidates. It is to be noted that the masks

*are not used*when computing the labels.The metadata dataset contains the information related to each trace in a HDF5 compound (aka structure), taken from \(\texttt {ATMega8515\_raw\_traces.h5}\) almost without any modification (an additional field is added, see below). From a strict ML perspective, this metadata is useless since the labels are the only necessary information to check the efficiency of an algorithm. These data are however useful from a SCA perspective since the

*plaintext*byte \(p[3]\) is necessary to extract the estimated \(\hat{k}[3]\) from the label values, and the real value of the key byte \(k[3]\) is useful for the key ranking with regard to each class probability. Even though only \(p[3]\) and \(k[3]\) are useful for key ranking, we have decided to keep all the other metadata (the other*plaintext*and*key*bytes, the*ciphertext*and the*masks*) for the sake of completeness: the size of this metadata is very reasonable. Finally, a*desync*field is added to the compound structure: this uint32 field represents the optional random desynchronization applied to the traces, which simulates a jitter as explained hereafter.

We feel that our ASCAD database is versatile enough to check the efficiency and accuracy of ML and DL algorithms applied to side-channel analysis, and we also aim at providing general purpose python scripts that will ease the process of:

creating new databases following the same structure to attack other outputs in other AES rounds (with data extracted from \(\texttt {ATMega8515\_raw\_traces.h5}\) or any other similarly structured HDF5 file),

modifying the profiling and attack datasets sizes and index to check their effect,

adding a parametrized desynchronization to the traces to check the efficiency of the algorithm against jitter, and its impacts on the hyper-parameters. See the sequel of the article for a discussion on this.

As a benchmarking baseline, we will actually provide three HDF5 files that form our reference database:

\(\texttt {ASCAD.h5}\), which contains profiling and attack datasets as previously described. The traces are synchronized and there is no jitter,

\(\texttt {ASCAD\_desync50.h5}\), which contains traces with a 50 samples window maximum jitter.

\(\texttt {ASCAD\_desync100.h5}\), which contains traces with a 100 samples window maximum jitter.

The method used to simulate the jitter is described in 3.3.2.

### C Multi-layer perceptrons (MLP)

### 1.1 C.1 Core principles and constructions

*Multi-Layer Perceptrons* (MLPs) are associated with a model/function \(\hat{\mathbf {\text {g}}}_{}\) that is composed of multiple linear functions and some nonlinear *activation functions* which are efficiently-computable and whose derivatives are bounded and efficient to evaluate. In short, an MLP can be defined as follows:

where:

the \(\lambda _i\) functions are the so-called

*Fully-Connected*(FC) layers and are expressible as affine functions: denoting \(\mathbf {v}\) the \(D\)-dimensional input of an FC, its output is given by \(\mathbf{A}\varvec{\ell }_{} + \mathbf {B}\), being \(\mathbf{A}\in \mathbb {R}^{C\times D}\) a matrix of weights and \(\mathbf {B}\in \mathbb {R}^C\) a vector of biases.the \(\alpha _i\) are the so-called

*activation functions*(ACT): an activation function is a nonlinear real function that is applied independently to each coordinate of its input (e.g. the ReLU activation function processes \(\max (0,x)\) to each coordinate*x*),\(\mathsf {s}\) is the so-called

*softmax*function (SOFT): \(\mathsf {s}(\varvec{\ell }_{})[i] = \frac{e^{\varvec{\ell }_{}[i]}}{\sum _{j}e^{\varvec{\ell }_{}[j]}}\).

In the rest of the paper, \({\texttt {MLP}}(n_{\text {layer}}, n_{\text {units}}, \alpha )\) will denote an MLP architecture with \(n_{\text {layer}}\) layers, \(n_{\text {units}}\) units (a.k.a. nodes or neurons) and \(\alpha \) as activation function for each hidden layer. Such an MLP corresponds to (10) with \(\alpha _i=\alpha \) for every *i*, with \(n=n_{\text {layer}}\), and with \(\lambda _i\) defined for \(D=C=n_{\text {units}}\) if \(i\in [2...n_{\text {layer}}-1]\) and for \((C,D)=(n_{\text {units}},700)\) if \(i=1\) and \((C,D)= (256,n_{\text {units}})\) for \(i=n_{\text {layer}}\) (indeed inputs of the model are 700-dimensional leakage traces while the outputs are in [0..255]).

### 1.2 C.2 Choice of the hyper-parameters

As explained in Sect. 3.3, the strategy we applied to tune the hyper-parameters is divided into three steps. First we fix a base architecture that we denote \(\hbox {MLP}_\mathrm{base}\) corresponding to \({\texttt {MLP}}(6,200,ReLU)\), the 6-layers MLP with 200 units and the ReLU activation function for each layer. Secondly we tune the training parameters, leading to the parametrization of a procedure \(\texttt {Training}(n_{\text {epochs}},{batch\_size},{optimizer},{learning\_rate})\). Then, different variations of \(\hbox {MLP}_\mathrm{base}\) are tested by studying the impact of each architecture parameter on the model efficiency after training with the procedure fixed during previous step. The full strategy aims at providing us with an architecture \(\hbox {MLP}_\mathrm{best}\) and a training procedure that are good w.r.t. the evaluation metrics listed in Sect. 2.4.2.

#### 1.2.1 C.2.1 Training parameters

This subsection aims at studying how the mean rank of the side-channel attack involving the trained model is impacted by the length of the training dataset, the number of epochs, the batch size and the learning rates/optimizers.

First we evaluate the impact of the size \(N_{\text {train}}\) of the training set on the success of a neural network based SCA. We performed a 10-fold cross-validation with different sizes of dataset, while keeping a constant computational time during the training step for fair comparison. This is done by adapting the number of epochs to the number of traces in the dataset. We expect that the performance of the model increases with the size of the training set until a certain threshold that determined the optimal number of traces. The neural network used for this experiment is \(\hbox {MLP}_\mathrm{base}\) trained with RMSProp optimizer, learning rate \(10^{-5}\) and batch size 100. The initialization of the weights is performed from an uniform distribution of mean 0 as defined in Glorot and Bengio’s article [22]. Figure 18 shows the mean rank function for different sizes of training set. Our empirical results on the full ATMega8515_raw_traces.h5 show that approximately 50, 000 training traces are required for a full success of the attack/test in less than 1000 traces. That is why ASCAD is composed of a training set \(\mathcal {D}_{\text {profiling}}\) of size 50, 000 and an attack set \(\mathcal {D}_{\text {attack}}\) of size 10,000. Based on these results, the benchmarks in the rest of the paper were performed on ASCAD profiling traces \(\mathcal {D}_{\text {profiling}}\).

Then we select the best values for the number of epochs and the batch size of the training step. Figure 19 shows the empirical results for different values \(n_{\text {epochs}}\) with a 10-fold cross-validation on \(\mathcal {D}_{\text {profiling}}\). We notice that the number of epochs has a significant impact on the rank functions. Taking into account the trade-off between computation time and SCA-efficiency, best results are obtained by choosing 400 epochs and a batch size equal to 500 or 200 epochs and a batch size equal to 100. However, it appears that we have a best accuracy and a best stability on the rank functions with the latter pair of parameters, which leads us to select these values for the rest of our benchmarks on MLP. We insist on the fact that theses values are obtained as a trade-off that allows us to perform multiple cross-validations in a reasonable amount of time. When the batch size parameter is fixed to 100 we can obtained better results by increasing the number of epochs and consequently the training time. Therefore in the case of a single SCA attack in a given amount of time, we recommend to fix the batch size to 100 and to increase progressively the number of epochs after 200 until the dedicated amount of time for the training step is reached (in our experimental results we did not notice any improvement in the SCA-efficiency after 800 epochs).

The last training parameters that we tune are the gradient descent optimization method (also called optimizer) and the learning rate. Empirical results in Figs. 20 and 21 show that these parameters also have a high impact on the success of the attack. We managed to obtain good results with \({optimizer}=RMSProp\) and a learning rate equal to \(10^{-5}\) (which confirms the soundness of the choices made for experiments reported in Figs. 18 and 19).

#### 1.2.2 C.2.2 Architecture parameters

As described in previous subsection, an MLP architecture is characterized by three architecture hyper-parameters: the number of layers, the number of units of each layer and the activation functions. In this section, we use the training procedure \(\texttt {Training}(200,100,RMSProp,10^{-5})\) determined in previous section and we come back on our \(MLP_{\text {base}}\) initial choice to challenge its hyper-parameters.

First we evaluate the optimal number of layers with a fixed number of nodes: models \({\texttt {MLP}}(n_{\text {layers}},200,ReLU)\) are trained for different values \(n_{\text {layers}}\in [3..11]\). Figure 22 plots the mean rank function, the mean accuracy and the average training time. All the mean rank functions converge to 0 when the number of traces increases. However, the 6-layers MLP has a slight advantage on less than 600 traces and has the best mean accuracy.

Then we evaluate the optimal number of units per layer. Small values lead to simple models that are not powerful enough to represent the dataset trends and high values lead to complex models that are difficult to train and are more susceptible to over-fitting. We limit our empirical study to MLPs with the same number of units by layer. Figure 23 shows the obtained results. With the previously fixed training parameters, the performance of the attack seems to increase once the number of units per layer equals or exceeds 200.

Finally we study the effect of the activation function on the performance of the neural network. Since its introduction in Deep Learning, Rectified Linear Units (ReLUs) have proved to be the best suitable choice for a number of problems, and most specifically in image recognition [23, 32, 51]. The obtained networks have sparse representation, and the simple definition of the \(ReLU(x) = \max (0,x)\) activation function allows quick computations. Figure 24 plots the experimental results obtained with \({\texttt {MLP}}(6,200,\alpha )\) for different activation functions \(\alpha \). The best results are obtained with *ReLU*, *tanh* and *softsign* which is a variation of *tanh*. We select *ReLU* activation function since it provides state-of-the-art results and its computation time is below the two other functions.

Benchmarks reported in this section confirms that the architecture \({\texttt {MLP}}(6,200,ReLU)\) leads to good compromise efficiency *versus* computational time when trained with the following parameters: \(\texttt {Training}(200,100,RMSProp,10^{-5})\). In the rest of this paper, this architecture is denoted \(MLP_{\text {best}}\). We insist on the fact that \(\hbox {MLP}_\mathrm{best}\) has a decent SCA-efficiency with 200 epochs but the latter efficiency continues to improve when the number of epochs increases until 800 epochs (in our experiments we did not notice any significant improvement after 800 epochs). Hence, depending on the amount of time allocated to the training of \(MLP_{\text {best}}\), it may be interesting to increase the number of epochs in the range [200..800].

### 1.3 C.3 Open discussions

#### 1.3.1 C.3.1 Self-normalizing neural networks

Recently, a new type of MLP called Self-Normalizing Neural Networks (SNN) has been introduced in [33]. It aims to improve the robustness of MLPs against perturbation during the training step and to reduce the variance of the training error. Its architecture is a slight variation of the standard MLP architecture: the activation function, called “scaled exponential linear units” (SELU) is given by:

Furthermore, the initialization of the weights is performed from a standard normal distribution. These two modifications imply that the neural network is self-normalizing, i.e. the mean and variance of the activation functions across the different layers stay within small pre-defined intervals. This new architecture outperformed standard MLPs on a number of benchmarks, including MNIST.

We test on the ASCAD a SNN architecture with 6 layers and 200 units for each layer and we compare it with \(\hbox {MLP}_\mathrm{best}\). Experimental results in Fig. 25 show that rank functions are very similar between the two architectures. This highlights the fact that there does not seem to be any significant improvement with the SNN architecture in the context of SCA. The accuracy is slightly higher with SNN as expected in a Machine Learning perspective, however it does not have an influence on the overall rank function.

#### 1.3.2 C.3.2 Hamming weight versus identity labelling

We test our \(\hbox {MLP}_\mathrm{best}\) architecture on the SCA dataset with a labelling of the traces modified to take the Hamming weight of the sensitive value instead of the real value itself. This strategy of data labelling reduces the number of classes to predict (9 values for the Hamming weight instead of 256 values for a byte). Consequently, the model trained on the new dataset is less complex than the model trained on the full values. We also modify the computation of the rank function in (2) by taking into account the distribution of the Hamming weight values. In Fig. 26, the corresponding rank functions are plotted. They show that the new labelling strategy is less interesting. Indeed, even if the Hamming weight model is less complex and requires a smaller number of epochs for the training step, the conditional probability approximated by the neural network is less discriminating (which is a consequence of the reduced number of classes). Moreover, the weighting coefficients in (2) (deduced from the Hamming weight distribution for uniform data) may increase the variance of the rank (viewed as a random variable) since e.g. an error on a value with Hamming weight 0 or 8 accounts for \({{8}\atopwithdelims (){4}} =70\) times an error on a value of Hamming weight 4. Eventually, assuming that the deterministic part of the leakage corresponds to an Hamming weight may be an incorrect abstraction and induces error in the modelling.

#### 1.3.3 C.3.3 Comparison with template attacks

We compare \(\hbox {MLP}_\mathrm{best}\) with standard template attacks (aka quadratic discriminant analysis, or QDA in the machine learning community). We first perform an unsupervised dimension reduction to extract meaningful features. For this task we use a classical PCA which is parametrized by the number of components to extract. Then the classification task is performed with a QDA (i.e. Template Attacks). Note that, contrary to QDA, neural networks do not require the preprocessing feature extraction step since this task is realized by the first layers of the networks. Figure 27 shows the results obtained with different numbers of components extracted from the PCA.

#### 1.3.4 C.3.4 First-order attacks

By using the mask values contained in the ASCAD, it is possible to compute the masked output after the first round AES Sbox:

where \(z\) is the sensitive value and \(p[3], k[3], r_{\text {out}}\) are the plaintext byte, the key byte and the mask byte.

Therefore, we can mount a first-order SCA by labelling the traces with the masked output values and we can test the performance of \(\hbox {MLP}_\mathrm{best}\) in this weaker context. The results in Fig. 28 show that, without any modification in the architecture and the training parameters, \(\hbox {MLP}_\mathrm{best}\) easily succeeds in this attack. The rank functions converge to 0 with 20 epochs and only 4 traces are required to determine the correct key. We also managed to obtain an accuracy of 0.028, and we did not notice any over-fitting with 200 epochs.

### 1.4 C.4 Efficiency results on ASCAD database

We trained \(\hbox {MLP}_\mathrm{best}\) on ASCAD Database with and without desynchronization and with different numbers of epochs. As shown in Fig. 29, MLP is very sensitive to desynchronization and increasing the number of epochs is not enough to get better results.

### D Example of tested CNN architecture

See Fig. 30.

### E CNN: Supplementary materials for t7he hyper-parameters’ choice

*Pooling Layer and Padding* We tested three different options for all the pooling layers of \(\hbox {CNN}_\mathrm{base}\): max pooling, average pooling, and *stride pooling*^{Footnote 24}. Contrary to standard CNN architectures used in computer vision that rely on max pooling, we obtained our best results with average pooling layers (Fig. 31).

*Padding* Finally we tested two configurations of padding. Results in Fig. 32 show that this parameter does not have a significant impact on the SCA-efficiency.

## Rights and permissions

## About this article

### Cite this article

Benadjila, R., Prouff, E., Strullu, R. *et al.* Deep learning for side-channel analysis and introduction to ASCAD database.
*J Cryptogr Eng * **10**, 163–188 (2020). https://doi.org/10.1007/s13389-019-00220-8

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s13389-019-00220-8