Learning When to Stop: A Mutual Information Approach to Prevent Overfitting in Profiled Side-Channel Analysis

Perin, Guilherme; Buhan, Ileana; Picek, Stjepan

doi:10.1007/978-3-030-89915-8_3

Guilherme Perin¹⁰,
Ileana Buhan¹¹ &
Stjepan Picek¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12910))

Included in the following conference series:

International Workshop on Constructive Side-Channel Analysis and Secure Design

905 Accesses
7 Citations

Abstract

Today, deep neural networks are a common choice for conducting the profiled side-channel analysis. Unfortunately, it is not trivial to find neural network hyperparameters that would result in top-performing attacks. The hyperparameter leading the training process is the number of epochs during which the training happens. If the training is too short, the network does not reach its full capacity, while if the training is too long, the network overfits and cannot generalize to unseen examples. In this paper, we tackle the problem of determining the correct epoch to stop the training in the deep learning-based side-channel analysis. We demonstrate that the amount of information, or, more precisely, mutual information transferred to the output layer, can be measured and used as a reference metric to determine the epoch at which the network offers optimal generalization. To validate the proposed methodology, we provide extensive experimental results.

This work was supported by the European Union’s H2020 Programme under grant agreement number ICT-731591 (REASSURE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the Influence of Optimizers in Deep Learning-Based Side-Channel Analysis

Online Performance Evaluation of Deep Learning Networks for Profiled Side-Channel Analysis

Focus is Key to Success: A Focal Loss Function for Deep Learning-Based Side-Channel Analysis

Notes

1.
It is also possible for a machine learning model to underfit if the training stopped too early. Still, this is usually of less concern as the resulting machine learning model would generalize to unseen data but not use its full potential, i.e., the attack would not be as powerful as possible.
2.
A saturating activation function squeezes the input data, i.e., the output is bounded to a certain range.
3.
Note, information plane figures show different layers, but it is not possible to recognize a specific layer by just “observing” the graph, i.e., there is no pre-specified behavior for a specific layer. We store and plot data for each layer separately.

References

Amjad, R.A., Geiger, B.C.: How (not) to train your neural network using the information bottleneck principle. CoRR abs/1802.09766 (2018). http://arxiv.org/abs/1802.09766
Bronchain, O., Hendrickx, J.M., Massart, C., Olshevsky, A., Standaert, F.-X.: Leakage certification revisited: bounding model errors in side-channel security evaluations. In: Boldyreva, A., Micciancio, D. (eds.) CRYPTO 2019. LNCS, vol. 11692, pp. 713–737. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26948-7_25
Chapter MATH Google Scholar
Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3
Chapter Google Scholar
Chelombiev, I., Houghton, C., O’Donnell, C.: Adaptive estimators show information compression in deep neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SkeZisA5t7
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S.J. (eds.) Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, 9–12 July 1995, pp. 194–202. Morgan Kaufmann (1995). https://doi.org/10.1016/b978-1-55860-377-6.50032-3
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
Hettwer, B., Gehrer, S., Güneysu, T.: Profiled power analysis attacks using convolutional neural networks with domain knowledge. In: Cid, C., Jacobson, M.J., Jr. (eds.) Selected Areas in Cryptography - SAC 2018, pp. 479–498. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10970-7_22
Chapter Google Scholar
Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. unleashing the power of convolutional neural networks for profiled side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(3), 148–179 (2019). https://doi.org/10.13154/tches.v2019.i3.148-179
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6) (2004). https://doi.org/10.1103/physreve.69.066138
Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Carlet, C., Hasan, M.A., Saraswat, V. (eds.) SPACE 2016. LNCS, vol. 10076, pp. 3–26. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49445-6_1
Chapter Google Scholar
Masure, L., Dumas, C., Prouff, E.: Gradient visualization for general characterization in profiling attacks. In: Polian, I., Stöttinger, M. (eds.) COSADE 2019. LNCS, vol. 11421, pp. 145–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16350-1_9
Chapter Google Scholar
Masure, L., Dumas, C., Prouff, E.: A comprehensive study of deep learning for side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(1), 348–375 (2020). https://doi.org/10.13154/tches.v2020.i1.348-375
Perin, G., Chmielewski, L., Picek, S.: Strength in numbers: improving generalization with ensembles in machine learning-based profiled side-channel analysis. IACR Trans. Cryptogr. Hardware Embed. Syst. 2020(4), 337–364 (2020). https://doi.org/10.13154/tches.v2020.i4.337-364. https://tches.iacr.org/index.php/TCHES/article/view/8686
Picek, S., Heuser, A., Jovic, A., Bhasin, S., Regazzoni, F.: The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(1), 209–237 (2019). https://doi.org/10.13154/tches.v2019.i1.209-237
Prouff, E., Strullu, R., Benadjila, R., Cagli, E., Dumas, C.: Study of deep learning techniques for side-channel analysis and introduction to ASCAD database. IACR Cryptology ePrint Archive 2018, 53 (2018). http://eprint.iacr.org/2018/053
Rijsdijk, J., Wu, L., Perin, G., Picek, S.: Reinforcement learning for hyperparameter tuning in deep learning-based side-channel analysis. Cryptology ePrint Archive, Report 2021/071 (2021). https://eprint.iacr.org/2021/071
Robissout, D., Zaid, G., Colombier, B., Bossuet, L., Habrard, A.: Online performance evaluation of deep learning networks for profiled side-channel analysis. In: Bertoni, G.M., Regazzoni, F. (eds.) COSADE 2020. LNCS, vol. 12244, pp. 200–218. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68773-1_10
Chapter Google Scholar
Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D.: On the information bottleneck theory of deep learning. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=ry_WPG-A-
Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. CoRR abs/1703.00810 (2017). http://arxiv.org/abs/1703.00810
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall (1998). https://doi.org/10.1201/9781315140919
Standaert, F.-X., Peeters, E., Archambeau, C., Quisquater, J.-J.: Towards security limits in side-channel attacks. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 30–45. Springer, Heidelberg (2006). https://doi.org/10.1007/11894063_3
Chapter Google Scholar
Standaert, F.-X., Koeune, F., Schindler, W.: How to compare profiled side-channel attacks? In: Abdalla, M., Pointcheval, D., Fouque, P.-A., Vergnaud, D. (eds.) ACNS 2009. LNCS, vol. 5536, pp. 485–498. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01957-9_30
Chapter Google Scholar
Standaert, F.-X., Malkin, T.G., Yung, M.: A unified framework for the analysis of side-channel key recovery attacks. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 443–461. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01001-9_26
Chapter Google Scholar
TELECOM ParisTech SEN research group: DPA Contest (\(4^{{\rm th}}\) edition) (2013–2014). http://www.DPAcontest.org/v4/
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle (2015)
Google Scholar
van der Valk, D., Picek, S.: Bias-variance decomposition in machine learning-based side-channel analysis. Cryptology ePrint Archive, Report 2019/570 (2019). https://eprint.iacr.org/2019/570
van der Valk, D., Picek, S., Bhasin, S.: Kilroy was here: the first step towards explainability of neural networks in profiled side-channel analysis. Cryptology ePrint Archive, Report 2019/1477 (2019). https://eprint.iacr.org/2019/1477
Wouters, L., Arribas, V., Gierlichs, B., Preneel, B.: Revisiting a methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardware Embed. Syst. 2020(3), 147–168 (2020). https://doi.org/10.13154/tches.v2020.i3.147-168. https://tches.iacr.org/index.php/TCHES/article/view/8586
Wu, L., Perin, G., Picek, S.: I choose you: automated hyperparameter tuning for deep learning-based side-channel analysis. Cryptology ePrint Archive, Report 2020/1293 (2020). https://eprint.iacr.org/2020/1293
Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(1), 1–36 (2019). https://doi.org/10.13154/tches.v2020.i1.1-36. https://tches.iacr.org/index.php/TCHES/article/view/8391

Download references

Author information

Authors and Affiliations

Delft University of Technology, Delft, The Netherlands
Guilherme Perin & Stjepan Picek
Radboud University, Nijmegen, The Netherlands
Ileana Buhan

Authors

Guilherme Perin
View author publications
You can also search for this author in PubMed Google Scholar
Ileana Buhan
View author publications
You can also search for this author in PubMed Google Scholar
Stjepan Picek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stjepan Picek .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Shivam Bhasin
Siemens AG, Munich, Germany
Fabrizio De Santis

Appendices

A Bin Size Estimators

To estimate the probability density, a critical step is to determine the bin width, which is the user-supplied parameter for the histogram estimator. The results of our experiments are shown in Figs. 13, 14, and 15. As shown in the figures on the right, any bin width larger than 15 leads to a final key rank lower than 4 (key rank equal to 1 indicates the successful key recovery). The key rank is computed for a separate test set and is obtained by selecting the machine learning model at the epoch that gives the highest \(I(T_n;Y)\) for each tested bin width. The plots on the left side of Figs. 13, 14, and 15 show the value of \(I(T_n;Y)\) w.r.t. the number of epochs for all tested bin sizes. As we can see, if the bin size is too small, the mutual information \(I(T_n;Y)\) barely changes.

Figure 16 shows results for eight different well-known estimators, vs. the choice of fixing the number of bins to 100. We tested Freedman Diaconis (‘fd’), ‘sturges’, ‘auto’ (which is the maximum of ‘fd’ and ‘sturges’ estimators), ‘rice’, ‘scott’, square-root estimator (‘sqrt’), and ‘doane’ estimators. The results show the guessing entropy resulting from early stopping by having \(I(T_{n}; Y)\) as a metric by testing different bin size estimators. Notice they all lead to successful key recovery with similar guessing entropy convergence for all tested datasets and leakage models.

B From Information Path to the Best Epoch to Stop the Training

We use the information path for visualizing how much information each hidden layer has learned about the true labels (Y). The amount of information learned is an estimation of how well each hidden layer fits the distribution of Y, which is directly derived from the selected leakage model.

We repeat the experiment for the ASCAD random key dataset (Fig. 17, MLP architecture). Using the visualization provided by the information path, we confirm that all hidden layers are a transformed representation of the input traces and contain information about Y. However, the estimate of the mutual information between Y and the output layer probabilities (after the Softmax activation layer) captures best the prediction capability of the network. In our experiments the best epoch to stop the training is the epoch where the value of I(\(T_{n};Y\)) reaches its maximum value.

Figure 17a shows the evolution value for I(\(T_{i};Y\)) during neural network training for all hidden layers, while Fig. 17b provides the guessing entropy results. We provide results by selecting the best epoch from the maximum value of I(\(T_{i};Y\)) for all the hidden layers, as suggested by the reviewer. We provide these particular results to illustrate that only the last layer is indicative for early stopping results. We conclude that the information path can lead to optimal choices for the number of training epochs for profiling side-channel analysis. Note, however, that for the general solution selection, the best epoch from the I(\(T_{n};Y\)) (output layer) provided better results across multiple datasets.

C On the Length of the Generalization Interval

As stated, we aim to reach the generalization interval and then stop the training. By doing so, we ensure that the trained machine learning model will generalize to unseen data. The question remains how difficult it is to stop at the generalization interval. Intuitively, the shorter the interval, the easier it would be to miss it. Ideally, we aim to have a neural network that reaches the generalization interval relatively fast and stays in that interval for a longer period. Before discussing how to obtain a long generalization interval, we must ensure it happens and that we do not go to the overfitting phase from the underfitting phase.

Regularization techniques can help prevent a deep neural network from overfitting during the training process. To check the impact of the regularization on the neural network and its generalization interval, we use the information plane as it provides a visual indication for the relationship between \(I(X;T_{n})\) and \(I(T_{n};Y)\). The maximum value of \(I(T_{n};Y)\) during training indicates an epoch at which the neural network should be inside the generalization interval for the training process, as defined in Definition 3. When the network does not implement any regularization technique in its hyperparameter configuration, the trained model has a higher chance of overfitting the training data.

Figure 18 depicts results for a CNN with and without regularization (dropout). This experiment is conducted on a proprietary unprotected software AES implementation (STM32 microcontroller). For that, we considered 6 000 traces for the training set and 1 000 traces for the validation set, both having fixed keys. The traces contain 400 features. Observing Fig. 18b, for the case without regularization, we see that the mutual information \(I(T_{n};Y)\) reaches a maximum value (where the distributions \(T_n\) and Y are obtained from the validation set) and after that, \(I(T_{n};Y)\) for validation decreases continuously while \(I(T_{n};Y)\) for the training stays at a maximum value. Additionally, \(I(T_{n};Y)\) indicates that the generalization phase lasts shorter than one would infer from accuracy, as illustrated in Fig. 18a.

Figures 18a and 18b also show the accuracy and \(I(T_{n};Y)\), respectively, for training and validation labels sets obtained from a regularized CNN with dropout. After processing 200 epochs, the training accuracy has not reached 100%, the desired outcome for a regularized neural network. At the same time, the validation accuracy reaches approximately 56%, which is a significantly higher value compared to 51% without regularization, as shown in Fig. 18a. The mutual information \(I(T_{n};Y)\) for the validation set (see Fig. 18b) reaches its maximum value and stays longer at this level. This indicates that the same generalization level is kept until at least epoch 100. Consequently, as the value of \(I(T_{n};Y)\) stays high for more training epochs, the neural network provides better generalization for those epochs. Again, accuracy cannot indicate the same phenomenon, as its value remains stable (albeit of different magnitude for the validation set) for regularized and non-regularized networks.

The neural network configurations (with and without dropout) are illustrated in Fig. 19. The “R” and “S” labels refer to ReLU and Softmax, respectively. The number under the layer block indicates the number of neurons in dense layers (“D”) and the dropout rate for dropout layers.

D DPAv4 Results

For the DPAv4 dataset, we consider 34 000 traces in the training set and 2 000 traces in the validation set. An additional 2 000 traces are used as a test set. These results were obtained from the 100 training runs on CNN configured with unchanged hyperparameters. Fig. 20 shows guessing entropy and success rate obtained from the selected metrics (accuracy, recall, loss, key rank, and maximum \(I(T_{n};Y)\)) from the validation set. Selecting the model at an epoch with the maximum \(I(T_{n},Y)\) for the validation set provides the best results for both SR and GE. Like the ASCAD dataset in the HW leakage model, \(I(T_{n};Y)\) gives better results than the validation key rank for a small number of attack traces. Again, this happens due to the influence of the validation set on the trained model. Interestingly, allowing training for all 50 epochs leads to overfitting, but the same behavior happens if we stop training based on loss, recall, and accuracy.

Observe from Figs. 21a and 21b that the network achieves its maximum \(I(T_{n};Y)\) value between epochs 10 and 16. Figure 21c confirms that we require around 10 epochs to reach guessing entropy of 1. Additionally, the behavior stays relatively stable up to epoch 38 (where there is no deterioration up to epoch 15, and afterward, there are slight changes in GE). Finally, in Fig. 21d, the validation key rank agrees with \(I(T_{n};Y)\) by reaching the maximal frequency values for epochs 11 to 15 (cf. Fig. 21b).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perin, G., Buhan, I., Picek, S. (2021). Learning When to Stop: A Mutual Information Approach to Prevent Overfitting in Profiled Side-Channel Analysis. In: Bhasin, S., De Santis, F. (eds) Constructive Side-Channel Analysis and Secure Design. COSADE 2021. Lecture Notes in Computer Science(), vol 12910. Springer, Cham. https://doi.org/10.1007/978-3-030-89915-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-89915-8_3
Published: 21 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89914-1
Online ISBN: 978-3-030-89915-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics