Skip to main content

Learning When to Stop: A Mutual Information Approach to Prevent Overfitting in Profiled Side-Channel Analysis

  • Conference paper
  • First Online:
Constructive Side-Channel Analysis and Secure Design (COSADE 2021)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12910))

Abstract

Today, deep neural networks are a common choice for conducting the profiled side-channel analysis. Unfortunately, it is not trivial to find neural network hyperparameters that would result in top-performing attacks. The hyperparameter leading the training process is the number of epochs during which the training happens. If the training is too short, the network does not reach its full capacity, while if the training is too long, the network overfits and cannot generalize to unseen examples. In this paper, we tackle the problem of determining the correct epoch to stop the training in the deep learning-based side-channel analysis. We demonstrate that the amount of information, or, more precisely, mutual information transferred to the output layer, can be measured and used as a reference metric to determine the epoch at which the network offers optimal generalization. To validate the proposed methodology, we provide extensive experimental results.

This work was supported by the European Union’s H2020 Programme under grant agreement number ICT-731591 (REASSURE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    It is also possible for a machine learning model to underfit if the training stopped too early. Still, this is usually of less concern as the resulting machine learning model would generalize to unseen data but not use its full potential, i.e., the attack would not be as powerful as possible.

  2. 2.

    A saturating activation function squeezes the input data, i.e., the output is bounded to a certain range.

  3. 3.

    Note, information plane figures show different layers, but it is not possible to recognize a specific layer by just “observing” the graph, i.e., there is no pre-specified behavior for a specific layer. We store and plot data for each layer separately.

References

  1. Amjad, R.A., Geiger, B.C.: How (not) to train your neural network using the information bottleneck principle. CoRR abs/1802.09766 (2018). http://arxiv.org/abs/1802.09766

  2. Bronchain, O., Hendrickx, J.M., Massart, C., Olshevsky, A., Standaert, F.-X.: Leakage certification revisited: bounding model errors in side-channel security evaluations. In: Boldyreva, A., Micciancio, D. (eds.) CRYPTO 2019. LNCS, vol. 11692, pp. 713–737. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26948-7_25

    Chapter  MATH  Google Scholar 

  3. Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3

    Chapter  Google Scholar 

  4. Chelombiev, I., Houghton, C., O’Donnell, C.: Adaptive estimators show information compression in deep neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SkeZisA5t7

  5. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S.J. (eds.) Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, 9–12 July 1995, pp. 194–202. Morgan Kaufmann (1995). https://doi.org/10.1016/b978-1-55860-377-6.50032-3

  6. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org

  7. Hettwer, B., Gehrer, S., Güneysu, T.: Profiled power analysis attacks using convolutional neural networks with domain knowledge. In: Cid, C., Jacobson, M.J., Jr. (eds.) Selected Areas in Cryptography - SAC 2018, pp. 479–498. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10970-7_22

    Chapter  Google Scholar 

  8. Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. unleashing the power of convolutional neural networks for profiled side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(3), 148–179 (2019). https://doi.org/10.13154/tches.v2019.i3.148-179

  9. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6) (2004). https://doi.org/10.1103/physreve.69.066138

  10. Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Carlet, C., Hasan, M.A., Saraswat, V. (eds.) SPACE 2016. LNCS, vol. 10076, pp. 3–26. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49445-6_1

    Chapter  Google Scholar 

  11. Masure, L., Dumas, C., Prouff, E.: Gradient visualization for general characterization in profiling attacks. In: Polian, I., Stöttinger, M. (eds.) COSADE 2019. LNCS, vol. 11421, pp. 145–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16350-1_9

    Chapter  Google Scholar 

  12. Masure, L., Dumas, C., Prouff, E.: A comprehensive study of deep learning for side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(1), 348–375 (2020). https://doi.org/10.13154/tches.v2020.i1.348-375

  13. Perin, G., Chmielewski, L., Picek, S.: Strength in numbers: improving generalization with ensembles in machine learning-based profiled side-channel analysis. IACR Trans. Cryptogr. Hardware Embed. Syst. 2020(4), 337–364 (2020). https://doi.org/10.13154/tches.v2020.i4.337-364. https://tches.iacr.org/index.php/TCHES/article/view/8686

  14. Picek, S., Heuser, A., Jovic, A., Bhasin, S., Regazzoni, F.: The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(1), 209–237 (2019). https://doi.org/10.13154/tches.v2019.i1.209-237

  15. Prouff, E., Strullu, R., Benadjila, R., Cagli, E., Dumas, C.: Study of deep learning techniques for side-channel analysis and introduction to ASCAD database. IACR Cryptology ePrint Archive 2018, 53 (2018). http://eprint.iacr.org/2018/053

  16. Rijsdijk, J., Wu, L., Perin, G., Picek, S.: Reinforcement learning for hyperparameter tuning in deep learning-based side-channel analysis. Cryptology ePrint Archive, Report 2021/071 (2021). https://eprint.iacr.org/2021/071

  17. Robissout, D., Zaid, G., Colombier, B., Bossuet, L., Habrard, A.: Online performance evaluation of deep learning networks for profiled side-channel analysis. In: Bertoni, G.M., Regazzoni, F. (eds.) COSADE 2020. LNCS, vol. 12244, pp. 200–218. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68773-1_10

    Chapter  Google Scholar 

  18. Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D.: On the information bottleneck theory of deep learning. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=ry_WPG-A-

  19. Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. CoRR abs/1703.00810 (2017). http://arxiv.org/abs/1703.00810

  20. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall (1998). https://doi.org/10.1201/9781315140919

  21. Standaert, F.-X., Peeters, E., Archambeau, C., Quisquater, J.-J.: Towards security limits in side-channel attacks. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 30–45. Springer, Heidelberg (2006). https://doi.org/10.1007/11894063_3

    Chapter  Google Scholar 

  22. Standaert, F.-X., Koeune, F., Schindler, W.: How to compare profiled side-channel attacks? In: Abdalla, M., Pointcheval, D., Fouque, P.-A., Vergnaud, D. (eds.) ACNS 2009. LNCS, vol. 5536, pp. 485–498. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01957-9_30

    Chapter  Google Scholar 

  23. Standaert, F.-X., Malkin, T.G., Yung, M.: A unified framework for the analysis of side-channel key recovery attacks. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 443–461. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01001-9_26

    Chapter  Google Scholar 

  24. TELECOM ParisTech SEN research group: DPA Contest (\(4^{{\rm th}}\) edition) (2013–2014). http://www.DPAcontest.org/v4/

  25. Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle (2015)

    Google Scholar 

  26. van der Valk, D., Picek, S.: Bias-variance decomposition in machine learning-based side-channel analysis. Cryptology ePrint Archive, Report 2019/570 (2019). https://eprint.iacr.org/2019/570

  27. van der Valk, D., Picek, S., Bhasin, S.: Kilroy was here: the first step towards explainability of neural networks in profiled side-channel analysis. Cryptology ePrint Archive, Report 2019/1477 (2019). https://eprint.iacr.org/2019/1477

  28. Wouters, L., Arribas, V., Gierlichs, B., Preneel, B.: Revisiting a methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardware Embed. Syst. 2020(3), 147–168 (2020). https://doi.org/10.13154/tches.v2020.i3.147-168. https://tches.iacr.org/index.php/TCHES/article/view/8586

  29. Wu, L., Perin, G., Picek, S.: I choose you: automated hyperparameter tuning for deep learning-based side-channel analysis. Cryptology ePrint Archive, Report 2020/1293 (2020). https://eprint.iacr.org/2020/1293

  30. Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(1), 1–36 (2019). https://doi.org/10.13154/tches.v2020.i1.1-36. https://tches.iacr.org/index.php/TCHES/article/view/8391

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stjepan Picek .

Editor information

Editors and Affiliations

Appendices

A Bin Size Estimators

To estimate the probability density, a critical step is to determine the bin width, which is the user-supplied parameter for the histogram estimator. The results of our experiments are shown in Figs. 1314, and 15. As shown in the figures on the right, any bin width larger than 15 leads to a final key rank lower than 4 (key rank equal to 1 indicates the successful key recovery). The key rank is computed for a separate test set and is obtained by selecting the machine learning model at the epoch that gives the highest \(I(T_n;Y)\) for each tested bin width. The plots on the left side of Figs. 1314, and 15 show the value of \(I(T_n;Y)\) w.r.t. the number of epochs for all tested bin sizes. As we can see, if the bin size is too small, the mutual information \(I(T_n;Y)\) barely changes.

Fig. 13.
figure 13

(to be viewed in colors) The influence of the number of bins in the calculation of \(I(T_n;Y)\) for the ASCAD dataset. Average \(I(T_n;Y)\) w.r.t. the number of epochs for different bin sizes (left). The final key ranking having the maximum value of \(I(T_n;Y)\) as a reference metric for different bin sizes (1 to 256) (right).

Fig. 14.
figure 14

(to be viewed in colors) The influence of the number of bins in the calculation of \(I(T_n;Y)\) for the DPAv4 dataset. Average \(I(T_n;Y)\) w.r.t. the number of epochs for different bin sizes (left). The final key ranking having the maximum value of \(I(T_n;Y)\) as a reference metric for different bin sizes (1 to 256) (right).

Figure 16 shows results for eight different well-known estimators, vs. the choice of fixing the number of bins to 100. We tested Freedman Diaconis (‘fd’), ‘sturges’, ‘auto’ (which is the maximum of ‘fd’ and ‘sturges’ estimators), ‘rice’, ‘scott’, square-root estimator (‘sqrt’), and ‘doane’ estimators. The results show the guessing entropy resulting from early stopping by having \(I(T_{n}; Y)\) as a metric by testing different bin size estimators. Notice they all lead to successful key recovery with similar guessing entropy convergence for all tested datasets and leakage models.

B From Information Path to the Best Epoch to Stop the Training

We use the information path for visualizing how much information each hidden layer has learned about the true labels (Y). The amount of information learned is an estimation of how well each hidden layer fits the distribution of Y, which is directly derived from the selected leakage model.

We repeat the experiment for the ASCAD random key dataset (Fig. 17, MLP architecture). Using the visualization provided by the information path, we confirm that all hidden layers are a transformed representation of the input traces and contain information about Y. However, the estimate of the mutual information between Y and the output layer probabilities (after the Softmax activation layer) captures best the prediction capability of the network. In our experiments the best epoch to stop the training is the epoch where the value of I(\(T_{n};Y\)) reaches its maximum value.

Fig. 15.
figure 15

(to be viewed in colors) The influence of the number of bins in the calculation of \(I(T_i,Y)\) for the CHES CTF dataset. Average \(I(T_i,Y)\) w.r.t. the number of epochs for different bin sizes (left). The final key ranking having the maximum value of \(I(T_i,Y)\) as a reference metric for different bin sizes (1 to 256) (right).

Fig. 16.
figure 16

Guessing entropy results when early stopping is conducted with mutual information as a metric for different binning size estimators.

Figure 17a shows the evolution value for I(\(T_{i};Y\)) during neural network training for all hidden layers, while Fig. 17b provides the guessing entropy results. We provide results by selecting the best epoch from the maximum value of I(\(T_{i};Y\)) for all the hidden layers, as suggested by the reviewer. We provide these particular results to illustrate that only the last layer is indicative for early stopping results. We conclude that the information path can lead to optimal choices for the number of training epochs for profiling side-channel analysis. Note, however, that for the general solution selection, the best epoch from the I(\(T_{n};Y\)) (output layer) provided better results across multiple datasets.

C On the Length of the Generalization Interval

As stated, we aim to reach the generalization interval and then stop the training. By doing so, we ensure that the trained machine learning model will generalize to unseen data. The question remains how difficult it is to stop at the generalization interval. Intuitively, the shorter the interval, the easier it would be to miss it. Ideally, we aim to have a neural network that reaches the generalization interval relatively fast and stays in that interval for a longer period. Before discussing how to obtain a long generalization interval, we must ensure it happens and that we do not go to the overfitting phase from the underfitting phase.

Regularization techniques can help prevent a deep neural network from overfitting during the training process. To check the impact of the regularization on the neural network and its generalization interval, we use the information plane as it provides a visual indication for the relationship between \(I(X;T_{n})\) and \(I(T_{n};Y)\). The maximum value of \(I(T_{n};Y)\) during training indicates an epoch at which the neural network should be inside the generalization interval for the training process, as defined in Definition 3. When the network does not implement any regularization technique in its hyperparameter configuration, the trained model has a higher chance of overfitting the training data.

Figure 18 depicts results for a CNN with and without regularization (dropout). This experiment is conducted on a proprietary unprotected software AES implementation (STM32 microcontroller). For that, we considered 6 000 traces for the training set and 1 000 traces for the validation set, both having fixed keys. The traces contain 400 features. Observing Fig. 18b, for the case without regularization, we see that the mutual information \(I(T_{n};Y)\) reaches a maximum value (where the distributions \(T_n\) and Y are obtained from the validation set) and after that, \(I(T_{n};Y)\) for validation decreases continuously while \(I(T_{n};Y)\) for the training stays at a maximum value. Additionally, \(I(T_{n};Y)\) indicates that the generalization phase lasts shorter than one would infer from accuracy, as illustrated in Fig. 18a.

Fig. 17.
figure 17

4-layer MLP trained with the ASCAD random key dataset.

Fig. 18.
figure 18

Convolutional neural network configurations (learning rate \(= 0.001\), Adam optimizer, batch size \(=400\), randomly uniform initialized weights).

Figures 18a and 18b also show the accuracy and \(I(T_{n};Y)\), respectively, for training and validation labels sets obtained from a regularized CNN with dropout. After processing 200 epochs, the training accuracy has not reached 100%, the desired outcome for a regularized neural network. At the same time, the validation accuracy reaches approximately 56%, which is a significantly higher value compared to 51% without regularization, as shown in Fig. 18a. The mutual information \(I(T_{n};Y)\) for the validation set (see Fig. 18b) reaches its maximum value and stays longer at this level. This indicates that the same generalization level is kept until at least epoch 100. Consequently, as the value of \(I(T_{n};Y)\) stays high for more training epochs, the neural network provides better generalization for those epochs. Again, accuracy cannot indicate the same phenomenon, as its value remains stable (albeit of different magnitude for the validation set) for regularized and non-regularized networks.

The neural network configurations (with and without dropout) are illustrated in Fig. 19. The “R” and “S” labels refer to ReLU and Softmax, respectively. The number under the layer block indicates the number of neurons in dense layers (“D”) and the dropout rate for dropout layers.

Fig. 19.
figure 19

Convolutional neural network configurations (learning rate \(= 0.001\), Adam optimizer, batch size \(=400\), randomly uniform initialized weights).

D DPAv4 Results

For the DPAv4 dataset, we consider 34 000 traces in the training set and 2 000 traces in the validation set. An additional 2 000 traces are used as a test set. These results were obtained from the 100 training runs on CNN configured with unchanged hyperparameters. Fig. 20 shows guessing entropy and success rate obtained from the selected metrics (accuracy, recall, loss, key rank, and maximum \(I(T_{n};Y)\)) from the validation set. Selecting the model at an epoch with the maximum \(I(T_{n},Y)\) for the validation set provides the best results for both SR and GE. Like the ASCAD dataset in the HW leakage model, \(I(T_{n};Y)\) gives better results than the validation key rank for a small number of attack traces. Again, this happens due to the influence of the validation set on the trained model. Interestingly, allowing training for all 50 epochs leads to overfitting, but the same behavior happens if we stop training based on loss, recall, and accuracy.

Fig. 20.
figure 20

Results on DPAv4 for the Hamming weight leakage model, CNN architecture.

Fig. 21.
figure 21

Results on DPAv4 for the Hamming weight leakage model, CNN architecture.

Observe from Figs. 21a and 21b that the network achieves its maximum \(I(T_{n};Y)\) value between epochs 10 and 16. Figure 21c confirms that we require around 10 epochs to reach guessing entropy of 1. Additionally, the behavior stays relatively stable up to epoch 38 (where there is no deterioration up to epoch 15, and afterward, there are slight changes in GE). Finally, in Fig. 21d, the validation key rank agrees with \(I(T_{n};Y)\) by reaching the maximal frequency values for epochs 11 to 15 (cf. Fig. 21b).

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Perin, G., Buhan, I., Picek, S. (2021). Learning When to Stop: A Mutual Information Approach to Prevent Overfitting in Profiled Side-Channel Analysis. In: Bhasin, S., De Santis, F. (eds) Constructive Side-Channel Analysis and Secure Design. COSADE 2021. Lecture Notes in Computer Science(), vol 12910. Springer, Cham. https://doi.org/10.1007/978-3-030-89915-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89915-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89914-1

  • Online ISBN: 978-3-030-89915-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics