Autoencoder-enabled Model Portability for Reducing Hyperparameter Tuning Efforts in Side-channel Analysis

Hyperparameter tuning represents one of the main challenges in deep learning-based profiling side-channel analysis. For each different side-channel dataset, the typical procedure to find a profiling model is applying hyperparameter tuning from scratch. The main reason is that side-channel measurements from various targets contain different underlying leakage distributions. Consequently, the same profiling model hyperparameters are usually not equally efficient for other targets. This paper considers autoencoders for dimensionality reduction to verify if encoded datasets from different targets enable the portability of profiling models and architectures. Successful portability reduces the hyperpa-rameter tuning efforts as profiling model tuning is eliminated for the new dataset, and tuning autoencoders is simpler. We first search for the best autoencoder for each dataset and the best profiling model when the encoded dataset becomes the training set. Our results show no significant difference in tuning efforts using original and encoded traces, meaning that encoded data reliably represents the original data. Next, we verify how portable is the best profiling model among different datasets. Our results show that tuning autoencoders enables and improves portability while reducing the effort in hyperparameter search for profiling models. Lastly, we present a transfer learning case where dimensionality reduction might be necessary if the model is tuned for a dataset with fewer features than the new dataset.


Introduction
Hardware and software implementations of cryptographic algorithms may leak unintended and measurable sidechannel information such as power consumption, electromagnetic emissions, and execution time. Although mathematically secure, these cryptographic implementations may become vulnerable to side-channel attacks (SCAs). SCA is an implementation attack mainly categorized into direct and two-stage attacks. Direct attacks, also known as non-profiled SCA, mainly consist of simple power analysis [17], differential power analysis [18], and correlation power analysis [5]. These attacks explore the statistical dependency between leaked side-channel information and secret cryptographic keys. Recovering the secret depends on running the attack over all possible key hypotheses through a divide-andconquer strategy and selecting an efficient statistical distinguisher (e.g., Pearson correlation, difference-ofmeans, or mutual information). On the other hand, a two-stage or profiling SCA [11] can evaluate the security of a cryptographic implementation by assuming a stronger adversary. Profiling SCA assumes that a potential adversary has an open device (identical to the target one) that provides conditions to learn a profiling model by reprogramming the key and input data to the cryptographic algorithm. Depending on how much knowledge is assumed that the adversary possesses (e.g., source code and access to the secret randomness of the implementation), profiling SCA allows the deployment of worst-case (i.e., white-box) or black-box security assessment.
Countermeasures such as masking and hiding are often considered to mitigate SCA. For twenty years, Gaussian template attacks (GTA) [11] have proven to be theoretically the best option to test the worst-case security of SCA countermeasures [6]. Deep learning (DL) has been widely investigated as an alternative profiling SCA solution in the last few years. The results with real-world datasets have demonstrated that deep neural networks provide several practical advantages in comparison to GTA, such as skipping points-of-interest or feature selection from raw measurements [20,26], relaxing assumptions about underlying leakage distribution, and being less sensitive to trace desynchronization [8,33,37]. However, together with large training times, the main open challenge for DL-based SCA is hyperparameter tuning. In [27], the authors suggested that hyperparameter tuning should be taken as one of the adversarial assumptions, together with the number of profiling and attack measurements. However, verifying the correctness and reliability of a DL-based profiling model concerning its hyperparameters is still difficult. Even considering advanced hyperparameter search algorithms [29,34] cannot guarantee that the obtained best model delivers reliable security assessment.
Hyperparameter tuning is a trade-off between time effort and neural network performance, as there is no proven best way to tune the network in a reasonable time. According to [27], the maximum number of searched DL models should be considered when inferring the target's security. A profiling SCA process that is unbounded in the number of hyperparameter tuning models (or learnability capacity) would be able to deliver reliable security assessment 1 . However, as the number of searched models is always limited in reality, one would like to optimize the model search process by reducing the hyperparameter tuning efforts by ensuring that a reliable and efficient DL model is always found and trained within available computation bounds. In other words, by applying DL-based profiling SCA, the security evaluator wants to ensure that a successful security assessment (i.e., the one that fails in recovering the secret) results from an SCA-secure implementation instead of a wrong profiling attack. One way to reduce hyperparameter search effort across different targets is to apply preprocessing techniques on raw side-channel measurements, such as points-of-interest selection or dimensionality reduction.
In this paper, we consider only dimensionality reduction because points-of-interest selection tends to be inefficient due to the presence of masking countermeasures in the evaluated datasets. We assume a black-box threat model, i.e., an adversary without access to secret masks during profiling and attack phases. We consider autoencoders for dimensionality reduction, which were already considered in several other applications in SCA [19,23,35], which we discuss in Section 2. Our primary goal is to verify whether efforts can be moved from tuning a profiling deep neural network model to tuning an autoencoder by reusing profiling modes across different datasets. Our main contributions are: 1. We experimentally confirm that the standard reconstruction error metric for autoencoders works well for SCA settings. Moreover, the data encoded with autoencoders stays relevant, where we show that tuning efforts on encoded data are similar to tuning on original traces. 2. We demonstrate that the portability of profiling model hyperparameters is possible. We apply the same best profiling model across different datasets encoded into the same dimension with the obtained best autoencoders. Thus, the same profiling model obtained for one dataset can be utilized for other datasets, which reduces the hyperparameter search effort. 3. We show through transfer learning that our best profiling model can be applied to different datasets, eliminating hyperparameter tuning of the profiling model and reducing training time.
The analysis provided in this paper contributes to making DL-based SCA more practical for security evaluations of cryptographic implementations when protected with the first-order Boolean masking schemes. The paper is organized as follows. Related work is discussed in Section 2. We explain the necessary details on deep learning-based side-channel attacks, autoencoders, and transfer learning, followed by a description of utilized datasets in Section 3. Section 4 details our experimental setup, explaining the steps of our analysis and several use cases we consider. Experimental results are reported in Section 5. We conclude the paper in Section 6, shortly discussing possible future work.

Related Work
Several research papers address the portability issue between profiling and target devices of the same type. For instance, Choudary and Kuhn [12] employed different devices in template attacks (TAs) and utilized Fisher's Linear Discriminant Analysis and Principal Component Analysis to enhance TA performance. Another approach proposed in several papers involves using multiple devices to improve the attack performance [4,13]. In Cao et al. [9], the authors suggest enhancing the pre-trained model on a profiling set with unlabeled traces from the target. They propose a loss function that consists of the standard classification task and minimizing the distribution discrepancy between data traces. In another work by Cao et al. [10], a method inspired by Generative Adversarial Networks (GANs) is presented, where an encoder replaces the generator to reduce the data discrepancy before the attack. Transfer learning is also proposed to transfer knowledge from the profiling device to the target device. Genevey-Metat et al. [15] investigate several scenarios that differ in the position and type of EM probe, side-channel information, and device samples belonging to the same family. All the aforementioned works discussed portability between profiling and target device samples, while our approach utilizes public data from different devices and acquisitions.
In Thapar et al. [31], the authors use transfer learning to accelerate the attack by reusing a model trained on a different target. However, they fixed their input size for model reuse. In contrast, we do not restrict the input size of the initial model we aim to reuse, as we can achieve effective dimension reduction and latent representation with autoencoders for our target device.
Autoencoders have been used to reduce noise and alignment issues in measurements, improving the performance of traditional and deep learning methods in non-profiling SCA [19]. Similarly, Wu and Picek [35] use autoencoders (denoising autoencoder) as a preprocessing method to remove the traces' noise and enable training with clean data in profiling attacks. Paguada et al. [23] utilize autoencoders to reduce the dimensionality of SCA traces, leading to lower complexity of the profiling phase, reduced computational time, and improved classification performance. Thus, autoencoders have predominantly been employed to enhance attack performance on the same dataset (target) by mitigating countermeasures and reducing the traces' complexity. Similarly, our work employs autoencoders to obtain a reduced latent representation of the given input (SCA trace). However, unlike the related works, our approach utilizes autoencoders to facilitate attacks on different datasets (targets).

Deep Learning-based Side-channel Attacks
In deep learning-based side-channel attacks (DL-based SCA), the main goal is to train deep neural network parameters θ with training data D by minimizing a loss function L. Each instance of training data D consists of a tuple (x i , y i ), where x i is a one-dimensional vector representing the i-th side-channel measurement (or trace) in a dataset D. The range of i is from 0 to the size of the dataset |D|. The term y i refers to the label (or class) associated to x i .
Labeling a dataset requires the definition of a leakage model and a selection function. In SCA, the main leakage models are identity (ID), Hamming weight (HW), Hamming distance (HD), and bit-level models. could be an S-Box output byte in the first encryption round, i.e., y i = S-Box(d j ⊕ k j ), where d j and k j are the j-th plaintext and key bytes (j ∈ [0, 15] for a key size of 128 bits), respectively.
From the training set D, we select a subset V to validate the trained model. This model is later tested on a separate dataset A collected from the attacked device that we refer to as the attack set. Since the goal is to obtain the secret key from A (or a single byte of the key), we use guessing entropy (GE) [30] to assess the attack performance. The best possible neural network model is the one that requires minimal attack complexity, which is measured in terms of the minimum number of attack traces that are necessary to successfully recover the key [7].
To compute GE, we first predict the validation or attack set and obtain class probabilities p i,y i for each trace i. As labels y i are derived from a key-dependent selection function, we obtain the log-likelihood l k of a certain key byte k j ∈ [0, 255]: where N a is the number of traces in the predicted set. This process is then repeated for all possible key byte hypotheses. Each hypothesis will define different labels y i for each trace. The key rank of the correct key k * is obtained by sorting all l k values and by returning the position of l k * associated with the correct key byte k * . The GE of the correct key, ge * , is given by an empirical process in which we repeat the key rank process multiple times (each time with a different and randomly selected subset from the attack or validation set). We obtain an average log-likelihood or key guessing vector g and get the average position of the correct key k * inside g. When ge * = 1, we say that the model successfully recovers the key with N a attack traces. The minimum number of traces to retrieve the key is referred to as N ge * =1 . Although the primary goal of training a deep neural network in the SCA context is to minimize N ge * =1 , the models in this paper are still trained with a categorical cross-entropy loss function. In [22], the authors showed that minimizing this loss function is aligned with minimizing N ge * =1 .

Autoencoders (AEs)
Autoencoder (AE) is a specific self-supervised neural network used for data compression, dimensionality reduction, generating new data, denoising, etc. The authors in [16] first used autoencoders for dimensionality reduction. Different autoencoders, such as denoising or variational autoencoders, are described in [1,25]. We use autoencoders for dimensionality reduction to learn, in an unsupervised manner, an informative smaller representation of the data. We consider deep autoencoders since they are often better than shallow or linear counterparts. While variational autoencoders are very popular, they are more helpful in generating new data, which is different from our goal here.
Autoencoders usually have an encoder and decoder part. The encoder takes the original input and learns a function that encodes the data into a representation given by a latent space. In dimensionality reduction, the input dimension is reduced in latent space. That middle layer is known as the "bottleneck layer", as it holds the data's compressed representation. Later, we use the decoder function to reconstruct the original input from the encoded data. Both encoder and decoder are neural networks, commonly symmetrical, having the same type and number of layers with the same layer sizes.
The objective function of the autoencoder is minimizing the difference between input and output by preserving the relevant information. The compressed data is evaluated by the decoder's ability to reconstruct the original input from the compressed data, so the common metric is Mean Squared Error (MSE). The output for autoencoders is the input itself, so MSE is calculated with where x i is the original observation andx i its reconstruction, while m is the number of inputs (samples). In SCA, x i is the side-channel trace with n features, for which the distance fromx i is again MSE. Therefore, we do not use labels as in profiling models and do not need to use any leakage model. In this work, we search for the best autoencoders, following the information from [25] for defining the hyperparameter tuning space.

Transfer Learning
Transfer learning (TL) in machine learning focuses on transferring knowledge across domains and aims to leverage knowledge from a related domain to improve learning in a new task (target domain). The success of transfer learning depends on many factors, such as the relevance between the source and target domains and the learner's (model's) capacity to find transferable and valuable knowledge across the two domains. Transfer learning can be categorized based on the feature space between the two domains and the availability of the labels. More information on categorizations of TL is found in surveys, e.g., [24,32,38]. Our case belongs to inductive transfer learning, where we have labels for both the source and target domains (different intermediate values belonging to a specific dataset). We aim to achieve high performance in the target task. There are many approaches to transfer learning, and they depend on what we aim to transfer. In our case, we use parameter-based TL to transfer knowledge at the model/parameter level. We use models trained on one dataset and use them for different datasets. Our main objective is to obtain accurate predictions in the target domain for the new task. Specifically, we train the model to learn the correct key k * of another dataset. We do it with parameter sharing so that we have a neural network for the source task, and we share (freeze) most of the layers and fine-tune the last few layers to obtain a network that works for the targeted task. We keep the first layers since the first layers in deep neural networks appear not to be specific to particular datasets or tasks [36].

Datasets
We describe three datasets that are used in our experiments. For all datasets, we use 5 000 traces for valida-tion and another 5 000 traces as the attack set in both profiling attacks and autoencoders. We use 3 000 traces randomly chosen from that 5 000 in each key rank calculation to calculate GE.

DPAcontest v4.2
DPAcontest v4.2 dataset (here referred as DPAv4.2) 2 is the second implementation available in the DPAcontest v4 [3]. It is an improved version implemented in software on an 8-bit Atmel ATMega-163 smart card and corrects several leaks identified in its previous generation. This dataset represents the power consumption of the first AES encryption round, and the AES implementation is protected with Rotate Shift countermeasure. The dataset contains a total of 80 000 traces, and each of them contains 1 704 402 sample points. In our experiments, we trim the dataset to the interval representing the processing of the 13-th S-box byte, resulting in 2 000 samples per trace. The first interval ranges from sample 305 000 to 315 000 from original measurements. We apply the resampling process with a resampling window of 10 and step of 5, resulting in 2 000 samples per measurement. We use 70 000 traces for training (which contains 14 different keys).

ASCAD
ASCAD dataset 3 with a fixed key (ASCADf), along with ASCAD dataset with a random key (ASCADr), consists of measurements from masked AES on the 8-bit AT-Mega8515 MCU target without any specific hiding countermeasures activated on the target [2]. For ASCADf dataset, the key is fixed for all measurements. We have 50 000 training traces with 700 features per trace. ASCADr dataset corresponds to the second campaign with the same target and setup as in ASCADf. However, in this setting, the key is variable for 66% of the measurements. We use 200 000 training traces with 1 400 features per trace.

Experimental Setup
In this section, we provide details about our experimental setup. The process starts with a hyperparameter search to find the best autoencoders for different datasets. Before that, we verify that the MSE metric is appropriate as it keeps the side-channel leakage in the reconstructed traces. Then we verify if searching for profiling neural network models remains similar when we train the models with the encoded datasets. We compare the attack performance of profiling models trained with encoded and original datasets. That is necessary to validate that encoded data stays relevant without worsening tuning efforts. Next, we reused profiling models' hyperparameters across multiple datasets as it was shown that tuning encoded data is equal to tuning original datasets. We consider portability from encoded data to other encoded data and from original to encoded data. The first case enables universal models where all datasets are represented in a similar latent space. The second case addresses the portability of architecture between different feature spaces. Finally, we explore transfer learning advantages utilizing autoencoders and profiling models. To summarize, we apply the following steps: 1. Search for the best latent space size for all datasets based on two datasets. 2. Search for the best autoencoders with the lowest Mean Squared Error (MSE) by setting the best found latent space size. 3. Compare the performance of profiling models when trained with original and encoded traces of the datasets. 4. Investigate the portability of best profiling model hyperparameters trained with an encoded dataset to other encoded datasets. All datasets are encoded into the same latent dimension by using best-found autoencoders. 5. Investigate the portability of the best profiling model hyperparameters trained with an original dataset to other encoded datasets. The concept is used when a new dataset has more features than the original dataset. The new dataset is encoded into the same dimension as the original dataset using best-found autoencoders. 6. Investigate transfer learning of best profiling model trained with an original dataset to other encoded datasets encoded into the same dimension using the best autoencoders.
The overall structure of our experimental setup and the corresponding steps are shown in Figure 1. Additionally, the source code is publicly available 4 .

Autoencoder Architectures
We consider the following CNN and MLP autoencoder structures: Fig. 1: Experimental setup. The term n refers to the number of features in datasets. We denote h as the set of architecture hyperparameters and θ as trainable parameters (weights and biases) in portability cases.
ae mlp: autoencoders given by symmetric encoder and decoder blocks, in which all layers have the same number of neurons. Latent size can be smaller, equal, or larger than the number of neurons in previous layers. ae mlp dcr: autoencoder with decreasing number of neurons in subsequent layers (with possible repetition). We do not ensure that the layer before the latent space is strictly larger (or equal) to the latent dimension. ae mlp str dcr: autoencoder with decreasing number of neurons in subsequent layers in the encoder.
Here, str dcr stands for strictly decreasing. The la-tent size is smaller than the number of neurons in the previous layer. However, the cases where we still use the same number of neurons in layers before the latent layer are possible.
We have several options for MLP autoencoders, while the usual, most common choice is ae mlp str dcr. A decreasing number of neurons in the encoder and symmetrical decoder are commonly chosen because, intuitively, decreasing the number of neurons forces generalization and seems useful for dimensionality reduction. The real benefit of this structure is possibly lower computation costs compared to alternatives. However, as in classification, other options can be explored. Thus, we test the possibilities mentioned above where the number of neurons is not consistently decreasing, and the latent size is not strictly following the decreasing pattern.
In described autoencoder types, the encoder and decoder with MLP structure are always symmetrical, which means that the number of layers is the same in both encoder and decoder blocks. Also, the layer sizes are symmetrically decreasing in the encoder while increasing in the decoder. While common, this is again not strictly defined and can be explored. Intuition again says it makes the most sense for the decoder to follow a reverse structure from its encoder counterpart, but other possibilities can be similarly capable of good performance. For this setting, we keep the traditional symmetrical design.
CNN autoencoder uses similar convolutional blocks to those reported in [25]. Specifically, we use a convolutional layer followed by a pooling layer in the encoder. While they specified Max pooling, we allow both Max and Average pooling in hyperparameter selection. For the decoder, we use upsampling followed by a standard convolutional layer. Since there are more options, the convolutional autoencoder (ConvAE) structure is more complex to define than the MLP autoencoder. We observe in the literature versions of ConvAE increasing and decreasing the number of filters while kernel size and pooling size remain the same. In some cases, kernel sizes were changing. Thus, there is no specific best way to structure the ConvAE. In our case, we increase the number of filters in the encoder because the kernel size and pooling reduce the number of features, sometimes to only one. Thus, having more filters in those deeper layers ensures that after flattening, we have more than one neuron before the last fullyconnected layer. We increase the number of filters per layer following the expression nb f ilters · 2 i , where i is the order of the layer + 1. In the decoder, with the combination of upsampling and standard convolutional layer, upsampling increases the number of features, while kernel size again decreases it. Thus, we keep the same expression for increasing the number of filters. Both the encoder and decoder end with a flattened layer followed by a fully-connected layer with the number of neurons equal to the latent size in the encoder and input size in the decoder.
In this work, we tested different structures of MLP autoencoders. At the same time, more analysis should be done for the CNN autoencoder, as the described structure is one of many possibilities. We leave this exploration on CNN structures for future work.

Autoencoder Metric Analysis
Autoencoders for dimensionality reduction imply finding a reduced representation of input data through a latent space. To assess the quality of the reduction and obtained latent representation, the most common error metric is Mean Squared Error (MSE): where x ij is the j-th feature value of i-th original sidechannel observation andx ij its reconstruction. m is the number of traces (inputs), and n is the number of features in the side-channel trace. Minimizing the MSE leads to a good reconstruction of the original input.
where N v is the number of side-channel traces represented or labeled with intermediate variable v. Next, we obtain the mean vector from all 256 variance vectors σ 2 v : and the variance of mean vectors µ v : Finally, SNR is given by: The SN R from Eq. (8) results in a vector with the same length as side-channel traces. We compute this vector for original and reconstructed traces. Then, we take the maximum SNR peak obtained with original traces and subtract it from the value on that exact location in the SNR obtained from reconstructed traces. In the result figures, we refer to this as SNR diff.

Autoencoders Search
In this section, we deploy a random search to find the best latent space size for autoencoders based on experiments with two datasets. After defining the best latent space size, we deploy a random search to find the best autoencoder architecture for MLP and CNNbased structures. We also obtain the best autoencoder types. Datasets are then encoded with these best autoencoders. Finally, we deploy another random search to find the best profiling model trained with the encoded datasets. We include the third dataset later for portability experiments while also evaluating how well the decisions made on two other datasets for latent size and autoencoder type apply to new datasets.

Assessing MSE Metric with SNR
We conduct a random search on autoencoders using MSE as the loss function for achieving good reconstruction from the latent space. In these experiments, we consider SNR to verify that minimizing MSE is a meaningful objective when tuning autoencoder hyperparameters. Here, we are not searching for the best latent space size, so we fix the latent dimension to 100 features in all cases to evaluate the MSE metric.
Hyperparameter search space for all autoencoder types are listed in Tables 19 and 20 in Appendix A. We randomly search for 20 models with each of the four autoencoder types and calculate the SNR difference, as described before, between SNR vectors obtained from original and reconstructed traces. The analysis is conducted for the Hamming weight and identity leakage models with v = S-Box(d j ⊕ k j ) ⊕ m i as the intermediate variable to compute SNR.
Results are shown in Figure 2 for the DPAv4.2 and ASCADr datasets. The x-axis in Figures 2a and 2b shows the maximum peak SNR value difference between the original and reconstructed traces. The corresponding MSE value for each autoencoder is on the y-axis. These figures show that MSE increases as the SNR peak difference increases regardless of the autoencoder type and leakage models in SNR calculations. The vertical lines occur when the reconstructed traces result in insignificant SNR peak values, indicating that side-channel leakages concerning v = S-Box(d j ⊕ k j ) ⊕ m i are not preserved in the reconstructed traces. Negative SNR difference values on the x-axis indicate that the reconstructed trace has a higher SNR value than the original trace on the same sample point, which means that the corresponding autoencoder is preserving and even amplifying the occurrence of side-channel leakages concerning v. However, MSE is not created to lead the autoencoder (AE) to amplify any such SNR peak, as it gets minimized by correct reconstruction of the trace without amplifications.  Next, we consider Pearson correlation coefficient ρ to test whether there is a positive (linear) correlation between MSE and SNR differences. Indeed, from the results in Table 1, we see a high correlation until the vertical lines (maximum difference in the SNR values). Specifically, for both datasets, the correlation is stronger for MSE below 0.5. For the DPAv4.2 dataset, the maximum correlation is for MSE values below 0.25, and in the case of ASCADr, below 0.5. Therefore, we conclude that minimizing MSE is a meaningful objective error function to optimize autoencoder models for the given datasets. In this section, we use random search to compare different latent space sizes. The latent sizes we consider are 20, 40, 50, 100, 200, 250, 400, and 500 for all autoencoder types except that for the ae mlp str dcr, we do not use the latent size 500 as we also limit the search to 400 neurons per layer (see Table 19). By choosing these latent sizes, we ensure that the bottleneck layer in the autoencoder is always smaller than the input layer (which contains the same number of units as the input side-channel trace dimension). The datasets evaluated in this section contain 2 000 features (DPAv4.2) and 1 400 features (ASCADr), which is significantly larger than the chosen latent space sizes given by the bottleneck layer. Hyperparameter search space (Tables 19  and 20) is the same as in the metric analysis provided in the previous section. The hyperparameters for the autoencoders are chosen at random, and we train 20 autoencoder models per latent size, dataset, and autoencoder type combination. The total number of autoencoder combinations in this search is 62.
The main idea here is to verify if a specific latent space size tends to provide the lowest MSE among the searched ones regardless of the dataset and autoencoder (AE) type. Considering our two datasets and four autoencoder types, we have eight cases, each testing eight or seven latent sizes. Autoencoder type ae mlp str dcr does not use latent size 500, which leads to trying seven latent sizes instead of eight. We apply the following procedure to obtain the best latent size: 1. For each of these 62 combinations, we extract the autoencoder (out of 20) with the lowest MSE for that dataset, autoencoder type, and latent size. 2. For each autoencoder type and dataset combination, we rank the latent sizes based on the best (lowest) MSE. The latent size for the model with the lowest MSE gets the rank 1 being the best one. 3. For each of the latent space sizes, we average these eight ranks coming from the dataset-autoencoder (AE) type combination. Table 2 shows the average ranks of the latent sizes. The left side of the table is for AE types except for ae mlp str dcr as that one must have a decreasing structure, and with the latent size of 500, we cannot achieve that as we limited our number of neurons to a maximum of 400. On the right side of Table 2, we order all latent sizes according to the average rank, except for 500, and include the results from ae mlp str dcr.
Following, we use the Friedman test [14] across all autoencoder types and the two datasets (ASCADr and DPAv4.2). This test determines whether there is a statistically significant difference between the means of three or more groups in which the same subjects appear in each group. In our case, groups are based on latent sizes, and subjects are dataset-AE type combinations. The comparison is based on the lowest MSE obtained. Friedman test calculates test statistic Q using the ranks from the samples in the groups. Q value has to be greater than the critical value of Q for a selected significance level α to reject the null hypothesis. Commonly, significance level α of 0.05 works well [28]. We determine the critical value from the Chi-Square distribution table with k − 1 degrees of freedom where k is the number of groups and selected significance level α. The p-value is the probability of obtaining test results at least as extreme as a result observed under the assumption that the null hypothesis is correct. The null hypothesis for the Friedman test is that the mean of the groups is the same. A very small p-value means such an extreme observed outcome would be very unlikely under the null hypothesis. The null hypothesis can be rejected if the p-value is below α.
We report the Friedman test results at the bottom row in Table 2. Since the p-value is below 0.05, we conclude that the difference between the mean values of the groups (latent sizes) is statistically significant. Additionally, the test statistic Q on the left part is greater than the critical value of 14.07 for the degree of freedom 7. On the right side, the test statistic is greater than the critical value 12.59 for α = 0.05 and degree of freedom 6. We perform the Nemenyi post-hoc test to determine which groups have different means. The results are shown in Appendix B in Tables 23 and 24 corresponding to the cases without and with ae mlp str dcr model. The values in the tables are p-values where if the value is below 0.05, the two groups (column-row combination) have statistically significantly different means. The lowest MSE values for models with latent sizes 400 and 200 differ significantly from models with lower latent sizes (20, 40, and 50). For latent sizes 100 and 250, the difference is significant compared only to latent size 20 in the case with the ae mlp str dcr model. Thus, we select latent sizes 200 and 400 to find the best autoencoders using a random search.

Selecting the Best Autoencoders
After we found the best latent size for the ASCADr and DPAv4.2 datasets, we randomly search for additional 80 autoencoder models to obtain a total of 100 models for latent space sizes of 200 and 400. The hyperparameter search space for each autoencoder type stays the same as in the search for the best latent size. From these 100 autoencoder models, we select the best autoencoder for each dataset. Table 3 shows the MSE of the best autoencoder for each of the given latent sizes (200 or 400) and autoencoder types per dataset. We see that for ASCADr, latent size 400 always results in a lower MSE. For DPAv4.2, ae mlp and ae mlp dcr had better results with 200 features in latent space. However, we can conclude that the best autoencoder types are ae cnn and ae mlp str dcr, with the lowest MSE in both datasets obtained using latent size 400. Since we initially only allow up to 400 neurons per layer, with a latent space size of 400, the autoencoder ae mlp str dcr type could not create a bottleneck architecture with decreasing number of neurons in consecutive layers of the encoder. Thus, we repeated the random search for another 100 models for this autoencoder type by allowing layers with 500 and 600 neurons. Table 4 shows the minimum, mean, median, and maximum MSE found in 100 models for the two datasets. Note that this table shows MSE results when the autoencoder contains layers with 400 neurons and MSE results when layers can include 400, 500, and 600 neurons.
For DPAv4.2 dataset, an autoencoder with up to 400 neurons per layer results in a lower MSE than when we allow 400, 500, and 600 neurons per layer. The hyperparameters of the best models for both cases are in Table 5. Note that this architecture has one hidden layer with 400 neurons between the input layer and the layer with the specified latent size, and the decoder is symmetrical. When allowing 500 and 600 neurons in the random search, the best-found autoencoder has an architecture with two hidden layers with 400 neurons in the encoder and decoder. They also differ in batch size, activation function, learning rate, and weight initialization, but the optimizer is the same. The best model in the second case is not using a larger number of neurons in the first layers.
The autoencoder for the ASCADr dataset has a lower minimal MSE when the number of neurons includes 500 and 600 neurons in the random search. However, the ability to use a larger number of neurons in layers closer to the input layer was not utilized. In both cases for ASCADr dataset, the best model has the same architecture: one hidden layer with 400 neurons for the encoder and decoder and a bottleneck layer with 400 neurons. They differ in activation function and weight initialization, while batch size, learning rate, and optimizer are the same. As the best autoencoders have the same architecture (layers and neurons) for both datasets, the slight difference in the performance comes from the other hyperparameters.
In general, the results in Table 4 indicate that a random search only including the option of 400 neurons per layer delivers better MSE values than when we allow more neurons per layer. Mean, median, and maximum MSE are always lower when only 400 neurons are permitted. These results are confirmed for both datasets.
Based on these results, the best autoencoder we use in further experiments is the MLP autoencoder for the DPAv4.2 dataset with an MSE of 0.03321. For the ASCADr dataset, we use the MLP autoencoder with an MSE of 0.12179. The autoencoder with the CNN structure achieves even better MSE -with 0.026 for DPAv4.2, and 0.1187 for ASCADr. The hyperparameters for the best ae cnn autoencoders are in Table 6. We use these four autoencoders in the further experiments, which are denoted ae mlp dpav42 best, ae mlp ascadr best, ae cnn dpav42 best, and ae cnn ascadr best.

Are Encoded Datasets as Good as Original Datasets?
After defining the best ae mlp str dcr and ae cnn autoencoder structures for DPAv4.2 and ASCADr datasets, we investigate if the encoded datasets can keep relevant leakage information when they are considered as training and attack datasets.
We run a random search to find different MLP and CNN profiling models, and we compare the search performance using original and encoded traces obtained from the best autoencoders listed in Tables 5 and 6. The hyperparameter search space for the profiling mod-els is shown in Table 21. Training, validation, and attack sets are labeled with S-box(d 2 ⊕ k 2 ) (third S-box output byte in first AES encryption round 6 ) for ASCADr and S-box(d 12 ⊕ k 12 ) (13-th S-box output byte in the first AES encryption round) for DPAv4.2. We consider the Hamming weight (HW) and identity (ID) leakage models. We search for 100 models for each combination of the leakage model and profiling model type (MLP-ID, MLP-HW, CNN-ID, and CNN-HW). We measure how many out of the random 100 models reach ge * = 1 for a given number of validation traces. With that information, we compare if we can more easily obtain a good model using original or encoded traces within the same hyperparameter search space. If we can get a similar amount of models out of 100 that reach ge * = 1 with original and encoded traces, it means that encoded traces preserve enough information and can be used for training profiling models in SCA. Results for both datasets are shown in Figure 3. From the results, we see that out of 100 models, for the ASCADr dataset, only in the case with MLP 6 Note that we start counting from byte index 0. and the ID leakage model we obtained the same number of models with ge * = 1. Looking at the number of traces N ge * =1 , using original traces on average 1 462.8 traces are necessary, while for the dataset encoded with ae mlp ascadr best, we need on average N ge * =1 = 1205.9 traces. For other attack setups, using original traces led to more models with ge * = 1. However, using encoded data was not much worse.
On the other hand, for DPAv4.2, in three out of four attack settings, we obtained more models with ge * = 1 when using traces encoded with ae mlp dpav42 best or ae cnn dpav42 best. Those cases are the MLP profiling model with both leakage models and the CNN profiling model with the ID leakage model. The result with the CNN profiling model and HW leakage model is again close in performance for encoded data with ae cnn dpav42 best and original traces.
We use the Friedman test on the eight scenarios presented in Figure 3. We want to see if there is a statistical difference between training on encoded and original traces. We obtain a test statistic of 2.7742 and a pvalue of 0.2498. Thus, there is no significantly better setup based on the number of models reaching ge * = 1 out of 100 runs. The initial hypothesis for Friedman is that there is no statistically significant difference in the mean of these numbers. Since the p-value, in this case, is not below 0.05, all the setups lead to similar performance. To conclude, using original traces is not statistically significantly better than using encoded data, meaning that encoded data preserves relevant features that can be used in a profiling attack.

The Portability of Profiling Models
In this section, we verify the efficiency of the best profiling model (obtained in the previous section) when found through hyperparameter tuning with one dataset but concerning different datasets. We include a third dataset to show portability from one to two other datasets. This way, we can answer the following question: can we move effort from profiling model tuning into autoencoder tuning to reuse the same profiling model across multiple encoded side-channel datasets?

Portability of Encoded-Data Trained Profiling Model to Different Encoded Datasets
We start by verifying the portability of a best-found profiling model trained with an encoded dataset concerning other encoded datasets. That is possible because we encode all datasets into the same encoding dimension, i.e., all encoded datasets contain an equal number of features. We take the best MLP and CNN profiling models trained with encoded DPAv4.2 dataset for both leakage models. Their attack performance is shown in Tables 7 and 8, while their hyperparameters  are presented in Tables 25 and 26, respectively. We test the performance of those architectures on the encoded ASCADr and ASCADf datasets.  For ASCADr, we already have the best autoencoders with the latent size of 400 (see Tables 5 and 6 for hyperparameters and MSE). We additionally train autoencoders ae mlp str dcr and ae cnn for the ASCADf dataset to encode it to 400 features per trace as well. The hyperparameters range to find the best ae cnn autoencoder with the random search are the same as considered for the DPAv4.2 and ASCADr datasets (see Table 20). To find the best ae mlp str dcr autoencoder for the ASCADf dataset, we again consider the random search settings shown in Table 19. However, we allow the number of neurons per layer for a latent size of 400 to be [400, 500, 600, 700]. The hyperparameters for best autoencoders ae mlp ascadf best and ae cnn ascadf best for ASCADf with latent size 400 are reported in Table 9.  Before the training, we performed standardization on the encoded datasets. Standardization is a typical preprocessing method before training in the SCA and other domains. However, later we also test without standardization to observe the effects.
The results with encoded ASCADr indicate superior performance compared to results obtained with the encoded ASCADf. Performance with the encoded ASCADr for MLP with the identity leakage model when this architecture was found with ae cnn dpav42 best-encoded DPAv4.2 dataset is slightly worse with finding 19 and 51 models out of 100 reaching ge * = 1. As mentioned, the results with encoded ASCADf are not good, specifically for cases with ASCADf encoded with ae mlp ascadf best. The poor performance might come from the fact that the features in encoded data do not share comparable features despite the equal latent size. We observe that the architecture of that autoencoder is different from the architectures for the other two datasets. To improve these results for encoded ASCADf, and since for ASCADf we allowed more than 400 neurons per layer (which was not the case for other datasets), we again train 100 ae mlp str dcr autoencoders for ASCADf but with only 400 neurons per layer and latent size 400. This way, the autoencoder architecture will be more similar to autoencoders of other datasets. The resulting features also become more comparable, which could improve the performance. The hyperparameters of the best autoencoder for this case are in Table 11.
The results using the ae mlp ascadf best-encoded ASCADf from the described search are in Table 12. Here, we see an improvement, which indicates our hypothesis on the similarity of latent representations with DPAv4.2 and ASCADr could be true. Accordingly, this is a crucial  remark to consider if universal models are to be considered. As the feature space is more similar, the portability becomes easier. Despite the MSE being slightly worse than before, the attack performance is better since the representations are more comparable. After improving ae mlp ascadf best, we also test best MLP and CNN profiling architectures from Tables 7 and 8 without data standardization. The results are in Table 13 and show that for ASCADr, we have similar successful behavior in comparison to results from Table 10 when data standardization was done. For encoded ASCADf, we compare results with Table 12 for ae mlp ascadf best as that autoencoder was used for encoding as it was shown to be better. Additionally, we compare it with Table 10 for ae cnn ascadf best. Re- sults with and without standardization for ASCADf are also similar. Our analysis demonstrates that reusing profiling models trained on an encoded dataset is possible. That reduces hyperparameter tuning efforts when considering new encoded datasets, where the effort is moved to tuning the autoencoder. Additionally, universal profil-ing architecture is then something we can consider on autoencoder-encoded data. Moreover, tuning autoencoders is easier as optimization of MSE is more straightforward.

Portability of Original-Data Trained Profiling Model to Different Original and Encoded Datasets
In this section, we test the portability of a best-found profiling model architecture (from random search) when it is trained on an original (i.e., not encoded) dataset. For that, we consider ASCADf, which contains 700 features. We made this choice because ASCADf has fewer features than ASCADr and DPAv4.2, which contain 1 400 and 2 000 features, respectively, in their original versions. In this case, to reuse that architecture, we need to decrease the number of features of other datasets to the size of the data used in training. However, since the input layer is a dedicated first layer in neural networks, to reuse the architecture, we can also replace that first layer. In that case, we can keep the original number of features of the new datasets. Our goal is to verify if the best-found profiling architecture with ASCADf also provides good attack performance when trained with the encoded and original ASCADr and DPAv4.2 datasets. Therefore, we have three cases per dataset -using original and encoded data with two different AE types.
Since this time we have to encode ASCADr and DPAv4.2 into 700 features, we again run a hyperparameter search to find the best autoencoders, which are reported in Table 14 with their corresponding MSE values. The hyperparameter search spaces are shown in Tables 20  and 22. Table 15 shows the results with the best MLP and CNN profiling architectures found for the original ASCADf dataset. We ran a random search for 100 models using hyperparameter search space from Table 21. Hyperparameters for the models with results presented in Table 15 can be found in Table 27.
Since we reuse only the architecture and not the trained parameters (weights and biases), we modify the input layer to use the original ASCADr and DPAv4.2 datasets that have more features than the original ASCADf. This way, we take the best architectures from Table 15 and train them with the original ASCADr and DPAv4.2 datasets as well as with their encoded versions by using the best-found autoencoders listed in Table 14. For each dataset, profiling model architecture, and leakage model, we run 100 trainings and compare the number of times the model reaches ge * = 1. The analysis is also done with and without data standardization.
The results in Table 16 show that best-found architecture provides good performance even if we use directly original traces from the DPAv4.2 and ASCADr  datasets. However, with DPAv4.2, the best-found CNN architectures are less successful. For the encoded DPAv4.2 dataset, results are better than original traces as it leads to either similar performance or often better. With the dataset encoded with ae cnn dpav42 best 700, we got better results without standardization, and for the encoded dataset from ae mlp dpav42 best 700, it was better using standardization. Using original traces was already very successful for the ASCADr dataset, so using encoded data is less valuable, but still shows good performance when the dataset is encoded with ae cnn ascadr best 700, especially with standardization. On the other hand, using ae mlp ascadr best 700 encoded data usually resulted in worse outcomes. Considering the standardization of encoded data, we see that it was slightly beneficial to use standardization for data encoded with both ae mlp ascadr best 700 and ae cnn ascadr best 700 encoded cases. Statistically, however, we cannot claim that it is always necessary to use standardization. On the other hand, based on our results, models trained with encoded data perform similarly or better in most experiments than those trained with original data. In Table 16, the cases where the performance is worse are marked in red color. Thus, we conclude that using encoded data to reuse the profiling attack architecture trained with other datasets' original traces can be done despite different feature spaces. Moreover, encoded data is beneficial when the performance with the original data is unsuccessful. Hyperparameter tuning for new datasets can be significantly reduced in that way. Again, tuning is more straightforward for autoencoders by minimizing MSE and does not require the typical attack phase in classification with GE calculations.

Transfer Learning with Profiling Models to Different Encoded Datasets
We also test the benefit of autoencoders in the context of transfer learning. Again, we have the same best profiling models for the ASCADf dataset (Table 15), and we retrain the last layer to obtain the secret key byte for the new dataset. In this case, the input must be the same size since we also use the trained parameters (weights and biases). Therefore, we encoded the ASCADr and DPAv4.2 datasets for transfer learning to the profiling model input size. Additionally, we again test with and without standardization of the encoded data. Since the training is faster as we train only one layer, we have a setting where we train one by one epoch, calculating the GE after each epoch and stopping when we reach a ge * = 1. The maximum number of epochs is 100. Another setting is running training for a given number of epochs, which is 100, as in all our experiments. In the results shown in Table 17, when datasets were encoded using the ae cnn * best 700 autoencoder, we see that in all cases, we reached ge * = 1. Often the necessary number of epochs is small. However, when using standardization of encoded data, we see that with DPAv4.2, we reach ge * = 1 in fewer cases.
Standardization for encoded ASCADr did not have much influence. Table 18 shows the attack results when datasets are encoded with ae mlp * best 700. Here, we see that the performance is a bit worse. However, we can still reach ge * = 1 in some cases without standardization. For DPAv4.2, profiling models using the identity leakage model did not reach ge * = 1. Using CNN, we see that it got close to one with ge * = 1.65 and ge * = 1.15, so we believe this can be corrected using, e.g., more epochs. Thus, we increased the number of epochs to 150 and got a ge * = 1 within N ge * =1 = 854 traces. Similarly, we select other specific cases that did not reach ge * = 1, and we experiment with the number of epochs and training two instead of one last layer in the model to verify if with those modifications we can obtain better performance. Since using the standardization primarily led to worse results, we only experimented without standardization, changing the number of epochs and the number of layers we train.
Specifically, for the MLP and HW combination with DPAv4.2, we reach ge * = 1 in 12 epochs when checking the GE after every epoch. Training for 100 epochs at once, GE gets worse (3.8). Thus, we tested with only 50 epochs and reached N ge * =1 = 1993. A combination of MLP and the ID leakage model is the worst, with minimal GE being 67.8 in 41 epochs and 115.75 after 100 epochs. Therefore, we tested multiple modifications. We tested with training two last layers in the model, again with a different number of epochs -100, 150, and 200. The lowest GE are in cases with 150 epochs training one layer where we reach ge * = 98.8, and training two last layers with 200 and 150 epochs reaching ge * = 104.7 and ge * = 99.55, respectively. Similarly, we do this for the ASCADr dataset. In the case of MLP and the ID leakage model, again, we tested all cases as with DPAv4.2, and the improvement happens only with training the two last layers with 200 epochs getting ge * = 1.25. In combination with CNN and ID, the minimal GE we get is 2.55 after 100 epochs, and when we train epoch by epoch, the minimum GE is 1.8 in epoch number 96. The results indicate that the model has the capacity to learn the new dataset. We tried adding more epochs, 150 and 200, but we did not reach better results (GE was 2.5 and 4.5, respectively). Also, with only 50 epochs, we get worse results with minimal GE of 35.45. We reached ge * = 1.3 by training the two last layers with 150 epochs. While not investigated, it seems that perhaps using early stopping could help in this case. Early stopping could prevent GE from increasing after a certain number of epochs. The last combination we tested is the MLP and HW, where we reach a GE of 1 when training epoch by epoch. Using 50 epochs In most cases, we could get GE close to 1. In many cases, we also see capacity in the model to learn the new dataset where early stopping could be beneficial as GE seems to deteriorate after some epochs. On the other hand, training modifications did not help reach ge * = 1 for the MLP and the ID leakage model for the DPAv4.2 dataset. Possibly, the autoencoder requires improvements, but we also see that the results for this case specifically were better with standardization. Including standardization with training modifications might help. Additionally, from the setup with training epoch by epoch, the minimum GE is around 50 epochs and gets larger as we train for the entire 100 epochs. Thus, early stopping might also be beneficial in this case, along with other training alternatives.
Here, we see that results using data encoded with ae cnn * best 700 are better than those encoded with ae mlp * best 700 with and without standardization. In both cases, standardization made performance worse, so with transfer learning, we could opt not to use standardization, at least when the reused model is trained on the original dataset and now used for encoded data. However, more exploration of this can be done as the sample might be small. Additionally, if we use transfer learning from a model trained on encoded data and then used for new encoded data, this conclusion about standardization may not be valid. Still, our experiments show significant benefits of transfer learning where tuning the profiling model for a new dataset was eliminated and training time reduced. That holds while the data is also in different feature spaces as the model is trained on original data, then transferred for encoded data of other datasets.

Conclusions and Future Work
In this work, we proposed autoencoders to decrease the hyperparameter tuning effort of profiling models for new datasets. Hyperparameter tuning for profiling models in SCA is a necessary but time-consuming task, and additionally, those efforts are needed for each specific dataset. Thus, we propose reusing profiling models to reduce the efforts for each new dataset by using autoencoders. The commonly used metric for autoencoders is MSE, which we showed to be positively correlated with the SNR difference between the original and reconstructed trace. Tuning autoencoders is more effortless as the MSE metric is relevant to the goal of reconstruction. On the contrary, with the classification of intermediate values, we need to perform GE calculations to validate the performance of the profiling model. Since those calculations are computationally expensive, they are not done during training, contrary to MSE computation. Therefore, AEs are easier to tune and train than profiling models of SCA.
Concerning the observed results, we show that the hyperparameter tuning was not significantly better with original traces, which means that encoded data does keep relevant information for the attacks. We consider three portability cases enabled with autoencoders.
-Reusing profiling architecture trained on one encoded dataset for other encoded datasets: This approach comes close to finding a universal profiling model, where all the datasets get encoded to the same feature size using autoencoders and then attacked with the same attack architecture. The results show good performance over encoded datasets. In other cases, we also reach ge * = 1 with a bit longer training or training more layers. The benefit of transfer learning enabled by autoencoders is that we eliminate the hyperparameter tuning of the profiling model and significantly reduce training time for the new dataset.
In future work, CNN autoencoder types need to be more thoroughly investigated as they are more powerful considering feature extraction than MLPs. On the other hand, we should study what is represented in the latent space of autoencoders for SCA traces. We can compare autoencoders as feature processing tools with classical approaches, such as principal component analysis (PCA). Instead of running a DL-based SCA attack on encoded data, performing classical SCA on AE-encoded data would be interesting.

A Hyperparameter Search Spaces
We execute a random search over hyperparameter search spaces for autoencoders and profiling models. This section reports hyperparameter search spaces for all of our experiments. Hyperparameter search space for MLP autoencoders in the initial experiments with metric analysis and best latent size search are in Table 19. The differences are in the number of neurons per layer for the different types of autoencoders we use. Table 20 shows search space for CNN autoencoders.  For profiling models, the hyperparameter search space for MLP and CNN is in Table 21.
Lastly, we use autoencoders with latent size 700, so we report in Table 22 the number of layers and neurons per layer we allow. Other hyperparameters stay the same as in Table 19 and Table 20.

B Statistical Tests
Using a Friedman test, we identify that there is indeed a significant difference in the means of the groups. However, we need to find out which ones differ specifically. Thus, a posthoc test is necessary. One such test is the Nemenyi test, and using Python packages, we obtain the results for different latent sizes in Tables 23 and 24. Latent dimensions are in rows  and columns. The Nemeyi post-hoc test returns the p-values for each pairwise comparison of means. Using a significance level α = 0.05, the pairwise latent sizes with a significant difference are bolded. Table 23 shows the pairwise comparison for eight latent sizes because we exclude the results for ae mlp str dcr type as with specified hyperparameter search it did not work for latent size 500. Table 24 shows results including that AE type but excluding the latent size 500.

C Hyperparameters Values of Models from Experiments
This section provides information on hyperparameter values for the models we use in our experiments. In Table 25, we show hyperparameters of the best MLP and CNN profiling models for the encoded DPAv4.2 dataset when encoded with the best-found ae mlp dpav42 best. These models correspond to results in Table 7. Corresponding to Table 8, we provide hyperparameter values for the best MLP and CNN models    for the encoded DPAv4.2 dataset when encoded with the bestfound ae cnn dpav42 best in Table 26. Hyperparameter values of the best MLP and CNN models for ASCADf dataset are visible in Table 27. The attack performance of those models is shown in Table 15.