Template attacks on nano-scale CMOS devices

Profiled attacks are widely considered to be the most powerful form of side-channel analysis attacks. A common form is known as Gaussian template attacks which fit a Gaussian distribution to better model the behavior of the target device. Since profiled attacks build the model based on a device identical to the target device, manufacturing variances are an important factor for the success of such attacks. With shrinking the feature size, the influence of manufacturing variation on the power consumption of integrated circuits increases. It has been warned that this issue might render template attacks less effective. We evaluate this assumption on an ASIC design manufactured in 40 nm technology. We characterize the introduced variation and show that these can be easily mitigated. By performing attacks on multiple samples of the same ASIC, we show that template attacks on small technology sizes are still successful.


Introduction
For today's embedded systems dealing with cryptographic primitives and secrets involved in cryptographic operations, side-channel analysis (SCA) attacks are considered as one of the most serious threads. Especially differential power analysis (DPA) attacks as introduced by Kocher et al. [15] and later extended to correlation power Analysis (CPA) attacks [3] have been proven to be powerful tools to extract secrets from cryptographic devices when the attacker has physical access to the target [10,23,24]. Along the same line, compared to that using power consumption, measuring the electromagnetic emanation (EM) of the device can lead to stronger attacks [11] since EM signals can be localized and are usually less influenced by other irrelevant parts of the circuit.
In general, such multi-query attacks are conducted under a black-box scenario, where no (or little) information about the device-under-test (DUT) is known. In contrast, profiling This work is partly supported by the German Research Foundation (DFG) through the Project 393207943 "Security for Internet of Things with Low Energy and Low Power Consumption (GreenSec)", and Germany's Excellence Strategy-EXC 2092 CASA -390781972.
B Bastian Richter bastian.richter@rub.de 1 Horst Görtz Institute, Ruhr University Bochum, Universitässtr. 150, 44801 Bochum, Germany SCA attacks have access to and full control over a device identical to the DUT, with which the power consumption (or EM) characteristics of the device can be studied. Such a device is referred to as profiling device. In short, during the profiling phase, the attacker collects as many measurements as required from such a device, whose entire intermediate values are known. Using such profiles, the attack is conducted on the DUT by means of very small number of measurements (ideally a single one). The first and the most popular method in this area is the Gaussian template attack (TA) introduced by Chari et al. [6]. During the profiling phase, based on the value of a chosen intermediate value, the attacker estimates multivariate Gaussian distributions (socalled profiles) using the measurements collected from the profiling device. Later, the attacker makes use of the profiles to predict the targeted intermediate value in each SCA measurement collected from the DUT. The Gaussian distribution is the most commonly used distribution for this kind of attack although it is also possible to base it on other distributions. Due to the multivariate nature of this method, it can recover the secret using less number of measurements compared to the attacks under black-box scenario. Profiling attacks are even able to directly target the secret values which are independent of the given inputs. For example, such attacks can target the key chunks (e.g., bytes) when they are transferred from memory to registers, e.g., for key schedules.
Since key schedule is independent of the cipher input (plain-text/ciphertext), the associated leakages cannot be exploited by multi-query DPA/CPA attacks. Apart from Gaussian template attacks, there are also other forms of profiling attacks which make use of machine learning techniques like support vector machine (SVM) [14,16] or deep neural networks (DNN) [17,19]. Notably, DNNs have shown very promising results when applied in cases where the implementation is protected by means of temporal noise, i.e., randomized or jitter-full clock [4].
The steady shrinking of CMOS feature sizes steadily increases the speed and power efficiency of modern devices. The smaller structures consume significantly less dynamic power due to the lower resulting capacitances in the gate and in the shorter connections [33]. So far mostly dynamic power consumption contributed to the information leakage exploited by SCA attacks. However, due to the increase in static power consumption in smaller technology nodes, it can also be considered as a source of information leakage [21,22,26]. Djukanovic et al. [9] and Bellizia et al. [2] also examined the application of multivariate attacks to static leakage by first investigating the influence of temperature on the information leakage and then creating multiple measurements with different temperatures for one attack to acquire multiple dimensions. The decrease in dynamic power consumption may complicate power analysis attacks on circuits build using the newest technologies, which can be considered as a positive development.
More importantly, process variation is more intensively observed in new smaller technology nodes. Since the concept behind the profiling SCA attacks relies on the similarity of the profiling device and the DUT (and their corresponding power/energy characteristics), severe process variation has been expected to increasingly hinder successful profiling SCA attacks. Renauld et al. [27] examined the power variability of a nano-scale chip and its influence on SCA attacks. Based on practical experiments on a 65 nm circuit, they concluded that variability of power consumption pattern of different chips would make it very challenging to successfully conduct template attacks. Hence, it is expected that by decreasing the feature size in the future and more intensive process variation such attacks will get more and more difficult. According to personal communication with the authors, the device under test in [27] was a single AES S-box implemented as a fully combinatorial circuit without any register at input/output or control logic. The 8-bit input and 8-bit output signals of the S-box were provided as physical I/O pins of the chip. Although the authors put external register banks (on PCB level) at the input and output of the chip, this leads to very dominant changes in power consumption curves when the input of the S-box alters. Such changes are due to the activity of the energy-consuming I/O cells and fan out of the chip and are seen as strong noise in the measurement. Further, ASIC samples are slightly different in their package, e.g., in the length of bonding wires. Since no register is packed into the targeted ASIC, the changes on S-box input and output pins lead to various amount of power consumption in different ASIC samples. This can justify the variety that the authors have observed in [27].
It is noteworthy that susceptibility of devices to profiling attacks are usually examined under the worst-case scenario. More precisely, a single device is used in both profiling and attack phases. Under such circumstances, it is examined whether the attack is successful even if the DUT and profiling device have a very similar (ideally identical) power consumption characteristics. In this work, we conduct Gaussian template attacks on the AES encryption function implemented by a 40 nm ASIC standard library. In contrast to the worst-case scenario, we try to evaluate the real-world applicability of such profiling attacks by examining 11 ASIC sample chips. This also includes analyzing a full AES implementation compared to the single S-box of [27]. Our experiments also enabled examining different intermediate values and models for the templates which turned out to highly affect the results. As a side note, every ASIC sample includes seven AES encryption cores with identical netlist but with different placement and routing. This allowed us to quantify how strongly the routing influences such profiling attacks, when profiling and attack devices are not from the same placement and routing. This can be of high interest when dealing with selling and cloning third-party IP cores.
In short, we found that the attacks are still easily possible. Based on our experimental results (only valid on the underlying technology of our prototyped ASIC samples), the increasing process variation in modern technology nodes does not strongly affect the success and feasibility of profiling attacks. To be more precise, such variations are easily compensated by already-available portability methods like mean compensation [20] usually used to compensate other variations, e.g., in the measurement setup.

Template attacks
In Gaussian template attacks, it is assumed that the adversary is able to obtain a device identical to the DUT [6]. This enables a profiling phase in which the attacker uses the profiling device to build multivariate Gaussian models for the leakage associated with intermediate values. The built model can then be applied to attack the DUT with a few number of measurements.

Profiling phase
Following the notation of [7], we build multivariate Gaussian distributions for the leakages associated with the value k ∈ S with S being an intermediate value of the calculation. This can either be a certain value or a model like Hamming Distance (HD) of consecutive values stored in a register. For each value in S, we measure n traces, each represented as x = (t 1 , . . . , t m ) with m sample points t i ∈ R. The m sample points in x are usually selected or compressed from the originally measured traces (this is discussed in more details in Sect. 2.1.3). X k ∈ R n×m is then a matrix of measurements for the value k, with x k,i representing the i-th row of matrix X k . We can then compute the parameters needed to describe the multivariate Gaussian distribution which are the sample mean vectorx k ∈ R m and the sample covariance matrix S k ∈ R m×m for each value k ∈ S.

Attack phase
In the attack phase, the profilex k and S k for a selected k ∈ S is applied on a single trace y measured from the DUT by calculating the Gaussian probability density function pdf: Repeating this for all ∀k ∈ S results in a set of values that can be used as a discriminant score to rank the k candidates. Knowing the input (or output) associated with the attack trace y, each key candidate is assigned to a category of k and therefore receives its pdf score. Since often a single attack trace is not sufficient to recover the key, the result of multiple attack traces should be combined. In this context, it is reasonable to calculate the logarithm of the pdf instead: It offers the advantage that the scores acquired by different attack traces can simply be summed up. It also prevents some numerical instabilities which can occur during the computation. When exploiting the leakages associated with an intermediate value which not only depends on the secret key but also on the algorithm input (or output), the multiple attack traces can be recorded for different inputs to exploit the full distribution. This process is usually called Template DPA [18].

Points of interest
The selection of points of interest, i.e., the points which contain the highest amount of information associated with the chosen intermediate value, is a crucial step in the preparation of template attacks. Although Gaussian multi-variate templates can benefit from correlated noise in additional points, there is a sweet spot of added points which has to be found. 1 Including additional non-informative points can degrade the matching with the calculated templates. Another point is that the computational complexity quadratically depends on the number of points used for the templates. In order to select these points, the attacker can either use some metrics to directly select certain points of the traces or use datadimensionality reduction methods to compress a part of the trace to a small amount of derived points. There are several metrics available for directly selecting the points of interest. Estimating the difference-of-means (DOM) over different k values has been proposed in the original work introducing Gaussian template attacks [6]. Other methods include (1) sum of squared differences (SOSD) [12] which amplifies larger differences and prevents the cancellation of smaller signals with alternating signs and (2) the points with maximum correlation as the result of a CPA. Leakage detection tests like the t-test [28] might also be used but have to be modified for detecting the leakage of specific intermediate values and not the whole computation. Alternatively, signal-to-noise ratio (SNR) [18] can be used, that is defined as the ratio between the variance of the categories' means and the variance of traces within the same category: with Var(.) standing for variance and E(.) for expected value. Dimensionality reduction is most of the times performed by means of principle component analysis (PCA) [1] or linear discriminant analysis (LDA) [29]. While PCA maps the input dimensions (points of the traces) to orthogonal dimensions with maximum variance, LDA maximizes the ratio between inter-class and intra-class variance in the consecutively added orthogonal dimensions. 2 In our experiments, we did not use any dimensionality reduction. In order to allow us to compare the results with those of [27] and to not distort the results by being dependent on the preprocessing step, we select the points of interest by means of SNR (more details in Sect. 3).

Known improvements
There are different factors which can cause problems when building the templates or lead to a mismatch between the created templates and the measurements collected from the DUT. These might lie in minimal changes, e.g., environmental, during the measurements or numerical and statistical issues when building the templates with too few traces. Below, we shortly restate the common techniques known to compensate such effects.

Pooled covariance matrix
If too few traces are used to calculate the covariance matrix, it can happen that it is singular and thus not invertible (see Eqs. (3) and (4)). This often occurs if the categories of the TA are not equally likely, and thus some categories have not many profiling traces. As an example, this can happen when the input of the profiling device is randomly selected while the categories are defined based on HD of an intermediate value. This can cause the covariance to not be well estimated and thus inaccurate. Because in classical TA a Gaussian noise is assumed for each sample point, templates with the same points of interest often exhibit the same or a very similar covariance. Hence, it is possible to build a pooled covariance matrix for all templates instead separated ones for each category. In our experiments, since we worked with a high number of training traces, we did not encounter the aforementioned problems. Thus, we calculated separate covariance matrices for each category.

Mean adjustment
The main factors which interfere with the matching are parasitic resistance and capacitance introduced by the measurement setup. These parameters can be different in case of the profiling device and the DUT. Also, tolerances in the manufacturing or the packaging of the chip can have an influence, e.g., increased resistance due to longer bond wires. It has been shown in [8] that the main difference between measurements of different devices on the same setup is usually a DC offset.
In order to tolerate this, the attacker can shift the attack traces to make their mean the same as that of the profiling traces.
Here,x train is the mean over all training traces andx attack over all attack traces. The attacker then uses x adjusted for the attack. When performing a template DPA, it is initially not know how many traces are needed to succeed. Hence, it is possible to either record a set of traces of a predefined size and calculate its mean. But the mean then includes information used from traces which might not be used for the rest of the attack so the number of traces used for the attack is actually not correct. Thus, we decided to estimatex attack incrementally and updated it for each trace added to the attack set. The requirement to calculate the attack device's mean trace restricts the attack to cases in which enough attack traces recorded with different inputs are available to properly approximate the mean, i.e., a reasonably uniform sample over the different classes should be acquired to not get a bias in the mean approximation. For example, a one-trace attack is not possible if mean adjustment is needed. Methods based on dimensionality reduction can help here as they might be able to compensate the mean difference within one trace. 3

Key ranking/enumeration
Performing a TA results in probabilities (logarithmic when Eq. (4) is used) for each candidate for a key portion (e.g., a byte). The question is then how difficult it would be to perform a search for the full key, e.g., when 16 different TAs are performed each of which for a byte of a round key of an AES encryption. The first problem is to find an algorithm which searches through the key space in an optimal way based on the probabilities resulting from the TAs. A key enumeration algorithm is presented in [31] that performs such a key search. Unfortunately, launching an actual attack is very time-consuming and might be infeasible if the remaining entropy is too high. Thus, to estimate how difficult an attack might be in the future or with higher computational power, an algorithm is needed to estimate the rank of the correct key in a security evaluation scenario. Such a ranking algorithm is presented in [32]. This algorithm works by carving boxes in the key space thereby approximating the volume of the key space with higher probability than the correct key. The algorithm continues until all boxes defined by the sub-key candidates are processed or a given tightness of the bounds is achieved.
For the evaluation of our experiments, we used Algorithm 2.1 introduced in [13]. Suppose that the target key is split into N p portions, e.g., in 16 parts in case of the AES-128 round key. The algorithm operates on N p histograms generated over the probabilities (or log Pr) as the result of N p template attacks, given the set of measurements Y. To this end, the distance between minimum and maximum of the entire probabilities (or log Pr) is divided into N bin bins.
For each of the N p key parts, the bins' values in the respective histogram H i are incremented for each key candidates probability falling in the respective range. This is done for each of the N p key parts and thus results in N p histograms. As input, the algorithm receives such histograms H 1≤i≤N p and the probability of the correct key Pr[k * |Y] given the measurements Y. The histograms are iteratively convoluted to each other, and the correct key rank is estimated by summing up the values in the bins representing higher probabilities than that of the correct key. By increasing N bin , the bounds of the estimation can be tightened. This algorithm offers a very fast estimation of the remaining entropy, which is the logarithm to the base 2 of the rank, with tight bounds. 4 As we later use the average ranking returned by the algorithm to calculate the entropy, it is more precisely the remaining guessing entropy [30] which directly relates to the average remaining workload of the side-channel attacker. 5 However, the results of a ranking algorithm are only relevant if there is a corresponding key enumeration algorithm for which the correct key would achieve the calculated rank. In this case, the corresponding enumeration algorithm was presented by Poussier et al. [25].

Target
The target of our experiments is an implementation of the AES block cipher on an ASIC prototype which has been manufactured in a 40 nm technology and is bonded into a JLCC68 package. Each ASIC sample contains 7 AES cores which are synthesized from the same RTL design and thus have the same netlist (except for the adjustment of drive strengths) but differ in their placement and routing. The cores are placed next to each other in a defined area as illustrated by the different colors in Fig. 1   The underlying implementation follows a byte-serial architecture, i.e., only one instance of the S-box is implemented. Trivially, in order to hold the cipher state, the design contains a 128-bit register (marked as Data Reg in Fig. 2) where each byte can be addressed individually. The S-box, which is based on Canright's design [5], is split up by two registers Z and C before and after the inversion which enables pipelining the operations. After the plaintext is byte-serially loaded into the Data Reg, the SubBytes-Operation is performed byte-wise in a particular order fitting to ShiftRows operation. This is enabled by the pipeline structure formed by the registers in the S-box module. Since the corresponding SubKey byte is XORed to the S-box input, in total the AddRoundKey, SubBytes and ShiftRows operations are performed in 18 clock cycles. Afterwards, MixColumns is initiated by storing four bytes of one column into the Mix Col In register. The Quarter MixColumn module calculates one byte of the MixColumns output, which is stored back into the Data Reg. By rotating the Mix Col In, other bytes of the MixColumns output are calculated. In total, the entire MixColumns operation is performed in 32 clock cycles (8 clock cycles per columns).
Further, the same S-box module is used by the KeySchedule module (not shown in Fig. 2) to calculate the next SubKey. 6 Hence, 6 clock cycles are spent during the KeySchedule to perform four required S-box lookups, and 12 clock cycles for the XOR operations on the remaining SubKey bytes. In sum, except the last one (which only needs 36 clock cycles due to the missing MixColumns operation), every cipher round (including the KeySchedule) needs 68 clock cycles, and a full encryption terminates in 648 clock cycles excluding the required clock cycles to load the plaintext bytes and send out the ciphertext bytes.
For practical measurements, we made use of a single toolkit board which hosts our packaged ASIC samples by means of a socket. In order to measure SCA leakages of each ASIC sample, we just exchanged the chip. The SCA traces have been recorded by a Teledyne-Lecroy HDO6054 digital sampling oscilloscope with a sampling rate of 1.25 GS/s. For the entire measurements, the AES core was clocked at 4 MHz by an internal oscillator. After some initial tests, we decided to use a Tektronix TC-2 AC current probe placed in the V D D path, since it provided measurements with less noise compared to measuring the voltage drop over a shunt resistor. Additionally, the output signal of the current probe was amplified by a Mini-Circuits ZFL-1000LN+ amplifier (with 20 dB gain).
We had access to 11 ASIC samples of the fabricated design. For each of the 7 cores in every ASIC sample, we recorded 10 million profiling traces with random plaintext and random key. Additionally, for each (core, chip) combination, we collected 1000 sets of attack traces, each of which containing 1000 traces measured for a fixed (but arbitrary selected) key while plaintext was provided randomly. In other words, we collected 10 million profiling traces and 1 million attack traces while each 1000 attack traces belong to a unique key. This enables us to first examine under which model (to define the categories in TAs) the measured traces show high dependency to the intermediate values. Then, we can apply this model in inter-chip TAs to evaluate how manufacturing variability and the setup affect the portability of the templates. More precisely, by inter-chip attacks we perform TAs on core x of chip y while the profiles have been made using the traces measured from the same core x of chip z = y. We further evaluate the influence of placement and routing of the target core by performing intra-chip attacks. It means that we conduct TAs on core x of chip y using the profiles constructed from core z = x of the same chip y. 6 #2.

Model selection
Since the AES architecture processes the S-boxes serially, the update of the Data Reg is an obvious candidate for an 8-bit model. At the beginning of the first cipher round, the register contains the plaintext which will then be overwritten by the S-box lookup in the ShiftRows order. Therefore, the resulting model is the HD between such consecutive values, i.e., HW(P i ⊕ S R i ), with P i and S R i the i-th byte of plaintext and ShiftRows output, respectively. Considering the largest combinatorial circuit of the design (i.e., the S-box), the HD of consecutive S-box output values again in ShiftRows order HW(S R i ⊕ S R i+1 ) is also expected to represent a valid model. There are also C and Z registers in the S-box module (see Fig. 2) which enable the pipelining. Thus, we also examined the HD of consecutive values in these two registers, i.e., HW(C i ⊕C i+1 ) and HW(Z i ⊕ Z i+1 ) ShiftRows order. For completeness, we also added the HW of the S-box output HW(S B i ) to the list of our considered models. Another large combinatorial circuit in the underlying AES core is the Quarter MicColumns module. Since it processes 32-bit key-dependent intermediate values, we did not consider its leakage into our list. In addition to the HW in the aforementioned models, we considered their pure 8-bit values as well, i.e., without the HW operator.
In order to compare the considered models, by considering a single S-box lookup (i.e., one i index in the aforementioned models) we estimated the SNR following the concept illustrated in [18]. The result are shown in Fig. 3 and indicate that the HD between the plaintext and the ShiftRows output stands out with the highest SNR of 0.68, more than twice as high as the second best model with 0.32. The same is observed for other S-box lookups, i.e., other i indices. For each byte, the best model depends on only one key byte. This implies that using such a model in an attack, since the Fig. 3 SNR of different models for one S-box lookup plaintext is known we have a direct 8-bit key candidate in comparison with the second best model HD(S R i , S R i+1 ), where two key bytes should be guessed. 7 All models where the HW operator is not used showed smaller SNR compared to their corresponding HW model.

Attack in worst-case scenario
When a single device is used for both profiling and attack, it defines a baseline of what is the best result which can be achieved by an attacker, i.e., the worst-case scenario with respect to vulnerability. Consequently, we use this settings by analyzing a unique core in a single chip for profiling and tuning the parameters for the subsequent attacks.

Points of interest
Another important step after choosing a model is the selection of Points of Interests (POIs) corresponding to the selected model. Since we do not apply a dimensionality reduction algorithm, we chose our POIs based on the SNR. We chose either the n points with the highest SNR or additionally consider a minimum distance of d time samples between any two selected points. This can efficiently be performed by first calculating the templates for a high number of POIs, e.g., 200 points, and keeping them ordered by SNR. Then, in the attack phase, we can pick the points which fulfill our requirements, e.g., the 20 points with highest SNR and minimum distance of 3 and adjust the templates. Such an adjustment is very efficiently done by copying the corresponding elements of the mean vector and the covariance matrix. Only the inverse of the covariance matrix needs to be recalculated.
As stated in Sect. 3.2, we made use of HD between plaintext and ShiftRows output as the model to build the templates. For the selection of POIs, we examined different parameters to find an optimal combination. We checked the number of points from 200 down to 100 in steps of 25 and from 100 down to 5 in steps of 5 points, always selecting the points with the highest SNR. Our experiments indicate that 100 points lead to the best result for our targeted chip. We also tested the cases with a minimum distance distance between the points from 2 up to 5 points, which did not improve the attacks. Hence, we build our templates by means of 100 points exhibiting the highest SNR for the model HW(P i ⊕ S R i ). The POIs were selected on Chip 1 on Core 1 and then kept for the remaining tests. 8 7 #1. 8 #2.

Influence of environmental noise
In order to quantify the influence of variations in the measurement setup including the temperature, we first collected the profiling and attack traces for all cores in each chip directly after each other. We refer to these set of measurements as 'old'. This means that the setup ran with different chips for many days and has been dis-and reconnected multiple times between the measurements (to swap the chips). After finishing the measurements for all cores and all chips, we recorded one more set of profiling traces for the first core of the first chip, which we recall as 'new.' This means that (as given in Sect. 3.1) we have 1000 sets of attack traces (each containing 1000 traces for a fixed key), while two sets of 'old' and 'new' profiling traces are available (each consisting of 10 million traces).
We have considered four cases to perform the attacks; two cases trivially correspond to 'old' and 'new' profiling traces. This is repeated by applying the mean adjustment explained in Sect. 2.2.2. For each case, we performed 1000 different attacks and applied the key ranking algorithm given in Sect. 2.3 to obtain the remaining entropy. Figure 4 shows the average of remaining entropy for all four cases over the number of used attack traces.
Without mean adjustment, the attack does not work well when using the 'new' set of profiling traces. It achieves an average remaining entropy of 113 bits after 1000 attack traces. In comparison, when the 'old' profiling traces are used, the attack needs 830 traces to achieve a remaining entropy of less than 1 bit. The entropy already reaches 8 bits after 350 traces, where the correct key can be found in a space of 2 8 by an enumeration algorithm, e.g., [25]. When the mean is adjusted, the attacks in both cases are significantly improved. The remaining entropy of less than 1 bit is achieved Fig. 4 Remaining entropy of attacks using 'old' and 'new' profiling traces with and without mean adjustment (MA), averaged over 1000 attacks after 120 traces using the 'old' profiling traces and after 140 by the 'new' set. Hence, for the following experiments we assume that the differences caused by the measurement setup are mostly compensated by mean adjustment.

Inter-chip attack
After assessing the worst-case scenario, we proceed with the real-world scenario where the training and attack traces do not belong to the same device. To this end, we concentrated on a unique core in all 11 chips. In order to get an intuition about the degree of the variability between different chips, Fig. 5a shows the corresponding 11 mean traces estimated for a certain category of the underlying model, i.e., HW(P i ⊕ S R i ). For comparison, Fig. 5b presents the mean traces for all 9 different categories belonging to a single chip. This indeed is the signal that we are trying to exploit in the attacks. As the figures show, the variability between the chips is greater than the actual exploitable signal. Number of required traces to achieve a remaining entropy of less than 1 bit in inter-chip attacks on a unique core, with mean adjustment, averaged over 1000 attacks Figure 6a shows the result of the attacks on a unique core on all chips using the templates build based on a single chip. Directly applying the templates to the attack traces results in widely different results even using 1000 attack traces. This indeed confirms the assumed problems of transferring the templates to other chips. Therefore, as formerly discussed, the mean adjustment technique (see Sect. 2.2.2) should be applied during the attack phase, which can drastically improve the results. Doing so, the remaining entropy for all attacks similarly reached 0 bit after around 110-120 attack traces, independent of whether the profiling and attack devices are different or the same (see Fig. 6b). This confirms that applying the mean adjustment not only compensates for the differences in the measurement setup and environmental noise but also for the manufacturing variation of the chip.
Subsequently, by concentrating on a single core we repeated this scenario for all combinations, i.e., all chips for profiling and all chips to measure the attack traces. Figure 7 lists the number of traces required in attacks to achieve a remaining entropy of less than 1 bit for all combinations of profiling and attack chips. It clearly highlights the influence of the measurement variations. While the attacks on, e.g., chip 1 and 7 seem to be successful with 100 to 120 traces, the attacks on chip 8 seem to strongly deviate from the others and require 160 to 200 traces. Also, the templates built based on chip 11 seem to perform worse compared to all other cases.
It is noteworthy that we have repeated the same experiments on all other 6 AES cores of the targeted chips. Due to the their similarity to the presented results, we omit showing the corresponding outcomes.

Intra-chip attack
Another interesting aspect is whether cores with the same architecture and even the same netlist exhibit similar leakage characteristics when placed and routed differently. To examine this, we performed template attacks on all cores using the templates built based on a certain core. Ordinarily performing the attacks did not show a successful result. When the attack core is not the same as the profiling core, the remaining entropy is still higher than 100 bits, as can be seen in Fig. 8a. Applying mean adjustment makes the key recovery feasible again. While the attack on the same core as profiling needs around 110 traces to reach the entropy of 0, between 350 and 1000 traces are needed to reach the same remaining entropy when attacking other cores (see Fig. 8b). Hence, a major component of the difference between the cores' leakage is a DC offset which gets compensated by the mean adjustment. It also means that the significant components of the power consumption are only partially affected by the routing, and the relative difference between the values is still approximated. Note that another possible compensation factor is to additionally adjust the standard deviation [20]. This might be useful to decline the difference between the cores' leakage characteristics even more. In contrast to mean adjustment, the variances should be estimated for each category of the underlying model. This leads to a low number of samples for each category since the aim in TAs is to make use of a very low number of attack traces. Hence, the variance adjustment is not necessarily helpful if the number of attack traces is limited, as it is the case in our experiments.

Conclusion
We have shown that template attacks are still a high risk for integrated circuits manufactured in small technology nodes like 40 nm. While the manufacturing variation in our samples clearly leads to variation in the power consumption even more than the actual data-dependent leakage, this can be easily accounted for by adjusting the mean of the attack measurements. This way the attacks on different chips achieved nearly the same performance as on the profiling chips. Our results 9 differ from Renauld et al. [27] who concluded that template attacks will be very challenging for small technology nodes after testing it on a 65 nm prototype with no internal registers around the s-box. It highlights the importance of evaluating the whole cipher design or at least putting registers around the circuits as the I/O pins can be a major source of noise.
Additionally, we have shown that template attacks on the same RTL design but with different routing seem to be possible with mean adjustment. Although these do not perform as good as on the same core, the attack is still feasible.

Future work
In this work, we intentionally kept the conducted template attack as basic as possible, to better show the variation introduced by the manufacturing. Aside from the mean adjustment, we did not apply any other preprocessing. However, often some form of dimensionality reduction is performed on the traces to reduce the number of samples points and to make the attack more robust against noise. One future aspect might be to also consider methods like PCA [1] or LDA [29]. 10 Even smaller technologies can further increase the variability and there is also the question whether at some technology node other effects might introduce variabilities which cannot be easily corrected.
Based on the results of our attacks on different placement and routing, it might be interesting to see whether it is possible to use devices which are of the same chip family but not identical for profiling, assuming that these use the same RTL design for the underlying cipher. long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. 9 #2. 10 #2.