Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Side-channel attacks are a major threat to the security of modern embedded devices. If no particular attention is paid, the exploitation of physical leakages such as the power consumption and the electromagnetic radiation of a cryptographic implementation can lead to successful key recoveries, e.g., [2, 16, 27, 44, 58]. As a consequence, the topic has been followed by a vast literature on potential solutions to defeat such attacks.

On the other hand, probably the most investigated and best understood protection against side-channel attacks is masking [12, 15, 46]. The underlying principle of masking is to represent any sensitive variable in the implementation by d shares in such a way that the computations are performed only on these shares. Assuming that the leakage of the shares are independent of each other, a successful key-recovery attack needs to observe – at least – the dth-order statistical moment of the leakage distributions, where the corresponding complexity increases exponentially with d.

However, the independence of leakages associated to the shares is an assumption which is usually violated in hardware applications. As an example, the masked AES Sbox designs [11, 39], where the glitches are ignored, failed in practice to satisfy the desired security level, i.e., first-order resistance [25, 32]. Instead, based on Boolean masking and multiparty computation, threshold implementations (TI) [37, 38] can ensure first-order resistance in the presence of glitches. Indeed, not only its underlying principles are sound and realistic but also practical investigations confirmed its effectiveness [4, 33]. Trivially, higher-order attacks are feasible on TI designs [4, 26], which motivated the work presented in [5] where the concept of higher-order TI is demonstrated that extends its definitions to any order. Regardless of its significant overhead (e.g., requiring at least \(d=5\) for a second-order security) the note given in [45] and later practically confirmed in [49] made clear that the definitions of the higher-order TI stand valid only in univariate scenarios.

Our Contribution. Indeed, it is known to the community that hiding techniques (in particular power-equalizing approaches) are not solely capable to prevent key-recovery attacks. It is always suggested that such techniques should be combined with other countermeasures, but the benefit of such a combination has never truly been examined for a hardware platform. More precisely, exploiting higher-order leakages becomes extremely hard in practice when the leakage traces are sufficiently noisy [43]. Along the same lines, power-equalization schemes are also expected to reduce the signal (versus the noise) and have the same effect. To the best of our knowledge, the only work which tried to proceed toward this goal is [30], where a flawed masking scheme [11] has been implemented in a glitch-free setting. No particular attention has been payed on equalizing the power hence not a concrete hiding technique.

Our contribution in this work is to examine the benefit of combining two sound hardware-based countermeasures. More precisely, we aim at considering a provably (first-order) secure masking scheme (TI) and realize it under the principles of a proper power-equalizing technique (GliFreD). We pursue an investigation of our combined construction compared with:

  • the same masking design (first-order TI) without employing any hiding technique, and

  • the second-order TI of the same design excluding any power-equalization scheme.

Such comparisons with respect to the data complexity of leakage detection as well as time and area overheads of the designs allows us to have an overview on the tradeoff between the gains and overheads of different countermeasures as well as their combination.

Since the design overheads are application specific, we consider two design methodologies: first, a fully serialized architecture for lightweight applications with KATAN-32 cipher and second, a parallelized architecture for high-speed applications with PRESENT cipher. Amongst our achievements in this work – including a second-order TI of PRESENT – we can refer to the designs we developed with a combination of GliFreD and the first-order TI (of both KATAN-32 and PRESENT) which showed to be secure by up to 1 billion power traces measured from a Spartan-6 FPGA platform.

2 GliFreD

Dual-rail Precharge Logic (DPL) schemes are popular side-channel countermeasures for hardware circuits and assigned to the group of hiding techniques. Each DPL scheme places two contrary working (true and false) circuits on a device to ideally decorrelate the power consumption from the processed data. In common, DPL schemes have to deal with some implementation challenges. The three major challenges that the FPGA-based DPL designers face are: early propagation, glitches and different wire capacitance of coupled signals. GliFreD is a DPL scheme exclusively designed for FPGAs, and is amongst the few schemes which address all these three problems [56].

To overcome the aforementioned problems GliFreD defines the following design methodology. Each Look-Up Table (LUT) instance is connected to two global control signals: CLK and active; the later one toggles with half of the other one’s frequency. These control signals determine whether the LUTs reside in precharge or in evaluation phase. Hence, the regulated LUT transitions overcome the definition of early evaluation [50]. To prevent the propagation of the LUT output transition, a register is connected to each LUT output. However, a single register stage in a DPL circuit contradicts the requirement of a constant gate and register transition per clock cycle [28] as inconstant and data-dependent transitions would result in data-dependent leakage. Therefore, the GliFreD principles require to place an even number of register stages between each two LUTs connected in the circuit. Consequently, GliFreD forms a pipeline architecture which prevents glitches by halting the propagation of a signal after each LUT. Figure 1(a) shows the timing diagram of a GliFreD circuit.

Similar to many DPL schemes, GliFreD also needs to place a dual of the circuit. Copying the routing structure is currently the best known way in FPGAs to keep the wire capacitances of the false circuit as equivalent as those of the true circuit. Hence, to perform the circuit dualization, i.e., placing the false circuit, a second horizontally-moved instance of the true circuit is placed on the FPGA. The copy process is performed on netlist level to pass on the routing information to the false circuit.

GliFreD allows an arbitrary LUT configuration; since both control signals CLK and active should be connected to each LUT, the function f each LUT can realize is limited to a 4-to-1 look-up table. The output of each LUT can be seen as \(\mathrm {O}=\mathtt{active } \cdot \overline{\mathtt{CLK }} \cdot f(\mathrm {I}_2,\ldots ,\mathrm {I}_5)\) Footnote 1, while the corresponding dual function (of the false circuit) becomes \(\overline{\mathrm {O}}=\mathtt{active } \cdot \overline{\mathtt{CLK }} \cdot \overline{f(\overline{\mathrm {I}_2},\ldots ,\overline{\mathrm {I}_5})}\). Figure 1 shows the GliFreD pendant of an exemplary function

$$\begin{aligned} y = x_0 + x_0x_3 + x_2x_3 + x_3x_4 + x_3x_6 + x_0x_7 + x_2x_7, \end{aligned}$$
(1)

whose standard implementation is shown in Fig. 1(b).

Fig. 1.
figure 1

An exemplary function implemented in a standard 6-to-1 LUT architecture and its GliFreD representation including the timing diagram

Since the output of each LUT is buffered by a register, the critical path in a GliFreD circuit is minimized allowing to run the circuit at high frequencies. To this end the delay between the CLK and active signals should be kept minimum (see Fig. 1(a)), that can be achieved by forcing active signal to be routed through the clock trees. The GliFreD design methodology offers the ability to transfer a design into a fully-pipelined architecture, hence achieving a high throughput in combination with a high clock frequency. In general, large combinatorial circuits cause glitches which propagate through the whole circuit. Since GliFreD prevents those glitches, it may also reduce the power consumption. In small combinatorial circuits this benefit is faded and dominated by the increased amount of resources the GliFreD circuit utilizes. Nevertheless, GliFreD is a resource-costly solution. The LUT overhead (at most 8) required to form a GliFreD circuit strongly depends on the original design structure. Compared to the LUT utilization GliFreD causes a massive register overhead and hence an increased latency. The register overhead cannot be trivially estimated and depends on the LUT depth, width and the amount of registers in the original design.

3 Case Studies

Before giving the details of our case studies, we briefly restate the concept behind threshold implementation.

3.1 Threshold Implementation

As stated before, the masking scheme which we consider in this work is threshold implementation (TI) introduced and extended in [4, 5, 37, 38]. Let us denote an intermediate value of a cipher by \({{\varvec{x}}}\) made of s single-bit signals \(\langle x_1,\ldots ,x_s\rangle \). The underlying concept of TI is to use Boolean masking to represent \({{\varvec{x}}}\) in a shared form \(({{\varvec{x}}}^1,\ldots ,{{\varvec{x}}}^n)\), where \({{\varvec{x}}}=\bigoplus {{\varvec{x}}}^i\) and each \({{\varvec{x}}}^i\) similarly denotes a vector of s single-bit signals \(\langle x^i_1,\ldots ,x^i_s\rangle \). A linear function l(.) can be trivially applied over the shares of \({{\varvec{x}}}\) as \(l({{\varvec{x}}}) =\bigoplus l({{\varvec{x}}}^i)\). However, the realization of non-linear functions, e.g., an Sbox, over Boolean masked data is challenging. Following the concept of TI, if the algebraic degree of the underlying Sbox is denoted by t and the desired security order by d, the minimum number of shares to realize the Sbox under the TI settings is \(n=t\,d+1\). Further, such a TI Sbox provides the output \({{\varvec{y}}}=S({\varvec{x}})\) in a shared form \(({{\varvec{y}}}^1,\ldots ,{{\varvec{y}}}^m)\) with at least \(m=\displaystyle {\left( {\begin{array}{c}n\\ t\end{array}}\right) }\) shares. Note that the bit length of \({{\varvec{x}}}\) and \({{\varvec{y}}}\) (respectively of their shared forms) are not necessary the same since S(.) might be not a bijection, e.g., in case of DES.

Each output share \({{\varvec{y}}}^{j\in \{1,\ldots ,m\}}\) is given by a component function \(f^j(.)\) over a subset of the input shares. To achieve the dth-order security, any d selection of the component functions \(f^{j\in \{1,\ldots ,m\}}(.)\) should be independent of at least one input share.

Since the security of masking schemes is based on the uniform distribution of the masks, the output of a TI Sbox must be also uniform as it is used as input in further parts of the implementation. To express the uniformity under the TI concept suppose that for a certain input \(\mathbf {x}\) all possible sharings \(\mathcal {X}=\Big \{({{\varvec{x}}}^1,\ldots ,{{\varvec{x}}}^n)|\mathbf {x}=\bigoplus {{\varvec{x}}}^i\Big \}\) are given to a TI Sbox. The set made by the output shares, i.e., \(\Big \{\big (f^1(.),\ldots ,f^m(.)\big )|({{\varvec{x}}}^1,\ldots ,{{\varvec{x}}}^n) \in \mathcal {X}\Big \}\), should be drawn uniformly from the set \(\mathcal {Y}=\Big \{({{\varvec{y}}}^1,\ldots ,{{\varvec{y}}}^m)|\mathbf {y}=\bigoplus {{\varvec{y}}}^i\Big \}\) as all possible sharings of \(\mathbf {y}=S(\mathbf {x})\).

This uniformity check process should be individually performed for \(\forall ~\mathbf {x}\in \{0,1\}^s\). We should note that for \(d\,>\,1\) where \(m\,>\,n\) the uniformity cannot be achieved. Hence, some of the registered output shares should be combined to reduce the number of output shares to n. Afterward the uniformity can be examined. For more detailed information we refer to the original articles [5, 38].

3.2 KATAN-32

As stated in Sect. 2, the overhead and performance of a GliFreD circuit depends on the nature of the underlying application. If the target design is made of small combinatorial circuits, the overhead of the resulting GliFreD circuit is minimal. Therefore, KATAN [10] which benefits from a serialized architecture with very small combinatorial logics is a suitable candidate for our investigations. Further, both first- and second-order uniform TI representation of its non-linear functions are given in [5], allowing us to develop the design with minimal efforts.

The architecture of our designs are based on those given in [5]. Figure 2(a) shows an overview of such a serialized architecture considering KATAN-32 encryption engine with 32-bit plaintext and 80-bit symmetric key. The plaintext and key are serially loaded into the registers, and after 254 clock cycles the ciphertext can be taken from the state registerFootnote 2. The first-order TI of KATAN-32 with 3 shares (the minimum settings) needs the state (shift) registers to be tripled. Similar to that of [5], we do not represent the key (and the corresponding shift register) in a shared form. The XOR operations are easily repeated for each share, and the non-linear functions which are limited to the AND/XOR module (involved in function \(f_a\) and \(f_b\) of Fig. 2(a)) need to be realized under the concept of the first-order TI. An AND/XOR function receives a 3-bit input (abc) and gives a single-bit output y as

$$\begin{aligned} y=a+bc. \end{aligned}$$

Following the concept of direct sharing [6] the component functions (given in [5]) which realize a uniform first-order TI can be derived as

$$\begin{aligned} f^{i,j}(\langle a^i,b^i,c^i\rangle ,\langle a^j,b^j,c^j\rangle ) = a^j + b^jc^j + b^ic^j + b^jc^i, \end{aligned}$$
(2)

where each output share is made by an instance of such a component function as

$$\begin{aligned} y^1=f^{1,2}(.,.), \qquad y^2=f^{2,3}(.,.), \qquad y^3=f^{3,1}(.,.). \end{aligned}$$

The same procedure is followed to realize the second-order TI of KATAN-32. First, the minimum number of shares is increased to 5, and all state registers and linear functions need to be repeated accordingly. Further, a second-order TI representation of AND/XOR module (given in [5]) can be derived from Eq. (2) and the following component function

$$\begin{aligned} g^{i,j}(\langle a^i,b^i,c^i\rangle ,\langle a^j,b^j,c^j\rangle ) = b^ic^j + b^jc^i. \end{aligned}$$
(3)

In such a case, the output shares are made as

$$\begin{aligned} y^1=f^{1,2}(.,.),~~~y^2=f^{1,3}(.,.),~~~y^3=f^{1,4}(.,.),~~~y^4=f^{5,1}(.,.),~~~y^5=f^{2,5}(.,.), \end{aligned}$$

and

$$\begin{aligned} y^6=g^{2,3}(.,.),~~~y^7=g^{2,4}(.,.),~~~y^8=g^{3,4}(.,.),~~~y^9=g^{3,5}(.,.),~~~y^{10}=g^{4,5}(.,.). \end{aligned}$$

As mentioned before, in a second-order case the output shares should be combined after being registered in order to reduce the number of shares back to 5. In this case, the reduction is done as

$$\begin{aligned} z^{i\in \{1,\ldots ,4\}}=y^i,~~~~~ z^5=y^5+y^6+y^7+y^8+y^9+y^{10}, \end{aligned}$$

thereby achieving a uniform second-order TI of the AND/XOR module [5]. For more clarification the formula for all the component functions are given in the extended version of this article [35].

Fig. 2.
figure 2

Architecture of the case studies, first (\(d=1\)) and second (\(d=2\)) order TI

3.3 PRESENT

As the second target we selected the PRESENT cipher [9] to be implemented in a round-based fashion. As Fig. 2(b) shows, 16 instances of the Sbox in addition to the PLayer operate in parallel to compute one cipher round. The reason for choosing such a target is to have an application for GliFreD with large combinatorial circuit compared to that of KATAN. Also, due to a possibility to decompose the PRESENT Sbox – as we express below – we are able to develop its uniform first- and second-order TI representations. We should note that we have not selected the AES as a target because its first-order TI (in [4, 33]) can only be realized by remasking (requiring multiple fresh mask bits per clock cycle) and furthermore there is not yet a clear roadmap how to realize its second-order TI.

Similar to the case of KATAN, the first-order (respectively second-order) TI of the targeted PRESENT architecture employs a 3-share (respectively 5-share) Boolean masking. The PLayer (realized by routing in the round-based architecture) is repeated on each share, and the key XOR is applied on only one share as the 80-bit key is not represented in a shared form. Clearly the remaining part is the TI representation of the PRESENT Sbox. Previously Poschmann et al. [42] have shown a decomposition and a uniform first-order TI of such an Sbox. However, below we represent another decomposition allowing us to develop its both first- and second-order uniform TI representations.

The PRESENT Sbox \(S({{\varvec{x}}})={\varvec{y}}\) is a cubic bijection (i.e., with algebraic degree \(t=3\)) leading to minimum \(n=4\) and \(n=7\) shares in the first- and second-order TI settings respectively. Therefore, it is preferable to decompose the Sbox into two (at most) quadratic bijections F and G, in such a way that \(S({{\varvec{x}}})=F(G({{\varvec{x}}}))\) (i.e., \(S=F \circ G\)). If so, each F and G can be shared with \(n=3\) and \(n=5\) (for first- and second-order TI). According to the classifications given in [7], the PRESENT Sbox belongs to the cubic class \(\mathcal {C}_{266}\). It means that there exist affine transformations A and B, where \(S({{\varvec{x}}})=B(\mathcal {C}_{266}(A({{\varvec{x}}})))\). In other words, S and \(\mathcal {C}_{266}\) are affine equivalent. To find the affine functions the algorithm given in [8] can be used; indeed there exist 4 such two affine functions. Also, as stated in [7] \(\mathcal {C}_{266}\) can be decomposed into two quadratic bijections. One of the possibilities is \(\mathcal {Q}_{294}\times \mathcal {Q}_{299}\). It means that there exist three affine functions \(A_1\), \(A_2\), \(A_3\), where \(\mathcal {C}_{266}=A_3\circ \mathcal {Q}_{299}\circ A_2\circ \mathcal {Q}_{294}\circ A_1\). Since \(\mathcal {C}_{266}\) and S are affine equivalent, there exist also three affine functions to decompose the PRESENT Sbox as

$$\begin{aligned} S({{\varvec{x}}})=A_3\Bigg (\mathcal {Q}_{299}\bigg (A_2\Big (\mathcal {Q}_{294}\big (A_1({{\varvec{x}}})\big )\Big )\bigg )\Bigg ). \end{aligned}$$
(4)

We have found 229, 376 such 3-tuple affine bijections, and we have selected one of the most simplest solutions with respect to the number of terms in their Algebraic Normal Form (ANF) directly affecting the size of the corresponding circuit.

The next step is to provide the uniform first-order TI of the quadratic bijections \(\mathcal {Q}_{294}\) and \(\mathcal {Q}_{299}\) which can be easily achieved by direct sharing [7]. For \(\mathcal {Q}_{294}\):0123456789BAEFDC we can write

$$\begin{aligned} e = a + bd, \qquad f= b + cd, \qquad g = c, \qquad h = d, \end{aligned}$$
(5)

with \(\langle a,b,c,d\rangle \) the 4-bit input, \(\langle e,f,g,h\rangle \) the 4-bit output, and a and e the least significant bits. The component functions of the first-order TI of \(\mathcal {Q}_{294}\) can be derived by \(f_{\mathcal {Q}_{294}}^{i,j}(\langle a^i,b^i,c^i,d^i\rangle ,\langle a^j,b^j,c^j,d^j\rangle )=\langle e,f,g,h\rangle \) as

$$\begin{aligned} e = a^i + b^id^i + d^ib^j + b^id^j \qquad g = c^i\nonumber \\ f = b^i + c^id^i + d^ic^j + c^id^j \qquad h = d^i \end{aligned}$$
(6)

The three 4-bit output shares provided by \(f_{\mathcal {Q}_{294}}^{2,3}(.,.)\), \(f_{\mathcal {Q}_{294}}^{3,1}(.,.)\) and \(f_{\mathcal {Q}_{294}}^{1,2}(.,.)\) make a uniform first-order TI of \(\mathcal {Q}_{294}\).

Following the same principle for \(\mathcal {Q}_{299}\):012345678ACEB9FD as

$$\begin{aligned} e=a+ad+cd, \qquad f=b+ad+bc+cd, \qquad g=c+bd+cd, \qquad h=d, \end{aligned}$$
(7)

we can define the component function \(f_{\mathcal {Q}_{299}}^{i,j}(\langle a^i,b^i,c^i,d^i\rangle ,\langle a^j,b^j,c^j,d^j\rangle )=\langle e,f,g,h\rangle \) as

$$\begin{aligned} e&= a^i + (a^id^i + d^ia^j + a^id^j) + (c^id^i + d^ic^j + c^id^j) \nonumber \\ f&= b^i + (a^id^i + d^ia^j + a^id^j) + (b^id^i + d^ib^j + b^id^j) + (c^id^i + d^ic^j + c^id^j) \nonumber \\ g&= c^i + (b^id^i + d^ib^j + b^id^j) + (c^id^i + d^ic^j + c^id^j) \nonumber \\ h&= d^i. \end{aligned}$$
(8)

Similarly, three 4-bit output shares provided by \(f_{\mathcal {Q}_{299}}^{2,3}(.,.)\), \(f_{\mathcal {Q}_{299}}^{3,1}(.,.)\) and \(f_{\mathcal {Q}_{299}}^{1,2}(.,.)\) make a uniform first-order TI of \(\mathcal {Q}_{299}\).

Since the affine transformations \(A_1\), \(A_2\), \(A_3\) do not change the uniformity and should be applied on each 4-bit share separately, the decomposition in Eq. (4) provides a 3-share uniform first-order TI of the PRESENT Sbox. It should be noted that registers are required to be placed between the component functions of \(\mathcal {Q}_{294}\) and \(\mathcal {Q}_{299}\) to avoid the propagation of the glitches (see Fig. 3). Note that the affine function \(A_2\) can be freely placed before or after the intermediate register.

Fig. 3.
figure 3

A first-order TI of the PRESENT Sbox: \(S({{\varvec{x}}})={\varvec{y}}\)

For the second-order TI representations in addition to the above expressed component functions, we define \(g_{\mathcal {Q}_{294}}^{i,j}(\langle a^i,b^i,c^i,d^i\rangle ,\langle a^j,b^j,c^j,d^j\rangle )=\langle e,f,g,h\rangle \) as

$$\begin{aligned} e = d^ib^j + b^id^j \qquad g = 0\nonumber \\ f = d^ic^j + c^id^j \qquad h = 0. \end{aligned}$$
(9)

The 4-bit output shares \({{\varvec{y}}}^{i\in \{1,\ldots ,10\}}\) are provided by

$$\begin{aligned} {{\varvec{y}}}^1=f_{\mathcal {Q}_{294}}^{2,3}(.,.),&{{\varvec{y}}}^2=f_{\mathcal {Q}_{294}}^{3,4}(.,.),&{{\varvec{y}}}^3=f_{\mathcal {Q}_{294}}^{4,5}(.,.),&{{\varvec{y}}}^4=f_{\mathcal {Q}_{294}}^{5,1}(.,.),\nonumber \\ {{\varvec{y}}}^5=f_{\mathcal {Q}_{294}}^{1,2}(.,.),&{{\varvec{y}}}^6=g_{\mathcal {Q}_{294}}^{2,4}(.,.),&{{\varvec{y}}}^7=g_{\mathcal {Q}_{294}}^{3,5}(.,.),&{{\varvec{y}}}^8=g_{\mathcal {Q}_{294}}^{1,4}(.,.),\nonumber \\&{{\varvec{y}}}^9=g_{\mathcal {Q}_{294}}^{2,5}(.,.),&{{\varvec{y}}}^{10}=g_{\mathcal {Q}_{294}}^{1,3}(.,.). \end{aligned}$$
(10)

After a clock cycle, when \({{\varvec{y}}}^{i\in \{1,\ldots ,10\}}\) are stores in dedicate registers, the output shares should be combined as

$$\begin{aligned} {{{\varvec{z}}}}^{i\in \{1,\ldots ,5\}}={\varvec{y}}^{i}+{\varvec{y}}^{i+5}, \end{aligned}$$
(11)

which provides the uniform second-order TI of \(\mathcal {Q}_{294}\).

The same procedure is valid in case of \(\mathcal {Q}_{299}\) considering the component function \(g_{\mathcal {Q}_{299}}^{i,j}(\langle a^i,b^i,c^i,d^i\rangle ,\langle a^j,b^j,c^j,d^j\rangle )=\langle e,f,g,h\rangle \) as

$$\begin{aligned} e&= d^ia^j + d^ic^j + a^id^j + c^id^j \nonumber \\ f&= d^ia^j + d^ib^j + d^ic^j + a^id^j + b^id^j + c^id^j\nonumber \\ g&= d^ib^j + d^ic^j + b^id^j + c^id^j\nonumber \\ h&= 0. \end{aligned}$$
(12)

By changing the indices from \(_{\mathcal {Q}_{294}}\) to \(_{\mathcal {Q}_{299}}\) in Eq. (10) and later applying the reduction in Eq. (11), a uniform second-order TI of \(\mathcal {Q}_{299}\) is achieved. Hence by means of these component functions in addition to the affine transformations, we can realize a uniform second-order TI of the PRESENT Sbox. Figure 4 shows the graphical view of such a construction, and all the required formulas are given in the extended version of this article [35]. Note that the registers after the affine function \(A_2\) can instead be place before \(A_2\) right after the reduction from 10 to 5 shares.

Fig. 4.
figure 4

A second-order TI of the PRESENT Sbox: \(S({\varvec{x}})={\varvec{y}}\)

3.4 Implementation

Based on the specifications given above and considering a Spartan-6 FPGA (indeed the XC6SLX75 of SAKURA-G [1]) we implemented six designs. The first three ones are different profiles of KATAN-32, and the next three designs realize the encryption of PRESENT with a round-based architecture. For each of the targeted cipher we implemented

  • the first-order TI, i.e., KATAN-1st and PRESENT-1st profiles,

  • the second-order TI, i.e., KATAN-2nd and PRESENT-2nd profiles, and

  • the first-order TI with GliFreD, i.e., KATAN-1st-G and PRESENT-1st-G profiles.

Although we did not consider any constraints on placement and routing of the four non-GliFreD profiles, following the principles of GliFreD the corresponding profiles have been realized by first defining an area on the target FPGA, where the component of the true part of the GliFreD circuit should be placed. After finishing the placement and routing, the corresponding dual circuit, i.e., the false part of the GliFreD circuit, has been cloned and dualized by means of the RapidSmith tool [22]. As a reference, the circuits shown in Fig. 1 are the normal and GliFreD realizations of the least significant bit e of Eq. (8).

Due to its serialized ring architecture, the KATAN-1st-G profile does not form a pipeline. The most important difference between such a profile and its original one (KATAN-1st) is on the one hand the number of required clock cycles to finish an encryption (i.e., latency) which is doubled and on the other hand the raised achievable clock frequency due to the minimal LUT depth. The max LUT depth in GliFreD circuits is 1, hence a very short critical path. However, the PRESENT-1st-G profile is implemented in a fully-pipelined way, so that the round-based architecture is able to hold 11 different cipher states. Hence, after \(32\times 11\times 2=704\) clock cycles, 11 encryptions with the same key are performed. The pipelined architecture naturally increases the register utilization of the components but provides a much higher throughput.

Table 1. Details about the implemented profiles. The values given in this table are taken from the post route synthesis report of Xilinx ISE 14.7.

Table 1 compares the overhead and performance of different design profiles. It indeed gives an overview on the disadvantage (area and time overheads) as well as the advantage (throughput) of employing GliFreD with respect to two different design architectures, i.e., a fully-serialized one which is register oriented (KATAN-1st-G) and a round-based one which is combinatorial oriented (PRESENT-1st-G). As shown by Table 1, although the resource utilization and the latency of the GliFreD profiles are drastically increased, the throughput is still kept comparable with the original design profiles. Such achievements are mainly due to the naturally-minimized critical paths in the GliFreD designs allowing a high clock frequency.

4 Empirical Results

In addition to the performance and overhead figures given in Sect. 3.4, we practically examined the ability of each of our six developed designs to avoid side-channel leakages.

Setup. The experimental platform is a SAKURA-G [1] equipped with a Xilinx Spartan-6 FPGA. The side-channel leakages have been measured by collecting power consumption traces of the underlying FPGA by means of a Teledyne LeCroy HRO 66Zi digital oscilloscope at a sampling frequency of \(500\,\mathrm {MS/s}\) and a limited bandwidth of \(20\,\mathrm {MHz}\). Due to the low peak-to-peak amplitude of the signals we also made use of the amplifier embedded on the SAKURA board. For all six design profiles, the target FPGA operated at a frequency of \(24\,\mathrm {MHz}\) during the collection of the power traces. Our intuition on the measured power traces from our platform is that the traces are heavily filtered by the measurement setup including the shunt resistor, chip packaging, printed circuit board (PCB), and probes. Measuring the power traces with high bandwidth (\(>20\,\mathrm {MHz}\)) leads to higher electrical noise. We have examined this behavior and observed leakages easier when the bandwidth is limited. Note that this intuition does not hold true in case of EM measurements.

Fig. 5.
figure 5

KATAN-1st profile, sample trace and non-specific t-test results using 1, 000, 000 traces

It is noteworthy that such a frequency of operation has intentionally been taken in order to : i) cover the full power trace length in the measurements as the KATAN profiles need 254 clock cycles after data being loaded (respectively 508 for KATAN-1st-G), and ii) cause the power peaks of adjacent clock cycles slightly overlap each other. The later has been considered with respect to the note given in [45] that the second-order TI can still be vulnerable to a second-order bivariate attack. Recalling the techniques introduced in [31], employing certain amplifiers or running the device at a high clock frequency leads to converting multivariate leakages to univariate. It has been shown in [49] that a second-order TI design actually can exhibit a univariate second-order leakage if the measurement setup is employed by certain components, e.g., DC blockers and/or amplifiers. Hence, operating the device at \(24\,\mathrm {MHz}\) allows us to easily cover the long traces in the measurements and provide particular situations, where second-order TI profiles may demonstrate second-order leakage.

Evaluation. As the evaluation metric we employed the leakage assessment methodology of [17, 48] which is based on the Student’s t-test. The reason for such a choice is twofold. First, the t-test can examine the existence of detectable leakages without performing any key-recovery attack, which significantly eases the evaluation process particularly where higher-order leakages using millions of traces should be examined. Moreover, the efficiency of the state-of-the-art key-recovery attacks strongly depends on the targeted intermediate value and the underlying (power) model. Second, the same leakage assessment technique (more precisely the non-specific t-test also known as fixed vs. random test) has been used to examine the resistance of different threshold implementations (for example see [5, 49]). In order to keep our evaluations comparable with the former ones, we trivially employed the same evaluation method.

Fig. 6.
figure 6

KATAN-2nd profile, sample trace and non-specific t-test results using 100, 000, 000 traces

In a non-specific t-test the leakages associated to a fixed input (plaintext in case of encryption) are compared to that of random inputs while the key in all the measurements is kept constant. Such a test gives a level of confidence to conclude that the leakages related to the process of the fixed input are different to those of the random inputs. If so, an attack is expected to be feasible to exploit the leakage and recover the secrets. For more detailed information we refer the interested reader to [5, 17].

Fig. 7.
figure 7

KATAN-1st-G profile, sample trace and non-specific t-test results using 1, 000, 000, 000 traces

Fig. 8.
figure 8

PRESENT-1st profile, sample trace and non-specific t-test results using 10, 000, 000 traces

It is noteworthy that all the tests we performed here are based on a univariate scenario. In other words, we did not run any combination function on different sample points of each collected power trace. Further, we followed the same principle explained in [5, 48] to conduct the tests at higher orders. It means that we made the power traces mean-free squared (at each sample point independently), i.e., \((X-\mu )^2\) for the second-order evaluations, and standardized cubed, i.e., \(\displaystyle {\Big (\frac{X-\mu }{\sigma }\Big )^3}\) for the third-order evaluations. In general, the pre-processing is done by \(\displaystyle {\Big (\frac{X-\mu }{\sigma }\Big )^d}\) for the analyses at order \(d>2\), with X as a random variable denoting the power traces (at a particular sample point), \(\mu \) and \(\sigma ^2\) as the sample mean and sample variance (at the same sample point) respectively. Indeed, these pre-processes required for higher-order evaluations are with the respect to the centered and standardized higher-order statistical moments (for more information see [26, 34]).

Fig. 9.
figure 9

PRESENT-2nd profile, sample trace and non-specific t-test results using 300, 000, 000 traces

We start our evaluations with KATAN-1st profile. Figure 5(a) shows a corresponding sample power trace. Note that the collected power traces do not cover a time period, when plaintext and key are serially loaded into the shift registers. In order to have an overview about the quality of the measurement setup and verify the employed evaluation metric, for the first analysis we turned the PRNG off thereby forcing all masks to zero, used for sharing the plaintexts. As shown by Fig. 5(b), the first-order t-test shows clear detectable leakages using a few 10, 000 traces. By keeping the PRNG active and conducting the same non-specific t-tests up to third-order using 1, 000, 000 traces we observed the curves shown by Fig. 5, which indeed confirm the first-order resistance and vulnerability at the second and third orders, as expected.

Fig. 10.
figure 10

PRESENT-1st-G profile, sample trace and non-specific t-test results using 1, 000, 000, 000 traces

For the KATAN-2nd profile we had to collect much more traces to be able to observe the higher-order leakages. It is due to the high order of sharing, i.e., at least 5 shares (see Sect. 3.1) in case of a second-order TI. In fact, we observed the fourth- and fifth-order leakages using approximately 100, 000, 000 traces, as shown in Fig. 6. However, in order to examine the issue reported in [45] (by operating the target at \(24\,\mathrm {MHz}\)) we continued the collection of the traces up to 500, 000, 000, but we have not observed any second-order leakage while the fourth- and fifth-order leakages became detectable – expectedly – with higher confidence. We should here refer to the issue addressed in [45] and the detectable second-order leakage reported in [49]. Based on the explanations of [45] a second-order bivariate leakage should be detectable, but such a bivariate leakage is not necessarily detectable from the consecutive clock cycles, that can additively be combined by means of an amplifier or running the device at a high clock frequency [31]. In case of the application of [49] apparently the consecutive clock cycles exhibit such a bivariate leakage, but it is not hold true for the serialized KATAN architecture. Further, compared to our design profiles the constructions in [49] make use of a kind of remasking which is a different methodology to ensure the uniformity.

Following the same scenario we performed the evaluations on the KATAN-1st-G profile and collected 1, 000, 000, 000 traces to perform the same t-tests at up to third order. The corresponding results which are depicted in Fig. 8 indeed confirm the effectiveness of the underlying hiding technique to significantly harden the higher-order attacks. The result of this profile can be compared to that of the KATAN-1st profile (Fig. 5), where 1, 000, 000 traces are adequate to observe the second- and third-order leakages.

The same leakage assessment technique has been conducted on the three profiles of the round-based PRESENT architecture, and the corresponding results are shown in Figs. 8, 9 and 10. For the PRESENT-1st profile we required 10, 000, 000 trace to observe the second- and third-order leakages. Respectively 300, 000, 000 traces were necessary for the PRESENT-2nd profile to exhibit fourth- and fifth-order leakages. We should again bring the reader’s attention to the infeasibility to observe a second-order leakage from the PRESENT-2nd profile. We indeed continued our evaluations on this profile by measuring 1, 000, 000, 000 traces as well as with different fixed inputs (with respect to the non-specific t-tests), but in none of the tests we observed a detectable second-order leakage. As an example, we give the results of one of such tests with 1, 000, 000, 000 traces in the extended version of this article [35], where the third-order leakage also becomes detectable. Finally, similar to the KATAN GliFreD design we collected 1, 000, 000, 000 traces and conducted the same non-specific t-tests on the PRESENT-1st-G profile, which still shows robustness to avoid the leakages to be detectable at first, second, and third orders.

Discussion. Comparing the presented practical results, at the first glance it can be noticed that the GliFreD profiles consume more energy than the other corresponding profiles. They also increase the number of required clock cycles (latency) particularly in case of the PRESENT design as its combinatorial circuit has a longer depth compared to the KATAN design. However, their achievement, i.e., hiding the higher-order leakages to make the higher-order attacks practically infeasible, is confirmed. Hence, it can be concluded that the combination of such a power-equalization technique and a proper masking scheme (i.e., first-order TI) gives a high level of confidence to argue the practical infeasibility of the key-recovery attacks.

Our comparisons are limited to the second-order TI of KATAN and PRESENT, which can be extended to higher-order TI designs. However, by increasing the desired order of security the number of shares and the required internal PRNGs respectively increase (e.g., at least 7 and 9 shares for third- and fourth-order TI). Note that the numbers given in Table 1 exclude the area required for the PRNGs.

Nonetheless, due to the local separation of false and true parts in GliFreD circuits, the resistance of our proposed method against higher-order EM attacks is still an open question and should be addressed in the future. Further, GliFreD is exclusively designed for FPGAs and uses the fixed LUT structure to realize Boolean functions of a circuit. Transforming this logic style naively to ASIC may not lead to the expected results especially with respect to the area overhead. The idea of combining TI with DPL styles can be adopted for ASICs by employing one of the logic styles designed for ASICs in addition to a customized router.