1 Introduction

When designing a block cipher, one needs to consider many possible cryptanalysis attacks and often give the best trade-off between the security, speed, ease of implementation, etc. Besides the two main directions in the form of linear [1] and differential [2] cryptanalyses, today the most prominent attacks come from the implementation attacks group where side-channel attacks (SCAs) play an important role. To protect against SCA, one common option is to use various countermeasures such as hiding or masking schemes [3] where one well known example is the threshold implementation [4]. However, such countermeasures come with a cost when implementing ciphers. If considering more resource constrained environments, one often does not have enough resources to implement standard ciphers like AES and therefore one needs to use lightweight cryptography. However, even lightweight ciphers can be too resource demanding especially when the cost of countermeasures is added. Therefore, although countermeasures represent the way how to go when considering SCA protection, there is no countermeasure (at least at the current state of the research) that offers sufficient protection against any attack while being cheap enough to be implemented in any environment.

In this paper, we consider how to improve SCA resilience of ciphers without imposing any extra cost. This is possible by considering the inherent resilience of ciphers. We particularly concentrate on block ciphers which utilize S-boxes and therefore study the resilience of S-boxes against side-channel attacks.

In the case of SCA concentrating only on 1-bit of the S-box output, a theoretical connection between the side-channel resistance and differential uniformity of S-boxes has been found in [5]. In particular, the authors showed that the higher the side-channel resistance, the smaller the differential resistance. However, as we show, this extension does not straightforwardly hold when considering more complex leakage models as the Hamming weight of the S-box output, which is the most prominent leakage model in side-channel analysis when considering Correlation Power Analysis (CPA) [6]. We therefore investigate S-box parameters which may influence the side-channel resistance while still having good or optimal cryptographic properties. The (almost) preservation of Hamming weight and a small Hamming distance between x and F(x) are two properties each of which could strengthen the resistance to SCA from an intuitive perspective. Our theoretical and empirical findings show that notably in the case when exactly preserving the Hamming weight, the SCA resilience is improved. Moreover, we relax this assumption and investigate in S-boxes that almost preserve the Hamming weight. For our study, we employ the confusion coefficient [7] as a metric for side-channel resistance. Besides the signal-to-noise-ratio and the number of observed measurements, the confusion coefficient is the factor influencing the success rate of CPA and, moreover, it is the only factor that depends on the underlying considered algorithm and thus on the S-box. More precisely our main contributions are:

  1. 1.

    We calculate (resp. we bound above) the confusion coefficient value of a function F in the two scenarios where:

    1. (a)

      x and F(x) have the same Hamming weight.

    2. (b)

      in average, F(x) has a Hamming weight near that of x.

  2. 2.

    We observe that the S-boxes with no difference between the Hamming weights of their input and output have nonlinearity equal to 0; more generally, the same happens when the Hamming weight of x and the Hamming weight of F(x) have always the same parity. Such functions are of course to be avoided from a cryptanalysis perspective. Furthermore, we show that more generally as well, for every S-box F, denoting by \(d_{w_H}\) the number of inputs x for which the Hamming weights of x and F(x) have different parities, F has nonlinearity at most \(d_{w_H}\). This implies that if the number of inputs x such that \(w_H(x)\ne w_H(F(x))\) is at most \(d_{w_H}\), the nonlinearity is at most \(d_{w_H}\). We show in Example 2 that this does not make however the S-box necessarily weak. We emphasize that although these observations could be regarded trivial, they have practical consequences.

  3. 3.

    We show the connection between the number of fixed points in a function F and its nonlinearity.

  4. 4.

    We show that S-boxes such that F(x) lies at a small Hamming distance from x (or more generally from an affine function of x) cannot have high nonlinearity although the obtainable values are not too bad for \(n = 4, 8\).

  5. 5.

    In the practical part, we confirm our theoretical findings about the connection between (almost) preserving the Hamming weight and the confusion coefficient by investigating several S-boxes.

  6. 6.

    We investigate the relationship between the confusion coefficient of different key guesses and evaluate a number of S-boxes used in today’s ciphers to show that their SCA resilience can significantly differ.

2 Preliminaries

2.1 Generalities on S-Boxes

Let nm be positive integers, i.e., \(n, m \in \mathbb {N}^+\). We denote by \(\mathbb {F}_{2}^{n}\) the n-dimensional vector space over \(\mathbb {F}_{2}\) and by \(\mathbb {F}_{2^n}\) the finite field with \(2^n\) elements. The set of all n-tuples of elements in the field \(\mathbb {F}_{2}\) is denoted by \(\mathbb {F}_{2}^{n}\), where \(\mathbb {F}_{2}\) is the Galois field with two elements. Further, for any set S, we denote \(S \backslash \{0\}\) by \(S^{*}\). The usual inner product of a and b equals \(a\cdot b = \bigoplus _{i=1}^{n} a_{i}b_{i}\) in \(F_{2}^n\).

The Hamming weight \(w_H(a)\) of a vector a, where \(a \in \mathbb {F}_{2}^{n}\), is the number of non-zero positions in the vector. An (nm)-function is any mapping F from \(\mathbb {F}_{2}^{n}\) to \(\mathbb {F}_{2}^{m}\). An (nm)-function F can be defined as a vector \(F = (f_1,\cdots ,f_m)\), where the Boolean functions \(f_i: \mathbb {F}_2^n \rightarrow \mathbb {F}_2\) for \(i \in \{1, \cdots , m\}\) are called the coordinate functions of F.

The component functions of an (nm)-function F are all the linear combinations of the coordinate functions with non all-zero coefficients. Since for every n, there exists a field \(\mathbb {F}_{2^n}\) of order \(2^n\), we can endow the vector space \(\mathbb {F}_2^n\) with the structure of that field when convenient. If the vector space \(\mathbb {F}_2^n\) is identified with the field \(\mathbb {F}_{2^n}\) then we can take \(a\cdot b = tr (ab)\) where \(tr(x) = x + x^2 + \ldots +x^{2^{n-1}}\) is the trace function from \(\mathbb {F}_{2^n}\) to \(\mathbb {F}_{2}\). The addition of elements of the finite field \(\mathbb {F}_{2^n}\) is denoted with “+”, as usual in mathematics. Since, often, we identify \(\mathbb {F}_{2}^n\) with \(\mathbb {F}_{2^n}\) and if there is no ambiguity, we denote the addition of vectors of \(\mathbb {F}_{2}^n\) when \(n>1\) with “+” as well.

An (nm)-function F is balanced if it takes every value of \(\mathbb {F}_{2}^{m}\) the same number \(2^{n - m}\) of times.

The Walsh-Hadamard transform of an (nm)-function F is (see e.g. [8]):

$$\begin{aligned} W_{F} (a, v) = \sum _{x \in \mathbb {F}_{2}^{m}} (-1)^{v\cdot F(x) + a\cdot x}, \ a, v \in \mathbb {F}_{2}^{m}. \end{aligned}$$

The nonlinearity nl of an (nm)-function F equals the minimum nonlinearity of all its component functions \(v\cdot F\), where \(v \in \mathbb {F}_{2}^{m*}\) [8, 9]:


The nonlinearity of any (nm) function F is bounded above by the so-called covering radius bound:

$$\begin{aligned} nl \le 2^{n-1} - 2^{\frac{n}{2}-1}. \end{aligned}$$

In the case \(m=n\), a better bound exists. The nonlinearity of any (nn) function F is bounded above by the so-called Sidelnikov-Chabaud-Vaudenay bound [10]:

$$\begin{aligned} nl \le 2^{n-1} - 2^{\frac{n-1}{2}}. \end{aligned}$$

Bound (4) is an equality if and only if F is an Almost Bent (AB) function, by definition of AB functions [8].

Let F be a function from \(\mathbb {F}_2^n\) into \(\mathbb {F}_2^m\) with \(a \in \mathbb {F}_2^n\) and \(b \in \mathbb {F}_2^m\). We denote:

$$\begin{aligned} D_F (a, b) = \left\{ x \in \mathbb {F}_2^n : F(x)+F(x+a) =b\right\} . \end{aligned}$$

The entry at the position (ab) corresponds to the cardinality of the delta difference table \(D_F (a, b)\) and is denoted as \(\delta (a, b)\). The differential uniformity \(\delta _F\) is then defined as [11]:

$$\begin{aligned} \delta _F = \max _{\begin{array}{c} a \ne 0, b \end{array}} \delta (a, b). \end{aligned}$$

Functions that have differential uniformity equal to 2 are called the Almost Perfect Nonlinear (APN) functions. Every AB function is also APN, but the converse does not hold in general. AB functions exist only in an odd number of variables, while APN functions also exist for an even number of variables. When discussing the differential uniformity parameter for permutations, the best possible (and known) value is 2 for any odd n and also for \(n = 6\). For n even and larger than 6, this is an open question. The differential uniformity value for the inverse function \(F(x)=x^{2^n-2}\) equals 4 when n is even and 2 when n is odd.

2.2 Side-Channel Resistance

Side-channel attacks analyze physical leakage that is unintentionally emitted during cryptographic operations in a device (e.g., through the power consumption [12] or electromagnetic emanation [13]). This side-channel leakage is statistically dependent on the intermediate processed values involving the secret key, which makes it possible to retrieve the secret from the measured data. In particular, as the attacker wants to retrieve the secret key, he makes predictions (hypotheses) on a small enumerable chunk (e.g., byte) of an intermediate state using all possible key values.

The side-channel resistance of implementations against Correlation Power Attack (CPA) [6] depends on three factors: the number of measurement traces, the signal-to-noise ratio (SNR) [14], and the confusion coefficient [7]. The relationship between the three factors is linear in case of low SNR [15]. The confusion coefficient measures the discrepancy between the hypothesis of an intermediate state using the correct (secret) key and any hypothesis made with a (wrong) key assumption. Therefore, as one compares possible intermediate processed values, the confusion coefficient depends on the underlying cryptographic algorithm and thus, if the attacker targets an S-box operation, on the side-channel resistance of that S-box. More precisely, let us assume the attacker exploits an intermediate processed value \(F(k_c + t)\) during the first round that depends on the secret key \(k_c \in \mathbb F_2^n\), an n-bit chunk of the plaintext \(t \in \mathbb {F}_2^n\), and an S-box function F. Moreover, let us make the commonly accepted assumption that the device is leaking side-channel information as the Hamming weight (see e.g., [14]) of intermediate values with additive noise N:

$$\begin{aligned} w_H(F(k_c + t)) + N. \end{aligned}$$

As the secret key \(k_c\) is unknown to the attacker, he computes for each key guess \(k_g \in \mathbb {F}_2^n\) a hypothesis about the intermediate state:

$$\begin{aligned} y_{k_g,t} = y(k_g,t) = w_H(F(k_g + t)) \end{aligned}$$

of the deterministic part of the leakage in Eq. (7). Interestingly, these hypotheses are not independent and their discrepancy is characterized by the confusion coefficient. Originally in [7] the confusion coefficient has been introduced for (n, 1) Boolean functions:

$$\begin{aligned} \kappa (k_c,k_g)= Pr[(y(k_c,T))\ne (y(k_g,T))], \end{aligned}$$

with T being the random variable whose realization is t. In [5], the authors related \(\kappa (k_c,k_g)\) in Eq. (9) to \(\delta _F\) and showed that the higher the side-channel resistance, the smaller the differential resistance (that is, the higher \(\delta _F\)). In fact, \(\kappa (k_c,k_g)\) is represented as

$$\begin{aligned} \frac{1}{2^n} \sum _{t\in \mathbb F_2^n}\left( F(t + k_c) + F(t + k_g)\right) , \end{aligned}$$

which can then be straightforwardly connected to \(\delta _F\) for 1-bit models.

In [16] the authors extend \(\kappa (k_c,k_g)\) to the general multi-bit case for CPA and thus to (nm)-functions F. In this paper, we use the definition given in [15] which is a standardized version of confusion coefficient given in [16] and thus a natural extension of Eq. (9):

$$\begin{aligned} \kappa (k_c,k_g) = \mathbb E\Bigl \{\Bigl (\frac{1}{2}({y(k_c,T)-y(k_g,T)})\Bigr )^2\Bigr \}, \end{aligned}$$

where y is assumed to be standardized (i.e., \(\mathbb E(y(\cdot ,T))=0, Var(y(\cdot ,T))=1\)). More specifically, Eq. (11) enables us to compare confusion coefficients for different functions F. By substituting \(y(*)\) with Eq. (8) and denoting \(x = t \oplus k_c\) and \(a=k_c+k_g\) we can write \(\kappa (k_c,k_g)\) as


Now, it is easy to see that from Eq. (12) we cannot straightforwardly derive a connection to \(\delta _F\) for (nm) functions. More precisely, for \(m = 1\) the square is just 4 times the value of \(F(t)+ F(t+a)\) and then the confusion coefficient equals \(\delta (a, 1)\). For \(m>1\) we have the square of the difference between the weights of F(t) and \(F(t+ a)\) which is not 4 times the weight of \(b=F(t)+ F(t + a)\) because the \(1-0\) and the \(0-1\) count with their signs in the sum. So there is no direct connection with \(\delta _F\) anymore.

As a decisive criterion for comparison between confusion coefficients, the minimum value of \(\kappa (k_c,k_g)\) was specified in [15] as it relates to the success rate when the SNR is low. Note that the higher is the minimum of the confusion coefficient, the lower is the side-channel resilience. This comes from the fact that the lower the confusion coefficient the smaller is the (Euclidean) distance between the correct key \(k_c\) and a key guess \(k_g\) and thus the harder it is for an attacker to distinguish if the leakage is arising due to a computation with \(k_c\) or \(k_g\). A detailed discussion on this will be given in Subsect. 5.2. On the other hand, in [17] authors use \(var(\kappa (k_c,k_g))\) as a criterion, where smaller values indicate lower side-channel resilience. Our experiments in Sect. 5 show that both metrics coincide with the empirical resilience using simulations.

In the case \(\kappa (k_c,k_g)=0\) or \(\kappa (k_c,k_g)=1\) for any \(k_g \ne k_c\), CPA is not able to distinguish between \(k_c\) and this key guess \(k_g\) and will thus fail to reveal the secret key exclusively even if the number of measurements goes to infinity. More precisely, \(\kappa (k_c,k_g)=0\) means that for a key guess \(k_g\) one observes exactly the same intermediate values (see Eq. (8)) as for the correct key \(k_c\). Contrary for \(\kappa (k_c,k_g)=1\) one observes the complementary value (can be seen from Eqs. (9) and (11)), however, as CPA takes the absolute value of correlation (due to hardware related properties [14]) an attacker again cannot distinguish between \(k_c\) and \(k_g\) in this case. In general, normalized confusion coefficient values close to 0.5 indicate that \(k_c\) and \(k_g\) can be easily distinguished (see Eq. (9)). We will show in Sect. 3 and empirically confirm in Sect. 5 that in case of preserving \(w_H\) there exists an key guess \(k_g\) such that \(\kappa (k_c,k_g)=1\).

3 S-Boxes (Almost) Preserving the Hamming Weight

3.1 Relation to the Confusion Coefficient

To obtain, for an (nm)-function F, a connection between the confusion coefficient parameter and the Hamming weight preservation (i.e., the fact that, for every x, F(x) has the same Hamming weight as x) or, more generally, a limited average Hamming weight modification, we start with Eq. (12). For any function F, we have:


Lemma 1 addresses the case where F preserves the Hamming weight, whereas the scenario in which F modifies the Hamming weight in a limited way is described in Lemma 2. Note that the first scenario is a particular case of the second.

Lemma 1

For an (nn)-function such that, for every x, F(x) has the same Hamming weight as x, the confusion coefficient equals \(\frac{w_H(a)}{n}\).


If F preserves the Hamming weight, that is, if \(w_H(F(x))=w_H(x)\) for every x (or more generally, if F is the composition of a function preserving the weight by an affine isomorphism on the right), then the confusion coefficient \(\kappa (k_c,k_g)=\mathbb {E}\left( \left( \frac{w_H(F(x))-w_H(F(x+a))}{\sqrt{n}}\right) ^2\right) ,\) where \(a=k_c+k_g\), becomes \(\mathbb {E}\left( \left( \frac{w_H(x)-w_H(x+a)}{\sqrt{n}}\right) ^2\right) \), and by applying Eq. (13) (which is valid for every F) to \(F=Id\), we obtain:

$$\begin{aligned}&\frac{1}{2n} \mathbb {E} \left( \left( \sum _{i=1}^n(-1)^{x_i}\right) ^2- \left( \sum _{i=1}^n(-1)^{x_i}\right) \left( \sum _{i=1}^n (-1)^{x_i+a_i}\right) \right) \\ \nonumber =&\,\frac{1}{2n} \mathbb {E} \left( \sum _{1\le i,j\le m}(-1)^{x_i+x_j}- \sum _{1\le i,j\le m}(-1)^{x_i+x_j+a_j}\right) \nonumber . \end{aligned}$$

The expectations of all these sums for \(i\ne j\) are null (since the character sums of nonzero linear functions are null), and we obtain:

$$\begin{aligned} \frac{1}{2n} \mathbb {E} \left( m- \sum _{1\le i\le m}(-1)^{a_i}\right) =\frac{1}{n} \mathbb {E} \left( w_H(a)\right) =\frac{w_H(a)}{n}. \end{aligned}$$

Example 1

For \(n=4\), Lemma 1 gives \(\min _{k_c \ne k_g} \kappa (k_c,k_g) = 0.25\) and for \(w_H(a)=n\) we have \(\kappa (k_c,k_g) = 1\), which means that the CPA distinguisher is not able to distinguish between these two hypotheses \(k_g\) and \(k_c\) (see Subsect. 2.2). Note that we give a more detailed discussion about the results and their ramifications in Sect. 5.

Lemma 2

For an (nn)-function such that, on average, F(x) has a Hamming weight near that of x, more precisely, where \(\sum _x |w_H(F(x))-w_H(x)|\le d_{w_H}\), where \(d_{w_H}\) is some number, the standardized confusion coefficient is bounded above by \(\frac{w_H(a)}{n}+ \frac{4d_{w_H}}{2^n} \).


If \(\mathbb {E}(|w_H(F(x))-w_H(x)|)\le \frac{d_{w_H}}{2^n}\), then according to Lemma 1 and its proof the confusion coefficient \(\kappa (k_c,k_g)=\) \(\mathbb {E}\left( \left( \frac{w_H(F(x))-w_H(F(x+a))}{\sqrt{n}}\right) ^2\right) \) is such that

$$\begin{aligned} \left| \kappa (k_c,k_g)-\frac{w_H(a)}{n}\right|\le & {} \mathbb {E}\left( \left| \left( \frac{w_H(F(x))-w_H(F(x+a))}{\sqrt{n}}\right) ^2-\left( \frac{w_H(x)-w_H(x+a)}{\sqrt{n}}\right) ^2\right| \right) \\ {}= & {} \mathbb {E}\left( \left| \left( \frac{w_H(F(x))-w_H(F(x+a))}{\sqrt{n}}-\frac{w_H(x)-w_H(x+a)}{\sqrt{n}}\right) \right. \right. \\&\quad \quad \quad \quad \left. \left. \left( \frac{w_H(F(x))-w_H(F(x+a))}{\sqrt{n}}+\frac{w_H(x)-w_H(x+a)}{\sqrt{n}}\right) \right| \right) \\ {}\le & {} \frac{2}{n} \left( \max _{x\in \mathbb {F}_2^n} w_H(F(x))+\max _{x\in \mathbb {F}_2^n}w_H(x)\right) \mathbb {E}\left( \left| w_H(F(x))-w_H(x)\right| \right) \\ {}= & {} \frac{4d_{w_H}}{2^n}. \end{aligned}$$

3.2 Relation to Cryptographic Properties

We study the cryptographic consequences of the preservation of the Hamming weight. Again we first cover the specific case were the input and output of an S-box always have the same Hamming weight, and then the second case where the output has on average a Hamming weight close to that of the corresponding input (see Lemma 3).

If for every x, we have \(w_H(F(x))=w_H(x)\) then the sum (mod 2) of all coordinate functions of F equals the sum (mod 2) of all coordinates of x. This means that F has nonlinearity equal to zero since one of its component functions is linear. Of course, the same happens under the much weaker hypothesis that \(w_H(F(x))\) and \(w_H(x)\) have always the same parity. Therefore, an S-box function preserving the Hamming weight is cryptographically insecure.

However, if \(\sum _x |w_H(F(x))-w_H(x)|\le d_{w_H}\), then we have \(nl \le d_{w_H}\). Indeed, this is a direct consequence of the following straightforward result, which has however much importance in our context:

Lemma 3

If the Hamming weight of the Boolean function:

$$x\mapsto (w_H(F(x))-w_H(x)) \ [mod\, 2],$$

that is, \(\sum _x ((w_H(F(x))-w_H(x)) \ [mod\, 2])\), is at most \(d_{w_H}\), then we have \(nl \le d_{w_H}\).

Indeed, the Hamming distance between the component function \(\sum _i F_i\ [mod\, 2])\) and the linear function \(\sum _i x_i\ [mod\, 2])\) is then at most \(d_{w_H}\).

Example 2

For a (4, 4)-function F to have nonlinearity equal to 4 (optimal nonlinearity), it means that \(d_{w_H}\) must be at least 4. In order to construct functions with such properties, we ran a genetic algorithm as given by Picek et al. [17]. We use the same settings as there: 30 independent runs, population size equal to 50, 3-tournament selection, and mutation probability 0.3 per individual. The objective is the maximization of the following fitness function:

$$\begin{aligned} fitness = nl + \varDelta _{nl, 4}(n\times 2^n - \left| w_H(F(x))-w_H(x) \right| ). \end{aligned}$$

Here, \(\varDelta _{nl, 4}\) represents the Kronecker delta function that equals 1 when nonlinearity is 4 and 0 otherwise. Notice we subtract the difference of the Hamming weights of the inputs and outputs of an S-box from the summed Hamming weight value for a (4, 4)-function since we work with the maximization problem while that value should be minimized. Interestingly, we observed that finding S-boxes with those properties is a relatively easy task and that the obtained S-boxes never have more than 8 fixed points. We give examples of such S-boxes in Table 1, for instance, \(S_5\) where nonlinearity equals 4 and \(d_{w_H}\) is 4.

Next, inspired by our empirical results, we investigate whether it is theoretically possible to construct an S-box with even more fixed points while still having the maximal nonlinearity.

Lemma 4

If an (nn)-function has k fixed points then the maximal value of \(W_F(a,v)\) when \(v \ne 0\) is bounded below by \((k-1)/(1-2^{-n})\). If nl is the nonlinearity of an (nn)-function, then its number k of fixed points is not larger than \(2^n-\lceil (2-2^{1-n})\, nl\rceil \).


The number of fixed points k of an (nn)-function F equals:

$$\begin{aligned} k = 2^{-n}\sum _{v\in \mathbb {F}_2^n} W_F(v,v) = 2^{-n} \sum _{x,v\in \mathbb {F}_2^n} (-1)^{v\cdot (x + F(x))}, \end{aligned}$$

which follows from Eq. (1) when \(a = v\) and the property that \(\sum _{v\in \mathbb {F}_ 2^n}(-1)^{v\cdot a}\) equals \(2^n\) if \(a=0\) and is null otherwise. The value of \(W_F(0,0)\) involved in Eq. (17) equals \(2^n\). We take it off and obtain:

$$\begin{aligned} k - 1 = 2^{-n}\sum _{v\in \mathbb {F}_ 2^{n*}} W_F(v,v). \end{aligned}$$

Then the arithmetic mean of \(W_F(v,v)\) when \(v \ne 0\) equals \((k-1)/(1-2^{-n})\). This implies that \(\max _v W_F(v,v)\) is at least \((k-1)/(1-2^{-n})\) and the nonlinearity cannot be larger than \(2^{n-1}-(k-1)/(2-2^{1-n})\). The inequality \(nl\le 2^{n-1}-(k-1)/(2-2^{1-n})\) is equivalent to \(k\le 2^n-\lceil (2-2^{1-n})\, nl\rceil \).

4 S-Boxes Minimizing the Hamming Distance

4.1 Relation to the Confusion Coefficient

In real world applications, the device may not only leak in the Hamming weight, but also in the Hamming distance, therefore we now extend our study to the case were the leakage arises from the Hamming distance between x and F(x). Again we first study the relation to the confusion coefficient and then give the connection to cryptographic properties.

By the triangular inequality, we have \(|w_H(F(x))-w_H(x)|\le d_H(x,F(x))\). This implies that \(\sum _x |w_H(F(x))-w_H(x)|\le \sum _x d_H(x,F(x))\).

Hence, if \(\sum _x d_H(x,F(x))\le d_{d_H}\), we can use Lemma 2 and deduce that also in this scenario the confusion coefficient is bounded by \(\frac{w_H(a)}{n}+\frac{4d_{d_H}}{2^n}\).

4.2 Relation to Cryptographic Properties

From \(\sum _x d_H(x,F(x))\le d_{d_H}\), up to adding a linear function (which does not change the nonlinearity nor the differential uniformity), considering S-boxes such that, for every x, F(x) lies at a small distance from x corresponds to considering functions which take a too small number of values. We show that such functions have bad nonlinearity and bad differential uniformity.

Lemma 5

Let F be any (nm)-function such that \(|F(\mathbb {F}_2^n)| \le D\), then \(\delta _F\ge \frac{2^n}{2^m-1}\left( \frac{2^n}{D}-1\right) \) and \(nl\le 2^{n-1}-\frac{\frac{2^{n+m-1}}{D}-2^{n-1}}{2^m-1}\).


By using the Cauchy-Schwartz inequality, we obtain \(\sum _{a\in \mathbb {F}_2^{n*}}|D_aF^{-1}(0)|=\sum _{b\in \mathbb {F}_2^m}|F^{-1}(b)|^2-2^n\ge \frac{(\sum _{b\in \mathbb {F}_2^m}|F^{-1}(b)|)^2}{D}-2^n=\frac{2^{2n}}{D}-2^n\), and there exists then \(a\in \mathbb {F}_2^{m*}\) such that \(|D_aF^{-1}(0)|\ge \frac{\frac{2^{2n}}{D}-2^n}{2^m-1}\). This proves the first assertion.

We have a partition of \(\mathbb {F}_2^n\) into at most D parts by the preimages \(F^{-1}(b)\), \(b\in \mathbb {F}_2^m\), and there exists then \(b\in \mathbb {F}_2^m\) such that \(|F^{-1}(b)|\ge \frac{2^n}{D}\); for such b, we have \(\sum _{x\in \mathbb {F}_2^n,v\in \mathbb {F}_2^m} (-1)^{v\cdot (F(x)+b)}\ge \frac{2^{n+m}}{D}\), which is equivalent to \(\sum _{v\in \mathbb {F}_2^m, v\ne 0} (-1)^{v\cdot b} W_F(0,v) \ge \frac{2^{n+m}}{D}-2^n\), and then there exists \(v\ne 0\) such that \(|W_F(0,v)|\ge \frac{\frac{2^{n+m}}{D}-2^n}{2^m-1}\), which implies that \(nl\le 2^{n-1}-\frac{\frac{2^{n+m-1}}{D}-2^{n-1}}{2^m-1}\). This proves the second assertion.

If D is small with respect to \(2^m\) (so that \(2^{n-1}\) is small with respect to \(\frac{2^{n+m-1}}{D}\)) and D is small with respect to \(2^{n/2}\) (so that \(\frac{2^n}{D}\) is large with respect to \(2^{n/2}\)), the nonlinearity is bad with respect to the covering radius bound \(nl\le 2^{n-1}-2^{n/2-1}\). More precisely, if \(D\le \frac{2^m}{\lambda }\) with \(\lambda >1\), then \(nl\le 2^{n-1}-\frac{(\lambda -1)2^{n-1}}{2^m-1}< 2^{n-1}-(\lambda -1)2^{n-m-1}\) and if \((\lambda -1)2^{n-m}\) is significantly larger than \(2^{n/2}\), the nonlinearity is bad with respect to the covering radius bound. We have also that if D is small with respect to \(2^m\) then \(\delta _F\) is large with respect to \(2^{n-m}\) if \(m<n\) and with 2 if \(m=n\) (which are the smallest possible values of \(\delta _F\)).

If F is an (nn)-function and \(x+F(x)\) has low weight for every x, say at most \(t_{d_H}\), which is equivalent to saying that \(d_H(x,F(x))\le t_{d_H}\) for every x, then its number of values is at most \(D=\sum _{i=0}^{t_{d_H}} {n\atopwithdelims ()i}\) and we can apply the result above to \(x+F(x)\), which has the same nonlinearity and the same \(\delta _F\) as F. As far as we know, these observations are new. Note that we also have the possibility of applying Lemma 3 and then we have that nonlinearity is bounded by \(t_{d_H}\).

Remark 3

Lemma 5 applies to the case when \(d_H(x,F(x))\le t_{d_H}\) for every x where x equals \(t\, \oplus \, k_g\). This represents a setting one would encounter when working for instance with software implementations. Now, if we consider a hardware setting (e.g., FPGA), then we are interested in the case \(d_H(t,F(t \ \oplus k_g))\le t_{d_H}\) for every key. However, this case leads to the same observation as before but now with up to adding an affine function instead of up to adding a linear function as given in Lemma 5.

5 Side-Channel Evaluation

5.1 Evaluation of S-Boxes with (Almost) \(w_H\) Preservation

As cryptographically non-optimal examples of S-boxes (almost) preserving \(w_H\) we consider five different functions F: the identity mapping (\(S_1\)), F not Id but preserving \(w_H\) (\(S_2\)), the identity mapping with an exchange of the images at position \(x=3\) and \(x=12\), i.e., \(F(3)=12\) and \(F(12)=3\), and as \(w_H(3) = 2\) and \(w_H(12) = 3\) we have \(d_{w_H}=2\) (see Lemma 1) (\(S_3\)), \(F(x) = 2^n-x\) which gives the complementary Hamming weight (\(S_4\)). Finally, we investigate four S-box functions \(S_5\) to \(S_8\) with the smallest possible distance \(d_{w_H}\) that equals 4 and maximal possible nonlinearity equal to 4 (see Subsect. 3.2). S-box functions \(S_7\) and \(S_8\) have furthermore optimal differential uniformity (=4). The mappings are given in Table 1.

Table 1. Specifications of functions F, \((x) w_H\!(x)\)

The confusion coefficients are illustrated in Fig. 1. Note that, the distribution of \(\kappa (k_c,k_g)\) is independent on the particular choice of \(k_c\) (in the case there are no weak keys) and the values for \(\kappa (k_c,k_g)\) are only permuted when choosing different value \(k_c\in \mathbb F_2^n\). For our experiments we choose \(k_c=0\) and furthermore we order \(\kappa (k_c,k_g)\) in an increasing order of magnitude for illustrative purpose. The minimum value of \(\kappa (k_c,k_g)\) for \(k_g\ne k_c\) is highlighted with a red cross as it is one indicator of the side-channel resistance. Moreover, we mark \(\kappa (k_c,k_g)=0\) or \(\kappa (k_c,k_g)=1\) with a red circle which points out that CPA is not able to distinguish between \(k_c\) and the marked \(k_g\).

Figure 1a shows that, indeed, \(k_c\) is indistinguishable from one key hypothesis \(k_g\) if \(w_H\) is preserved. Or in other words, even if knowing t and observing \(w_H(F(t+k_c))+N\) with F equal to \(S_1\) the attacker can not exclusively gain information about \(k_c\) even if the number of measurements \(m \rightarrow \infty \). Moreover, it confirms Lemma 1. Note that in our example \(a=k_g\), thus \( \kappa (k_c,k_g) = \frac{w_H(k_g)}{4}\). Interestingly, when comparing our results to the study in [5], where the authors investigated (n, 1)-functions, we observe that the confusion coefficient takes different values which indeed confirms that the Hamming weight model is not a straightforward extension from 1-bit models. More precisely, in case of linear (n, 1)-function the authors observed that the confusion coefficient only takes values from {0,1}, whereas our examples illustrate (as well as our theoretical findings in Sect. 3) that the confusion coefficient is not restricted to only {0,1}, and is equal to 1 for only one particular \(k_g\). Interestingly, for \(d_{w_H}=2\) (see in Fig. 1b) we also have that \(k_c\) is indistinguishable for one \(k_g\). Moreover, apart from \( \kappa (k_c,k_g)=1\), only two different values are taken, each 7 times. This means that CPA is not able to distinguish between each of these 7 key guesses and in total only produces three different correlation values. When considering a complementary \(w_H\) preservation (e.g. \(4-w_H\)) we achieve the same results as for \(w_H\) preservation (see also Fig. 1).

Note that, while being illustrative, these first four examples of F are not cryptographically optimal and thus are not suitable in practice. We therefore constructed four S-boxes (\(S_5\) to \(S_8\)) with the smallest \(d_{w_H} (=4)\) while having optimal nonlinearity. Note that \(S_5,S_6\) have suboptimal differential uniformity, while \(S_7,S_8\) are cryptographically optimal (i.e. optimal nonlinearity and differential uniformity). Figures 1c to f show the confusion coefficient of \(S_5\) to \(S_8\). We can observe that all S-boxes have a very low minimum confusion coefficient that is even lower than for \(S_1\) to \(S_4\). Even more, as the previously investigated S-boxes, \(S_5\) has \(\kappa (k_c,k_g)=1\). Therefore, we find an S-box with almost Hamming weight preserving for which even with an infinity amount of traces the secret key cannot exclusively be found. As the minimum value of the confusion coefficient of \(S_5\) is low (=0.125) there exists additionally other key hypotheses which are harder to distinguish from the secret key. As a conclusion we can say that indeed exact \(w_H\) preserving results in a good side-channel resistance since we have \( \kappa (k_c,k_g)=1\). Moreover, when the \(w_H\) is almost preserved we present here S-boxes which have a very low minimum confusion coefficient.

Fig. 1.
figure 1

Confusion coefficients

5.2 A Closer Look at the Confusion Coefficient

To understand the exact reason why some (one or more) key guesses result in a smaller confusion coefficient than others and how this is related to F, we concentrate on the connection between \(k_c,k_g\), F, and \(\kappa (k_c,k_g)\). Loosely speaking, we are iterating on key guesses influencing the input of F while calculating the confusion coefficient on the measured output of F and being interested in the properties of F. To better address these connections, we split the problem into 2 individual problems.

First, we take a deeper look at the input of F, i.e., \(t \oplus k_g\) where \(\forall t,k_g \in \mathbb F_2^n\) (see Eq. (8)). Clearly, due to the \(\oplus \) operation a particular permutation for different key guesses \(k_g\) is given. A 2-D representation for \(t \oplus k_g\), where \(k_g\) is on the horizontal and t on the vertical axis, is given in Fig. 2, where again, for simplicity reasons, \(t,k_g \in \mathbb F_2^4\). In this figure we furthermore group \(t \oplus k_g\) into 4 boxes (\(n\times n\)) together, each containing \(4\times 4\) values: blue (\(B_0\)): \(t \oplus k_g \in [0,3]\), yellow (\(B_1\)): \(t \oplus k_g \in [4,7]\), green (\(B_2\)): \(t \oplus k_g \in [8,11]\), and red (\(B_3\)): \(t \oplus k_g \in [12,15]\). Using this color representation we can easily see 4 different permutations \( \pi _0, \pi _1, \pi _2, \pi _3\) applied on (\(B_0\) \(B_1\) \(B_2\) \(B_3\)). More precisely, when considering a column representationFootnote 1 among the key guesses \(k_g\), we have:

  • for \(k_g \in [0,3]\): no permutation (\( \pi _0 = \bigl ({\begin{matrix} 0 &{} 1 &{} 2 &{} 3 \\ 0 &{} 1 &{} 2 &{} 3 \end{matrix}}\bigr )\)),

  • for \(k_g \in [4,7]\): pairwise swap of elements in each half of matrix (\( \pi _1 = \bigl ({\begin{matrix} 0 &{} 1 &{} 2 &{} 3 \\ 1 &{} 0 &{} 3 &{} 2 \end{matrix}}\bigr )\)),

  • for \(k_g \in [8,11]\): additionally reverse ordering of elements (\( \pi _2 = \bigl ({\begin{matrix} 0 &{} 1 &{} 2 &{} 3 \\ 2 &{} 3 &{} 0 &{} 1 \end{matrix}}\bigr )\)),

  • for \(k_g \in [12,15]\) additionally a pairwise swap of elements in each half of matrix (\( \pi _3 = \bigl ({\begin{matrix} 0 &{} 1 &{} 2 &{} 3 \\ 3 &{} 2 &{} 1 &{} 0 \end{matrix}}\bigr )\)).

Moreover, as highlighted by the zoom in on each box, within each box (i.e., \(B_i\), \(0\le i \le 3\)) we have the same permutations \( \pi _0,\ldots , \pi _3\) on the 4 column entries. Note that the order of permutations is equivalent for each box, or in other words, regardless of the color and position of the box the same permutation is applied. More formally, let \(b_{ij} \in [4i,4i+3]^4\) (for \(0 \le i,j \le 3\)) denote the columns within \(B_i\), then \(b_{ij}\) equals \(\pi _j\) applied on the column vector \((4i \ 4i+1 \ 4i+2 \ 4i+3)\).

Second, we examine the expression of the confusion coefficient in Eq. (11) itself. Recall from Eq. (8), \(y_{k_g,t} = y(k_g,t) = w_H(F(k_g + t))\). Let

$$ y_{k_g} = (y(k_g,0),y(k_g,1), \ldots , y(k_g,2^n-1))$$

denote the vector of hypotheses for one key guess \(k_g\) over all texts t. Referring to Fig. 2, \( y_{k_g}\) relates to one column before its application to F and \(w_H\). The confusion coefficient can be rewritten as

$$\begin{aligned} \kappa (k_c,k_g) =\frac{1}{4} \Big \Vert y_{k_c} - y_{k_g} \Big \Vert _2^2 \end{aligned}$$

with \(\Vert \cdot \Vert _2\) being the Euclidean norm. Let us recall that we are especially interested in \(min_{k_g\ne k_c} \kappa (k_c,k_g)\). Moreover, the elements of \( y_{k_c} - y_{k_g}\) are in \([-4,4]\). Now, as Eq. (19) considers not only the difference but its squared values, we may conjecture that the minimum value is most likely reached when the elements of \( y_{k_c} - y_{k_g}\) are in \([-1,1]\), which is discussed in more detail and confirmed using several lightweight S-boxes in Appendix A. Roughly speaking, one difference of \(\pm 2\) is equivalent to 4 changes with \(\pm 1\) and so on.

Fig. 2.
figure 2

Illustration of permutations of \(t \oplus k_g \ \forall t,k_g \in \mathbb F_2^4\) (input of F) (Color figure online)

Now let us put the observations of both parts together. Our previous findings about the permutations can be straightforwardly applied to the Hamming weight of the output of F. Let us assume w.l.o.g. \(k_c = 0\), then for \(k_g = 4i+j\) (with \(0\le i,j \le 3\)) we have

$$\begin{aligned} y_{k_g}&= \pi _i \begin{pmatrix} \pi _j\begin{bmatrix} y_{0,0} \\ y_{0,1} \\ \vdots \\ y_{0,3} \end{bmatrix}^T &{} \pi _j \begin{bmatrix}y_{0,4} \\ y_{0,5} \\ \vdots \\ y_{0,7} \end{bmatrix}^T &{} \pi _j \begin{bmatrix}y_{0,8}\\ y_{0,9}\\ \vdots \\ y_{0,11} \end{bmatrix}^T &{} \pi _j \begin{bmatrix}y_{0,12} \\ y_{0,13} \\ \vdots \\ y_{0,15} \end{bmatrix}^T \end{pmatrix}^T, \end{aligned}$$

with \( y_{0} = (y_{0,0}, y_{0,1}, \ldots , y_{0,15})\) and \((\cdot )^T\) denoting the transpose. Thus, we are looking for a function F such that the distance

$$\begin{aligned} \left\| y_{0} - \pi _i \begin{pmatrix} \pi _j\begin{bmatrix} y_{0,0} \\ y_{0,1} \\ \vdots \\ y_{0,3} \end{bmatrix}^T &{} \pi _j \begin{bmatrix}y_{0,4} \\ y_{0,5} \\ \vdots \\ y_{0,7} \end{bmatrix}^T &{} \pi _j \begin{bmatrix}y_{0,8}\\ y_{0,9}\\ \vdots \\ y_{0,11} \end{bmatrix}^T &{} \pi _j \begin{bmatrix}y_{0,12} \\ y_{0,13} \\ \vdots \\ y_{0,15} \end{bmatrix}^T \end{pmatrix}^T \right\| _2^2 \end{aligned}$$

is as small as possible for any \(\pi _i,\pi _j \in \{\pi _0,\pi _1,\pi _2,\pi _3\}\).

This finding indicates that the order of the Hamming weight of the output of F plays a significant role. To be more precise, the minimum confusion coefficient may depend not only on the distribution of values along the 4 boxes (Example 4), but also on the order within each box (Example 5).

Example 4

Note that the elements of \(y_{k_g}\) follow a binomial distribution due to the application of \(w_H\). Therefore, 0 and 4 occur once, 1 and 3 occurs four times, and 2 six times. In order to reach a mininum squared Euclidean distance in Eq. (21) a natural strategy seems to be to distribute the values broadly among the 4 sets \([4i,4i+3]\) and to have a small difference between the values in one set. Let us consider the S-box of Midori [18] and Mysterion [19]. From Table 2 one can observe that for Midori we have the following sets: 2,2,3,2 – 3,3,4,3 – 1,2,1,2 – 0,1,1,2. So, the maximal distance between values is 2. Moreover, the first three sets only contain 2 different values and the last has 3. On the contrary, when looking at Mysterion (0,1,2,3 – 2,4,3,2 – 1,3,1,2 – 2,1,3,2), the structure looks less balanced. In particular, the maximal distance is 3 and we have always 3 different values within a set. When comparing the confusion coefficient in Fig. 3 we can observe that Midori has a much smaller minimum confusion coefficient and is thus more SCA resilient.

Table 2. Known S-boxes and one modification of KLEIN, \( (x) w_H\!(x)\)
Fig. 3.
figure 3

Confusion coefficients of Midori, Mysterion, KLEIN, and KLEIN with a small modification

Example 5

Let us consider the S-box of KLEIN [20] and a small modification (\(S_9\)) in which we swap F(1) with F(3) (see Table 2). Note that both functions consist of the same values among the sets: 3,1,2,2 – 1,4,3,0 – 2,2,1,2 – 1,3,3,2. For both \(min_{k_g\ne k_c} \kappa (k_c,k_g)\) is reached for \(k_g=11\), thus \(\pi _1=2\) and \(\pi _2=2\). However, as Fig. 3d shows, for KLEIN we have \(min_{k_g\ne k_c} \kappa (k_c,k_g) = 0.125\), whereas \(min_{k_g\ne k_c} \kappa (k_c,k_g) = 0.185\) for \(S_9\), which relates to a squared Euclidean distance (see Eq. (21)) of 8 and 12, respectively.

Furthermore, in Appendix A we investigate several lightweight S-boxes in terms of minimum confusion coefficient and provide empirical evaluations. Note that, a preliminary study showing the difference of some lightweight S-boxes has been conducted in [21]Footnote 2. Our extended results in Appendix A theoretically and empirically confirm [21]. Moreover, the appendix provides details about the minimum Euclidean distance and the permutations \(\pi _i,\pi _j\). Additionally, we take a deeper look at the expression of \( y_{k_c} - y_{k_g}\) for the key hypothesis \(k_g\) that results in the smallest confusion coefficient (i.e., \(\arg \min _{k_c\ne k_g} \kappa (k_c,k_g)\)). We discover that for \(S_5\) and the S-box proposed in [17], which has optimal properties of the confusion coefficient while holding optimal differential properties, the difference \( \Vert y_{k_c} - y_{k_g}\Vert ^2_2\) has a special particular structure, which is not observed for any other investigated 4-bit S-box.

Concluding, we derived specific criteria influencing the side-channel resistance (in particular in Eq. (21) and our findings in Appendix A) that could be exploited to optimize and find S-boxes in terms of side-channels resistance in future work – especially when adapted for \(n>4\).

6 Conclusions

In this paper, we prove a number of bounds between various cryptographic properties that can be related also with the side-channel resilience of a cipher. Our results confirm some well known intuitions that having an S-box more resilient against SCA will make it potentially more vulnerable against classical cryptanalyses. However, they also show that for the usual sizes of S-boxes, this weakening is moderate and trade-offs are then possible.

Since in this work we concentrated in our practical investigations on the Hamming weight model, in the future we plan to explore possible trade-offs for the Hamming distance model and to extend our (empirical) analysis to larger S-boxes using the theoretical findings in this paper.