Keywords

1 Introduction

It is well known that symmetric cryptographic primitives need to be non-linear. It is common to rely on so-called S-boxes to obtain this property. Typically these are functions S mapping \(\mathbb {F}_{2}^{n}\) to \(\mathbb {F}_{2}^{m}\) for a value of n small enough that it is possible to specify S using its lookup table. They are applied in parallel to the whole state as part of the round function of the primitive.

This common definition of S-boxes is being challenged by the recent use of larger S-boxes in some designs. First, the designers of the hash function WHIRLWIND  [5] used a 16-bit S-box based on the multiplicative inverse in the finite field \(\mathbb {F}_{2^{16}}\). In this case, the intention of the implementers was not to use the \(2^{17}\)-byte lookup table of the permutation but instead to rewrite the permutation using tower fields. More recently, large S-boxes have been proposed in Sparx  [19] and in the NIST lightweight candidate Saturnin  [15]. In the latter case, a 16-bit S-box is constructed using a classical Substitution-Permutation Network (SPN): four 4-bit S-boxes are applied to a 16-bit word in parallel, followed by an MDS matrix, and another application of the 4-bit S-box layer. While there is no closed formula for the differential and linear properties of such a structure (unlike for the multiplicative inverse used in WHIRLWIND), 16-bit remains small enough that a direct computation is possible.

This is not the case for the 32-bit S-box of Sparx. In this cipher, the S-box consists of an Addition, Rotation, XOR (ARX) network operating on two 16-bit branches, and it is key-dependent. Furthermore, while the properties of the S-box are usually sufficientFootnote 1 to prove that the cipher meets some security criteria, it is not the case for the ARX-box of Sparx. Indeed, in order to achieve the security goals by its designers (following the long trail security argument), it was necessary to study several “S-boxes”, namely A, \(A \circ A\), \(A \circ A \circ A\), etc.

Another significant difference between the 32-bit ARX-box of Sparx and 16-bit S-boxes is the fact that it is not possible to evaluate its cryptographic properties directly because the complexity of the algorithms involved is usually proportional to \(2^{2n}\), where n is the block size. Thus, the authors of Sparx instead considered their ARX-box like a small block cipher and used techniques borrowed from block cipher analysis  [14] to investigate their ARX-box.

Our Contribution. In this paper, we present a new 64-bit S-box called Alzette (pronounced ) that satisfies a similar scope statement to that of the Sparx ARX-box: it is also an ARX-based S-box, and we analyze both A and \(A \circ A\). Alzette is parameterized by a constant \(c \in \mathbb {F}_{2}^{32}\) and is defined for each such c as a permutation of \(\mathbb {F}_{2}^{32} \times \mathbb {F}_{2}^{32}\). The algorithm evaluating this permutation is given in Algorithm 1 and depicted in Fig. 1. Alzette has the following advantages:

  • it relies on 32-bit rather than 16-bit operations, meaning that (according to [18, Sect. 5]) it is suitable for a larger number of architectures;

  • it makes better use of barrel shift registers (when available) and has more efficient rotation constants (for platforms on which they have different costs);

  • its differential and linear properties are superior to those of a scaled-up Sparx ARX-box;

  • our analysis takes more attacks into account, and is confirmed experimentally whenever possible;

After providing a detailed design rationale of Alzette, we investigate its security against cryptanalytic attacks in more detail. Besides using state-of-the-art methods to conduct the analysis, we also developed new methods. In particular, to analyze the security against generalized integral attacks, we describe a new encoding of the bit-based division property  [39] for modular addition.

Note that in some attack scenarios, the security of Alzette needs to be analyzed for the precise choice of round constants c used in the actual primitive. In this work, we provide experimental analysis for the round constants employed in the permutation Sparkle, submitted to the NIST lightweight cryptography standardization process  [8]. However, our methods can easily be applied for an arbitrary choice of round constants.

Large parts of the experimental analysis have been carried out on the UL HPC cluster  [40]. The source code for our experimental analysis can be found at https://github.com/cryptolu/sparkle.

We provide software implementations of Alzette on 8-bit AVR and 32-bit ARM processors. To summarize, Alzette can be executed in only 12 cycles on a 32-bit ARM Cortex-M3 and 122 cycles on an 8-bit AVR ATmega128 processor. Besides, the code size is low: respectively 24 and 176 bytes on those platforms.

Finally, we discuss the suitability of Alzette as a building block in cryptographic primitives. Since we already know how to use Alzette to design a cryptographic permutation, i.e., Sparkle, we show in this paper how it can be applied to design (tweakable) block ciphers operating on a variety of block lengths. In a nutshell, those ciphers use Alzette in a Feistel construction and interleave it with xor ing the round keys. In a tweakable block cipher, the tweak will be xor ed only to half the state and only every second round. Similar to how the long-trail strategy was applied to take into account cancellations of differences within the absorption phase in a cryptographic sponge construction  [8], we use the same technique to provide security arguments against related-tweak attacks, by taking cancellations of differences through tweak injection into account.

Besides describing this more general design idea, we provide two concrete cipher instances Crax and Trax.

Crax is a 64-bit block cipher that uses a 128-bit secret key. Since its key schedule is very simple and does not have to be precomputed, it is one of the fastest 64-bit lightweight block ciphers in software, beaten only for messages longer than 72 bytes by the NSA cipher Speck  [6]. Due to this simple key schedule, it consumes lower RAM than Speck. While the family of tweakable block ciphers Skinny  [9] can be considered as an academic alternative to the NSA cipher Simon  [6] in terms of hardware efficiency, Crax can be seen as an academic alternative to Speck in terms of software efficiency.

Trax is a tweakable block cipher operating on a larger state of 256-bit blocks. It applies a 256-bit key and 128-bit tweak. To the best of our knowledge, the only other large tweakable block cipher is Threefish which was used as a building for the SHA-3 candidate Skein  [31]. Unlike this cipher, Trax uses 32-bit words that are better suited for vectorized implementation as well as on micro-controllers. Another improvement of Trax over Threefish is the fact that we provide strong bounds for the probability of all linear trails and all (related-tweak) differential trails. Because of its Substitution-Permutation Network structure, Trax is indeed inherently easier to analyze. Such a large tweakable block cipher can provide robust authenticated encryption, meaning that it can retain a high security level even in case of nonce misuse or in the presence of quantum adversaries, as argued in  [15]. The performance penalty of such guarantees can be minimized using vectorization and/or parallelism.

figure b
Fig. 1.
figure 1

The Alzette instance \(A_c\).

Outline. The design process that we used to construct Alzette is explained in Sect. 2. In particular, we show that it offers resilience against a large variety of attacks. This analysis is confirmed experimentally in Sect. 3. We also discuss the efficiency of Alzette in Sect. 4. The discussion on the usage of Alzette as a building block, together with the specification of our (tweakable) block ciphers is given in Sect. 5.

Notation. By \(\mathbb {F}_{2}\), we denote the finite field with two elements and by \(\mathbb {F}_{2}^n\) the set of bitstrings of length n. We denote the set \(\{0,1,\dots ,n-1\}\) by \(\mathbb {Z}_n\). We use \(+\) to denote the addition modulo \(2^{32}\) and \(\oplus \) to denote the XOR of two bitstrings of the same size. The symbol & denotes the bit-wise AND. Further, by \(x \ggg r\), we denote the cyclic rotation of the 32-bit word x to the right by the offset r.

Let E be a key alternating block cipher with r rounds, and round function R. In a differential attack  [12] against \(E_k\), an attacker exploits differences \(\delta \) and \(\varDelta \) such that the probability that \(E_k(x \oplus \delta ) \oplus E_k(x) = \varDelta \) is significantly higher than \(2^{-n}\) (for an n-bit block cipher). For typical values of n (64, 128 or 256) the exact computation of this probability is infeasible. Instead, the common practise is to approximate this quantity by the maximum probability of a differential trail/characteristic averaged over all round keys. A differential trail is a sequence of differences \(\{\delta _0, \delta _1,...,\delta _r\}\) that specifies not only the input and output differences to the block cipher, but also the intermediate differences between the rounds such that \(R(\delta _i \oplus x) \oplus R(x) = \delta _{i+1}\). The approximated probability (averaged over all round keys) is derived as the product of the probabilities of the transitions occurring in each roundFootnote 2. The maximum probability (across all trails) computed in this way is denoted Maximum Expected Differential Characteristic Probability (MEDCP). An upper bound on the MEDCP is an approximation of the maximum differential probability and is called a differential bound.

By analogy to the differential case, for linear attacks  [29] the aim is to find masks \(\alpha \) and \(\beta \) such that \(\beta \cdot E_k(x) = \alpha \cdot x + f(k)\), where “\(\cdot \)” denotes the usual scalar product over \(\mathbb {F}_{2}^n\) and where f is a function of the key bits. In practice, we look for a sequence of input, output and intermediate masks \(\{\alpha _0,...,\alpha _r\}\) called a linear trail/characteristic that has high absolute correlation, where \(\alpha _{i+1} \cdot R(x) = \alpha _i \cdot x + f_i(k)\). Analogously to MEDCP and the differential bound, in the linear case we define a Maximum Expected Linear Characteristic Correlation (MELCC) and a linear bound.

2 The Design of Alzette

We now present both the design process and the main properties of Alzette. These are verified experimentally later in Sect. 3, and summarized in Sect. 3.6.

2.1 Block and Word Sizes

Our S-box should be efficient on a wide variety of platforms, while allowing a practical analysis of its relevant cryptographic properties. What would be the best word and block sizes in this context?

Word Size. In Sparx, the S-box operates on 32 bits, which are split into two 16-bit words. This word size allows a computationally cheap analysis of its cryptographic properties while facilitating efficient implementations on 8 and 16-bit micro-controllers. However, 16-bit words hamper performance on 32-bit platforms, simply because only half of their 32-bit registers and datapath can be used. The same holds when 16-bit operations are executed on a 64-bit processor. Furthermore, 16-bit operations can also incur a performance penalty on 8-bit micro-controllers; for example, rotating two 16-bit operands by n bits on an 8-bit AVR device is usually slower than rotating a single 32-bit operand by n bits (see e.g. [17, Appendix A, B, C] for details).

While 16-bit words are suboptimal because they are too small, it can also be argued that 64-bit word are too large. To establish why, we have to separately discuss the performance of 64-bit operations on 8/16/32-bit micro-controllers and on 64-bit processors. We start with three arguments for why 64-bit operations may not be a good choice on small micro-controllers.

  1. 1.

    32-bit ARM micro-controllers allow one to perform a rotation “for free” since it can be executed together with another arithmetic/logical instruction.Footnote 3 Still, a 32-bit ARM processor can only perform rotations of 32-bit operands for free, but not rotations of 64-bit words.

  2. 2.

    As discussed later, we will use word-wise modular additions. Some 32-bit architectures, most notably RISC-V and MIPS32, do not have an add-with-carry instruction. Adding two 64-bit operands on these platforms requires to first add the lower 32-bit parts of the operands and then compare the 32-bit sum with any of the operands to find out whether an overflow happened (i.e. to obtain a carry bit). Then, the two upper 32-bit words are added up together with the carry bit. A 64-bit addition requires at least four instructions (i.e. four cycles) on these platforms, whereas two 32-bit additions take only two instructions (i.e. two cycles).

  3. 3.

    Compilers for 8 and 16-bit micro-controllers are notoriously bad at handling 64-bit words, especially rotations of 64-bit words. The reason is simple: outside of cryptography, 64-bit words are of little to no use on an 8- or 16-bit platform, and therefore compiler designers have no incentive to optimize 64-bit operations.

A word size of 64 bits is naturally a good choice for 64-bit processors. For example, the authors of  [21] established that SHA512 (which operates in 64-bit words) reaches much higher throughput on 64-bit Intel processors than SHA256 (operating on 32-bit words). However, this does not necessarily imply that ARX designs using 32-bit words are inferior to 64-bit variants on 64-bit processors. This can be justified with the fact that the best way to implement an ARX cipher on a 64-bit Intel or a 64-bit ARM processor is to use the vector (SIMD) extensions they provide, e.g. Intel SSE, AVX or ARM NEON. Most high-end 64-bit processors have such vector instruction sets, and all of them can execute additions, rotations and XORs on 32-bit words. The fact that a 32-bit word size allows peak performance on 64-bit processors was already used for instance by the designers of Gimli  [11].

As a consequence, we chose to design an S-box that operates on 32-bit words as those offer the best performances across the board.

Block Size. Our S-box could a priori operate on any block size that is a multiple of 32. However, two criteria significantly narrow down the design space.

First, we need to be able to investigate the cryptographic properties of our S-box. We are not aware of any efficient combination of simple operations (AND, addition, rotation, XOR, etc.) on a single word that would allow us to give strong bounds on the differential and linear probabilities. On the other hand, computational technique that find such bounds tend to be less efficient if the state size is large as it implies a greater number of potential branches to explore in a tree. Our ability to find bounds thus imposes a number of words which is at least equal to 2 and as small as possible.

Second, in order to use vector instruction sets to their fullest extent, it is better to have a larger number of S-boxes that can be applied in parallel in each call to the round function. On smaller micro-controllers, limiting the block size makes it easier for implementers to keep one full S-box state (or maybe even several full S-box states) in the register file, thereby reducing the number of memory accesses. Finally, in order to build primitives with a small state size, it is necessary that the S-box size is at most equal to said state size. However, as mentioned before, it makes sense to aim for the smallest possible number of branches (and, consequently, a large number of S-boxes) to leverage SIMD-style parallelism.

Because of these requirements, we settled for the use of two words. Given that our discussion above set a 32-bit word size, our S-box operates on 64 bits.

2.2 Round Structure and Number of Rounds

We decided to build an ARX-box out of the operations XOR of rotation and ADD of rotation, i.e., \(x \oplus (y \ggg s)\) and \(x + (y \ggg r)\), because they can be executed in a single clock cycle on ARM processors and thus provide extremely good diffusion per cycle. As the ARX-boxes could be implemented with their rounds unrolled, we allowed the use of different rotations in every round. We observed that one can obtain much better resistance against differential and linear attacks in this case compared to having identical rounds.

In particular, we aimed for designing an ARX-box consisting of the composition of t rounds of the form

$$\begin{aligned} T_i : {\left\{ \begin{array}{ll} \mathbb {F}_{2}^{32} \times \mathbb {F}_{2}^{32} &{}\rightarrow \mathbb {F}_{2}^{32} \times \mathbb {F}_{2}^{32} \\ (x,y) &{}\mapsto \left( x + (y \lll r_i),~ y \oplus \big (\left( x + (y \lll r_i)\right) \lll s_i \big ) \right) \oplus (\gamma ^L_i, \gamma ^R_j) ~, \end{array}\right. } \end{aligned}$$

where i-th round is defined by the rotation amounts \((r_i,s_i) \in \mathbb {Z}_{32} \times \mathbb {Z}_{32}\) and the round constant \((\gamma _i^L,\gamma _i^R) \in \mathbb {F}_{2}^{32} \times \mathbb {F}_{2}^{32}\). It is computed in three steps: \(x \leftarrow x + (y \lll r_i)\), \(y \leftarrow y \oplus (x \lll s_i)\), and finally \((x, y) \leftarrow (x \oplus \gamma ^L_i, y \oplus \gamma ^R_i)\).

In our final design, we decided to use \(t=4\) rounds. The reason is that, when it comes to designing primitives, for r-round ARX-boxes, usable bounds from the long-trail strategy can be obtained from the 2r-round bounds of the ARX structure by concatenating two ARX-boxes. The complexity of deriving upper bounds on the differential trail probability or absolute linear trail correlation depends on the number of rounds considered. For 8 rounds, i.e., 2 times a 4-round ARX-box, it is feasible to compute strong bounds in reasonable time (i.e., several days up to few weeks on a single CPU). For 3-round ARX-boxes, the 6-round bounds of the best ARX-boxes we found seem not strong enough to build a secure cipher with a small number of iterations. Since we cannot arbitrarily reduce the number of round (step) iterations in a cryptographic function because of structural attacks, using ARX-boxes with more than four rounds would lead to worse efficiency overall. In other words, we think that four-round ARX-boxes provide the best balance between the number of ARX-box layers needed and rounds per ARX-box in order to build a secure primitive.

2.3 Criteria for Choosing the Rotation Amounts

We aimed for choosing the rotations \((r_i,s_i)\) in Alzette in a way that maximizes security and efficiency. For efficiency reasons, we want to minimize the cost of the rotations, where we use the cost metric as given in Table 1. While each rotation has the same cost in 32-bit ARM processors (i.e., 0 because rotation is for free on top of XOR, resp., AND), we further aimed for minimizing the cost with regard to 8-bit and 16-bit architectures. Therefore, we restricted ourselves to rotations from the set \(\{0,1,7,8,9,15,16,17,23,24,25,31\}\), as those are the most efficient when implemented on 8 and 16-bit micro-controllers. We define the cost of a collection of rotation amounts (that is needed to define all the rounds of an ARX-box) as the sum of the costs of its contained rotations.

Table 1. For each rotation in \(\{0,1,7,8,9,15,16,17,23,24,25,31\}\), the table shows an estimation of the number of clock cycles needed to implement the rotation on top of XOR, resp. ADD. We associate the mean of those values for the three platforms to be the cost of a rotation.

For security reasons, we aim to minimize the provable upper bound on the expected differential trail probability (resp. expected absolute linear trail correlation) of a differential (resp. linear) trail. More precisely, our target was to obtain strong bounds, preferably at least as good as those of the round structure of the 64-bit block cipher Speck, i.e., an 8-round differential bound of \(2^{-29}\) and an 8-round linear bound of \(2^{-17}\). If possible, we aimed for improving upon those bounds. Note that for \(r >4\), the term r-round bound refers to the differential (resp. linear) bound for r rounds of an iterated ARX-box. As explained above, at the same time we aimed for choosing an ARX-box with a low cost. In order to reduce the search space, we relied on the following criteria as a heuristic for selecting the final choice for Alzette:

  • The candidate ARX-box must fulfill the differential bounds (\(-\log _2\)) of 0, 1, 2, 6, and 10 for 1, 2, 3, 4 and 5 rounds respectively, for all four possible offsets. We conjecture that those bounds are optimal for up to 5 rounds.

  • The candidate must fulfill a differential bound of at least 16 for 6 rounds, also for all offsets.

  • The 8-round linear bound (\(-\log _2\)) of the candidate ARX-box should be at least 17.

By the term offset we refer to the round index of the starting round of a differential trail. Note that we are considering all offsets for the differential criteria because the bounds are computed using Matsui’s branch and bound algorithm, which needs to use the \((r-1)\)-round bound of the differential trail with starting round index 1 (second round) in order to compute the r-round bound of the trail.

We tested all rotation sets with a cost below 12 for the above conditions. None of those fulfilled the above criteria. For a cost below 15, we found the ARX-box with the following rotations:

$$ (r_0,r_1,r_2,r_3,s_0,s_1,s_2,s_3) = (31,17,0,24,24,17,31,16)\;. $$

This rotation set fulfills all the criteria. The differential and linear bounds for the respective ARX-box are summarized in Table 2.

Table 2. Differential and linear bounds for several rotation parameters. For each offset, the first line shows the differential bound and the second shows the linear one. The value set in parenthesis is the maximum absolute correlation of the linear hull taking clustering into account (see Sect. 3.2). The bounds [14, 20, 27, 28] for SPECK are given for comparison.

2.4 On the Round Constants

The purpose of round constant additions, i.e., the XORs with \(\gamma _i^L,\gamma _i^R\) in the general ARX-box structure, is to ensure some independence between the rounds. They also break additive patterns that could arise on the left branch due to the chain of modular addition it would have without said constant additions. Perhaps even more importantly, they should also ensure that the Alzette instances called in parallel are different from one another to avoid symmetries.

For efficiency reasons, we decided to use the same round constant in every round of the ARX-box, i.e., \(\forall i: \gamma _i^L=c\). As the rounds themselves are different from one another, we do not rely on \(\gamma _i^L\) or \(\gamma _i^R\) to prevent slide-style patterns. Thus, using the same constant in each round is not a problem. Moreover, we chose \(\gamma _i^R=0\) for all i. It is important to note that the experimental verification of the differential probabilities and absolute linear correlations we conducted (see Sects. 3.1 and 3.2 respectively) did not lead to significant differences when changing to a more complex round constant schedule. In other words, even for random choices of all \(\gamma _i^L\) and \(\gamma _i^R\), we did not observe significantly different results that would justify the use of a more complex constant schedule (which would of course lead to worse efficiency in the implementation).

The analysis provided in the next section is dependent on the actual choice of round constants c. We conducted this analysis for the constants of Sparkle:

$$\begin{aligned} \begin{array}{cccccccc} c_0= &{} \texttt {b7e15162}, &{} c_1= &{} \texttt {bf715880}, &{} c_2= &{} \texttt {38b4da56}, &{} c_3= &{} \texttt {324e7738}, \\ c_4= &{} \texttt {bb1185eb}, &{} c_5= &{} \texttt {4f7c7b57}, &{} c_6= &{} \texttt {cfbfa1c8}, &{} c_7= &{} \texttt {c2b3293d} ~. \\ \end{array} \end{aligned}$$
(1)

3 Analysis of Alzette

In this section, we study cryptographic properties of the ARX-box Alzette. The analysis is done for the round constants used in Sparkle, except for analysis of differential/linear characteristic bounds and division property propagation, which are independent of the choice of the constants. All described methods can easily be applied to arbitrary choices of constants.

3.1 On the Differential Properties

Bounding the Maximum Expected Differential Trail Probability. We used the Algorithm 1 in  [14] and adapted it to our round structure to compute the bounds on the maximum expected differential trail probabilities of the ARX-boxes which use the constants given in Eq. (1). The algorithm is basically a refined variant of Matsui’s well-known branch and bound algorithm  [30]. While the latter has been originally proposed for ciphers that have S-boxes (in particular the DES), the former is targeted at ARX-based designs that use modular addition, rather than an S-box, as a source of non-linearity.

Algorithm 1 [14] exploits the differential properties of modular addition to efficiently search for characteristics in a bitwise manner. Upon termination, it outputs a trail (characteristic) with the maximum expected differential trail probability (MEDCP). For Alzette, we obtain such trails for up to six rounds, where the 6-round bound is \(2^{-18}\). We further collected all trails corresponding to the maximum expected differential probability for 4 and 5 rounds and experimentally checked the actual probabilities of the differentials (for the constants used in Sparkle), see below.

Note that for 7 and 8 rounds, we could not get a tight bound due to the high complexity of the search. In other words, the algorithm did not terminate in reasonable time. However, the algorithm exhaustively searched the range up to \(-\log _2(p)=24\) and \(-\log _2(p)=32\) for 7 and 8 rounds respectively, which proves that there are no valid differential trails with an expected differential trail probability larger than \(2^{-24}\) and \(2^{-32}\), respectively. We evaluated similar bounds for up to 12 rounds.

Experiments on the Fixed-Key Differential Probabilities. As in virtually all block cipher designs, the security arguments against differential attacks are only average results when averaging over all keys of the primitive. When leveraging such arguments for a cryptographic permutation, i.e., a block cipher with a fixed key, it might be possible in theory that the actual fixed-key maximum differential probability is higher than the expected maximum differential probability. In particular, the variance of the distribution of the maximum fixed-key differential probabilities might be high.

For all of the 8 Alzette instances corresponding to the constants in Eq. (1), we conducted experiments in order to see if the expected maximum differential trail probabilities derived by Matsui’s search are close to the actual differential probabilities of the fixed ARX-boxes. Our results are as follows.

By Matsui’s search we found 7 differential trails for AlzetteFootnote 4 that correspond to the maximum expected differential trail probability of \(2^{-6}\) for 4 rounds, see Table 3. For any Alzette instance \(A_{c_i}\) and any such trails with input difference \(\alpha \) and output difference \(\beta \), we experimentally computed the actual differential probability of the differential \(\alpha \rightarrow \beta \) by

$$\begin{aligned} \frac{|\{x \in S | A_{c_i}(x) \oplus A_{c_i}(x \oplus \alpha ) = \beta \}|}{|S|}\;, \end{aligned}$$

where S is a set of \(2^{24}\) inputs sampled uniformly at random. Our results show that the expected differential trail probabilities approximate the actual differential probabilities very well, i.e., all of the probabilities computed experimentally are in the range \([2^{-6}-10^{-4},2^{-6}+10^{-4}]\) for a sample size of \(2^{24}\).

For 5 rounds, i.e., one full Alzette instance and one additional first round of Alzette, there is only one trail with maximum expected differential trail probability \(p = 2^{-10}\). In the case of Sparkle, for all combinations of round constants that can occur in 5 rounds (one Alzette instance plus one round) that do not go into the addition of a step counter, i.e., corresponding to the twelve compositions

$$ \begin{array}{cccccc} A_{c_2} \circ A_{c_0} &{} A_{c_3} \circ A_{c_1} &{} A_{c_3} \circ A_{c_0} &{} A_{c_4} \circ A_{c_1} &{} A_{c_5} \circ A_{c_2} &{} A_{c_4} \circ A_{c_0} \\ A_{c_5} \circ A_{c_1} &{} A_{c_6} \circ A_{c_2} &{} A_{c_7} \circ A_{c_3} &{} A_{c_2} \circ A_{c_3} &{} A_{c_3} \circ A_{c_4} &{} A_{c_2} \circ A_{c_7}, \end{array} $$

we checked whether the actual differential probabilities are close to the maximum expected differential trail probability. We found that all of the so computed probabilities are in the range \([2^{-10}-10^{-5},2^{-10}+10^{-5}]\) for a sample size of \(2^{28}\).

Table 3. The input and output differences \(\alpha , \beta \) (in hex) of all differential trails over Alzette corresponding to maximum expected differential trail probability \(p = 2^{-6}\) and \(p=2^{-10}\) for four and five rounds, respectively.

3.2 On the Linear Properties

Bounding Maximum Expected Absolute Linear Trail Correlation. We used the Mixed-Integer Linear Programming approach described in  [20] and the Boolean satisfiability problem (SAT) approach in  [27] in order to get bounds on the maximum expected absolute linear trail correlation. It was feasible to get tight bounds even for 8 rounds, where the 8-round bound of our final choice for Alzette is \(2^{-17}\). We were able to collect all linear trails that correspond to the maximum expected absolute linear trail correlation for 4 up to 8 rounds and experimentally checked the actual correlations of the corresponding linear approximations for the Alzette instances using the constants in Eq. (1), see below.

Experiments on the Fixed-Key Linear Correlations. Similarly as for the case of differentials, for all of the 8 Alzette instances used in Sparkle, we conducted experiments in order to see whether the maximum expected absolute linear trail correlations derived by MILP and presented in Table 2 are close to the actual absolute correlations of the linear approximations over the fixed Alzette instances. Details of our results are presented in the full version, but they can be summarized as follows.

For a full Alzette instance, there are 4 trails with a maximum expected absolute trail correlation of \(2^{-2}\). For all of the eight Alzette instances, the actual absolute correlations are very close to the theoretical values and we did not observe any clustering. For more than four rounds (i.e., one full instance plus additional rounds), we again checked all combinations of ARX-boxes that do not get a step counter in Sparkle. For five rounds, there are 16 trails with a maximum expected absolute trail correlation of \(2^{-5}\). In our experiments, we can observe a slight clustering. In fact, we chose the round constants \(c_i\) of Sparkle such that, for all combinations of Alzette that occur over the linear layer, the linear hull effect is to our favor, i.e., the actual correlation tends to be lower than the theoretical value.Footnote 5

This tendency also holds for the correlations over six rounds. There are 48 trails with a maximum expected absolute linear trail correlation of \(2^{-8}\).

For seven rounds, there are 2992 trails with a maximum expected absolute linear trail correlation of \(2^{-13}\). Over all the twelve combinations that do not add a step counter in Sparkle and all of the 2992 approximations, the maximum absolute correlation we observed was \(2^{-11.64}\) using a sample size of \(2^{32}\) plaintexts chosen uniformly at random.

For eight rounds, there are 3892 trails with a maximum expected absolute linear trail correlation of \(2^{-17}\). Over all the twelve combinations that do not add a step counter and all of the 3892 approximations, the maximum absolute correlation we observed was \(2^{-15.79}\) using a sample size of \(2^{40}\) plaintexts chosen uniformly at random.

Overall, our correlation estimates based on linear trails seem to closely approximate the actual absolute correlations since our estimate is only \(2^{1.21}\) times lower than the actual absolute correlation.

3.3 On the Algebraic Properties

Integral cryptanalysis exploits low algebraic degree or a more fine-grained algebraic degeneracy of the cryptographic primitive under attack. An integral distinguisher defines an input set X such that the analyzed function sums to zero over this set (at least in some bits) for any value of the secret key involved. In the case of a keyless permutation, such as an ARX-box, such distinguishers are trivial to find and are meaningless. However, an analysis of the growth of the algebraic degree (and the evolution of the algebraic structure in general) provides a useful information about the permutation. When the permutation is plugged into, for example, a block cipher, this information directly translates into information about integral distinguishers.

Division property is a technique introduced by Todo  [37] to find integral characteristics. Later, Xiang et al.   [41] discovered that the bit-based division property propagation can be efficiently encoded as an mixed-integer linear programming instance (MILP), and, surprisingly, can be solved on practice using modern optimization software (Gurobi Optimizer  [22]) for practically all known block ciphers. Sun et al.   [36] described a way to encode the modular addition operation using MILP inequalities, extending the framework to ARX-based primitives.

We briefly recall the MILP-aided bit-based division property framework.

Definition 1 (Block-Based Division Property)

Let n be an integer and let X be a set of n-bit vectors. Let k be an integer, \(0 \le k \le n\). The set S satisfies division property \(\mathcal {D}^{n}_{k}\) if and only if for all \(u \in \mathbb {F}_{2}^n\) with \(wt(u) < k\), we have \(\bigoplus _{x \in X} x^u = 0\), where \(x^u\) is a shorthand for \(x_0^{u_0}\ldots x_{n-1}^{u_{n-1}}\).

Definition 2 (Bit-Based Division Property)

Let n be an integer and let XK be two sets of n-bit vectors, \(0 \notin K\). The set X satisfies division property \(\mathcal {D}^{}_{K}\) if and only if for all \(u \in \mathbb {F}_{2}^n\) such that \(u \prec k\) for all \(k \in K\)

$$ \bigoplus _{x \in X} x^u = 0\;, $$

where \(u \prec k\) if and only if \(u \ne k\) and \(u_i \le k_i\) for all \(i, 0 \le i < n\).

For further information on division property propagation and its encoding using MILP inequalities, we refer to [41]. However, we describe briefly a new technique for encoding division property propagation through the modular addition. Our technique is simpler and more compact than the one proposed by Sun et al. [36].

Addition Modulo \(2^{32}\). The method by Sun et al. is based on expressing the modular addition as a Boolean circuit and applying the standard known encoding for xor and and operations. As a result, for each bit of a word at least 12 bit operations are produced. We propose a new simple method which requires only 2 inequalities per bit.

Our key idea is to compute the carry bits and the output bits in pairs using a \(3\times 2\) bit look-up table. The division property propagation through this look-up table can be encoded using only 2 inequalities.

Consider an addition of two n-bit words \(a,b \in \mathbb {F}_{2}^n\) and let \(y = a \boxplus b \mod 2^n\) (recall that \(a_0\) denotes the most significant bit of a, \(a_{n-1}\) denotes the least significant bit of a, etc.). Define carry bits \(c_i, 0 \le i < n\) as follows: \(c_{n-1} = 0\) and \(c_{i} = \mathrm {Maj}(a_{i+1}, b_{i+1}, c_{i+1})\) for \(-1 \le i < n - 1\), where \(\mathrm {Maj}\) is the 3-bit majority function. Then it is easy to verify that \(y_i = a_i \oplus b_i \oplus c_i\) for all \(0 \le i < n\). Full modular addition can be computed sequentially from \(i=n-1\) to \(i=0\). Let \(f: \mathbb {F}_{2}^3 \rightarrow \mathbb {F}_{2}^2\) be such that \(f(a,b,c) = (\mathrm {Maj}(a,b,c),a\oplus b\oplus c)\), then we can write

$$ (c_{i-1}, y_{i}) = f(a_{i}, b_{i}, c_{i}), $$

for all \(0 \le i < n\). The lookup table of f is given in Table 4. Note that no bits are copied in the sequential computation process. It follows that the division property propagation can be encoded directly by encoding n sequential applications of f (using the S-Box encoding methods by Xiang et al.   [41]). Finally, an additional constraint is needed to ensure that the resulting division property is not active in the bit \(c_{-1}\).

The division property propagation table is given in Table 5. This table can be characterized by the two following integer inequalities:

$$ {\left\{ \begin{array}{ll} -a -b -c + 2c' + y &{}\ge 0, \\ a + b + c -2c' -2y &{}\ge -1, \end{array}\right. } $$

where \(a,b,c \in \mathbb {Z}_2\) correspond to the values of the input division property and \(c', y \in \mathbb {Z}_2\) correspond to the values of the output division property. In our experiments, these two inequalities applied for each bit position generate precisely the correct division property propagation table of the addition modulo \(2^n\) for n up to 7. There are a few redundant transitions, but they do not affect the result.

Table 4. Look-up table of f.
Table 5. Division property propagation table of f.

An alternative to MILP-solvers that is used for division property analysis are SMT-solvers. To facilitate this alternative method, we characterize the division property propagation table of f by four Boolean propositions (obtained by enumerating all possible outputs and constraining respective inputs):

$$ \left\{ \begin{array}{lcll} c' \wedge y &{} ~~\Rightarrow ~~ &{} a \wedge b \wedge c , &{} ~~~~~~\triangleright ~a=b=c = 1 \\ \lnot c' \wedge \lnot y &{} ~~\Rightarrow ~~ &{} \lnot a \wedge \lnot b \wedge \lnot c , &{} ~~~~~~\triangleright ~a=b=c = 0\\ \lnot c' \wedge y &{} ~~\Rightarrow ~~ &{} (a \oplus b \oplus c) \wedge (\lnot a \vee \lnot b) , &{} ~~~~~~\triangleright ~a+b+c = 1\\ c' \wedge \lnot y &{} ~~\Rightarrow ~~ &{} (a \vee b \vee c) \wedge (\lnot a \vee \lnot b \vee \lnot c) . &{} ~~~~~~\triangleright ~1 \le a+b+c \le 2 \end{array} \right. $$

We used this representation together with the Boolector SMT-solver  [32] (version 3.1.0) to verify our results.

Finally, we note that subtraction modulo \(2^n\), used in the inverse of Alzette, is equivalent to the addition with respect to the division property propagation in our method. Indeed, let \(f': \mathbb {F}_{2}^{3} \rightarrow \mathbb {F}_{2}^{2}\),

$$ f'(a,b,c) = (c',y) = ([a - b - c < 0], a \oplus b \oplus c), $$

where the first coordinate of \(f'\) computes the subtraction carry bit. It is in fact equivalent to the first coordinate of f (the majority function) up to xor with constants:

$$\begin{aligned}{}[a - b - c< 0]&= [a + (1 - b) + (1 - c) < 2] \\&= 1 - [a + (1 - b) + (1 - c) \ge 2] = 1 \oplus f_0(a, 1\oplus b, 1\oplus c). \end{aligned}$$

We conclude that \(f'\) has the same division property propagation table as f and thus division property propagation using our method is the same for modular addition and subtraction.

Division Property Propagation in . First, we evaluated the general algebraic degree of the ARX-box structure based on the division property. The \(5^{th}\) and \(6^{th}\) rounds rotation constants were chosen as the \(1^{st}\) and \(2^{nd}\) rounds rotation constants respectively, as this will happen when two Alzette instances will be chained. The inverse ARX-box structure starts with \(4^{th}\) round rotation constants, then \(3^{rd}\), \(2^{nd}\), \(1^{st}\), \(4^{th}\), etc. The minimum and maximum degree among coordinates of the ARX-box structure and its inverse are given in Table 6 Even though these are just upper bounds, we expect that they are close to the actual values, as the division property was shown to be rather precise  [39]. Thus, the Alzette structure may have full degree in all its coordinates, but the inverse of an Alzette instance has a coordinate of degree 46.

Table 6. The upper bounds on the minimum and maximum degree of the coordinates of Alzette and its inverse.

The block-based division property of Alzette is such that, for any \(1 \le k \le 62\), \(\mathcal {D}^{64}_{k}\) maps to \(\mathcal {D}^{64}_{1}\) after two rounds, and \(\mathcal {D}^{64}_{63}\) maps to \(\mathcal {D}^{64}_{2}\) after two rounds and to \(\mathcal {D}^{64}_{1}\) after three rounds. The same holds for the inverse of an Alzette instance.

The longest integral characteristic found with bit-based division property is for the 6-round ARX-box, where the input has 63 active bits and the inactive bit is at the index 44 (i.e., there are 44 active bits from the left and 19 active bits from the right), and in the output 16 bits are balanced:

$$\begin{aligned}&\text {input active bits:}\\&{\texttt {11111111111111111111111111111111,11111111111101111111111111111111}},\\&\text {balanced bits after 6-round ARX-box (denoted by }{} \texttt {B}{\text{) }:}\\&{\texttt {????????????????????????BBBBBBBB,?????????BBBBBBBB???????????????}}. \end{aligned}$$

The inactive bit can be moved to indexes 45, 46, 47, 48 as well, the balanced property after 6 round stays the same. For the 7-round ARX-box we did not find any integral distinguishers.

For the inverse ARX-box, the longest integral characteristic is for 5 rounds:

$$\begin{aligned}&\text {input active bits:}\\&{\texttt {11111111111111111111111111101111,11111111111111111111111111111111}},\\&\text {balanced bits after 5-round ARX-box inverse:}\\&{\texttt {???????????????????????????????B,???????BBBBBBBBB????????????????}}. \end{aligned}$$

For the ARX-box inverse with 6-rounds we did not find any integral characteristic.

As a conclusion, even though a single Alzette instance has integral characteristics, for two chained Alzette instances there are no integral characteristics that can be found using the state-of-the-art division property method.

Experimental Algebraic Degree Lower Bound. The modular addition is the only non-linear operation in Alzette. Its algebraic degree is 31 and thus, in each 4-round Alzette instance, there must exist some output bits of algebraic degree at least 32.

We experimentally checked that, for each instance \(A_{c_i}\) with \(c_i\) as in Eq. (1), the algebraic degree of each output bit is at least 32. In particular, for each output bit we found a monomial of degree 32 that occurs in its ANF. Note that for checking whether the monomial \(\prod _{i=0}^{m-1}x_{i_m}\) occurs in the ANF of a Boolean function f one has to evaluate f on \(2^m\) inputs.

3.4 Invariant Subspaces

Invariant subspace attacks were considered in  [26]. For the round constants used in Sparkle, using a similar “to and fro” method from [13, 33], we searched for an affine subspace that is mapped by an Alzette instance \(A_{c_i}\) to a (possibly different) affine subspace of the same dimension. We could not find any such subspace of nontrivial dimension.

Note that the search is randomized so it does not result in a proof. As an evidence of the correctness of the algorithm, we found many such subspace trails for all 2-round reduced ARX-boxes, with dimensions from 56 up to 63. For example, let A denotes the first two rounds of \(A_{c_0}\). Then for all \(l,r,l',r' \in \mathbb {F}_{2}^{32}\) such that \(A(l, r) = (l', r')\), it holds that

$$\begin{aligned}&(l_{29} + r_{21} + r_{30}) (l_{30} + r_{31}) (l_{31} + r_{0}) (r_{22}) (r_{23}) = \\&\qquad \quad (l'_{4} + r'_{21}) (l'_{5} + r'_{22}) (l'_{6} + r'_{23}) (l'_{28} + l'_{30} + l'_{31} + r'_{13} + 1) (l'_{29} + l'_{31} + r'_{14}). \end{aligned}$$

This equation defines a subspace trail of constant dimension 59.

3.5 Nonlinear Invariants

Nonlinear invariant attacks were considered recently in  [38] to attack lightweight primitives. For the round constants used in Sparkle, using linear algebra, we experimentally verified that for any ARX-box \(A_{c_i}\) and any non-constant Boolean function f of degree at most 2, the compositions \(f \circ A_{c_i}\) and \(f \circ A_{c_i}^{-1}\) have degree at least 10:

$$ \forall f:\mathbb {F}_{2}^{64} \rightarrow \mathbb {F}_{2}^{}, 1 \le \deg (f) \le 2, ~~ \deg (f \circ A_{c_i}) \ge 10, \deg (f \circ A_{c_i}^{-1}) \ge 10, $$

and for functions f of degree at most 3, the compositions have degree at least 4:

$$ \forall f:\mathbb {F}_{2}^{64} \rightarrow \mathbb {F}_{2}^{}, 1 \le \deg (f) \le 3, ~~ \deg (f \circ A_{c_i}) \ge 4, \deg (f \circ A_{c_i}^{-1}) \ge 4. $$

In particular, any \(A_{c_i}\) has no cubic invariants. Indeed, a cubic invariant f would imply that \(f \circ A_{c_i} + \varepsilon = f\) is cubic (for a constant \(\varepsilon \in \mathbb {F}_{2}^{}\)). The same holds for the inverse of any ARX-box \(A_{c_i}\).

By using the same method, we also verified that there are no quadratic equations relating inputs and outputs of any \(A_{c_i}\). However, there are quadratic equations relating inputs and outputs of 3-round reduced versions of each \(A_{c_i}\).

3.6 Summary of the Properties of Alzette

Our experimental results validate our theoretical analysis of the properties of Alzette: in practice, the differential and linear trail probabilities (resp., absolute correlations) are as predicted. In the case of differential probabilities, the clustering is minimal. While it is not quite negligible in the linear case, our estimates remain very close to the quantities we measured experimentally.

The diffusion is fast: all output bits depend on all input bits after a single call of Alzette – though the dependency may be sometimes weak. After a double call of Alzette, diffusion is of course complete. More formally, as evidenced by our analysis of the division property, no integral distinguisher exist in this case.

While the two components have utterly different structures, Alzette has similar properties to one round of AES and the double iteration of Alzette to the AES super-S-box (see Table 7). The bounds for the (double) ARX-box come from Table 2. For the AES, the bounds for a single rounds are derived from the properties of its S-box, so its maximum differential probability is \(4/256 = 2^{-6}\) and its maximum absolute linear correlation is \(2^{-3}\). For two rounds, we raise the quantities of the S-box to the power 5 because the branching number of the MixColumn operation is 5,

Table 7. A comparison of the properties of Alzette with those of the AES with a fixed key. MEDCP denotes the maximum expected differential trail probability and MELCC denotes the maximum expected absolute linear trail correlation.

These experimental verifications were enabled by our use of a key-less structure. For a block cipher, we would need to look at all possible keys to reach the same level of confidence.

4 Implementation Aspects

4.1 Software Implementations

Alzette was designed to provide good security bounds, but also efficient implementation. The rotation amounts have been carefully chosen to be a multiple of eight bits or one bit from it. On 8 or 16 bit architectures these rotations can be efficiently implemented using move, swap, and 1-bit rotate instructions. On ARM processors, operations of the form z \(\leftarrow \) x <op>(y \(\lll \ell \)) can be executed with a single instruction in a single clock cycle, irrespective of \(\ell \).

Alzette itself operates over two 32-bit words of data, with an extra 32-bit constant value. This allows the full computation to happen in-register in AVR, MSP and ARM architectures, whereby the latter is able to hold at least 4 Alzette instances entirely in registers. This in turn reduces load-store overheads and contributes to the performance of a primitive calling Alzette.

The consistency of operations allows one to either focus on small code size (by implementing the parallel Alzette instances in a substitution layer in a loop), or on architectures with more registers, execute two or more instances to exploit instruction pipelining. This consistency of operations also allows some degree of parallelism, namely by using Single Instruction Multiple Data (SIMD) instructions. SIMD is a type of computational model that executes the same operation on multiple operands. Due to the layout of Alzette, an SIMD implementation can be created by packing \(x_0 \dots x_{n_b}\), \(y_0 \dots y_{n_b}\), and \(c_0 \dots c_{n_b}\) each in a vector register. That allows 128-bit SIMD architectures such as NEON to execute four Alzette instances in parallel, or even eight instances when using x86 AVX2 instructions.

Table 8. Execution time (in clock cycles) and codes size (in bytes) of Alzette.

Table 8 summarizes the execution time and code size of Alzette on an 8-bit AVR and a 32-bit ARM Cortex-M3 micro-controller. The assembler implementation of Alzette for the latter architecture consists of 12 instructions, which take 12 clock cycles to execute. The actual code size of Alzette may be less than 48 bytes since the Cortex-M3 supports Thumb2, which means some simple instructions can be only 16 bits long. However, whether an instruction is 16 or 32 bits long depends, among other things, on the register allocation. Our ARM implementation assumes that the two 32-bit branches of Alzette and the round constant are already in registers and not in memory, which is a reasonable assumption since the register file of a Cortex-M3 is big enough to accommodate a few instances of Alzette together with a few round constants.

The situation is a bit different for 8-bit AVR. The arithmetic/logical operations of Alzette amount to 78 instructions altogether, each of which executes in a single cycle, i.e. 78 clock cycles in total. Each of the used instructions has a length of 2 bytes, yielding a code size of 156 bytes. However, in contrast to ARM, we can not take it for granted that the whole state of a cipher fits into the register file of an AVR micro-controller, which means the load and store operations should be considered when evaluating the execution time. Loading a byte from RAM takes 2 cycles, while loading a byte from flash (e.g. for the round constants) requires 3 cycles. Storing a byte in RAM takes also 2 cycles. Consequently, when taking all loads/stores into account (including the loading of a round constant from flash), the execution time increases from 78 to 122 cycles and the code size from 156 to 196 bytes.

4.2 Hardware Implementations

A hardware implementation can, for example, use a 32-bit ALU that is able to execute the following set of basic arithmetic/logical operations: 32-bit XOR, addition of 32-bit words, and rotations of a 32-bit word by four different amounts, namely 16, 17, 24, and 31 bits. Since there are only four different rotation amounts, the rotations can be simply implemented by a collection of 32 4-to-1 multiplexers. There exist a number of different design approaches for a 32-bit adder; the simplest variant is a conventional Ripple-Carry Adder (RCA) composed of 32 Full Adder (FA) cells. RCAs are very efficient in terms of area requirements, but their delay increases linearly with the bit-length of the adder. Alternatively, if an implementation requires a short critical path, the adder can also take the form of a Carry-Lookahead Adder (CLA) or Carry-Skip Adder (CSA), both of which have a delay that grows logarithmically with the word size. On the other hand, when reaching small silicon area is the main goal, one can “re-use” the adder for performing XOR operations. Namely, an RCA can output the XOR of its two inputs by simply suppressing the propagation of carries, which requires an ensemble of 32 AND gates. In summary, a minimalist ALU consists of 32 FA cells, 32 AND gates (to suppress the carries if needed), and 32 4-to-1 multiplexers (for the rotations). To minimize execution time, it makes sense to combine the addition (resp. XOR) with a rotation into a single operation that can be executed in a single clock cycle.

5 Alzette as a Building Block

Alzette is at the core of two families of lightweight algorithms that are among the second round candidates of the NIST lightweight cryptography standardization process, namely the hash functions Esch and the authenticated ciphers with associated data Schwaemm (submission Sparkle  [8]). In this section, we show that it can also be used to easily construct block ciphers. This approach is flexible: combining Alzette with simple linear layers, we can simply build step functions operating on 64-, 128- and 256-bit blocks. We explain this approach and analyze the security of its result in Sect. 5.1. Specific instances are then given in Sect. 5.2, namely the 64-bit lightweight block cipher Crax, and the 256-bit tweakable block cipher Trax.

5.1 Skeletons for a Family of (Tweakable) Block Ciphers

Our approach relies on the long trail strategy pioneered by the designers of Sparx  [19], and which was then used to build sLiSCP  [3], sLiSCP-light  [4] as well as the NIST lightweight candidates using them (SPIX  [2], SPOC  [1], Sparkle  [8]). Provided that the round function allows its use, this method provides a simple algorithm for bounding the probability of differential and linear trails. To achieve this, we loop over all possible truncated trails, and bound the probability of all differential (resp. linear) trails that conform to the truncated trail using the differential (resp. linear) bounds of the employed S-box, including those for multiple iterations when relevant. In all the algorithms listed above, variants of the Feistel structure have been applied because such round functions lend themselves well to such an analysis.

It is simple to adapt this framework to the design of Alzette-based block ciphers. Furthermore, the structure of a long trail argument allows for an efficient algorithm bounding the probability of related-tweak differentials.Footnote 6 Indeed, in our case, the S-box used is 64 bit wide. Thus, the number of bits needed to describe a truncated differential in a given internal state is very small, only 4 suffice for a block size of 256 bits. Besides, the use of a Feistel structure implies that half of these bits are mere copies of the ones in the previous round. As a consequence, the total number of truncated trails that must be considered is low.

It also implies that the impact of a tweak difference is manageable: if the tweak difference activates a previously inactive S-box then its presence does not increase the number of truncated trails. On the other hand, a possible cancellation merely multiplies the number of possible trails by 2. An algorithm enumerating all related-tweak truncated trails such that the probability of all differential trails that conform to them is below a given threshold, is therefore easy to write and is efficient. In fact, our straight-forward Python implementation returned all the results needed for this paper in a matter of seconds at worst. Large S-boxes such as Alzette are therefore very convenient building blocks to construct tweakable block ciphers with strong security arguments.

Fig. 2.
figure 2

The round functions of Trax-S, Trax-M, and Trax-L, respectively. \(\ell '(z_1,z_2,z_3,z_4)=(z_4,z_3\oplus z_4,z_2,z_1\oplus z_2)\), where \(z_i\) are 16-bit words. The tweak is added only in odd steps.

Below, we present three Alzette-based (tweakable) block cipher structures for which we provide upper bounds on the probability of the best differential trail in both the single-key and the related-tweak setting. Of course, we also investigate other attacks. The “S”, “M” and “L” versions operate on 64, 128 and 256 bits respectively and their round functions are depicted in Fig. 2 (pseudo-code is provided in the full version). Their properties are summarized in Table 9:

  • \(r_e(c)\) rounds are needed to prevent the existence of known single-key distinguishers with a data complexity upper bounded by \(2^c\) in total,

  • \(r_e^T(c)\) rounds are needed to prevent the existence of known related-tweak distinguishers with a data complexity upper bounded by \(2^c\) in total (possibly spread across multiple tweak values), and

  • \(r_d\) rounds are needed in order for all the bits of the state to depend on all the bits of the key.

For example, if the best single-key differential trail with a probability above \(2^{-n}\) covers r rounds, then \(r_e \ge r+1\). It is assumed that n-bit subkeys are used.

It is assumed that there is no tweak schedule, i.e. that the tweak is simply xor ed in the same part of the state each time it is added. As discussed below, we found that the security level was higher when this addition occurred every second step. The motivation for this simple tweak-schedule is simple: the tweak is expected to change far more often than the key, so using a trivial tweak-schedule will improve the performances of our algorithms.

Of course, we can set the tweak to a constant (e.g. 0) and obtain a tweakless “regular” block cipher.Footnote 7 For the skeleton structures, we do not specify key schedules and leave it to cipher designers to come up with appropriate ones for their use cases. Related-key and related-tweak security will of course depend on the specifics of the key schedule chosen. We present concrete ciphers using these structures in Sect. 5.2 (along with their key schedules). Our best distinguishers against the various versions of our step function are summarized in Table 10. Related-tweak integral cryptanalysis is given in the full version of this paper.

Table 9. The properties of the different (tweakable) round functions.
Table 10. The number of steps \(r_e(i)\) needed for the S, M and L step functions to prevent various distinguishers with a data complexity of at most \(2^i\). “RT” stands for “related-tweak” where the tweak is added in every odd step. As \(r_e(n/2) \le r_e(n)\), we use the latter if the former is not known and use “”. For comparison, we give \(r_e(i)\) for the AES using thatit achieves 25 active S-boxes in any non-trivial 4-round (differential or linear) trail and plugging in the bounds for its S-box provided in Table 7.

The S Version. It operates on 64 bits, meaning that it simply consists in iterating Alzette, interleaving it with key additions. The tweak is xor ed every second step as it allows to ensure that at least one double Alzette is active during 4 steps. Thus, 8 steps are sufficient to prevent related-tweak differential distinguishers with a data complexity of \(2^{64}\). If we remove the tweak then we need 4 steps to argue the absence of differential distinguishers.

We start adding the tweak at the beginning of step 1 and not step 0 as it could otherwise trivially be cancelled out with chosen plaintexts.

As we saw in Sect. 3.2, linear distinguishers are in practice less predictable than differential ones. In particular, they exhibit some key-sensitivity that we did not observe in the differential case. As our bound for 4 steps is at the edge of being exploitable (\(2^{-34})\), a small key-dependent deviation may allow 4-step distinguishers. As a consequence, we consider that 5 steps are needed to prevent linear distinguishers. Note that, allowing related tweaks does not give an advantage when looking for linear distinguisher, as established by Kranz et al.  [24].

The security against integral attacks and other attacks that would exploit a slow diffusion (like impossible differential attacks) also follows directly from our analysis of Alzette: our best integral distinguisher relies on the bit-based division property and covers only 6 rounds of Alzette, i.e. 1.5 steps. Extending it backwards, we can obtain at most an 11-round zero-sum distinguisher, i.e. one that covers 2.75 steps. Thus, 3 steps are sufficient to prevent them. Since we have full diffusion in one step, there cannot be an impossible differential found via a miss-in-the-middle that covers 2 steps.

Assuming that the key schedule uses statistically independent key bits in even and odd steps, we need only \(r_d=2\) steps to ensure that all bits depend (although possibly weakly) on all key bits. This result, along with all the distinguishers we investigated for this step function, are summarized in Table 10.

The M Version. In order to operate on 128 bits, we use a simple Feistel round as the linear layer that maps (xy) to \((y \oplus x, x)\). This structure ensures long trails. To further foster the existence of long trails, we only XOR the tweak on half of the state, namely at the input of the Alzette instance which is always doubled due to the structure of the linear layer.

We have found using our long trail argument implementation that the best frequency for adding a tweak corresponds to an addition every second round (as for the S version). A smaller or larger number of steps between tweak additions would lead to worse differential bounds. As in the S version, we start adding the tweak at the beginning of step 1.

A long trail argument shows that differential and linear distinguishers become infeasible when the number of steps is at least equal to 7. Unlike in the S version, trail clustering is less of a concern here. Indeed, we observed the clustering within one Alzette to be minimal, and unlike in the S version, the linear masks are constrained in each step by the presence of the linear layer. It is not sufficient for the input and output masks of a double Alzette to be identical: in order to leverage clustering, we now need that the mask at the end of the first Alzette call is the same in all trails as well.

In the related-tweak setting, there could exist differential trail covering more than 7 steps with usable probabilities but none covering 11 steps (or more). If we restrict ourselves to attack with a data complexity at most equal to \(2^{64}\) then no useful related-tweak trail can cover more than 6 steps.

This step function employs a Feistel structure with a bijective Feistel function but the well known 5-round impossible differential identified by Knudsen  [23] cannot be used here. Indeed, our non-linear permutation (Alzette) is applied on both branches in each round, thus breaking the pattern used by this distinguisher. In fact, the best impossible differential we can find only covers 4 steps: the probability of the transition is equal to 0 for any non-zero 64-bit differences \(\delta \) and \(\varDelta \). It needs about \(2^{32}\) chosen plaintexts to be exploited.

Since the key is of the same size as the block, the number of rounds needed for diffusion of the key material is the number of rounds needed all state bits to impact the whole state. In this case, it is \(r_d=2\).

The L Version. This round function operates on 256-bit using 4 Alzette instances in parallel. The round key is added in the full state. The best frequency for adding the tweak is every second step, for the same reason as for the M version: changing this frequency leads to worse differential bounds in the related-tweak setting.

This round function is similar to the one of Sparkle: a lot of the cryptanalysis performed for this algorithm directly carries over (see  [8]). In particular, the type of attacks for which we need the largest number of steps to prevent the existence of distinguishers is indeed the linear one in the single-tweak setting.

As for the M version, the number of rounds needed for diffusion of the key is the number needed for all state bits to impact the whole state. Here, it is \(r_d=3\).

5.2 Recommended Instances

Choosing the Number of Steps. In order to evaluate the number of steps needed to build a secure cipher, we observed that attacks against block ciphers are usually constructed using a specific distinguisher against a round-reduced version of the algorithm. Then, rounds are added at the top and at the bottom using key guesses. As a consequence, we used the following heuristic.

Heuristic (Number of rounds)

Suppose that a block cipher round function is such that:

  • \(r_e\) rounds are needed to prevent the existence of known (and relevant) distinguishers, and

  • \(r_d\) rounds are needed in order for all the bits of the state to depend on all the bits of the key.

Then, we suggest using a number of rounds equal to \(H_{\eta } = \left\lceil 2r_d + (1+\eta )r_e\right\rceil \), where \(\eta \) is a security factor intended to take into account possible improvements of the relevant distinguishers.

This method is heuristic as it is impossible to foresee how the best distinguishers will be improved, if at all. At the same time, we think it makes more sense than an approach based e.g. on simply doubling the number of rounds needed to prevent known distinguisher since it takes into account the actual structure of the attacks known. Our restriction to “relevant” distinguisher allows for example designers to discard related-key distinguisher if those are not relevant for their design. On the other hand, in our case, we consider related-tweak distinguishers to be relevant. In our definition of \(r_d\), we assume that the diffusion is equally fast in the forward and in the backward direction.

A Lightweight Block Cipher. We can use our round function to build Crax-S-10, a lightweight block cipher operating on 64-bit using a 128-bit key intended for the most constrained micro-controllers. We claim that it provides 128 bits of security in the single-key setting. A reference implementation is provided in the full version. We used a security factor \(\eta =0.2\), so that the total number steps corresponds to \(10 = \lceil 2+2+r_{e}\times 1.2 \rceil \).

Our cipher uses a tweakless instance of the S step function described above. Since the step function has good diffusion and since we do not aim for related-key security, we use a very simple key schedule: the 64-bit round key \(k_{i}\) used at the beginning of step i is simply \(k_{i} = K_{i \mod 2} \oplus i\), where the master key is \((K_{0},K_{1})\). As there is no tweak, we do not need to worry about a bad interaction between tweak and key.

In order to prevent slide properties, we use the step counter in combination with a reduction of the number of round constants: instead of using all 8 of them, we only use 5. That way, in the first half of the cipher the steps involve \(c_{i}\) and \(K_{i \mod 2}\) while in the second half they use \(c_{i}\) and \(K_{(i+1) \mod 2}\). For other attacks, the security of Crax-S-10 follows directly from our analysis of Alzette.

Crax-S-10 is a very lightweight block cipher, arguably one of the the lightest ever reported in the literature when it comes to micro-controller implementations. The code size, RAM consumption, and execution time of Crax-S-10 on an 8-bit AVR and a 32-bit ARM Cortex-M3 micro-controller are summarized in Table 11, along with those of Speck-64/128  [7]. We obtained the results for Speck from the best implementations contained in the FELICS project  [18], namely the implementation “03” for ARM Cortex-M3 and the implementation “06” for the AVR architecture.Footnote 8 As Speck has the best performances across micro-controllers, it serves as a good benchmark for comparison.

Table 11. A comparison of our implementation of Crax-S-10 with Speck-64/128. RAM and ROM consumption are measured in bytes and the time for processing a 64-bit block is given in clock cycles.

The ARM implementation of Crax we benchmarked is the optimized C code included in the full version of this paper. Encrypting a single 64-bit block on a Cortex-M3 takes 239 cycles (including function-call overhead), and the decryption has exactly the same execution time. For comparison, Speck-64/128 encrypts and decrypts at a rate of 184 and 254 cycles per block, respectively. However, since Speck needs first to run its key schedule, Crax-S-10 encryption is faster than Speck for short messages of up to 9 blocks (i.e. 72 bytes). The Speck implementation occupies significantly more RAM than that of Crax (mostly because of the round keys) and has a much smaller binary code size.

The results for the 8-bit AVR platform in Table 11 were both obtained with hand-written assembler implementations. When executed on an ATmega128 micro-controller, Crax-S-10 is slower than Speck-64/128 when we leave the key schedule aside, but is actually faster on short messages (up to 5 blocks). Similar to ARM, the round keys make Speck significantly more RAM-demanding than Crax. In terms of code size, the decryption Crax is smaller than that of Speck, while the encryption is slightly larger. However, when both functionalities are needed, Crax consumes less code space than Speck (including key schedule).

In summary, we can say that Crax is at least as light as Speck (lighter on ARM, comparable on AVR). Further, Crax shines for short messages, which are common in real-world applications like simple challenge-response protocols for the authentication of RFID tags and other IoT devices.

A Wide Tweakable Block Cipher. We can build an efficient software-oriented 256-bit tweakable block cipher with a 256-bit key and a 128-bit tweak using Trax-L-17 (pronounced “T-rax”). We claim related-tweak security as long as the total number of (xT) queries to the encryption (or decryption) oracle for a given key k is at most equal to \(2^{128}\). We do not make any claim in the related-key setting. A reference implementation is provided in the full version.

The motivation for this bound on the data complexity is simple: while an attacker may have tremendous computing power, it is impossible that they obtain this many plaintext/ciphertext pairs. Furthermore, the security of many modes of operations drops when the amount of queries reaches the birthday bound—\(2^{128}\) in our case. Combining the fact that the best distinguisher in the related-tweak setting cannot cover 9 steps with the same security factor as Crax-S-10 (namely \(\eta =0.2\)), we use \(\lceil 3+3+1.2\times 9 \rceil = 17\) steps.

For the key schedule, we use a simple generalized Feistel structure to update the key state and thus derive \(k_{i+1} = F_i(k_i)\), where \(k_0\) is the 256-bit master key and where \(P_i\) is \(\sigma \circ F_i\), with \(\sigma (x_0, ..., x_7) = (x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_0)\) and

where the constant indices are taken modulo 8. This key schedule ensures that the key material undergoes some transformation so as to break potential patterns linking subkeys and tweak.

A tweakable block cipher lends itself well to a parallelizable mode of operation such as \(\varTheta \)CB  [25], a variant of OCB which saves its complex overhead needed to turn a regular block cipher into a tweakable one. Since our block size is equal to 256 bits, attacks relying on collisions obtained via the birthday paradox are non-issue with Trax-L-17. Some modes such as the Synthetic Counter-in-Tweak  [35] can retain a security level up to the birthday bound in case of nonce misuse. As suggested in  [15], using a 256-bit block cipher can also help providing post-quantum security in cases where the attacker is given a lot of power (e.g. if the primitive runs on a quantum computer).Footnote 9

In summary, Trax-L-17 can be used in SCT mode to provide 128 bits of security in case of nonce-misuse, and its large block size can frustrate some quantum attacks when used in the same mode as Saturnin: it can be used to offer a very robust authenticated encryption. On a Cortex-M3 micro-controller, the generation of sub-keys takes 980 cycles and the encryption has an execution time of 2506 cycles (both results are based on a standard C implementation). For comparison, Saturnin is more than two times slower on ARM (a detailed comparison can be found in the full version).

The use of 32-bit operations implies that it is possible to vectorize the computation of several parallel Trax-L-17 instances on many platforms, meaning that its speed can be multiplied whenever e.g. AVX instructions are available.

6 Conclusion

Alzette is a component of a new kind, a wide S-box operating on 64 bits that can nevertheless be argued to provide strong security against many attacks. Because of its reliance on ARX operations with carefully chosen rotations, a constant-time implementation is both easy to write and very efficient on a wide class of processors and micro-controllers.

The NIST LWC submission Sparkle  [8] provides the first application of the Alzette S-box, but we showed that Alzette can also be used to design software-efficient (tweakable) block ciphers on a variety of block lengths. A modified long-trail argument allows us to estimate the number of rounds needed to provide security with regard to (related-tweak) differential and linear attacks. We provided two concrete instances of this approach: the 64-bit block cipher Crax and the 256-bit tweakable block cipher Trax. Due to its very simple key schedule, Crax is competitive compared to the block cipher Speck: it consumes less RAM and is faster for short messages consisting of up to nine 64-bit blocks. On the other hand, the large block size of Trax can be used to obtain strong security guarantees in settings where the attacker is quite powerful (nonce-misuse, quantum computing) while its use of a tweak eases the use of parallelizable modes of operation that can better leverage vector instructions.