Keywords

1 Introduction

Message Authentication Code (MAC) algorithms are one of the basic building blocks for cryptographic systems. A MAC algorithm processes a message \(m\) and a secret key \(K\) to generate a tag \(\tau \). It should be hard for an attacker to construct a forgery: that is, to generate a valid combination of \((m,\tau )\) without knowledge of the secret key \(K\). Thereby, the MAC algorithm ensures the authenticity of the message \(m\).

Over the years, a large variety of MAC algorithms have been proposed. Some of the most commonly used algorithms today are CMAC [30, 38], HMAC [6, 60], and UMAC [14]. CMAC is based on a block cipher, usually AES or Triple-DES, whereas HMAC uses a hash function such as MD5, SHA-1, or SHA-2, and UMAC is based on a universal hash function combined with a standard cryptographic primitive such as a block cipher or a hash function.

Unlike most other MAC algorithms, a nonce input is required for MAC algorithms based on universal hash functions [19, 61]. This includes MAC algorithms such as UMAC [14], Poly1305-AES [9], and GMAC [31]. The nonce should not be reused, or this would lead to a forgery attack. Furthermore, Poly1305-AES and GMAC become insecure when tags are truncated [34]. We note that currently used MAC algorithms based on universal hash functions typically make use of multiplications. On several microcontrollers, the number of cycles required to execute an integer multiplication instruction is data-dependent, which makes the implementations potentially vulnerable to timing attacks [45].

For MAC algorithms that are based on hash functions, the block size is typically very large: for MD5, SHA-1, SHA-2, and the upcoming SHA-3 [10], messages are processed in blocks of at least 512 bits. For very short messages, this will result in a large overhead. But also for longer messages, it is generally undesirable for typical microcontrollers to process such large blocks. This is because many load and store operations are required to move data back and forth between the limited number of registers and the RAM, which significantly increases the time, energy, and code size of the MAC algorithm implementation.

A similar issue appears for block-cipher-based MAC algorithms, which typically use AES or Triple-DES. On typical microcontrollers, the key schedule of these block ciphers increases the register pressure: round keys must be either precomputed and stored in RAM, or computed on the fly. Furthermore, on 32-bit platforms, the S-box operations of AES and Triple-DES require extensive use of bit masking operations to implement the S-box operations, which again negatively impacts the speed of the implementation. Finally, we note that MAC algorithms based on reduced-round block ciphers such as ALPHA-MAC [22] and Pelican MAC [23] have been proposed, yet their performance gain is small for very short messages because a full-round block cipher is used for both initialization and finalization.

Chaskey

We present Chaskey, a permutation-based MAC algorithm that overcomes these issues. Chaskey takes a 128-bit key \(K\) and processes a message \(m\) in 128-bit blocks using a 128-bit permutation \(\pi \). This permutation is based on the Addition-Rotation-XOR (ARX) design methodology. Its design is inspired by the permutation of SipHash [3], however with 32-bit instead of 64-bit words.

Chaskey has the following features:

  • Dedicated Design. Chaskey is a dedicated design for 32-bit microcontroller architectures. The addition and XOR operations are performed on 32-bit words, and each of these operations requires only one instruction on these architectures.

  • Cross-Platform Versatility. We took into account that certain microcontrollers do not support variable-length bit rotations and bit shifts. By choosing some rotation constants to be multiples of 8, these bit rotations are efficiently implemented by swapping 8-bit or 16-bit registers.

  • Efficient Implementation. Benchmarks on an ARM Cortex-M4 show that Chaskey requires only 7.0 cycles/byte for long (\(\ge 128\) byte) messages, and 10.6 cycles/byte for short (16 byte) messages. It has been implemented in only 402 bytes of ROM. Results for the Cortex-M0 are very good as well: 16.9 cycles/byte for long messages, 21.3 cycles/byte for short ones, and 414 bytes of ROM for the implementation. There is, roughly speaking, a linear relation between the number of cycles and energy consumption [21]. We therefore expect Chaskey to be very energy efficient as well.

  • Resistance Against Timing Attacks. On all microcontroller architectures that we are aware of, every instruction of Chaskey takes a constant time to execute. The total number of cycles depends only on the message length. Therefore, Chaskey is inherently secure against timing attacks.

  • Key Agility. Chaskey does not have a key schedule, as keys are simply XORed into the state. Updating the key in Chaskey requires generating a new uniformly random 128-bit key, and only two shifts and two conditional XORs on 128-bit words to generate two subkeys.

  • Tag Truncation. Chaskey is robust under tag truncation. Unlike for example GMAC [34], the best attack on Chaskey with short tags is tag guessing. We recommend \(|\tau | \ge 64\) for typical applications. Shorter tags may be used after careful analysis of the probability of occasionally accepting an inauthentic message.

  • Nonces are Optional. Several MAC algorithms (including GMAC [31], VMAC [46], and Poly1305-AES [9]) require a nonce, and become completely insecure if this nonce is reused (see e.g. [40]). Chaskey does not require a nonce, and therefore avoids these issues altogether.

  • Provably Secure. We prove that Chaskey is secure, based on the security of an Even-Mansour [32, 33] block cipher based on \(\pi \), up to about \(D=2^{64}\) blocks of chosen plaintexts and \(T=2^{128}/D\) off-line block cipher evaluations.

  • Patent-Free. We are unaware of any patents or patent applications related to Chaskey.

The name Chaskey is derived from Chasqui, also written as Chaski. Chasquis were fast runners that delivered messages in the Inca empire. They were of short stature, and could cover large distances through mountainous areas with little nutrition available to them [52].

2 Preliminaries

Table 1 summarizes the notation used in this paper. Throughout, \(n\) is both the key size and the block size. While the Chaskey algorithm is introduced for \(n=128\), we remark that our statements on the Chaskey mode of operation are independent of this specific choice of \(n\).

We interchangeably consider an element \(a\) of \(GF(2^n)\) as an \(n\)-bit string \(a[n-1]a[n-2]\ldots a[0]\) and as the polynomial \(a(x)=a[n-1]x^{n-1} + a[n-2]x^{n-2} + \ldots + a[0]\) with binary coefficients. Let \(f(x)\) be an irreducible polynomial of degree \(n\) with binary coefficients. For \(n=128\), we choose \(f(x)=x^{128}+x^{7}+x^{2}+x+1\). Then to multiply two elements \(a\) and \(b\), we represent them as two polynomials \(a(x)\) and \(b(x)\), and calculate \(a(x)b(x)\ \mathrm{{mod}}\ f(x)\). For example, we show how to multiply an element by \(x\) in Algorithm 1. Note that \(x\) corresponds to bit string \(0^{126}10\), which is \(2\) in decimal notation.

When converting between bit strings and arrays of 32-bit words, we always use little endian byte ordering. Inside every byte, bit numbering starts with the least significant bit.

Table 1. Notation.

3 Specification of Chaskey

3.1 Mode of Operation

Chaskey uses an \(n\)-bit key \(K\) to process a message \(m\) of arbitrary size into a tag \(\tau \) of \(t \le n\) bits. For every key \(K\), two subkeys \(K_1\), \(K_2\) are generated as shown in Algorithm 2.

The message \(m\) is split into \(\ell \) blocks \(m_1,m_2,\ldots ,m_\ell \) of \(n\) bits each, except for the last block \(m_\ell \) which may be incomplete. We define that an empty message \(m = \varnothing \) consists of one empty block: \(|m_1|=0\). An \(n\)-bit permutation \(\pi \) then iterates over the message, as specified in Algorithm 3 and illustrated in Fig. 1.

An alternative description of Chaskey based on an Even-Mansour [32, 33] block cipher \(E\) with \(2n\)-bit key and \(n\)-bit block size is given in Algorithm 4. This block-cipher-based description is equivalent to Chaskey once we define \(E\) using \(\pi \) as \(E_{X\Vert Y}(m)=\pi (m\oplus X)\oplus Y\). The purpose of this block-cipher-based alternative is to reduce the security of Chaskey to the security of the underlying block cipher \(E\). A security proof will be given in Sect. 5. This security proof views \(\mathrm {Chaskey \text {-}B}\) as a variant of FCBC by Black and Rogaway [15, 16], shown in Algorithm 5.

Fig. 1.
figure 1

The Chaskey mode of operation when \(|m_\ell |=n\) (top), and when \(0\le |m_\ell |<n\) (bottom). The round function of permutation \(\pi \) is shown in Fig. 2, the subkeys \(K_1\) and \(K_2\) are generated according to Algorithm 2, and \(m_\ell \Vert 10^{*}\) is shorthand for \(m_\ell \Vert 10^{n-|m_\ell |-1}\).

From this block-cipher-based description, it can be seen that Chaskey is similar to the three-key MAC constructions proposed by Black and Rogaway [15, 16]. Their constructions are variants of CBC-MAC [1, 37] that are secure for variable-length messages and avoid padding for messages of an integer number of blocks. As in CMAC [30, 38], our algorithm requires only one \(n\)-bit key, from which two \(n\)-bit subkeys are generated. However, unlike CMAC, Chaskey does not require any block cipher calls to generate these two subkeys, only two shifts and two conditional XORs on 128-bit words.

Chaskey also differs from the CBC-MAC variants in literature because its underlying block cipher uses an Even-Mansour construction and as it uses the same subkey twice in the last two subkey XORs: before and after the last permutation call. Therefore, it is possible that this subkey (or part thereof) can remain inside the registers of the microcontroller. This reduces the number of load and store operations, which are very expensive on typical microcontrollers.

Every key \(K\) must be chosen independently and uniformly at random from the entire key space. To avoid attacks with a practical complexity of off-line permutation evaluations, as will be explained in Sect. 6.1, we restrict the total number of blocks to be authenticated under the same key \(K\) to at most \(2^{48}\). This corresponds to refreshing the key after at most 4 petabytes of data. To avoid tag guessing attacks, we recommend that the tag size \(|\tau | \ge 64\). Changing \(|\tau |\) always requires selecting a new key \(K\) uniformly at random.

figure b

3.2 Permutation \(\pi \)

The permutation \(\pi \) is built using three operations: addition modulo \(2^{32}\), bit rotations, and XOR (ARX). The structure is the same as that of SipHash [3], but with 32-bit instead of 64-bit words and different rotation constants. Although SipHash has been proposed only very recently, it has found its way into several widely used software packages. For example, SipHash is used inside the hash table implementations of FreeBSD, Python, Perl, and Ruby. Both Chaskey and SipHash use the 2-input MIX operation of Skein [35], one of the finalists of the SHA-3 competition [53].

Fig. 2.
figure 2

A round of the Chaskey permutation \(\pi \), defined as: \(v_{0} \Vert v_{1} \Vert v_{2} \Vert v_{3} \leftarrow \pi (v_{0} \Vert v_{1} \Vert v_{2} \Vert v_{3})\). We intentionally swapped \(v_{0}\) and \(v_{1}\), as this reduces the number of crossing lines in the figure.

In Chaskey, the permutation \(\pi \) consists of eight applications of a round function. This round function is specified in Fig. 2.

Although we are confident that 8 rounds is enough for a secure construction, we recommend that implementers include the 16-round variant Chaskey-LTS (long term security) as a fallback in case of cryptanalytical breakthroughs. Chaskey-LTS consumes roughly twice the number of cycles and thus twice the amount of energy as Chaskey, but is still much faster than AES-CMAC. As only the number of rounds is different, it is possible to implement both Chaskey and Chaskey-LTS with negligible overhead in code size.

Note that half of the rotation constants of \(\pi \) are chosen to be multiples of eight. This is because a variety of microcontrollers do not support rotations and shifts over arbitrary amounts, e.g. the Renesas H8/300 CPU supports only one-bit rotations and shifts, the Renesas H8/2000 supports one-bit and two-bit rotations and shifts, and Microchip’s 8-bit microcontrollers (PIC10/12/16/18) support one-bit rotations. Due to our choice of constants, implementation on 8- and 16-bit microcontrollers will be more efficient than had these constants been chosen at random. They furthermore allow us to implement Chaskey efficiently on a wide range of 32-bit microcontrollers, yet we have found that they do not seem to make \(\pi \) weaker against cryptanalytical attacks.

4 Implementation Results

We implemented Chaskey on several microcontroller platforms. We provide implementation results on ARM Cortex-M0 and -M4 platforms, and compare these to AES-128-CMAC on the same platforms. All our implementations have been compiled with GNU Tools for ARM Embedded Processors version 4.7.3 20121207. The Cortex-M0 benchmarks are executed on an STM32F030R8 microcontroller of STMicroelectronics, the Cortex-M4 ones on an STM32F401RE.

We compare the results for our Chaskey implementation with what is, to the best of our knowledge, the fastest available AES implementation for the ARM Cortex-M series: SharkSSL [55, 56]. Since no AES-128-CMAC benchmarks are available for this implementation, we instead compare with AES-128-ECB, which is guaranteed to be at least as fast and small as AES-128-CMAC. Note that we list SharkSSL results for the Cortex-M3, since Cortex-M4 results are not available. However, the architecture of both microcontrollers is extremely similar, and thus results are expected to be the same.

Results for the various implementations are shown in Table 2. In all of our own benchmarks, round keys are precomputed, and time required to do so is not included in the listed numbers.

Table 2. Benchmark results for Chaskey and AES-128-CMAC on Cortex-M0/M4. AES-128-CMAC is implemented using AES code from the MAGEEC framework. AES-128-ECB on Cortex-M0/M3 is based on figures from SharkSSL [55, 56]. Note that compiling with speed optimization flags does not always result in the fastest implementation.

5 Proof of Security

We focus on the security of the Chaskey mode of operation. For this, we consider \(n,t\in \mathbb {N}\) to be arbitrary values. Denote by \(\mathsf {block}(k,n)\) the set of all block ciphers with \(k\)-bit key and \(n\)-bit block size, and let \(\mathsf {perm}(n)\) denote the set of all permutations on \(n\) bits. Note that for \(E\in \mathsf {block}(k,n)\), we have \(E_K\in \mathsf {perm}(n)\) for all \(K\in \{0,1\}^{k}\). The definitions below follow Bellare et al. [7] and Iwata and Kurosawa [38, 39].

MAC Security. Let \(\mathcal {H}:\mathcal {K}\times \{0,1\}^{*}\rightarrow \{0,1\}^{t}\) be a MAC function.

$$\begin{aligned} \mathbf {Adv}_{\mathcal {H}}^{\mathsf {mac}}(q,D,r) = \max _{\mathcal {A}} \Pr \left( \begin{array}{l} K\xleftarrow {{\scriptscriptstyle \$}}\mathcal {K}\,,\, (m,\tau )\xleftarrow {{\scriptscriptstyle \$}}\mathcal {A}^{\mathcal {H}_K}\;;\\ \mathcal {H}_K(m)=\tau \text { and } m \text { never queried}\\ \end{array} \right) , \end{aligned}$$

where the maximum is taken over all adversaries making at most \(q\) queries of total length at most \(D\) blocks and running in time \(r\).

3PRP Security. The strength of a block cipher \(E\) is conventionally expressed as the PRP (pseudorandom permutation) security. In \(\mathrm {Chaskey \text {-}B} \) (see Algorithm 4) we use a block cipher \(E\in \mathsf {block}(2k,n)\) on input of three different keys: \(E_{K\Vert K}\), \(E_{K\oplus K_1\Vert K_1}\), and \(E_{K\oplus K_2\Vert K_2}\), where \(K_1\), \(K_2\) are generated as shown in Algorithm 2. As the keys \((K,K_1,K_2)\) are dependent, so are the three different usages of \(E\). As such, a slightly more involved security notion is needed, which we call 3PRP. For ease of presentation, the definition is adapted to the specific key generation and block cipher use mode of Chaskey.

$$\begin{aligned} \mathbf {Adv}_{E}^{\mathsf {3prp}}(D,r) = \max _{\mathcal {A}} \left| \begin{aligned} \Pr \left( \begin{array}{r} K\xleftarrow {{\scriptscriptstyle \$}}\{0,1\}^{k}\, ,\, (K_1,K_2)\leftarrow \mathrm {SubKeys} (K) \;;\\ \mathcal {A}^{E_{K\Vert K},E_{K\oplus K_1\Vert K_1},E_{K\oplus K_2\Vert K_2}} = 1\\ \end{array} \right) - \qquad&\\ \Pr \left( p_1,p_2,p_3\xleftarrow {{\scriptscriptstyle \$}}\mathsf {perm}(n) \;;\; \mathcal {A}^{p_1,p_2,p_3} = 1 \right)&\end{aligned}\right| , \end{aligned}$$

where the maximum is taken over all adversaries making at most \(D\) queries and running in time \(r\).

The proof consists of two phases. Theorem 1 states the security of \(\mathrm {Chaskey \text {-}B} \) in the standard model, based on any \(E\) with \(2n\)-bit key and \(n\)-bit block size. This result is generalized in the ideal permutation model to \(\mathrm {Chaskey} \) in Theorem 2, once we use \(E_{X\Vert Y}(m)=\pi (m\oplus X)\oplus Y\) for \(\pi \in \{0,1\}^{n}\).

Theorem 1

Let \(K\xleftarrow {{\scriptscriptstyle \$}}\{0,1\}^{n}\) and consider \(\mathrm {Chaskey \text {-}B} ^E_K:\{0,1\}^{*}\rightarrow \{0,1\}^{t}\). Then,

$$\begin{aligned} \mathbf {Adv}_{\mathrm {Chaskey \text {-}B}}^{\mathsf {mac}}(q,D,r) \le \frac{2D^2}{2^n} + \frac{1}{2^t} + \mathbf {Adv}_{E}^{\mathsf {3prp}}(D,r). \end{aligned}$$

Theorem 2

Let \(K\xleftarrow {{\scriptscriptstyle \$}}\{0,1\}^{n}\), assume that \(\pi \xleftarrow {{\scriptscriptstyle \$}}\mathsf {perm}(n)\), and let us consider \(\mathrm {Chaskey} ^\pi _K:\{0,1\}^{*}\rightarrow \{0,1\}^{t}\). Then,

$$\begin{aligned} \mathbf {Adv}_{\mathrm {Chaskey}}^{\mathsf {mac}}(q,D,r) \le \frac{2D^2}{2^n} + \frac{1}{2^t} + \frac{D^2+2DT}{2^n}, \end{aligned}$$

where \(T\) is defined as \(r/r_\pi \) for \(r_\pi \) denoting the running time of one evaluation of \(\pi \).

The proofs of Theorems 1 and 2 can be found in the full version of the paper.Footnote 1

6 Cryptanalysis

6.1 Attack Setting

In this section, we give an overview of the cryptographic properties of the Chaskey permutation \(\pi \), and the two-key Even-Mansour block cipher \(E_{X\Vert Y}(m)=\pi (m\oplus X)\oplus Y\). Note even if \(\pi \) has structural weaknesses, Theorem 1 guarantees that Chaskey remains secure as long as \(E_{K\Vert K}\), \(E_{K\oplus K_1\Vert K_1}\), and \(E_{K\oplus K_2\Vert K_2}\) are secure Even-Mansour block ciphers that are indistinguishable from each other. In particular, attackers are restricted to the following setting:

Uniformly Random Key \(\varvec{K}\) . Every implementation of Chaskey should ensure that the \(n\)-bit key \(K\) is chosen uniformly at random from the entire key space. In this way, Chaskey completely avoids all attacks on \(E_{K\Vert K}\) using weak keys [24], known keys [43] or related keys [8, 11, 12]. In a weak-key attack, the attacker knows that the key \(K\) is chosen from a smaller subset of the key space. The attacker controls the value of \(K\) in a known-key attack, which in the case of the Even-Mansour block cipher corresponds to an attack on the underlying permutation \(\pi \). In a related-key attack, the attacker obtains encryptions under different keys, and will know (or even control) the relationship among these keys.

Data Complexity \(\varvec{D}\) Below \(\varvec{2^{n/2}}\) Chosen Plaintexts. No encryption device is allowed to perform close to \(2^{n/2}\) block cipher calls under the same key. This is because after about \(2^{n/2}\) block cipher calls, an internal collision attack [54] becomes likely. The same restriction applies to all iterated MAC constructions with an \(n\)-bit state. We will now explain that the data complexity under the same key should be restricted further to avoid attacks with a practical time complexity.

Time Complexity \(\varvec{T}\) Below \(\varvec{2^{n}/D}\) Block Cipher Evaluations. Even and Mansour [32, 33] proved that any attack on their construction requires about \(T\) block cipher evaluations and \(2^{n}/D\) known plaintexts. Dunkelman et al. [29] described a key recovery attack on the Even-Mansour construction to show that this bound is tight. As they clarify, this tight bound holds for both single-key and two-key Even-Mansour. To avoid attacks with a practical time complexity, the specification restricts the total number of blocks under the same key \(K\) to at most \(2^{48}\). This limit assumes that performing about \(2^{80}\) off-line permutation evaluations is impractical for the attacker. Implementations that require a higher security level should rekey more frequently. We note that the amortized cost of rekeying is usually negligible, and rekeying does not require additional cryptographic components if Chaskey is also used as a key derivation function (KDF) [20].

No Chosen Ciphertext Attacks. The attacker cannot make any decryption queries \(E^{-1}_{K\Vert K}\), \(E^{-1}_{K\oplus K_1\Vert K_1}\), or \(E^{-1}_{K\oplus K_2\Vert K_2}\), for the simple reason that Chaskey implementations do not contain the decryption function, and the corresponding keys are secret.

Tag Guessing Has Probability \(\varvec{2^{-|\tau |}}\) . The probability of constructing a forgery by guessing the tag is \(2^{-|\tau |}\). Guessing a tag correctly for Chaskey does not make additional forgeries easier. The specification recommends that \(|\tau | \ge 64\), which ensures that the probability of guessing \(\tau \) correctly after \(2^{32}\) trials is less than one in a billion. If it is acceptable to occasionally accept an inauthentic message as authentic (e.g. in certain voice communication applications [34]), the use of shorter tags may be carefully considered.

Implementation Attacks. Chaskey is inherently secure against timing attacks, as its execution time depends only on the message length \(|m|\), and not on the secret key \(K\). However, a straightforward implementation of Chaskey provides no resistance against hardware side channel attacks, nor to fault attacks. Furthermore, note that if the internal state of Chaskey is recovered and \(|\tau |=n\), it is easy to recover the secret key \(K\) from any \((m,\tau )\)-pair.

6.2 Cryptanalysis of the Block Cipher

We now proceed with our cryptanalysis results for the block ciphers \(E_{K\Vert K}\), \(E_{K\oplus K_1\Vert K_1}\), and \(E_{K\oplus K_2\Vert K_2}\) using \(\pi \) as the underlying permutation.

Standard Differential Cryptanalysis. We searched for differential characteristics of \(E_{K\Vert K}\) that are linear in \(GF(2)\), which means the output difference of every addition is the XOR of the two input differences. This was done by formulating this problem as the search for low-weight codewords in a linear code [58].

The best found characteristics for \(1,2,\ldots ,8\) rounds are shown in Table 3. We show only the input and output differences; the linearity property can be used to find the internal differences. We calculated the characteristic probability in two ways: by determining the probability of every addition using the Lipmaa-Moriai formula [49] and multiplying these probabilities, and by using Leurent’s ARX Toolkit [47, 48] to obtain a more accurate estimate that takes certain dependencies between operations into account.

In Table 4, we give the differences after every round of the best found differential characteristic for eight rounds, which corresponds to the last characteristic in Table 3. It is interesting to note that this characteristic has what can be described as an hourglass structure: the differences are sparse in the middle of the characteristics (located only in the most significant bits), and gradually become denser towards the outer rounds. The same observation also holds for all other characteristics of Table 3.

In Table 3, probabilities below \(2^{-128}\) indicate that a characteristic exists only with some probability. Although such characteristics are not usable in an attack, it is important to explore them from a design point of view. Table 3 shows that Even-Mansour block ciphers based on \(\pi \) have a very large security margin against even very advanced variants of differential cryptanalysis attacks, especially as the data complexity in any attack on Chaskey is limited to \(2^{64}\).

Note that it is possible that better (possibly non-linear) characteristics exist, or that the probability of a given characteristic is lower than the probability of the corresponding differential. However, we expect that these effects will not be significant enough to invalidate our security claim against differential cryptanalysis.

Table 3. Best found differential characteristics for \(1, 2, \ldots , 8\) rounds of the permutation \(\pi \). Only the input and output differences are shown. Each of these characteristics is linear, this property can be used to determine the internal differences. We calculate the characteristic probability in two ways: assuming independence of every operation and using the Lipmaa-Moriai formula, as well as by Leurent’s ARX toolkit for a more refined estimate.
Table 4. Best found linear differential characteristic for \(8\) rounds of \(\pi \). This is the characteristic given in the last row of Table 3. If we assume independence of every operation and use the Lipmaa-Moriai formula for every addition, we find a probability of \(2^{-293}\). Leurent’s ARX toolkit can be used to refine this probability to \(2^{-289.9}\). Note the hourglass structure: differences are sparse in the middle, and gradually become denser towards the outer rounds.

Truncated Differential Cryptanalysis. We used the same techniques that were applied to Salsa20 [4] to find truncated differentials for \(E_{K\Vert K}\). More specifically, we introduced differences in the most significant bits of the inputs, and searched for statistical biases in the output bits. We found such biases for up to four rounds of the block cipher. For example, if in the plaintext \(\varDelta ^{\oplus } v_{1}[31]\) and \(\varDelta ^{\oplus } v_{2}[31]\) are both 1, then we found experimentally that \(\varDelta ^{\oplus } v_{2}[16]\) after four rounds has a bias of about \(2^{-12.48}\) towards 0. We tried out all combinations of input differences in the most significant bits of the four input words, but did not find biases in any of the output bit differences after five rounds or more, when experimenting with sets of \(2^{30}\) samples.

Meet-in-the-Middle Attacks. The idea behind a meet-in-the-middle attack is to separate the mathematical equations that describe a block cipher into two or more groups, in such a way that some variables do not appear in at least one of the groups of equations. After three rounds of \(\pi \), full diffusion occurs: every input bit affects every output bit. Similarly, \(\pi ^{-1}\) also reaches full diffusion after three rounds. As eight rounds of \(\pi \) consist of almost three full diffusions, meet-in-the-middle attacks should not be applicable to Even-Mansour block ciphers based on \(\pi \).

Note that the attacker is not allowed to perform chosen-ciphertext attacks, which limits the power of advanced meet-in-the-middle attacks, using the splice-and-cut technique that was introduced for hash function cryptanalysis [2, 59] and subsequently applied to block ciphers as well [18, 62].

A further extension of splice-and-cut meet-in-the-middle attacks are biclique attacks [17, 42]. Most applications of bicliques offer only slight improvements over brute force attacks [57]. Although brute-force-like attacks provide insight into the security of ciphers in the absence of other shortcut attacks, they do not affect the practical security of the cipher.

Rotational Cryptanalysis. A randomly chosen key \(K\) ensures that the input of the permutation \(\pi \) when used in an Even-Mansour block cipher will (with very high probability) have an asymmetrical state, thereby preventing rotational attacks [41].

Slide Attacks. Because every round of \(\pi \) is identical, slide attacks [13] are applicable to \(\pi \). However, in a slide attack, about \(2^{n/2}\) plaintext-ciphertext pairs are required before a slid pair is found. Therefore, slide attacks have a data complexity that goes beyond our security bound, and do not pose a threat to \(\pi \), nor to Even-Mansour block ciphers based on \(\pi \).

Fixed Points. Because \(\pi \) contains only the modular addition, XOR, and bitwise rotation operations, the permutation has the following fixed point: \(\pi (0^n)=0^n\). Fixed-points are a type of differentiability attack [50]. When \(\pi \) is used inside the \(E_{K\Vert K}\) block cipher, this fixed point corresponds to \(E_{K\Vert K}(K)=K\). If \(K\) is chosen uniformly at random, this relationship only holds with probability \(2^{-n}\) for any plaintext chosen by the attacker. Similar observations hold for \(E_{K\oplus K_1\Vert K_1}\) and \(E_{K\oplus K_2\Vert K_2}\). Although it may seem to be a bold move from a design point of view to allow that \(E_{0^{2n}}(0^n)=0^n\), we note that this property also holds for the stream cipher Trivium [25, 27] and the block cipher KATAN [26]. However, no attacks have been found that break the full version of these ciphers.

Dependency Between Key and Subkeys. As shown by Algorithm 1, the subkeys \(K_1\) and \(K_2\) are generated from the key \(K\) as \(K_1=xK\) and \(K_2=x^2K_1\). Theorem 1 requires that an attacker cannot distinguish \(E_{K\Vert K}\), \(E_{(x+1)K\Vert xK}\), and \(E_{(x^2+1)K\Vert x^2K}\) from each other. As shown by Theorem 2, this assumption holds if the underlying permutation \(\pi \) is an ideal permutation. We now argue that even if the permutation \(\pi \) of Sect. 3.2 is used instead, an attacker cannot distinguish these three block ciphers. Because of the rotational relations between the key \(K\) and the subkeys \(K_1\) and \(K_2\), rotational cryptanalysis [41] seems to be a promising technique. However, the fact that \((x+1)K\) and \(xK\), as well as \((x^2+1)K\) and \(x^2K\) both differ by \(K\), seems to effectively preclude rotational cryptanalysis to distinguish \(E_{(x+1)K\Vert xK}\) or \(E_{(x^2+1)K\Vert x^2K}\) from \(E_{K\Vert K}\), or from each other. Furthermore, the security proof assumes that individual queries to the three aforementioned block ciphers are permitted, whereas an attacker can in practice only observe \(\tau \).

Other Attacks. We do not consider zero-sum attacks [5] and cube attacks [28] to be a threat for ARX ciphers, because the addition operation ensures that for every output bit, the polynomial expression in \(GF(2)\) representing this bit in terms of its inputs will be of sufficiently high degree. Moreover, rebound attacks [51] are not known to be relevant to secret-key algorithms.

7 Conclusion

Chaskey is a permutation-based MAC algorithm, with at its core an ARX-based permutation \(\pi \) based on SipHash. Alternatively, Chaskey can also be interpreted as a block-cipher-based MAC algorithm based on an underlying Even-Mansour block cipher.

Inspired by the block-cipher-based CMAC, Chaskey avoids padding for messages of an integer number of blocks. Its subkey generation is even more efficient than CMAC, as it does not require any block cipher calls.

We proved that Chaskey is secure, based on the 3PRP-indistinguishability of three underlying Even-Mansour block ciphers. Assuming that the permutation \(\pi \) used in these Even-Mansour block ciphers is ideal, we proved that Chaskey is secure up to about \(D=2^{n/2}\) chosen plaintexts and about \(T=2^{n}/D\) queries to \(\pi \) or \(\pi ^{-1}\).

We remark, however, that the efficient permutation \(\pi \) designed for Chaskey shows properties that allow it to be distinguished from an ideal permutation. For example, it is easy to find a fixed point: \(\pi (0^n)=0^n\). Fortunately, this observation does not extend to an attack when this permutation is used inside an Even-Mansour block cipher, as finding this fixed point implies knowledge of the secret key.

Therefore, we explored the distinguishability of the three Even-Mansour block ciphers from a cryptanalysis point of view. After investigating a wide variety of currently known cryptanalysis attacks, we found no shortcut attacks resulting from using our proposed eight-round permutation \(\pi \) instead of an ideal permutation. We recommend, however, that implementers also support a 16-round Chaskey-LTS as a fallback in case of cryptanalytical breakthroughs.

Our benchmarks showed that Chaskey performs very well on ARM Cortex-M microcontrollers. We measured that our straightforward Chaskey implementations are between 7 to 15 times faster than AES-128-CMAC in speed-optimized implementations, and at about 10 times smaller in area optimized implementations. Because of the roughly linear relation between cycle count and energy consumption, Chaskey is therefore much more energy efficient as well. Although 32-bit microcontrollers were our main target platform, Chaskey is also expected to perform well on 8-bit and 16-bit platforms.