Keywords

1 Introduction

RFID tags and very constrained computing devices are expected to become increasingly important for many applications and industries. In parallel to this general trend, the growth of ubiquitous computing and communication interconnections naturally leads to more entry points and increased potential damage for attackers. Security is crucial for many situations, but often left apart due to cost and feasibility constraints. In order to fulfill the need in cryptographic primitives that can be implemented and executed in very constrained environments (area, energy consumption, etc.), aka lightweight cryptography, the research community has recently made significant advances in particular in the domain of symmetric-key cryptography.

Current NIST standards for block cipher (AES  [10]) or hash function (SHA-2  [25] or SHA-3  [3]) are not really fit for very constrained environments and several alternatives have been proposed, such as PRESENT  [5], KATAN  [7], LED  [13], Piccolo  [23], TWINE  [24] for block ciphers and QUARK  [1], PHOTON  [12], SPONGENT  [4] for hash functions. Notably, PRESENT block cipher is now part of an ISO standard [17]. All these different proposals greatly improved our knowledge in lightweight designs and many already achieve close to optimal performance for certain metrics such as area consumption.

In practice, the constrained devices will be either communicating with other constrained devices or more probably with a server. In the latter case, the server is likely to have to handle an important number of devices, and while cryptography might be used in a protocol to secure the communications, other applications operations have to be performed by the server. Therefore, it is crucial that the server does not spend too much time performing cryptographic operations, even when communicating with many clients, and thus software performance does matter for lightweight cryptography.

At CHES 2012, Matsuda and Moriai [19] have studied the application of bitslice implementations to PRESENT and Piccolo block ciphers, concluding that current lightweight ciphers can be surprisingly competitive for cloud applications. Bitslice implementations allow impressive speed results and are also valuable for their inherent protection against various side-channel cryptanalysis. However, we argue that they might not really fit all the lightweight cryptography use cases, where a server has to communicate with many devices. Indeed, constrained devices usually encipher a very small amount of data at a time. For example, in its smallest form, an Electronic Product Code (EPC), which is thought to be a replacement for bar codes using low-cost passive RFID-tags, uses \(64\), \(96\) or \(125\) bits as a unique identifier for any physical item. Small data enciphering makes the cost of data transformation into bitsliced form and key schedule process very expensive (these costs are usually omitted by assuming that a rather large number of blocks will be enciphered).

Therefore, it is interesting to explore the software efficiency profile of lightweight ciphers not only for cloud applications but also for classical lightweight applications and, surprisingly, apart from embedded 8-bit architectures, they are not really meant to perform well in software on mid-range or high-end processors. For example, the currently best non-bitslice AES implementations reaches about 14 cycles per byte (c/B) on a 32-bit Pentium III CPU [20], while the currently best non-bitslice PRESENT implementations only runs in 130 c/B on the same processor (the implementation in [21] reports a performance of 16.2 cycles per bit). Therefore, we believe a gap exists in this area, even if very recent proposals such as TWINE do report good non-bitsliced software performances.

Our Contributions. In this article, we provide three main contributions for lightweight ciphers software implementations on x86 architectures, with a special focus on LED, PRESENT and Piccolo. First, in Sect. 2, we argue that table-based implementations are still valuable in particular situations (some servers with “legacy” CPUs, pre Core2, might lack the necessary SSE instructions set that is used in optimized bitslice or vperm implementations) and we propose new interesting trade-offs, with a theoretical cache modeling to better predict which trade-off will be suitable depending on the target processor. Our model is backed up by our experiments and we obtain the best known table-based implementations for the studied lightweight ciphers.

Then, in Sect. 3, we further push software implementations of lightweight ciphers by exploring the vperm implementation trick for \(4\)-bit Sboxes that has already proven to be interesting for AES  [14] or TWINE  [24], and which provides cache-timing attack resistance. We propose a strategy for our portfolio of lightweight ciphers, and we conclude that the linear layer, usually not the part contributing a lot in the amount of computations, can have a significant impact on the performances for this type of implementation. We note that these implementations are interesting because they apply to almost all lightweight ciphers and they produce very efficient code.

Thirdly, in Sect. 4, we explore bitslice implementations for lightweight ciphers and we show that for some common use cases they are less interesting than for cloud applications [19]. In fact, bitslice implementations can be slower than table-based of vperm in some situations, for example when only a low number of blocks is enciphered per device. Moreover, previous bitslice analysis usually neglects the key schedule part, so we provide bitsliced versions of the key schedules. However, even in a bitsliced version, the key schedule can have a significant impact on the overall performance. We therefore revisit this type of implementation by taking into account various factors that are important in practice, such as the amount of distinct keys, the amount of blocks to be enciphered, etc. We note that we provide the first bitslice implementation of LED, greatly improving over the best known software performance of this cipher.

For all three primitives LED, PRESENT and Piccolo, we have coded all three versions of implementation with various tradeoffs. Then, for various crucial use cases presented in Sect. 5, we compare the different implementation strategies and we discuss our results in Sect. 6. For the readers not familiar with them, we quickly explain in Appendix A the three lightweight ciphers that we take as example for our implementations and refer to the specification documents [5, 13, 23] for a complete description. As many other lightweight ciphers, LED, PRESENT and Piccolo are \(64\)-bit ciphers based on the repetition of a round function built upon a \(4\)-bit Sbox and a linear diffusion layer, while the key schedule is very simple if not inexistent. All these design choices are justified by the reduction of the area consumption and while smaller Sboxes are possible, \(4\) bits is a sensitive choice which can provide very small area in hardware as well as ease of implementation in software.

The reader will find more details and code illustrations in the extended version of this paper [2]. Furthermore, the full source codes of the implementations presented in this paper are available online at https://github.com/rb-anssi/lightweight-crypto-lib.

2 Table-Based Implementations

2.1 Core Ideas

State of the Art. Tabulating operations for efficiency purposes is an old method well known by programmers. When applied to block ciphers, the goal is to tabulate as much as possible the different operations composing one round, such that its computation consists in:

  • selecting slices of the internal state by shift and mask operations;

  • performing several table lookups to achieve the round transformation;

  • aggregating lookup table outputs to get the updated internal state;

  • performing the key addition layer.

Such an approach has for instance been proposed by Daemen and Rijmen in [10] to perform efficiently AES operations on \(32\)-bit processors. Thus, an AES round can be computed with \(16\) table lookups, using \(4\) tables of \(8\)-bit inputs and \(32\)-bit outputs (each table having a size of about \(1\,\mathrm {KB}\)).

Table Size Tradeoffs. Many lightweight ciphers have in common \(64\)-bit block and \(4\)-bit Sbox, which allows a lot of tradeoffs when implemented on \(32\) or \(64\)-bit processors with a table-based approach. It can be tedious to implement all these tradeoffs and check which one provides the best results. Thus, in the following, we propose a model to directly find the most efficient implementation strategy. We emphasize that such a modeling method can be applied on all \(32\) or \(64\)-bit architectures, but here we focus specifically on Intel x86 ones. If one considers a generic Substitution and Permutation Network (SPN) based lightweight cipher (like LED or PRESENT), the round function can be performed, following the table-based approach, as follows (the internal state being divided into \(64/m\) slices of \(m\) bits each, T0, T1, \(\ldots \) being tables with \(m\)-bit input and \(64\)-bit output, and MASKm being a mask with \(m\) consecutive least significant bits at \(1\)):

figure a

Note that the choice of \(64\)-bit output tables exactly fits with the size of the cipher internal state, and thus one can directly include the linear layer when doing the table lookup. One round will then roughly be computed with shift, mask, table lookup and XOR operations, whose amount will depend on the size of the tables (each table \(T\) will have a size \(S_T = 2^m \times 8\) bytes). The main issue will be to choose the best tradeoff for \(m\). Indeed, the bigger are the tables, the smaller is the amount of operations to compute during the round function, but the bigger are the latencies of the table lookups. Furthermore, depending on \(m\), the Intel x86 instruction set allows some tricks (like mask-and-move instructions) reducing the number of operations. We then focus on the assembly pseudo-code corresponding to one line (for \(m=4, 12, 16\)) or two lines (for \(m=8\)) of the previous generic SPN round functions on \(m\) (the assembly is Intel syntax):

figure b

  

figure c

     

figure d

It is to be noted that these pseudo-codes are for x86 \(64\)-bit architectures (on \(32\)-bit ones, more mov and xor are required due to the fact that table lookups only get \(32\)-bit words – refer to table under section “Results” (footnote a) for more details about this –).

Cache Latency Model. To perform efficiently memory accesses, modern processors have different cache memory levels (\(L1\), \(L2\) and sometimes \(L3\)). According to cache policy rules, the most used data will be stored in \(L1\), then in \(L2\), etc... When considering a table \(T\) of size \(|T|\), the probability \(P_{L1}\) that an element of \(T\) is in \(L1\) (cache of size \(|L1|\)) is \(P_{L1} = \frac{|L1|}{|T|}\) if \(|T| > |L1|\) and \(P_{L1} =1\) otherwise. Furthermore, the probability \(P_{L2}\) that an element of \(T\) is in \(L2\) and not in \(L1\) (considering that the element is either in \(L1\) or in \(L2\) Footnote 1, i.e. \(|T| \le |L1| + |L2|\)) is \(P_{L2} = 1 - P_{L1}\). Thus, one can simply deduce the average latency \(l_T\) to load an element of \(T\) during a random access (which is the case for block ciphers):

$$\begin{aligned} l_T = l_{L1} \times P_{L1} + l_{L2} \times P_{L2} = l_{L1} \times P_{L1} + l_{L2} \times (1 - P_{L1}) \end{aligned}$$
(1)

with \(l_{L1}\) and \(l_{L2}\) denoting the latencies of \(L1\) and \(L2\) caches respectively. Since in this work we focus on x86 architectures, and more precisely on Intel ones, we now have to consider the size and the latency of their different cache memories. These numbers for several microarchitectures are given in the table below.

Microarchitecture

\(L1\) size (KB)

\(L1\) latency (cycles)

\(L2\) size (KB)

\(L2\) latency (cycles)

Intel \(P6\)

\(16\) or \(32\)

\(3\)

\(512\)

\(8\)

Intel Core

\(32\)

\(3\)

\(1500\)

\(15\)

Intel Nehalem/Westmere\(^\mathrm{a}\)

\(32\)

\(4\)

\(256\)

\(10\)

Intel Sandy/Ivy Bridge\(^\mathrm{a}\)

\(32\)

\(5^\mathrm{b}\)

\(256\)

\(12\)

  1. \(^\mathrm{a}\) Westmere is the 32 nm die shrink of the Nehalem microarchitecture, and Ivy Bridge is the 22 nm die shrink of Sandy Bridge.
  2. \(^\mathrm{b}\) Sandy and Ivy Bridge data L1 cache latency can drop to 4 cycles when the offset from the base address is less than 2048 bytes (see [16] for more details). However, rounding this to a fixed \(5\) cycles gives a reasonable first-order approximation for our model.

Results. From the generic SPN round pseudo-code given above (and its assembly decomposition), one can compute the number of instructions required for such a round according to the value \(m\) of a block slice and the microarchitecture cache memory characteristics. We can weight each instruction by its latency. We assume that a register to register shift, move and XOR have a latency of one cycle, whereas a table lookup has an average latency \(l_T\) defined in Eq. (1). An average theoretical latency of one round, for different microarchitectures, is given in the table below (for each architecture, the average latency is obtained by summing all the operations latencies).

 

Theoretical number of instructions for one round (for different table input sizes \(m\))

Instruction type

\(m=4\) bits

\(m=8\) bits

\(m=12\) bits

\(m=16\) bits

Shift

\(15\)

\(3\)

\(5\)

\(3\)

Move/xor

\(15\)

\(8\)

\(5\)

\(3\)

Mask

\(16\)

\(0\)

\(5\)

\(0\)

Table lookup\(^\mathrm{a}\)

\(16\)    \((32)\)

\(8\)    \((16)\)

\(6\)    \((12)\)

\(4\)    \((8)\)

 

Theoretical average round latency (for different table input sizes \(m\))

Microarchitecture

\(m=4\) bits

\(m=8\) bits

\(m=12\) bits

\(m=16\) bits

Intel \(P6\)

\(142\)

\(59\)

\(99\)

\(93\)

Intel Core

\(94\)

\(35\)

\(91\)

\(264\)

Intel Nehalem/Westmere

\(110\)

\(43\)

\(68\)

\(186\)

Intel Sandy/Ivy Bridge

\(126\)

\(51\)

\(79\)

\(114\)

  1. \(^\mathrm{a}\) The number in brackets denotes the cost for \(32\)-bit architectures, where two table lookups need to be performed in order to obtain an entire \(64\)-bit output. This however holds if we only consider general purpose registers loads: only one table lookup is needed when using SIMD registers, at the expense of additional loads to save back the state from SSE to general purpose registers. For the sake of simplicity, we only focus on general purpose registers code latency in this section.

Note that for \(m=16\) bits, we might have to also consider the \(L3\) or RAM latency depending on the \(L2\) size, and naturally extend the Eq. (1). We verified experimentally these values by implementing and running such a generic SPN round for the different \(m\) values considered. We could confirm the results on each of the considered microarchitectures. Note however that the experimental results do not exactly match the theoretical ones due to the superscalar property of the Intel architecturesFootnote 2. Nevertheless, we emphasize the fact that this model is sufficient for our purpose: one can deduce that \(8\)-bit slices seem to be the best tradeoff from an efficiency point-of-view, whatever the microarchitecture, and we will apply this tradeoff on each of the three lightweight ciphers from our portfolio. One can also notice that some theoretical counter-intuitive results are experimentally verified: for instance, \(16\)-bits input tables outperform \(4\)-bit input tables on some microarchitectures though a lot of data are outside \(L1\) and \(L2\) (this is due to the reduced number of shift/move/mask operations compensating the bad average table access latency). Even though this is not the core subject of our paper, this theoretical model can be used for performance comparisons of table based implementations on other architectures such as ARM, SPARC or PowerPC.

Finally, table-based implementations specificities for each cipher are described in the following sections.

2.2 LED

We build eight tables, each one taking as input two row-wise adjacent \(4\)-bit Sbox words (thus \(8\)-bit input), and providing \(64\)-bit output words. We give as example the description of the first table \(T_0\), that takes as input the two least significant \(4\)-bit words of the internal state (we denote \(a_0\) and \(a_1\) these two words, while \(SB\) represents the Sbox and \(\otimes \) the multiplication in \(GF(2^4)\)):

$$\begin{aligned} T_0(a_0, a_1)&= \left[ \begin{array}{cccc} 4 \otimes SB[a_0] &{} \quad 4 \otimes SB[a_1] &{} \quad 0 &{} \quad 0\\ 8 \otimes SB[a_0] &{} \quad 8 \otimes SB[a_1] &{} \quad 0 &{} \quad 0\\ B \otimes SB[a_0] &{} \quad B \otimes SB[a_1] &{} \quad 0 &{} \quad 0\\ 2 \otimes SB[a_0] &{} \quad 2 \otimes SB[a_1] &{} \quad 0 &{} \quad 0\\ \end{array} \right] \end{aligned}$$

Note that the \(4\)-bit input words row-wise packing and the \(64\)-bit output words allow to include the ShiftRows operation directly inside the table, by carefully placing the meaningful output bits at their correct position. Thus, the three round operations SubCells, ShiftRows and MixColumnsSerial are performed with \(7\) shifts and \(8\) masks (to select the eight \(8\)-bit slices of the internal state), \(8\) table lookups (to perform the three round operations) and \(7\) XORs to agglomerate the eight table outputs.

Furthermore one extra table of \(31\) or \(48\) \(64\)-bit words (respectively in the case of LED-\(64\) and LED-\(128\)) allows to perform the AddConstants operation with only one table lookup and one XOR (again, we manipulate \(64\)-bit words in order to directly place the \(4\)-bit constants at their correct position).

To summarize, one round of LED with \(64\)-bit keys (resp. \(128\)-bit keys) can be implemented with \(7\) shifts, \(8\) masks, \(8\) XORsFootnote 3 and \(9\) table lookups, and with memory requirements of \(16640\) bytes (resp. \(16768\) bytes). The tables are therefore small enough to fit mostly or even entirely in the L1 cache of the processor. We provide the pseudo-code for the \(i^{th}\) round computation (T_cst being the table computing the AddConstants operation, and T0, T1, T2, T3, T4, T5, T6 and T7 being the eight tables performing the SubCells, ShiftRows and MixColumnsSerial operations):

figure e

2.3 PRESENT

Encryption. Having a very similar structure to LED, we use the same implementation strategy for PRESENT. Eight tables are built, each one taking as input two adjacent Sbox \(4\)-bit words (\(8\)-bit inputs), and providing \(64\)-bit output words, such that the tables also take into account the permutation layer. The round computation pseudo-code is exactly the same as for LED, except that there is no constant addition in the round function. Therefore, one PRESENT round is performed with \(7\) shifts, \(8\) masks, \(8\) table lookups and \(7\) XORs\(^3\) and requires eight tables of \(2048\) bytes each, thus \(16384\) bytes in total. The tables are therefore small enough to fit mostly or even entirely in L1 cache. An example of how to build the tables is provided in Appendix C.1 of [2].

Key Schedule. The PRESENT key schedule is quite costly in software, due to the \(61\)-bit rotation over the full size of the master key (especially for the \(80\)-bit key version, which does not fit exactly within a multiple of a x86 general purpose register size). Using two small tables of \(31\) and \(16\) \(64\)-bit words, one can compute the round counter addition and the key schedule Sbox lookup with only a single table lookup and a XOR (the \(128\)-bit key version performs two adjacent Sbox calls in the key schedule, thus the second table will contain \(256\) elements in the case of PRESENT-\(128\)). We provide the pseudo-code of the \(80\)-bit version in Appendix C.1 of [2].

2.4 Piccolo

Encryption. The table-based implementation of Piccolo is slightly different from that of LED or PRESENT since Piccolo has a Feistel network structure. In order to tabulate as much as possible the internal function \(F\), we divide it in two parts. The first one packs the first Sbox layer of \(F\) and also the subsequent diffusion layer. It yields two tables of \(8\)-bit input and \(32\)-bit output (two Sbox inputs are handled at a time), which can be used to perform the first part of \(F\) in both branches of the Feistel. The second part computes the second Sbox layer only. It is therefore implemented using four tables of \(8\)-bit input and \(64\)-bit output (two tables per branch), allowing again to place the \(16\)-bit branches at their correct positions before the byte permutation at the end of the round. We explain in Appendix C.2 of [2] how to build these tables, and the total amount of memory required is \(10240\) bytes, which is small enough to fit entirely in the L1 cache of the processor. The final byte permutation of a Piccolo round can then be computed efficiently with two masks, two \(16\)-bit rotations and one XOR. We provide the pseudo-code for the \(i^{th}\) round computation of Piccolo in Appendix C.2 of [2].

Key Schedule. The \(80\)-bit and \(128\)-bit versions of the Piccolo key schedule are slightly different, nevertheless they have a similar core which consists in selecting \(16\)-bit slices of the master key and XORing them with constant values. Hence, we build one extra small table made of \(25\) \(64\)-bit words (or \(31\) words for Piccolo-\(128\)) corresponding to the constant values. Then, we prepare several \(16\)-bit slices of the master key in \(64\)-bit words, and one can perform the key schedule with only a single table lookup and one XOR operation. Note that the permutation used in the \(128\)-bit version of the key schedule can be efficiently implemented with two masks, two \(16\)-bit rotations and one XOR.

3 Implementations Using vperm Technique

3.1 Introducing the vperm Technique

Vector Permute, abbreviated vperm, is a technique that uses vector permutation instructions in order to implement table lookups by taking advantage of the SIMD engine present inside modern CPUs. The main advantages of the vperm technique are parallel table lookups and timing attacks side-channel resistance. This technique, applied to block cipher implementations, comes originally from [14]. It has also proven to be efficient for multivariate cryptography [8].

The main idea behind the vperm technique is to use shuffling instructions for looking into small size tables. Though this technique can be used in different architectures where SIMD shuffling instructions are present (for instance AltiVec on PowerPC, or NEON on ARM), we will exclusively focus on their x86 flavor, namely the pshufb instruction. This instruction has been introduced with the SSSE3 extension that came with the Intel Core microarchitecture.

Regarding the lightweight block ciphers, the vperm technique has already been applied to TWINE  [24], yielding in very competitive results on the x86 platform (6.87 c/B for a 2-message encryption). However, there are no results available for other lightweight block ciphers. In the following subsections, we study how the vperm technique fits for LED, Piccolo and PRESENT. We will show that, though the confusion layer is quite straightforward to implement using vperm, the linear diffusion layer can be challenging.

3.2 Core Ideas for vperm Applied to Lightweight Block Ciphers

In this section, we briefly describe the main implementation ideas that are common to LED, PRESENT and Piccolo (as well as to many lightweight block ciphers).

The pshufb Instruction. pshufb is a byte shuffling instruction that takes two 128-bit operands as input, e.g. xmm0 and xmm1 registers (see Fig. 1). The destination operand xmm0 bytes are shuffled according to the source operand: xmm1 is used as a permutation mask selecting bytes inside xmm0. Only the low nibbles of bytes in xmm1 are used, and if the MSB is set the resulting byte is nullified. The result of the shuffling is stored inside xmm0. The second operand can be a memory reference. There is an AVX variant vpshufb of pshufb that takes three operands as input, the first operand being the destination. This saves a movdqa operation that is normally required to avoid overwriting the destination, thus saving some CPU cycles.

Fig. 1.
figure 1

Result of “pshufb xmm0, xmm1”: bytes in xmm0 are shuffled according to the content of xmm1. Only the low nibble of bytes in xmm1 are used as index: index 0x10=\(16\) in the example translates to \(0\). When the byte index inside xmm1 \(\ge 128\) (i.e. the high bit of the byte is set, \(255\) on the example), the destination byte is zeroed.

Sbox Implementation. Most lightweight block ciphers use Sboxes on GF\((2^4)\), meaning that the Sbox is a table of \(16\) possible output values selected by the input. It is straightforward to implement such an operation on 16 nibbles, in parallel, using the pshufb instruction (see Fig. 1). The piece of code in Fig. 2 performs 16 parallel Sbox lookups on the 16 low nibbles of the r register (the high nibbles of each byte \(r_i\) are supposed to be zero). The same technique can be used for any function from GF\((2^4)\) to itself (including multiplications over the field), or even any function from GF\((2^4)\) to GF\((2^8)\) without any loss of parallelism since one xmm register is 16-byte long. Hence, the MixColumns-like multiplications can be stored inside \(16\)-byte tables. More specifically, we can store the composition of the Sbox and the multiplication inside a single \(16\)-element table, as explained in Sect. 2.

Fig. 2.
figure 2

Using pshufb for 16 parallel Sbox lookups (high nibbles of bytes \({r_i}\) are supposed to be zero).

Extending to Bigger Tables. pshufb can be utilized in order to compute a 4-bit to 8-bit function and it is possible to use the same technique for lookups inside tables with even bigger output size (but multiple of 8 bits). As presented on Fig. 8 in Appendix D.1 of [2], if we consider, for instance, a 16-bit output table, we can pack this table inside two 128-bit xmm registers \(s^{0}\) and \(s^{1}\). Then, the \(i^{th}\) entry of the table (\(0\le i \le 15\)) is represented by the \(16\)-bit value formed by \(s_{i}^{0}\) and \(s_{i}^{1}\). Therefore, one can perform \(16\) parallel table lookups (with 16 output bits) by using two pshufb in a row, the first one getting the first half of the word in \(t^{0}\), and the second one getting the second half in \(t^{1}\).

Fig. 3.
figure 3

Packing for the 2-parallel vperm implementation of PRESENT and Piccolo. Each rectangle represents a 4-bit nibble.

Message Packing and Unpacking. Lightweight block ciphers states are 64-bit long, which means that two of them can be stored inside a 128-bit xmm register. However, the natural packing concatenating the two states side by side inside a register is not optimal. This is due to the fact that the algorithms we focus on use nibble-based permutations as part of their linear diffusion layer. Implementing such permutations by using shift or rotation operations can be costly. However, if the two states are packed by interleaving their nibbles as presented in Fig. 3, it is possible to realize any nibble permutation by using pshufb, since they are now mapped to a byte permutation. The packing and unpacking are easily implemented using some shift and pshufb operations. Their cost, around ten cycles, is marginal compared to the encryption process. Using this packing, one can apply \(32\) Sbox lookups on the two states by using two pshufb, two pand masks and one 4-bit left shift psrlw to isolate low and high nibbles, and one pxor to merge the two results. As we will explain, this packing will be applied to Piccolo and PRESENT, but not to LED.

Using AVX Extensions. On the two last Intel CPU generations (Sandy and Ivy Bridge), a new instruction set has been added to extend Westmere’s SSE 4.2. The 128-bit xmm registers have also been extended to 256-bit ymm registers, most of the instructions do however not operate on the full ymm registers, but only on their low 128-bit part. The full AVX extensions operating on ymm will be introduced on the forthcoming Haswell architecture with AVX2. All the presented encryption algorithms and their key schedules can still benefit from AVX on Sandy and Ivy Bridge by using the three operands form of the instructions, which saves registers backup. For instance, table lookups can be performed with one instruction “vpshufb t, s, r” instead of the two instructions “movdqa t, s; pshufb t, rFootnote 4.

3.3 LED

The LED block cipher does not have a key schedule per se and since the decryption process is not more complex than the encryption one (the coefficients for the inverse diffusion matrix are the same as for the original matrix), we will only focus on the latter case. As explained previously, the \(4\)-bit Sbox layer can be implemented in a few cycles by using pshufb, masks and shifts. The ShiftRows is also immediate with a pshufb by using the interleaved nibbles packing described above. However, the MixColumnsSerial step uses field multiplications with 11 different constants (4, 2, B ...). Using as many as \(11\) lookup tables as multiplicative constants would be too costly as they would not leave room for the state and other operations inside the xmm registers. We could also use the fact that LED ’s MDS matrix is a power of a simpler sparse matrix, using less constants: the drawback is that raising to the power \(4\) would mean that the all operations would have to be applied four times.

We found out that there is a better implementation strategy for LED: we can use the table-based tricks to store the Sbox and MixColumnsSerial layers inside xmm register-based tables. Each column can be stored inside a \(2^{4}\times 2=32\) bytes table (thus 2 xmm registers). Hence, \(4\) pairs of xmm registers will store the \(4\) tables needed to perform a round of LED, and lookups inside each table will be performed in a vectorized way for each nibble of the state using two pshufb as described in Sect. 3.2. The drawback is that the output words will be on different xmm registers, but the repacking of this step can be combined with the ShiftRows layer that shuffles the columns. We also use por masking to force the MSB of bytes that are not concerned with a lookup in a specific table. For each LED round, \(8\) pshufb instructions are used for the lookups, and \(6\) pshufb for the shifting layer (ShiftRows and repacking). This implementation strategy does not use the specific state packing from Fig. 3 since shuffling for the ShiftRows and table repacking can be expressed using pshufb. However, one should notice that there is a small message packing cost for LED due to its row oriented message loading in the state: the input message is packed in a column wise fashion, and the ciphertext is packed back to row wise.

3.4 PRESENT

PRESENT can benefit from the vperm technique in both encryption and key schedule, since the latter uses Sbox lookups for subkeys computations.

Encryption. As explained in [21] and on Fig. 4, PRESENT’s pLayer permutation can be seen as the composition of two permutations: one acting on bits inside groups of \(16\) bits, and one shuffling the nibbles inside the 64-bit state. As for the table-based implementations of LED, it would be possible to compute a PRESENT round by using 4-bit to 16-bit tables that merge the Sbox and the first permutation of the pLayer acting on 16-bit groups. Nevertheless, we have found that this permutation can be implemented in a more efficient way by moving groups of bits (see the code on Fig. 4 for more details). The second permutation, mapped to a byte permutation thanks to the message packing from Fig. 3, can be expressed as one pshufb.

Fig. 4.
figure 4

PRESENT’s pLayer representation as two permutations.

Key Schedule. A straightforward method for implementing PRESENT’s keys schedules for 80-bit and 128-bit keys would be to store the master key inside one xmm register and compute all the subkeys by using rotations on the register (80-bit or 128-bit rotations depending on the key size), and extract the high part to get each subkey. However, the SSE instruction set lacks a rotation instruction. One could implement the rotation with two shifts and one XOR (and one AND for the \(80\)-bit key case), but the shift instructions ps(r/l)ldq that operate on the entire \(128\)-bit xmm register can only handle shift values that are multiples of 8 bits. Since PRESENT’s key schedules use a 61-bit rotation, using such instructions is therefore too costly. Instead, we split the high and low key parts in two xmm register halves, which allows us to perform the 61-bit rotation with the quadword shift instructions ps(r/l)lq. It is possible to vectorize the key schedule on two keys by using this trick, since the high parts of the two xmm registers can be used to store high and low keys of the second master key: 

figure f

One should notice that since the PRESENT  vperm encryption part uses packed messages, the scheduled keys that will be XORed to the cipher state must be packed in the same way. However, the 61-bit rotation is not compatible with the nibble interleaving packing from Fig. 3, which means that the key schedule cannot be easily performed with this data packing. This implies that all the subkeys are to be packed after they have been generated and this explains the high key schedule packing cost reported in Appendix B.

3.5 Piccolo

Encryption. Piccolo’s \(F\) function uses a circulant MixColumns matrix over GF(\(2^4\)), which allows using three \(16\) bytes tables, namely the Sbox, the Sbox composed with the multiplication by 2 in the field, and the Sbox composed with the multiplication by 3. Two states of Piccolo are stored in one xmm register with the nibbles interleaved as in Fig. 3. It is then possible to implement one Piccolo round with two \(F\) functions in parallel inside the xmm register, by using three pshufb for the three multiplications lookups (by 1, 2, and 3). Three more pshufb on the results and three pxor are necessary in order to perform the columns mixing according to the circulant nature of the MixColumns matrix. The second layer of Sbox lookups in \(F\) can be performed with only one pshufb. Finally, Piccolo’s Round Permutation is realized with a unique pshufb, since it is a byte permutation. The piece of code given in Appendix D. 2 of [2] illustrates these steps (it is suited for the low nibbles of the state, almost the same code is used for the high nibbles).

Key Schedule. Piccolo’s key schedules for 80 and 128-bit keys do not really benefit from the vperm technique, since no Sbox lookup nor field multiplication over GF\((2^4)\) is performed. The same implementation tricks as presented in Sect. 2.4 are used in order to minimize the number of operations extracting the master key subparts. The main difference with the table-based implementations key schedule is that in the case of vperm, the process is performed inside xmm registers with SSE instructions: the main benefit being that one can vectorize the key schedule for two master keys, performing all the operations with the nibbles interleaved packing format from Fig. 3. This results in an optimized key schedule for two keys that requires almost the same number of cycles than the table-based implementation on one key (see results in Appendix B).

4 Bitslice Implementations

Bitslice implementations often lead to impressive performance results, as shown for example in [19] for PRESENT and Piccolo. However, we would like to also take into account the key schedule cost that might not be negligible in several typical lightweight cryptography use cases, such as short data or independent keys for different data blocks (see 5.2 for specific examples). As a consequence, exploring the bitslice possibilities for the various key schedules is of interest. In particular, many distinct keys might be used for the encryption and non-bitsliced key schedules might kill the parallelism gain if one does the packing for each round key (packing/unpacking takes comparable cycles as for encryption in most of the cases). This bitsliced key scheduling has never been studied for lightweight block ciphers to our knowledge, and we provide some results for the three ciphers in this section. One of our conclusions is that some key schedules can significantly slow down performances depending on the use case, which somehow moderates the results exposed in [19].

4.1 The Packing/unpacking

The choice of an appropriate packing inside the xmm registers is important for a bitslice implementation. For the LED bitsliced version with 16 parallel blocks, we use the packing described in Fig. 9 in Appendix E.4 of [2]. The packing for 32 parallel blocks is identical (see Fig. 11 of [2]). It is to be noted that the packing used for PRESENT is the same as for LED (such a packing can be obtained with a little more than one hundred of instructions).

The (un)packing for Piccolo with 16-parallel blocks depicted in Fig. 10 of [2] is very similar and requires a few more instructions. The reader can refer to [19] for details and code about this.

4.2 The Encryption

An important part of the encryption cost are the Sboxes, but the bitslice representation allows to compute many of them in parallel within a few clock cycles. We recall the logical instructions sequences proposed by [19] in Appendix E.1 of [2] for the LED and PRESENT Sbox, and in Appendix E.2 of [2] for the Piccolo Sbox.

The second part of an encryption round is the linear diffusion layer. For LED, the ShiftRows is simply performed with a few pshufb operations and the MixColumnsSerial are handled with the same method as in [18] for AES or in [19] for Piccolo diffusion matrices. In the case of LED, one also has to consider the XORing of round dependent constants during the AddConstants function, but this can be done easily by preparing in advance the constants in bitsliced form. For PRESENT, the bit permutation function pLayer can be performed by just reorganizing the positions of the 16-bit (or 32-bit) words in the xmm registers in bitsliced form. This can be executed efficiently [19] using a few pshufd, punpck(h/l)dq and punpck(h/l)qdq instructions (see the pseudo-code in Appendix E.3 of [2] for \(8\)-parallel data blocks). For Piccolo, the nibble position permutation (performed with a few pshufd instructions) and the matrix multiplication are similar to the ones in [19].

4.3 The Key Schedule

As previously explained, the key schedule cost can be prohibitive in certain use cases when it comes to bitslicing. Thus, it seems reasonable to design bitsliced versions of the key schedule: this would leverage possible parallelism when many keys are processed, and this will prepare these keys in their packed format so that XORing them with the bitsliced state is straightforward. As a matter of consequence, the bitslice format for the key must be the same as for the data, or at least very similar so that the repacking cost is small. To minimize the key schedule cost, the packing is only performed once for the original keys, from which the subkeys are produced by shift and masking operations.

LED. No key schedule is defined for LED. Only the original secret key has to be packed in the data bitsliced format (one \(64\)-bit key for LED-64 and two \(64\)-bit keys for LED-128, other sizes use a sliding window requiring some additional shifts and masks).

Piccolo. The key schedule is very light: it basically consists in selecting \(16\)-bit chunks from the original secret key and XORing them with round constants. Similarly to LED, our implementation first prepares the \(16\)-bit chunks in bitsliced format once and for all. Thanks to the adapted packing, the two \(16\)-bit key words appear in the same registers. For instance, when parallelism is of \(16\) blocks, \(8\) xmm registers are required to store the data and each round key, however, \(4\) are required only for storing one round-key in our case, because the other \(4\) contain only \(0\)s, which can be discarded. This saves storage, and also key-addition operations by half. Another important observation is that even-number indexed chunks appear only in the left part of the round keys, and odd-number indexed chunks appear only in their right parts. Hence, we can pre-position these chunks only once, and the key schedule would involve only XORing the appropriate two chunks and the constants. To reduce the number of packing operations, we first pack all the original secret keys without re-positioning, and then do the pre-positioning for subkeys. These arrangements minimize the overall operations required by the key schedule.

PRESENT. The key schedule of PRESENT is not well suited for software, and even less suited when the key data has to be in bitsliced format. We divide the keys in two chunks (\(64\) and \(16\) bits for PRESENT-80 and two \(64\)-bit chunks for PRESENT-128) and prepare them in bitsliced format using the same packing as the data (each first chunks of the keys are packed together and each second chunks of the keys are packed together). The subkey to be XORed every round to the cipher internal state is located in the first chunk. The constants addition of the key schedule update function is simply handled by pre-formatting the constants in the bitsliced format and XORing them to the chunks (in fact only one chunk will be impacted if one does this step before the rotation). Then, the Sbox layer is performed by using the same Sbox function as for the internal cipher, just making sure with a mask that only a single Sbox is applied. Finally, the \(61\)-bit rotation is separated in two parts, since only rotations of a multiple of \(4\) bits are easy to handle in bitslice packing. First, a \(60\)-bit rotation is applied using several pshufb instructions (together with masking and XORs). Then, a single bit rotation is computed by changing the ordering of the xmm registers (the xmm registers containing the third Sbox bits will now contain the second Sbox bits, etc.). An adjustment is eventually required as some bits will go beyond the register limit and should switch to another one (this can be done with more shifts, masks and XORs). We provide the pseudo-code for the bitsliced key schedule implementation of PRESENT-80 in Appendix E.5 of [2].

4.4 Discussions

To have a fair view on the workload of the key schedules, we minimized the number of packing operations in each implementation. The Table below, deduced from Table 2, shows the ratio of the key schedule (including the packing of the keys and subkeys generations) over one data block encryption (including plaintext packing, encryption, ciphertext unpacking) for \(16\) blocks parallelism. In other words, it represents the workload increase when taking key schedule into account. LED is affected only slightly, Piccolo by a quarter and PRESENT by more than half.

 

LED-64

LED-128

Piccolo-80

Piccolo-128

PRESENT-80

PRESENT-128

Key schedule ratio

3.3 %

4.1 %

20.2 %

26.7 %

55.2 %

59.9 %

5 Analyzing the Performance

5.1 Framework for Performance Evaluation

In order to compare various implementation techniques, we will consider that a server is communicating with \(D\) devices, each using a distinct key. For each device, the server has to encipher/decipher \(B\) \(64\)-bit blocks of data. Moreover, we distinguish between the cases where the enciphered data comes from a parallel operating mode (like CTR) or a serial one (like CBC).

Now, we would like to take in account the fact that some implementations can be faster when some parallelism is possible (like bitslice technique). Let \(t_{E}\) be the time required by the implementation to perform the encryption process (without the key schedule and without the packing/unpacking of the input/output data). Let \(P_E\) denote the number of blocks that the implementation enciphers at a time in an encryption process (i.e. the number of blocks the implementation was intended to be used with). Similarly, let \(t_{KS}\) be the time required by the implementation to perform the key schedule process (without the packing of the key data) and we naturally extend the notation to \(P_{KS}\).

We remark that ciphering a lower number of blocks than \(P_E\) (resp. \(P_{KS}\)) will still require time \(t_E\) (resp. \(t_{KS}\)). However, contrary to the encryption or key schedule process, the packing/unpacking time of the input/output data will strongly depend on the number of blocks involved. Therefore, if we denote by \(t_{pack}\) the time required to pack one block of data, we get that packing \(x\) blocks simply requires \(x \cdot t_{pack}\). Similarly, we denote \(t_{unpack}\) the time required to unpack one block of data and unpacking \(x\) blocks simply requires \(x \cdot t_{unpack}\). For the key schedule, \(t_{packKS}\) denotes the time to pack the key data, and packing \(x\) keys requires \(x \cdot t_{packKS}\) (there is no need to unpack the key).

Finally, depending on \(D\) and \(B\), the average time per block required to encrypt all \(D \cdot B\) data blocks with a parallel operating mode is given by:

$$\begin{aligned} \begin{aligned} s_{parallel}(D,B)&= \frac{\text {total encryption time}}{\text {number of data blocks enciphered}}\\&= \frac{ \left\lceil \frac{D \cdot B}{P_E} \right\rceil \cdot t_{E} + \left\lceil \frac{D}{P_{KS}} \right\rceil \cdot t_{KS} + D \cdot B \cdot (t_{pack} + t_{unpack}) + D \cdot t_{packKS}}{D \cdot B} \nonumber \\&= \frac{ \left\lceil \frac{D \cdot B}{P_E} \right\rceil \cdot t_{E} + \left\lceil \frac{D}{P_{KS}} \right\rceil \cdot t_{KS}}{D \cdot B} + t_{pack} + t_{unpack} + \frac{t_{packKS}}{B}. \end{aligned} \end{aligned}$$

However, when using a serial operating mode, the average time per block required to encrypt all \(D \cdot B\) data blocks is given by:

$$\begin{aligned} \begin{aligned} s_{serial}(D,B)&= \frac{\text {total encryption time}}{\text {number of data blocks enciphered}}\\&= \frac{ \left\lceil \frac{D}{P_E} \right\rceil \cdot B \cdot t_{E} + \left\lceil \frac{D}{P_{KS}} \right\rceil \cdot t_{KS} + D \cdot B \cdot (t_{pack} + t_{unpack}) + D \cdot t_{packKS}}{D \cdot B} \nonumber \\&= \frac{ \left\lceil \frac{D}{P_E} \right\rceil \cdot t_{E}}{D} + \frac{\left\lceil \frac{D}{P_{KS}} \right\rceil \cdot t_{KS}}{D \cdot B} + t_{pack} + t_{unpack} + \frac{t_{packKS}}{B}. \end{aligned} \end{aligned}$$

Since table-based implementations are usually not faster when offered the possibility to encipher several blocks at a time, we have that \(P_E=P_{KS}=1\) and \(t_{pack} = t_{unpack} = t_{packKS} = 0\). Therefore, we conclude that for table-based implementations we have \(s(D,B)=t_{E} + t_{KS}/B\). On the opposite, for bitslice implementations, many data blocks will be enciphered in parallel. Note that vperm implementations will stand in between table-based and bitsliced versions, since a slight parallelism might increase the speed.

For previous bitslice implementations, since many blocks are assumed to be enciphered, the key schedule cost is usually omitted. However, in this article, we are interested in use cases where for example \(B\) can be a small value, like a single block. When \(B\) is small, one can see that the relative cost of the key schedule has to be taken in account.

5.2 The Use Cases

In order to have a clearer picture of the various scenarios that might be encountered in practice, we chose to study six distinct and meaningful use cases, depending on the value of \(D\), \(B\) and the type of encryption operating mode. The six situations are given in Table 1 together with some examples. One might argue that the first use case is not really interesting since with only few blocks and few devices the server would not be overloaded by the encryption/decryption work. However, latency can be an important criterion for many applications, and thus this use case checks the ability of the server to perform the cryptographic operation rapidly in software.

Table 1. Six device/server use cases for lightweight encryption. For the practical measurements given in Table 2 in Appendix B, the notation “big/small” refers to more/less than \(10\) on average. For the experimentations in Table 2 we used \(1000\) and \(1\)

6 Results and Discussions

6.1 Implementation Results

We have performed measurements of our three types of implementations for our three lightweight candidates. For more precision, the encryption times have been measured with a differential method, checking the consistency by verifying that the sum of the subparts is indeed equal to the entire encryption. Moreover, the measurements have been performed with the Turbo-Boost option disabled, in order to avoid any dynamic upscale of the processor’s clock rate (this technology was implemented in certain processor versions since Intel Nehalem CPU generation). We observe that our bitslice implementations timings for Piccolo and PRESENT are consistent with the ones provided in [19]. Moreover, we greatly improve over the previously best known LED software implementations (about 57 c/B on Core i7–720QM [13]), since our bitsliced version can reach speeds up to 12 c/B.

We give in Table 2 in Appendix B all the implementation results on Core i3–2367M (Sandy Bridge microarchitecture), XEON X5650 (Westmere microarchitecture) and Core 2 Duo P8600 (Core microarchitecture) processors. Using the measurements for \(t_E\), \(t_{KS}\), \(t_{pack}\), \(t_{unpack}\), \(t_{packKS}\) in our framework from Sect. 5, we can infer the performances for the 6 use cases.

6.2 Comparing the implementations types and the ciphers

We can extract general tendencies from our measurements (see Table 1) and one can remark that bitslice implementations will perform better than table-based or vperm ones except for use cases

figure m

and

figure n

, where only few devices are involved and data blocks can only be handled one at a time (bitslice implementations are naturally not fit for low latency and single block). More surprisingly, even for the use cases

figure o

and

figure p

, the gain of bitslice over table-based or vperm implementations is only clear for more than 10 devices.

For bitslice implementations, the cost of bitsliced form transposition on the server can be removed if the device also enciphers in bitsliced format. However, depending on the type of constrained device, the bitsliced algorithm might perform very poorly and the communication cost would increase if a serial mode is used or if a small amount of data is enciphered. Moreover, this solution would reduce the compatibility if other participants have to decipher in non-bitsliced form. The same compatibility issue is true for the keys in the server database, if one directly stores the keys or subkeys in bitsliced form. Finally, it is to be noted that bitsliced versions of the key schedule are especially interesting when all the keys are changed at the same time (i.e. fixed message length, messages synchronized in time).

We can see that from a software implementation perspective, all three ciphers perform reasonably well and are in the same speed range. Their internal round function is quite fit for x86 architectures. Table-based implementations are helped by the small \(64\)-bit internal state size. The vperm implementations are fast thanks to the use of small \(4\)-bit Sboxes, even though the linear diffusion layer can significantly impact the performance (which is the reason why TWINE has very good vperm implementation performances). For PRESENT the bit permutation layer is not really suited for software, the LED diffusion matrix has complex coefficients when not in its serial form, and the Piccolo \(F\) function with two layers of Sboxes reduces the possibilities of improvements. Concerning the key schedule, having a byte oriented or no key schedule is an advantage and bitwise rotation as in PRESENT is clearly difficult to handle in software.

Lots of research has been conducted on block cipher constructions and building a good cryptographic permutation is now well understood by the community. However, this is not the case of the key schedule and, usually, block ciphers designers try to build a key schedule very different from the round function, in a hope to avoid any unpredicted relation that might arise between the two components. However, we remark that this is in contradiction with efficient parallel implementations (like bitslice), since the packing of the key and the block cipher internal state must be (almost) the same (otherwise the repacking cost for every generated subkey would be prohibitive).

It is also to be noted that when analyzing ciphers software performances on the server side, it is more likely that decryption will have to be performed instead of encryption. We emphasize that the decryption process would have the same performances as our encryption implementations in the case of PRESENT. For LED and Piccolo, the inverse matrix for the diffusion layer will have more complex coefficients than the encryption one (only the non-serialized matrix for LED), but this shall not impact table-based implementations. However, we remark that this might have an impact on our best performing implementations for Piccolo and their decryption counterpart are likely to be somewhat slower than encryption mode.

6.3 Future Implementations

The forthcoming Haswell architecture will introduce the new AVX2 instruction set. As discussed in Sect. 3.2, this extension will permit most of the existing SSE instructions to operate on the full 256-bit ymm registers. Apart from the three operands improvement already utilized in our vperm versions (Sect. 3.2), the three types of implementations studied in this paper will probably gain from AVX2.

  • Table-based implementations: with the new vgatherqq instruction, it is possible to perform \(4\) parallel table lookups by using \(4\) indexes inside the ymm quadwords. The resulting quadwords, after the lookups, are stored inside the ymm source register. Such a technique has been applied to the Grøstl hash function in [15]. When applied to lightweight block ciphers, \(4\) internal states can be stored inside a single ymm register. One can isolate the 8-bit indexes (if we use 8-bit tables) by using the vpshufb instruction, perform the \(4\) lookups in parallel, and merge the results by XORing it within an accumulator. As we can see, this will result in a \(4\)-way vectorized block cipher. According to [16], vgatherqq will have a latency of \(16\) cycles and a throughput of \(10\) cycles when data is in L1. A very rough estimation of the results on Haswell CPU is thus a \(1.5\) to \(2\) times improvement over the table-based implementations results provided in Appendix B (since the mov instruction has a latency of \(4\) cycles in L1).

  • vperm based implementations: extending the vperm technique to 256-bit ymm registers is straightforward, since one would store \(4\) states instead of \(2\) in one register. As for table-based, vperm implementations will be vectorized on \(4\) states providing a \(2\) times performance improvement for at least \(4\) parallel message blocks.

  • Bitslice implementations: as for the vperm technique, bitslicing can naturally take advantage of the AVX2 extension to 256-bit registers by performing in the high 128-bit parts of ymm the exact same operations as on the low parts (if \(N\) message blocks are to be packed, \(N/2\) are packed as previously presented in the low part of ymm, and \(N/2\) are packed in the high part). This would roughly give a \(2\) times improvement for the performance (however requiring, as for vperm, twice more parallel message blocks).

Conclusion and Future Works

In this article, we have studied the software implementation of lightweight block ciphers on x86 architectures, with special focus on LED, Piccolo and PRESENT. We provided table-based, vperm and bitslice implementations and compared these three methods according to different common lightweight block ciphers use cases. We believe our work helps to get a more complete picture on lightweight block ciphers and we identified several possible future researches.

First, we remark that our cache latency model for table-based implementations predicts that new and future processors with an important amount of L2 cache might enable new fast primitives that utilize \(16\)-bit Sboxes (which could then be implemented using big table lookups). Moreover, this remark might also improve current ciphers such as LED or PRESENT, by imagining a “Super-Sbox” type of implementation: two rounds can be seen as only composed of the applications of four parallel \(16\)-bit Sboxes, and thus can be perfomed with only \(4\) table lookups.

Secondly, in the future, it would be interesting to use this kind of modeling to compare different implementation tradeoffs without tedious implementation for all of them (this would be also true for hardware implementations). Table-based is a simple case we leave as open problem if more complex implementations can be studied the same way.

Finally, another future work is to study other recently proposed block cipher designs such as PRINCE  [6] or Zorro  [11], and the lightweight SPN-based hash functions such as PHOTON  [12] or SPONGENT [4]. The analysis of hash functions would be quite different since their internal state sizes (which vary with the intended output size) is bigger than \(64\) bits. Therefore, the amount of memory required to store the tables for table-based implementations is likely to be bigger, and vperm or bitslice implementations would be impacted as well since the packing would be more complex and would use more xmm registers.