Implementing Lightweight Block Ciphers on x86 Architectures
Abstract
Lightweight block ciphers are designed so as to fit into very constrained environments, but usually not really with software performance in mind. For classical lightweight applications where many constrained devices communicate with a server, it is also crucial that the cipher has good software performance on the server side. Recent work has shown that bitslice implementations applied to Piccolo and PRESENT led to very good software speeds, thus making lightweight ciphers interesting for cloud applications. However, we remark that bitslice implementations might not be interesting for some situations, where the amount of data to be enciphered at a time is usually small, and very little work has been done on nonbitslice implementations.
In this article, we explore general software implementations of lightweight ciphers on x86 architectures, with a special focus on LED, Piccolo and PRESENT. First, we analyze tablebased implementations, and we provide a theoretical model to predict the behavior of various possible tradeoffs depending on the processor cache latency profile. We obtain the fastest tablebased implementations for our lightweight ciphers, which is of interest for legacy processors. Secondly, we apply to our portfolio of primitives the vperm implementation trick for \(4\)bit Sboxes, which gives good performance, extra sidechannels protection, and is quite fit for many lightweight primitives. Finally, we investigate bitslice implementations, analyzing various costs which are usually neglected (bitsliced form (un)packing, key schedule, etc.), but that must be taken in account for many lightweight applications. We finally discuss which type of implementation seems to be the best suited depending on the applications profile.
Keywords
Lightweight cryptography Software vperm Bitslice LED Piccolo PRESENT1 Introduction
RFID tags and very constrained computing devices are expected to become increasingly important for many applications and industries. In parallel to this general trend, the growth of ubiquitous computing and communication interconnections naturally leads to more entry points and increased potential damage for attackers. Security is crucial for many situations, but often left apart due to cost and feasibility constraints. In order to fulfill the need in cryptographic primitives that can be implemented and executed in very constrained environments (area, energy consumption, etc.), aka lightweight cryptography, the research community has recently made significant advances in particular in the domain of symmetrickey cryptography.
Current NIST standards for block cipher (AES [10]) or hash function (SHA2 [25] or SHA3 [3]) are not really fit for very constrained environments and several alternatives have been proposed, such as PRESENT [5], KATAN [7], LED [13], Piccolo [23], TWINE [24] for block ciphers and QUARK [1], PHOTON [12], SPONGENT [4] for hash functions. Notably, PRESENT block cipher is now part of an ISO standard [17]. All these different proposals greatly improved our knowledge in lightweight designs and many already achieve close to optimal performance for certain metrics such as area consumption.
In practice, the constrained devices will be either communicating with other constrained devices or more probably with a server. In the latter case, the server is likely to have to handle an important number of devices, and while cryptography might be used in a protocol to secure the communications, other applications operations have to be performed by the server. Therefore, it is crucial that the server does not spend too much time performing cryptographic operations, even when communicating with many clients, and thus software performance does matter for lightweight cryptography.
At CHES 2012, Matsuda and Moriai [19] have studied the application of bitslice implementations to PRESENT and Piccolo block ciphers, concluding that current lightweight ciphers can be surprisingly competitive for cloud applications. Bitslice implementations allow impressive speed results and are also valuable for their inherent protection against various sidechannel cryptanalysis. However, we argue that they might not really fit all the lightweight cryptography use cases, where a server has to communicate with many devices. Indeed, constrained devices usually encipher a very small amount of data at a time. For example, in its smallest form, an Electronic Product Code (EPC), which is thought to be a replacement for bar codes using lowcost passive RFIDtags, uses \(64\), \(96\) or \(125\) bits as a unique identifier for any physical item. Small data enciphering makes the cost of data transformation into bitsliced form and key schedule process very expensive (these costs are usually omitted by assuming that a rather large number of blocks will be enciphered).
Therefore, it is interesting to explore the software efficiency profile of lightweight ciphers not only for cloud applications but also for classical lightweight applications and, surprisingly, apart from embedded 8bit architectures, they are not really meant to perform well in software on midrange or highend processors. For example, the currently best nonbitslice AES implementations reaches about 14 cycles per byte (c/B) on a 32bit Pentium III CPU [20], while the currently best nonbitslice PRESENT implementations only runs in 130 c/B on the same processor (the implementation in [21] reports a performance of 16.2 cycles per bit). Therefore, we believe a gap exists in this area, even if very recent proposals such as TWINE do report good nonbitsliced software performances.
Our Contributions. In this article, we provide three main contributions for lightweight ciphers software implementations on x86 architectures, with a special focus on LED, PRESENT and Piccolo. First, in Sect. 2, we argue that tablebased implementations are still valuable in particular situations (some servers with “legacy” CPUs, pre Core2, might lack the necessary SSE instructions set that is used in optimized bitslice or vperm implementations) and we propose new interesting tradeoffs, with a theoretical cache modeling to better predict which tradeoff will be suitable depending on the target processor. Our model is backed up by our experiments and we obtain the best known tablebased implementations for the studied lightweight ciphers.
Then, in Sect. 3, we further push software implementations of lightweight ciphers by exploring the vperm implementation trick for \(4\)bit Sboxes that has already proven to be interesting for AES [14] or TWINE [24], and which provides cachetiming attack resistance. We propose a strategy for our portfolio of lightweight ciphers, and we conclude that the linear layer, usually not the part contributing a lot in the amount of computations, can have a significant impact on the performances for this type of implementation. We note that these implementations are interesting because they apply to almost all lightweight ciphers and they produce very efficient code.
Thirdly, in Sect. 4, we explore bitslice implementations for lightweight ciphers and we show that for some common use cases they are less interesting than for cloud applications [19]. In fact, bitslice implementations can be slower than tablebased of vperm in some situations, for example when only a low number of blocks is enciphered per device. Moreover, previous bitslice analysis usually neglects the key schedule part, so we provide bitsliced versions of the key schedules. However, even in a bitsliced version, the key schedule can have a significant impact on the overall performance. We therefore revisit this type of implementation by taking into account various factors that are important in practice, such as the amount of distinct keys, the amount of blocks to be enciphered, etc. We note that we provide the first bitslice implementation of LED, greatly improving over the best known software performance of this cipher.
For all three primitives LED, PRESENT and Piccolo, we have coded all three versions of implementation with various tradeoffs. Then, for various crucial use cases presented in Sect. 5, we compare the different implementation strategies and we discuss our results in Sect. 6. For the readers not familiar with them, we quickly explain in Appendix A the three lightweight ciphers that we take as example for our implementations and refer to the specification documents [5, 13, 23] for a complete description. As many other lightweight ciphers, LED, PRESENT and Piccolo are \(64\)bit ciphers based on the repetition of a round function built upon a \(4\)bit Sbox and a linear diffusion layer, while the key schedule is very simple if not inexistent. All these design choices are justified by the reduction of the area consumption and while smaller Sboxes are possible, \(4\) bits is a sensitive choice which can provide very small area in hardware as well as ease of implementation in software.
The reader will find more details and code illustrations in the extended version of this paper [2]. Furthermore, the full source codes of the implementations presented in this paper are available online at https://github.com/rbanssi/lightweightcryptolib.
2 TableBased Implementations
2.1 Core Ideas

selecting slices of the internal state by shift and mask operations;

performing several table lookups to achieve the round transformation;

aggregating lookup table outputs to get the updated internal state;

performing the key addition layer.
It is to be noted that these pseudocodes are for x86 \(64\)bit architectures (on \(32\)bit ones, more mov and xor are required due to the fact that table lookups only get \(32\)bit words – refer to table under section “Results” (footnote a) for more details about this –).
Microarchitecture  \(L1\) size (KB)  \(L1\) latency (cycles)  \(L2\) size (KB)  \(L2\) latency (cycles) 

Intel \(P6\)  \(16\) or \(32\)  \(3\)  \(512\)  \(8\) 
Intel Core  \(32\)  \(3\)  \(1500\)  \(15\) 
Intel Nehalem/Westmere\(^\mathrm{a}\)  \(32\)  \(4\)  \(256\)  \(10\) 
Intel Sandy/Ivy Bridge\(^\mathrm{a}\)  \(32\)  \(5^\mathrm{b}\)  \(256\)  \(12\) 
Theoretical number of instructions for one round (for different table input sizes \(m\))  

Instruction type  \(m=4\) bits  \(m=8\) bits  \(m=12\) bits  \(m=16\) bits 
Shift  \(15\)  \(3\)  \(5\)  \(3\) 
Move/xor  \(15\)  \(8\)  \(5\)  \(3\) 
Mask  \(16\)  \(0\)  \(5\)  \(0\) 
Table lookup\(^\mathrm{a}\)  \(16\) \((32)\)  \(8\) \((16)\)  \(6\) \((12)\)  \(4\) \((8)\) 
Theoretical average round latency (for different table input sizes \(m\))  

Microarchitecture  \(m=4\) bits  \(m=8\) bits  \(m=12\) bits  \(m=16\) bits 
Intel \(P6\)  \(142\)  \(59\)  \(99\)  \(93\) 
Intel Core  \(94\)  \(35\)  \(91\)  \(264\) 
Intel Nehalem/Westmere  \(110\)  \(43\)  \(68\)  \(186\) 
Intel Sandy/Ivy Bridge  \(126\)  \(51\)  \(79\)  \(114\) 
Note that for \(m=16\) bits, we might have to also consider the \(L3\) or RAM latency depending on the \(L2\) size, and naturally extend the Eq. (1). We verified experimentally these values by implementing and running such a generic SPN round for the different \(m\) values considered. We could confirm the results on each of the considered microarchitectures. Note however that the experimental results do not exactly match the theoretical ones due to the superscalar property of the Intel architectures^{2}. Nevertheless, we emphasize the fact that this model is sufficient for our purpose: one can deduce that \(8\)bit slices seem to be the best tradeoff from an efficiency pointofview, whatever the microarchitecture, and we will apply this tradeoff on each of the three lightweight ciphers from our portfolio. One can also notice that some theoretical counterintuitive results are experimentally verified: for instance, \(16\)bits input tables outperform \(4\)bit input tables on some microarchitectures though a lot of data are outside \(L1\) and \(L2\) (this is due to the reduced number of shift/move/mask operations compensating the bad average table access latency). Even though this is not the core subject of our paper, this theoretical model can be used for performance comparisons of table based implementations on other architectures such as ARM, SPARC or PowerPC.
Finally, tablebased implementations specificities for each cipher are described in the following sections.
2.2 LED
Furthermore one extra table of \(31\) or \(48\) \(64\)bit words (respectively in the case of LED\(64\) and LED\(128\)) allows to perform the AddConstants operation with only one table lookup and one XOR (again, we manipulate \(64\)bit words in order to directly place the \(4\)bit constants at their correct position).
2.3 PRESENT
Encryption. Having a very similar structure to LED, we use the same implementation strategy for PRESENT. Eight tables are built, each one taking as input two adjacent Sbox \(4\)bit words (\(8\)bit inputs), and providing \(64\)bit output words, such that the tables also take into account the permutation layer. The round computation pseudocode is exactly the same as for LED, except that there is no constant addition in the round function. Therefore, one PRESENT round is performed with \(7\) shifts, \(8\) masks, \(8\) table lookups and \(7\) XORs\(^3\) and requires eight tables of \(2048\) bytes each, thus \(16384\) bytes in total. The tables are therefore small enough to fit mostly or even entirely in L1 cache. An example of how to build the tables is provided in Appendix C.1 of [2].
Key Schedule. The PRESENT key schedule is quite costly in software, due to the \(61\)bit rotation over the full size of the master key (especially for the \(80\)bit key version, which does not fit exactly within a multiple of a x86 general purpose register size). Using two small tables of \(31\) and \(16\) \(64\)bit words, one can compute the round counter addition and the key schedule Sbox lookup with only a single table lookup and a XOR (the \(128\)bit key version performs two adjacent Sbox calls in the key schedule, thus the second table will contain \(256\) elements in the case of PRESENT\(128\)). We provide the pseudocode of the \(80\)bit version in Appendix C.1 of [2].
2.4 Piccolo
Encryption. The tablebased implementation of Piccolo is slightly different from that of LED or PRESENT since Piccolo has a Feistel network structure. In order to tabulate as much as possible the internal function \(F\), we divide it in two parts. The first one packs the first Sbox layer of \(F\) and also the subsequent diffusion layer. It yields two tables of \(8\)bit input and \(32\)bit output (two Sbox inputs are handled at a time), which can be used to perform the first part of \(F\) in both branches of the Feistel. The second part computes the second Sbox layer only. It is therefore implemented using four tables of \(8\)bit input and \(64\)bit output (two tables per branch), allowing again to place the \(16\)bit branches at their correct positions before the byte permutation at the end of the round. We explain in Appendix C.2 of [2] how to build these tables, and the total amount of memory required is \(10240\) bytes, which is small enough to fit entirely in the L1 cache of the processor. The final byte permutation of a Piccolo round can then be computed efficiently with two masks, two \(16\)bit rotations and one XOR. We provide the pseudocode for the \(i^{th}\) round computation of Piccolo in Appendix C.2 of [2].
Key Schedule. The \(80\)bit and \(128\)bit versions of the Piccolo key schedule are slightly different, nevertheless they have a similar core which consists in selecting \(16\)bit slices of the master key and XORing them with constant values. Hence, we build one extra small table made of \(25\) \(64\)bit words (or \(31\) words for Piccolo\(128\)) corresponding to the constant values. Then, we prepare several \(16\)bit slices of the master key in \(64\)bit words, and one can perform the key schedule with only a single table lookup and one XOR operation. Note that the permutation used in the \(128\)bit version of the key schedule can be efficiently implemented with two masks, two \(16\)bit rotations and one XOR.
3 Implementations Using vperm Technique
3.1 Introducing the vperm Technique
Vector Permute, abbreviated vperm, is a technique that uses vector permutation instructions in order to implement table lookups by taking advantage of the SIMD engine present inside modern CPUs. The main advantages of the vperm technique are parallel table lookups and timing attacks sidechannel resistance. This technique, applied to block cipher implementations, comes originally from [14]. It has also proven to be efficient for multivariate cryptography [8].
The main idea behind the vperm technique is to use shuffling instructions for looking into small size tables. Though this technique can be used in different architectures where SIMD shuffling instructions are present (for instance AltiVec on PowerPC, or NEON on ARM), we will exclusively focus on their x86 flavor, namely the pshufb instruction. This instruction has been introduced with the SSSE3 extension that came with the Intel Core microarchitecture.
Regarding the lightweight block ciphers, the vperm technique has already been applied to TWINE [24], yielding in very competitive results on the x86 platform (6.87 c/B for a 2message encryption). However, there are no results available for other lightweight block ciphers. In the following subsections, we study how the vperm technique fits for LED, Piccolo and PRESENT. We will show that, though the confusion layer is quite straightforward to implement using vperm, the linear diffusion layer can be challenging.
3.2 Core Ideas for vperm Applied to Lightweight Block Ciphers
In this section, we briefly describe the main implementation ideas that are common to LED, PRESENT and Piccolo (as well as to many lightweight block ciphers).
Message Packing and Unpacking. Lightweight block ciphers states are 64bit long, which means that two of them can be stored inside a 128bit xmm register. However, the natural packing concatenating the two states side by side inside a register is not optimal. This is due to the fact that the algorithms we focus on use nibblebased permutations as part of their linear diffusion layer. Implementing such permutations by using shift or rotation operations can be costly. However, if the two states are packed by interleaving their nibbles as presented in Fig. 3, it is possible to realize any nibble permutation by using pshufb, since they are now mapped to a byte permutation. The packing and unpacking are easily implemented using some shift and pshufb operations. Their cost, around ten cycles, is marginal compared to the encryption process. Using this packing, one can apply \(32\) Sbox lookups on the two states by using two pshufb, two pand masks and one 4bit left shift psrlw to isolate low and high nibbles, and one pxor to merge the two results. As we will explain, this packing will be applied to Piccolo and PRESENT, but not to LED.
Using AVX Extensions. On the two last Intel CPU generations (Sandy and Ivy Bridge), a new instruction set has been added to extend Westmere’s SSE 4.2. The 128bit xmm registers have also been extended to 256bit ymm registers, most of the instructions do however not operate on the full ymm registers, but only on their low 128bit part. The full AVX extensions operating on ymm will be introduced on the forthcoming Haswell architecture with AVX2. All the presented encryption algorithms and their key schedules can still benefit from AVX on Sandy and Ivy Bridge by using the three operands form of the instructions, which saves registers backup. For instance, table lookups can be performed with one instruction “vpshufb t, s, r” instead of the two instructions “movdqa t, s; pshufb t, r”^{4}.
3.3 LED
The LED block cipher does not have a key schedule per se and since the decryption process is not more complex than the encryption one (the coefficients for the inverse diffusion matrix are the same as for the original matrix), we will only focus on the latter case. As explained previously, the \(4\)bit Sbox layer can be implemented in a few cycles by using pshufb, masks and shifts. The ShiftRows is also immediate with a pshufb by using the interleaved nibbles packing described above. However, the MixColumnsSerial step uses field multiplications with 11 different constants (4, 2, B ...). Using as many as \(11\) lookup tables as multiplicative constants would be too costly as they would not leave room for the state and other operations inside the xmm registers. We could also use the fact that LED ’s MDS matrix is a power of a simpler sparse matrix, using less constants: the drawback is that raising to the power \(4\) would mean that the all operations would have to be applied four times.
We found out that there is a better implementation strategy for LED: we can use the tablebased tricks to store the Sbox and MixColumnsSerial layers inside xmm registerbased tables. Each column can be stored inside a \(2^{4}\times 2=32\) bytes table (thus 2 xmm registers). Hence, \(4\) pairs of xmm registers will store the \(4\) tables needed to perform a round of LED, and lookups inside each table will be performed in a vectorized way for each nibble of the state using two pshufb as described in Sect. 3.2. The drawback is that the output words will be on different xmm registers, but the repacking of this step can be combined with the ShiftRows layer that shuffles the columns. We also use por masking to force the MSB of bytes that are not concerned with a lookup in a specific table. For each LED round, \(8\) pshufb instructions are used for the lookups, and \(6\) pshufb for the shifting layer (ShiftRows and repacking). This implementation strategy does not use the specific state packing from Fig. 3 since shuffling for the ShiftRows and table repacking can be expressed using pshufb. However, one should notice that there is a small message packing cost for LED due to its row oriented message loading in the state: the input message is packed in a column wise fashion, and the ciphertext is packed back to row wise.
3.4 PRESENT
PRESENT can benefit from the vperm technique in both encryption and key schedule, since the latter uses Sbox lookups for subkeys computations.
One should notice that since the PRESENT vperm encryption part uses packed messages, the scheduled keys that will be XORed to the cipher state must be packed in the same way. However, the 61bit rotation is not compatible with the nibble interleaving packing from Fig. 3, which means that the key schedule cannot be easily performed with this data packing. This implies that all the subkeys are to be packed after they have been generated and this explains the high key schedule packing cost reported in Appendix B.
3.5 Piccolo
Encryption. Piccolo’s \(F\) function uses a circulant MixColumns matrix over GF(\(2^4\)), which allows using three \(16\) bytes tables, namely the Sbox, the Sbox composed with the multiplication by 2 in the field, and the Sbox composed with the multiplication by 3. Two states of Piccolo are stored in one xmm register with the nibbles interleaved as in Fig. 3. It is then possible to implement one Piccolo round with two \(F\) functions in parallel inside the xmm register, by using three pshufb for the three multiplications lookups (by 1, 2, and 3). Three more pshufb on the results and three pxor are necessary in order to perform the columns mixing according to the circulant nature of the MixColumns matrix. The second layer of Sbox lookups in \(F\) can be performed with only one pshufb. Finally, Piccolo’s Round Permutation is realized with a unique pshufb, since it is a byte permutation. The piece of code given in Appendix D. 2 of [2] illustrates these steps (it is suited for the low nibbles of the state, almost the same code is used for the high nibbles).
Key Schedule. Piccolo’s key schedules for 80 and 128bit keys do not really benefit from the vperm technique, since no Sbox lookup nor field multiplication over GF\((2^4)\) is performed. The same implementation tricks as presented in Sect. 2.4 are used in order to minimize the number of operations extracting the master key subparts. The main difference with the tablebased implementations key schedule is that in the case of vperm, the process is performed inside xmm registers with SSE instructions: the main benefit being that one can vectorize the key schedule for two master keys, performing all the operations with the nibbles interleaved packing format from Fig. 3. This results in an optimized key schedule for two keys that requires almost the same number of cycles than the tablebased implementation on one key (see results in Appendix B).
4 Bitslice Implementations
Bitslice implementations often lead to impressive performance results, as shown for example in [19] for PRESENT and Piccolo. However, we would like to also take into account the key schedule cost that might not be negligible in several typical lightweight cryptography use cases, such as short data or independent keys for different data blocks (see 5.2 for specific examples). As a consequence, exploring the bitslice possibilities for the various key schedules is of interest. In particular, many distinct keys might be used for the encryption and nonbitsliced key schedules might kill the parallelism gain if one does the packing for each round key (packing/unpacking takes comparable cycles as for encryption in most of the cases). This bitsliced key scheduling has never been studied for lightweight block ciphers to our knowledge, and we provide some results for the three ciphers in this section. One of our conclusions is that some key schedules can significantly slow down performances depending on the use case, which somehow moderates the results exposed in [19].
4.1 The Packing/unpacking
The choice of an appropriate packing inside the xmm registers is important for a bitslice implementation. For the LED bitsliced version with 16 parallel blocks, we use the packing described in Fig. 9 in Appendix E.4 of [2]. The packing for 32 parallel blocks is identical (see Fig. 11 of [2]). It is to be noted that the packing used for PRESENT is the same as for LED (such a packing can be obtained with a little more than one hundred of instructions).
The (un)packing for Piccolo with 16parallel blocks depicted in Fig. 10 of [2] is very similar and requires a few more instructions. The reader can refer to [19] for details and code about this.
4.2 The Encryption
An important part of the encryption cost are the Sboxes, but the bitslice representation allows to compute many of them in parallel within a few clock cycles. We recall the logical instructions sequences proposed by [19] in Appendix E.1 of [2] for the LED and PRESENT Sbox, and in Appendix E.2 of [2] for the Piccolo Sbox.
The second part of an encryption round is the linear diffusion layer. For LED, the ShiftRows is simply performed with a few pshufb operations and the MixColumnsSerial are handled with the same method as in [18] for AES or in [19] for Piccolo diffusion matrices. In the case of LED, one also has to consider the XORing of round dependent constants during the AddConstants function, but this can be done easily by preparing in advance the constants in bitsliced form. For PRESENT, the bit permutation function pLayer can be performed by just reorganizing the positions of the 16bit (or 32bit) words in the xmm registers in bitsliced form. This can be executed efficiently [19] using a few pshufd, punpck(h/l)dq and punpck(h/l)qdq instructions (see the pseudocode in Appendix E.3 of [2] for \(8\)parallel data blocks). For Piccolo, the nibble position permutation (performed with a few pshufd instructions) and the matrix multiplication are similar to the ones in [19].
4.3 The Key Schedule
As previously explained, the key schedule cost can be prohibitive in certain use cases when it comes to bitslicing. Thus, it seems reasonable to design bitsliced versions of the key schedule: this would leverage possible parallelism when many keys are processed, and this will prepare these keys in their packed format so that XORing them with the bitsliced state is straightforward. As a matter of consequence, the bitslice format for the key must be the same as for the data, or at least very similar so that the repacking cost is small. To minimize the key schedule cost, the packing is only performed once for the original keys, from which the subkeys are produced by shift and masking operations.
LED. No key schedule is defined for LED. Only the original secret key has to be packed in the data bitsliced format (one \(64\)bit key for LED64 and two \(64\)bit keys for LED128, other sizes use a sliding window requiring some additional shifts and masks).
Piccolo. The key schedule is very light: it basically consists in selecting \(16\)bit chunks from the original secret key and XORing them with round constants. Similarly to LED, our implementation first prepares the \(16\)bit chunks in bitsliced format once and for all. Thanks to the adapted packing, the two \(16\)bit key words appear in the same registers. For instance, when parallelism is of \(16\) blocks, \(8\) xmm registers are required to store the data and each round key, however, \(4\) are required only for storing one roundkey in our case, because the other \(4\) contain only \(0\)s, which can be discarded. This saves storage, and also keyaddition operations by half. Another important observation is that evennumber indexed chunks appear only in the left part of the round keys, and oddnumber indexed chunks appear only in their right parts. Hence, we can preposition these chunks only once, and the key schedule would involve only XORing the appropriate two chunks and the constants. To reduce the number of packing operations, we first pack all the original secret keys without repositioning, and then do the prepositioning for subkeys. These arrangements minimize the overall operations required by the key schedule.
PRESENT. The key schedule of PRESENT is not well suited for software, and even less suited when the key data has to be in bitsliced format. We divide the keys in two chunks (\(64\) and \(16\) bits for PRESENT80 and two \(64\)bit chunks for PRESENT128) and prepare them in bitsliced format using the same packing as the data (each first chunks of the keys are packed together and each second chunks of the keys are packed together). The subkey to be XORed every round to the cipher internal state is located in the first chunk. The constants addition of the key schedule update function is simply handled by preformatting the constants in the bitsliced format and XORing them to the chunks (in fact only one chunk will be impacted if one does this step before the rotation). Then, the Sbox layer is performed by using the same Sbox function as for the internal cipher, just making sure with a mask that only a single Sbox is applied. Finally, the \(61\)bit rotation is separated in two parts, since only rotations of a multiple of \(4\) bits are easy to handle in bitslice packing. First, a \(60\)bit rotation is applied using several pshufb instructions (together with masking and XORs). Then, a single bit rotation is computed by changing the ordering of the xmm registers (the xmm registers containing the third Sbox bits will now contain the second Sbox bits, etc.). An adjustment is eventually required as some bits will go beyond the register limit and should switch to another one (this can be done with more shifts, masks and XORs). We provide the pseudocode for the bitsliced key schedule implementation of PRESENT80 in Appendix E.5 of [2].
4.4 Discussions
LED64  LED128  Piccolo80  Piccolo128  PRESENT80  PRESENT128  
Key schedule ratio  3.3 %  4.1 %  20.2 %  26.7 %  55.2 %  59.9 % 
5 Analyzing the Performance
5.1 Framework for Performance Evaluation
In order to compare various implementation techniques, we will consider that a server is communicating with \(D\) devices, each using a distinct key. For each device, the server has to encipher/decipher \(B\) \(64\)bit blocks of data. Moreover, we distinguish between the cases where the enciphered data comes from a parallel operating mode (like CTR) or a serial one (like CBC).
Now, we would like to take in account the fact that some implementations can be faster when some parallelism is possible (like bitslice technique). Let \(t_{E}\) be the time required by the implementation to perform the encryption process (without the key schedule and without the packing/unpacking of the input/output data). Let \(P_E\) denote the number of blocks that the implementation enciphers at a time in an encryption process (i.e. the number of blocks the implementation was intended to be used with). Similarly, let \(t_{KS}\) be the time required by the implementation to perform the key schedule process (without the packing of the key data) and we naturally extend the notation to \(P_{KS}\).
We remark that ciphering a lower number of blocks than \(P_E\) (resp. \(P_{KS}\)) will still require time \(t_E\) (resp. \(t_{KS}\)). However, contrary to the encryption or key schedule process, the packing/unpacking time of the input/output data will strongly depend on the number of blocks involved. Therefore, if we denote by \(t_{pack}\) the time required to pack one block of data, we get that packing \(x\) blocks simply requires \(x \cdot t_{pack}\). Similarly, we denote \(t_{unpack}\) the time required to unpack one block of data and unpacking \(x\) blocks simply requires \(x \cdot t_{unpack}\). For the key schedule, \(t_{packKS}\) denotes the time to pack the key data, and packing \(x\) keys requires \(x \cdot t_{packKS}\) (there is no need to unpack the key).
For previous bitslice implementations, since many blocks are assumed to be enciphered, the key schedule cost is usually omitted. However, in this article, we are interested in use cases where for example \(B\) can be a small value, like a single block. When \(B\) is small, one can see that the relative cost of the key schedule has to be taken in account.
5.2 The Use Cases
\(D\)  \(B\)  Op. mode  Example  LED  PRESENT  Piccolo  
 Small  Small    Authentication/access control/secure traceability (industrial assembly line)  Table/ vperm  Table/ vperm  Table/ vperm 
 Small  Big  Parallel  Secure streaming communication (medical device sending continuously sensitive data to a server, tracking data, etc.)  Bitslice  Bitslice  Bitslice 
 Small  Big  Serial  Secure serial communication  Table/ vperm  Table/ vperm  Table/ vperm 
 Big  Small    Multiuser authentication/secure traceability (parallel industrial assembly lines)  Bitslice  Bitslice  Bitslice 
 Big  Big  Parallel  Multiuser secure streaming communication/cloud computing/smart meters server/sensors network/internet of things  Bitslice  Bitslice  Bitslice 
 Big  Big  Serial  Multiuser secure serial communication  Bitslice  Bitslice  Bitslice 
6 Results and Discussions
6.1 Implementation Results
We have performed measurements of our three types of implementations for our three lightweight candidates. For more precision, the encryption times have been measured with a differential method, checking the consistency by verifying that the sum of the subparts is indeed equal to the entire encryption. Moreover, the measurements have been performed with the TurboBoost option disabled, in order to avoid any dynamic upscale of the processor’s clock rate (this technology was implemented in certain processor versions since Intel Nehalem CPU generation). We observe that our bitslice implementations timings for Piccolo and PRESENT are consistent with the ones provided in [19]. Moreover, we greatly improve over the previously best known LED software implementations (about 57 c/B on Core i7–720QM [13]), since our bitsliced version can reach speeds up to 12 c/B.
We give in Table 2 in Appendix B all the implementation results on Core i3–2367M (Sandy Bridge microarchitecture), XEON X5650 (Westmere microarchitecture) and Core 2 Duo P8600 (Core microarchitecture) processors. Using the measurements for \(t_E\), \(t_{KS}\), \(t_{pack}\), \(t_{unpack}\), \(t_{packKS}\) in our framework from Sect. 5, we can infer the performances for the 6 use cases.
6.2 Comparing the implementations types and the ciphers
For bitslice implementations, the cost of bitsliced form transposition on the server can be removed if the device also enciphers in bitsliced format. However, depending on the type of constrained device, the bitsliced algorithm might perform very poorly and the communication cost would increase if a serial mode is used or if a small amount of data is enciphered. Moreover, this solution would reduce the compatibility if other participants have to decipher in nonbitsliced form. The same compatibility issue is true for the keys in the server database, if one directly stores the keys or subkeys in bitsliced form. Finally, it is to be noted that bitsliced versions of the key schedule are especially interesting when all the keys are changed at the same time (i.e. fixed message length, messages synchronized in time).
We can see that from a software implementation perspective, all three ciphers perform reasonably well and are in the same speed range. Their internal round function is quite fit for x86 architectures. Tablebased implementations are helped by the small \(64\)bit internal state size. The vperm implementations are fast thanks to the use of small \(4\)bit Sboxes, even though the linear diffusion layer can significantly impact the performance (which is the reason why TWINE has very good vperm implementation performances). For PRESENT the bit permutation layer is not really suited for software, the LED diffusion matrix has complex coefficients when not in its serial form, and the Piccolo \(F\) function with two layers of Sboxes reduces the possibilities of improvements. Concerning the key schedule, having a byte oriented or no key schedule is an advantage and bitwise rotation as in PRESENT is clearly difficult to handle in software.
Lots of research has been conducted on block cipher constructions and building a good cryptographic permutation is now well understood by the community. However, this is not the case of the key schedule and, usually, block ciphers designers try to build a key schedule very different from the round function, in a hope to avoid any unpredicted relation that might arise between the two components. However, we remark that this is in contradiction with efficient parallel implementations (like bitslice), since the packing of the key and the block cipher internal state must be (almost) the same (otherwise the repacking cost for every generated subkey would be prohibitive).
It is also to be noted that when analyzing ciphers software performances on the server side, it is more likely that decryption will have to be performed instead of encryption. We emphasize that the decryption process would have the same performances as our encryption implementations in the case of PRESENT. For LED and Piccolo, the inverse matrix for the diffusion layer will have more complex coefficients than the encryption one (only the nonserialized matrix for LED), but this shall not impact tablebased implementations. However, we remark that this might have an impact on our best performing implementations for Piccolo and their decryption counterpart are likely to be somewhat slower than encryption mode.
6.3 Future Implementations

Tablebased implementations: with the new vgatherqq instruction, it is possible to perform \(4\) parallel table lookups by using \(4\) indexes inside the ymm quadwords. The resulting quadwords, after the lookups, are stored inside the ymm source register. Such a technique has been applied to the Grøstl hash function in [15]. When applied to lightweight block ciphers, \(4\) internal states can be stored inside a single ymm register. One can isolate the 8bit indexes (if we use 8bit tables) by using the vpshufb instruction, perform the \(4\) lookups in parallel, and merge the results by XORing it within an accumulator. As we can see, this will result in a \(4\)way vectorized block cipher. According to [16], vgatherqq will have a latency of \(16\) cycles and a throughput of \(10\) cycles when data is in L1. A very rough estimation of the results on Haswell CPU is thus a \(1.5\) to \(2\) times improvement over the tablebased implementations results provided in Appendix B (since the mov instruction has a latency of \(4\) cycles in L1).

vperm based implementations: extending the vperm technique to 256bit ymm registers is straightforward, since one would store \(4\) states instead of \(2\) in one register. As for tablebased, vperm implementations will be vectorized on \(4\) states providing a \(2\) times performance improvement for at least \(4\) parallel message blocks.

Bitslice implementations: as for the vperm technique, bitslicing can naturally take advantage of the AVX2 extension to 256bit registers by performing in the high 128bit parts of ymm the exact same operations as on the low parts (if \(N\) message blocks are to be packed, \(N/2\) are packed as previously presented in the low part of ymm, and \(N/2\) are packed in the high part). This would roughly give a \(2\) times improvement for the performance (however requiring, as for vperm, twice more parallel message blocks).
In this article, we have studied the software implementation of lightweight block ciphers on x86 architectures, with special focus on LED, Piccolo and PRESENT. We provided tablebased, vperm and bitslice implementations and compared these three methods according to different common lightweight block ciphers use cases. We believe our work helps to get a more complete picture on lightweight block ciphers and we identified several possible future researches.
First, we remark that our cache latency model for tablebased implementations predicts that new and future processors with an important amount of L2 cache might enable new fast primitives that utilize \(16\)bit Sboxes (which could then be implemented using big table lookups). Moreover, this remark might also improve current ciphers such as LED or PRESENT, by imagining a “SuperSbox” type of implementation: two rounds can be seen as only composed of the applications of four parallel \(16\)bit Sboxes, and thus can be perfomed with only \(4\) table lookups.
Secondly, in the future, it would be interesting to use this kind of modeling to compare different implementation tradeoffs without tedious implementation for all of them (this would be also true for hardware implementations). Tablebased is a simple case we leave as open problem if more complex implementations can be studied the same way.
Finally, another future work is to study other recently proposed block cipher designs such as PRINCE [6] or Zorro [11], and the lightweight SPNbased hash functions such as PHOTON [12] or SPONGENT [4]. The analysis of hash functions would be quite different since their internal state sizes (which vary with the intended output size) is bigger than \(64\) bits. Therefore, the amount of memory required to store the tables for tablebased implementations is likely to be bigger, and vperm or bitslice implementations would be impacted as well since the packing would be more complex and would use more xmm registers.
Footnotes
 1.
For the sake of simplicity, we consider an exclusive cache model. Considering inclusive or hybrid models would not change the equation much.
 2.
One should consider the throughput of the instructions instead of their latencies for accurate performance estimates.
 3.
These figures correspond to high level pseudocode, but are slightly changed in assembly as reflected in the cache model results thanks to maskandmove instructions.
 4.
The expected throughput improvement would however vary across the considered microarchitectures (mainly depending on the pipeline stage where registertoregister moves are performed, as well as the front end instruction decoder throughput).
Notes
Acknowledgements
The authors would like to thank the anonymous referees for their helpful comments.
Supplementary material
References
 1.Aumasson, J.P., Henzen, L., Meier, W., NayaPlasencia, M.: quark: a lightweight hash. In: Mangard, S., Standaert, F.X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 1–15. Springer, Heidelberg (2010) Google Scholar
 2.Benadjila, R., Guo, J., Lomné, V., Peyrin, T.: Implementing lightweight block ciphers on x86 architectures. Cryptology ePrint Archive, Report 2013/445, full version. http://eprint.iacr.org/2013/445.pdf (2013)
 3.Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: Keccak specifications. Submission to NIST. http://keccak.noekeon.org/Keccakspecifications.pdf (2008)
 4.Bogdanov, A., Knezevic, M., Leander, G., Toz, D., Varici, K., Verbauwhede, I.: SPONGENT: a lightweight hash function. In: Preneel, B., Takagi, T. (eds.) [22], pp. 312–325Google Scholar
 5.Bogdanov, A.A., Knudsen, L.R., Leander, G., Paar, Ch., Poschmann, A., Robshaw, M., Seurin, Y., Vikkelsoe, C.: Present: an ultralightweight block cipher. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 450–466. Springer, Heidelberg (2007) Google Scholar
 6.Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B., Knezevic, M., Knudsen, L.R., Leander, G., Nikov, V., Paar, Ch., Rechberger, Ch., Rombouts, P., Thomsen, S.S., Yalçın, T.: prince – a lowlatency block cipher for pervasive computing applications. In: Wang, X., Sako, K. (eds.) ASIACRYPT 2012. LNCS, vol. 7658, pp. 208–225. Springer, Heidelberg (2012) Google Scholar
 7.De Cannière, C., Dunkelman, O., Knezevic, M.: KATAN and KTANTAN  a family of small and efficient hardwareoriented block ciphers. In: Clavier, C., Gaj, K. (eds.) [9], pp. 272–288Google Scholar
 8.Chen, A.I.T., Chen, M.S., Chen, T.R., Cheng, C.M., Ding, J., Kuo, E.L.H., Lee, F.Y.S., Yang, B.Y.: SSE implementation of multivariate PKCs on modern x86 CPUs. In: Clavier, C., Gaj, K. (eds.) [9], pp. 33–48Google Scholar
 9.Clavier, C., Gaj, K. (eds.): CHES 2009. LNCS, vol. 5747. Springer, Heidelberg (2009)Google Scholar
 10.Daemen, J., Rijmen, V.: The Design of Rijndael: AES  The Advanced Encryption Standard. Springer, Heidelberg (2002)Google Scholar
 11.Gérard, B., Grosso, V., NayaPlasencia, M., Standaert, F.X. : Block ciphers that are easier to mask: how far can we go? Cryptology ePrint Archive, Report 2013/369. http://eprint.iacr.org/ (2013)
 12.Guo, J., Peyrin, T., Poschmann, A.: The photon family of lightweight hash functions. In: Rogaway, P. (ed.) CRYPTO 2011. LNCS, vol. 6841, pp. 222–239. Springer, Heidelberg (2011)Google Scholar
 13.Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.J.B.: The LED block cipher. In: Preneel, B., Takagi,T. [22], pp. 326–341Google Scholar
 14.Hamburg, M.: Accelerating AES with vector permute instructions. In: Clavier, C., Gaj, K. (eds.) [9], pp. 18–32Google Scholar
 15.HolzerGraf, S., Krinninger, T., Pernull, M., Schläffer, M., Schwabe, P., Seywald, D., Wieser, W.: Efficient vector implementations of aesbased designs: a case study and new implemenations for Grøstl. In: Dawson, E. (ed.) CTRSA 2013. LNCS, vol. 7779, pp. 145–161. Springer, Heidelberg (2013) Google Scholar
 16.Intel. Intel 64 and IA32 Architectures Optimization Reference Manual, 2013.Google Scholar
 17.International Organization for Standardization. ISO/IEC 29192–2:2012, Information technology  Security techniques  Lightweight cryptography  Part 2: Block ciphers, 2012Google Scholar
 18.Käsper, E., Schwabe, P.: Faster and timingattack resistant AESGCM. In: Clavier, C., Gaj, K. (eds.) [9], pp. 1–17Google Scholar
 19.Matsuda, S., Moriai, S.: Lightweight cryptography for the cloud: exploit the power of bitslice implementation. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 408–425. Springer, Heidelberg (2012) Google Scholar
 20.Osvik, D.A.: Fast assembler implementations of the AES (2003)Google Scholar
 21.Poschmann, A.: Lightweight cryptography  cryptographic engineering for a pervasive world. Cryptology ePrint Archive, Report 2009/516. http://eprint.iacr.org/ (2009)
 22.Preneel, B., Takagi, T. (eds.): CHES 2011. LNCS, vol. 6917. Springer, Heidelberg (2011)Google Scholar
 23.Shibutani, K., Isobe, T., Hiwatari, H., Mitsuda, A., Akishita, T., Shirai, T.: Piccolo: an ultralightweight blockcipher. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 342–357. Springer, Heidelberg (2011)Google Scholar
 24.Suzaki, T., Minematsu, K., Morioka, S., Kobayashi, E.: \(\mathit{ {twine}}\): a lightweight block cipher for multiple platforms. In: Knudsen, L.R., Wu, H. (eds.) SAC 2012. LNCS, vol. 7707, pp. 339–354. Springer, Heidelberg (2013)Google Scholar
 25.U.S. Department of Commerce, National Institute of Standards and Technology. Secure Hash Standard (SHS) (Federal Information Processing Standards Publication 180–4). http://csrc.nist.gov/publications/fips/fips1804/fips1804.pdf (2012)