Bitsliced Implementations of the PRINCE, LED and RECTANGLE Block Ciphers on AVR 8Bit Microcontrollers
 5 Citations
 1.2k Downloads
Abstract
Due to the demand for lowcost cryptosystems from industry, there spring up a lot of lightweight block ciphers which are excellent for some different implementation features. An innovative design is the block cipher PRINCE. To meet the requirement for lowlatency and instantaneously encryption, NXP Semiconductors and its academic partners cooperate and design the lowlatency block cipher PRINCE. Another good example is the block cipher LED which is very compact in hardware, and whose designers also aim to maintain a reasonable software performance. In this paper, we demonstrate how to achieve high software performance of these two ciphers on the AVR 8bit microcontrollers using bitslice technique. Our bitsliced implementations speed up the execution of these two ciphers several times with less memory usage than previous work. In addition to these two nibbleoriented ciphers, we also evaluate the software performance of a newly proposed lightweight block cipher RECTANGLE, whose design takes bitslicing into consider. Our results show that RECTANGLE has very high ranks among the existing block ciphers on 8bit microcontrollers in the realworld usage scenarios.
Keywords
PRINCE LED RECTANGLE Bitslice Block cipher Lightweight Cryptography Microcontroller Wireless sensor AVR ATtiny Implementation1 Introduction
In several emerging areas and in the Internet of Things (IoT) era, where hundreds billion of highly constrained devices are interconnected, working together and typically communicating wirelessly, security and privacy can be very important. Cryptographic techniques will be required for those devices and will be implemented on more devices at present and in the future. For different application scenarios, there are different efficiency measures to evaluate whether a cryptographic technique can be applied to the scenarios. For hardware implementation, dominant metric are gate count (power, cost), energy, latency etc. The corresponding application examples are RFID and lowcost sensors, healthcare devices and batterypowered devices, memory encryption, invehicle devices and industrial control systems. For software implementation, memory (ROM/RAM) and execution time are important measures. Again, application examples are invehicle devices, sensors and consumer electronics.
Compare with general purpose cryptographic algorithms, lightweight cryptographic primitives which relax implementation requirements have more advantages in those scenarios. In this paper, we mainly talk about block ciphers which are the most versatile of the symmetric ciphers. Lightweight block ciphers can overcome limitations of gate count in small chips, for examples a state of the art AES128 [1] hardware implementation uses 2400 GE (Gate Equivalent) [7], while PRESENT128 [5], LED128 [3, 4] and SIMON128 [6] can respectively offer a 1391 GE [8], a 1265 GE [4] and a 1234 GE [6] implementation. As the gate count is small, energy consumption is small. In applications where instantaneous response is highly desired, lightweight primitives such as PRINCE [2] and SIMON can achieve same or less latency of AES with respectively 1/40 and 1/52 gate counts at 130 nm lowleakage Faraday libraries. In other applications, such as sensor nodes or RFID tags, the equipped inexpensive microprocessor and microcontroller typically have a limited power budget and severely constrained memory (RAM and flash). For example, the ATtiny45 [11] has just 4 KBytes of flash, 256 bytes of RAM. There are lightweight ciphers, such as PRIDE [9] that can use no RAM and achieve similar throughput of AES with 1/2 flash bytes.
Over the last decade, cryptographic community has made significant efforts in the development of lightweight block ciphers. Following as a result, there are now more than a dozen of lightweight block ciphers for industry to choose from, including PRESENT [5] and CLEFIA [12], which have become ISO/IEC lightweight block cipher standards, and many other lightweight block ciphers, such as HIGHT [13], KATAN/KATANTAN [14], KLEIN [15], LBlock [16], LED [3], TWINE [17], PRINCE [2], SIMON and SPECK [6], PRIDE [9], RECTANGLE [18] etc. They are all excellent with respect to certain features. We refer to a web page [19] for a complement lightweight block cipher list. Consequences are the work and projects to improve and evaluated the performance of those lightweight block ciphers. With respect to software implementation on microcontrollers, there are several survey papers and open projects provide benchmarking results and reports on the performance of lightweight block ciphers [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33].
On [25], a web page of ECRYPT II project, compact implementation and performance evaluation of 12 lowcost block ciphers on AVR ATtiny45 8bit microcontroller are presented. The set of analyzed ciphers includes the lowcost ciphers designed until the corresponding paper [24] publication and thus it does not contain recent designs. The authors introduce a comparison metric that takes into account both the code size (ROM and RAM), the cycle count and the energy consumption. Implementation of the 12 ciphers comes from 12 different designers and codes are written in assembly. Both encryption, decryption and key management routines have to be implemented, and there is no usage scenarios (message length and mode of operation) involved. The authors of [29] implemented 21 lowcost (5 classical and 16 lightweight) block ciphers on the WSN430 sensor which is based on 16bit TI MSP430 microcontroller. This is the biggest collection of lowcost ciphers implementations available on 16bit microcontroller. However, as [32] points out, some of the implemented ciphers do not verify the test vectors. Both of the two above projects are not active for a long time.
Recently, a new benchmarking framework [32] was presented at the NIST Lightweight Cryptography Workshop 2015 [36]. Learn from the strengths and weakness of those previous benchmarking frameworks, the authors manage to design a more flexible and powerful framework for evaluation of lightweight ciphers on different embedded devices commonly used in the IoT context. By the publication, they have studied software performance of 13 lightweight block ciphers on three different devices, 8bit AVR, 16bit MSP and 32bit ARM. Their evaluation consideration involves two most typical usage scenarios that resemble securityrelated operations commonly performed by realworld IoT devices. They also maintain a web page [34] with the most recent results. Triathlon challenge [33] are announced to improve those results and collect more implementations and more performance evaluations of more newer designs. They introduce a “Figure of Merit” (FOM) according to which an overall ranking of ciphers can be assembled.
Based on their current results in [32], the NSA designs Simon and Speck, AES and LSdesigns are among the smallest and fastest ciphers on all platforms. Unfortunately, PRINCE which is superior as for lowlatency and LED which is very compact as for hardware implementation get the lowest rank. Some newly proposed designs are also not included in their list, such as RECTANGLE [18] which is published in 2014 and presented at the NIST Lightweight Cryptography Workshop 2015 [36].
In this paper, we aim to contribute to the performance benchmarks of PRINCE, LED and RECTANGLE on 8bit microcontrollers. Generally speaking, we expect to improve the performance of those ciphers on 8, 16 and 32bit microcontrollers. While, it seems reasonable to begin with 8bit microcontroller, since the performance of those ciphers on 16 and 32bit microcontrollers will be much better than that they can achieve on the 8bit, and 8 and 16bit microcontrollers constituted an overwhelming part of the total microcontroller market [35]. Moreover, optimizations on the 8bit microcontrollers can have strong reference meanings for that on the 16 and the 32bit.
1.1 Related Work
For nibbleoriented and byteoriented ciphers, people usually use look up tables (LUTs), which may need large memory to achieve high throughput. Both of PRINCE and LED can be seen as nibbleoriented. Specifically, the original components of the ciphers can be described as operations on nibbles (intranibbles and internibbles). For PRINCE, there are two related efforts which aim to improve the software performance on 8bit AVR microcontrollers. In the first work [37], two implementations of PRINCE are presented. The first is Ttable implementation which combines different operations within a round into a single table lookup operation. The second is blockparallel implementation which stores two nibbles in one register and processes them in parallel wherever possible. The Sboxes are stored as two 256byte tables. In these two implementations, LUTs are too large to store in RAM. Thus they are coded in the programmable flash memory to which each access takes 1 more clock than that takes to the RAM.
The second work [38] presents a nibblesliced implementation which stemming from bitslicing. Nibbleslicing is custommade for nibbleoriented permutation layers. Similar to the blockparallel implementation in the first work, two nibbles are stored in one register. However, there is a difference between the two work: the blockparallel implementation processes nibbles within one block in parallel, while the nibblesliced implementation processes two blocks in parallel, specifically, two nibbles in the same position of two blocks are stored in one register. Thus, it needs 16 registers to store states of two blocks. As in the first work, the second work also uses byteoriented LUTs, which are 256byte tables stored in flash memory for the Sboxes computation. Due to the nibbleslicing manner, the Sboxes computation and the SR operations can be merged together, which helps to reduce execution time. In the second work, cycle count and memory consumption are derived using AVR Studio simulations. Code in both of the two work was implemented in assembly. Performance evaluation in the first work involved cycles and code for nibble reordering, while it is unclear whether cycles and code for the nibble slicing are taken into account in the second work. Usage scenarios are not considered in both of the two work.
Work to improve the performance of LED on 8bit AVR microcontrollers is quite rare. We can only refer to [34] for open source software implementation of LED on 8bit microcontrollers. Unfortunately, the performance achieved by those C implementations in [32, 34] are quite unsatisfactory.
1.2 Our Contributions
Results for Scenario 1 (encryption of 128 bytes of data using CBC mode)
Cipher  Implementation  Code [Bytes]  RAM [Bytes]  Time [Cycles] 

I: Encryption + Decryption (including key schedule)  
PRINCE  Triathlon [32]  5358  \(^c\)(\(37470\)) 304  243396 
This work \(^a\)  2454  248  84656  
LED  Triathlon [32]  5156  \(^c\)(\(57488\)) 486  2221555 
This work \(^a\)  2374  232  362889  
RECTANGLE  This work  682  310  60298 
II: Encryption (without key schedule)  
PRINCE  Triathlon [32]  4210  \(^c\)(\(17446\)) 128  121137 
BlockParallel [37]  \(^b\)1574  \(^b\)24  \(^b\)52048  
This work \(^a\)  1746  0  40736  
LED  Triathlon [32]  2600  \(^c\)(\(24266\)) 176  1074961 
This work \(^a\)  1412  0  180384  
RECTANGLE  This work  250  0  29148 
III: Decryption (without key schedule)  
PRINCE  Triathlon [32]  4352  \(^c\)(\(19870\)) 128  122082 
This work \(^a\)  1746  0  40976  
LED  Triathlon [32]  3068  \(^c\)(\(28088\)) 192  1146226 
This work \(^a\)  1414  0  182128  
RECTANGLE  This work  252  0  29788 
With respect to PRINCE, in Scenario 1, we achieve 2.88 \(\times \) boost in throughput with 1 / 1.23 RAM and 1 / 2.18 flash memory comparing with [32], achieve 1.28 \(\times \) boost in throughput comparing with [37]. In Scenario 2, we achieve 4.67 \(\times \) boost in throughput with the same RAM and 1 / 2.09 flash comparing with [32], achieve similar throughput with 1 / 9.17 RAM and similar flash comparing with [38]. For LED, comparing with [32], our implementation achieves 6.12 \(\times \) boost in throughput with 1 / 2.09 RAM and 1 / 2.17 flash in Scenario 1, and 11.27 \(\times \) boost in throughput with the same RAM and 1 / 2.72 flash in Scenario 2. As shown in this work, PRINCE which gets a low rank in [32] is actually very efficient on 8bit AVR devices.
We also aim to contribute performance benchmarks in the realworld usage scenarios of a newly proposed cipher RECTANGLE. Our results show that RECTANGLE gets very high ranks in those scenarios and can parallel SIMON in performance, see Table 3.
Results for Scenario 2 (encryption of 128 bits of data using CTR mode)
Cipher  Implementation  Code [Bytes]  RAM [Bytes]  Time [Cycles] 

PRINCE  Triathlon [32]  4420  \((6844)\) 24  17271 
NibbleSlice [38]  \(^a\)2382  \(^a\)220  \(^a\)3606  
This work (FixOrder)  2118  24  3696  
This work (ReOrder)  2642  24  4236  
LED  Triathlon [32]  2602  \((9167)\) 24  143317 
This work (FixOrder)  956  24  12714  
This work (ReOrder)  1480  24  13254  
RECTANGLE  This work (LessTime)  582  24  3405 
This work (LowFlash)  428  24  3995 
The rest of the paper is organized as follows. Section 2 clarifies our target device, considering scenarios and performance measurement metrics. Sections 3, 4 and 5 respectively demonstrate how to achieve high software performance of PRINCE, LED and RECTANGLE on the AVR 8bit microcontrollers using bitslice technique. For each cipher, after a brief description, we exhibit how to rearrange the state bits, and how to implement its main operations in bitslicing with minimal number of instructions in two usage scenarios. Section 6 summarizes the results of this work.
Updated results for ciphers performance in Scenario 1 and Scenario 2
Scenario 1 (encryption of 128 bytes of data using CBC mode)  Scenario 2 (encryption of 128 bits of data using CTR mode)  

Cipher  Code [Bytes]  RAM [Bytes]  Time [Cycles]  \(p_{i}\)  Cipher  Code [Bytes]  RAM [Bytes]  Time [Cycles]  \(p_{i}\) 
Speck\(^b\)  560  280  44264  3.21  Speck\(^b\)  294  24  2563  3.00 
Simon\(^b\)  566  320  64884  3.86  Simon\(^b\)  364  24  4181  3.87 
REC\(^b\)  682  310  60298  3.92  REC\(^b\)  428  24  3995  4.01 
PRIN\(^b\)  2454  248  84656  7.36  AES\(^{a}\)  1246  (\(8149\)) 32  3408  6.90 
AES\(^{a}\)  3010  (\(40870\)) 338  58246  8.15  LED\(^b\)  956  24  12714  9.21 
PRES\(^{a}\)  2160  (\(44846\)) 402  245232  11.13  PRIN\(^b\)  2118  24  3696  9.65 
LED\(^b\)  2374  232  362889  13.44  PRES\(^{a}\)  1294  (\(5632\)) 24  16849  11.97 
2 Our AVR Implementations, Considering Scenarios and Performance Measurement
The specific target device in this work is the AVR ATmega128 8bit microcontroller, which has 128 KBytes of flash, 4 KBytes of RAM and 32 8bit general purpose registers. The ATmega128 uses an 8bit RISC microprocessor. The arithmetic and logical instructions are usually destructive source operand, i.e. destination register is one of the source registers. And most of the numeric processing instructions take one clock cycle. Instructions which load data from and store data to RAM take 2 clock cycles and that access to flash memory take 3 clock cycles. Opcode of most instructions uses 2 bytes, some special types uses 4 bytes. For a detailed introduction of 8bit AVR instructions, please refer to [10, 32]. To achieve optimal performance, we code the ciphers in assembly and the assembly code was written in and compiled using Atmel Studio 6.2. Cycle counts, code sizes and RAM usage are also determined using this tool.
For each cipher, we have implementations targeting to two scenarios which are introduced in [32].

Scenario 1  Communication Protocol [32]. This scenario covers the need for secure communication in sensor networks and between IoT devices. Sensitive data is encrypted and decrypted using a lightweight block cipher in CBC mode of operation. Data length exchanged in a single transmission is fixed to 128 bytes. The master key is stored in the device’s RAM, from which, round keys are computed using the key schedule and then stored in RAM for later use. The key schedule does not modify the master key. The data that have to be sent as well as the initialization vector are also stored in RAM. Encryption is performed in place to reduce the RAM consumption.

Scenario 2  ChallengeHandshake Authentication Protocol [32]. This scenario covers the need of authentication in the IoT. In the authentication protocol, the block cipher is used in CTR mode to encrypt 128 bits of data. The device has the cipher round keys stored in Flash memory and there is no master key stored in the device and consequently no key schedule is required. The data that has to be encrypted is stored in RAM, as well as the counter value. To reduce the RAM usage, the encryption process is done in place.
We consider the same three metrics of ciphers performance as considered in [32], including code size, RAM and execution time. Code sizes include the value of the Code and Data sections. Code section contains the bytes used by the binary code, Data section contains global initialized variables (such as the flash used by round constants etc.). The measurements do not consider the main function’s code size, where all the cipher operations are put together. RAM usage includes scenario specific RAM data, such as data to encrypt, master keys, round keys and initialization vectors. The execution time is expressed in number of processor cycles spent executing those procedures, such as encryption, key schedule or decryption. In addition, our measurement includes the cost taken by rearrangement operations on the plaintexts, ciphertexts and round keys.
3 PRINCE AVR Implementations
In this section, we present the first (to our knowledge) bitsliced implementation of the PRINCE cipher on 8bit AVR microcontroller.
3.1 The PRINCE Cipher
PRINCE operates on 64bit blocks and uses a 128bit key k which composed of two 64bits elements, \(k_0\) and \(k_1\). It is based on the socalled FXconstruction. The 128bit key is extended to 192 bits by the mapping: \((k_0k_1)\rightarrow (k_0k_0'k_1) = (k_0(k_0\ggg 1)\oplus (k_0\gg 63)k_1)\). \(k_0\) and \(k_0'\) are used as whitening keys, while \(k_1\) is the 64bit key used without updates by the 12round block cipher refer to as PRINCE\(_{core}\). PRINCE has an \(\alpha \)reflection property. Decryption can reuse the exact same procedure of encryption by simply XOR a constant \(\alpha \) to the third element \(k_1\) of the extended key and exchange the used order between \(k_0\) and \(k_0'\). While, both procedures use the inverse round function as well as the round function.
The round function of PRINCE is AESlike, operates on a \(4\times 4\) state matrix of nibbles, which can be seen as composed of the following operations: KeyXor (correspond to \(k_i\)add in the cipher specification) and RCXor (corresponds to \(RC_i\)add), Sbox (corresponds to SLayer), MixColumns (corresponds to The Matrices \(M'\)layer) and ShiftRows (corresponds to SR, and \(SR\circ M'=M\) which is called the Linear Layer), among which only MixColumns is an involution. Thus, the implementations of the PRINCE have to instantiate the following operations: the KeyXor and RCXor, the Sbox and inverse Sbox, the MixColumns, the ShiftRows and Inverse ShiftRows. For more details about PRINCE please refer to [2].
We implement all of those operations in bitslicing, while all previous work is nibbleoriented and based on LUTs. Before the demonstration of those bitsliced implementations of each operation, we show how we slice the state.
3.2 PRINCE AVR Implementations
To bitsliced implement the operations, we gather the bits with index \(*yz\), i.e. bits with same row index and same slice index are gathered together. We call the resulted bit set a lane ^{1}, in which all bits will be settled in the same register in our implementations. And we rearrange the state in a way as depicted in Fig. 2a.
We then use the following conventions. Let \(\mathbf S \) denotes the complete state, then \(\mathbf S [*,y,z]\) denotes a particular lane. In implementations for Scenario 1, two lanes of one state \(\mathbf S [*,y,z]\) and \(\mathbf S [*,y+2,z]\) (\(y\in \{0,1\}\) , \(z\in \{0,\dots ,3\}\)) are stored in one register, in the lower half and the high half respectively. In Scenario 2, two lanes of two states \(\mathbf S [*,y,z]\) and \(\mathbf S '[*,y,z]\) (\(y\in \{0,\dots ,3\}\), \(z\in \{0,\dots ,3\}\)) are stored in one register to process two blocks in parallel.
Bitsliced Implementation of the KeyXor and RCXor. Since we have rearranged the encryption state, we should also rearrange the key and the round constant state in the same way. The KeyXor and RCXor operations can be merged together. In Scenario 1, during the key schedule procedure, the master key is extended and the resulted 3 subkeys are rearranged and XORed to the prerearranged round constants to generate roundkeyconstant materials. The resulted roundkeyconstant materials are then stored in RAM. In Scenario 2, the resulted roundkeyconstant materials is extended (to encrypt two blocks in parallel) and coded in flash memory.
Bitsliced Implementation of the SBox and the Inverse SBox. By rearranging the state bits, we can implement the Sbox and inverse Sbox using logical operations instead of LUTs. In our rearrangement, 4 lanes within one row respectively correspond to the 4 inputoutputs of Sboxes, thus 8 Sboxes can be computed in parallel using a logical instruction sequence operating on 8bit registers, since 2 lanes share one register.
We firstly managed to find the bitsliced implementation of the \(4\times 4\) Sbox (resp. inverse Sbox) of PRINCE using an automatic search tool [41]. Operations in the resulted bitsliced implementations use the ‘operator destination, source \(_1\) , source \(_2\)’ instruction format. While in AVR ATtiny, instructions destination register is one of the source register, i.e. it uses ‘operator destination, source’ instruction format. Thus, we translate the primary instruction sequences into twooperator instruction sequences manually. In our translation, we try to minimum the required clock cycles and realize in place process, i.e. the outputs are in the same registers as the inputs.
The primary bitsliced implementation of the Sbox (resp. inverse Sbox) of PRINCE need 17 (resp. 16) terms. Translating into AVR instructions, it turns to be an instruction sequence with length of \(17+4=21\) (resp. \(16+6=22\)) with 4 (resp. 6) additional copy register (MOV) instructions. Taking advantage of the copy register pair (MOVW) instruction, and processing 16 Sboxes together instead of 8 Sboxes, the Slayer (inverse Slayer) of PRINCE needs \(17\times 2+4=38\) (resp. \(16\times 2+6=38\)) instructions, instead of \(21\times 2=42\) (resp. \(22\times 2=44\)).
Bitsliced Implementation of the MixColumns. According to observations on the linear layer of PRINCE in [39, 40], MixColumns of PRINCE can be seen as being composed of three compositions: mirror on the rows, addition of a parity bit, slicewise rotations by 0,1,2 or 3 positions. Thus, it can be expressed as the parallel application of 16 independent transformations operating on one slice of the internal state. Figure 2c explains those transformations in a 3dimensional way.
In our way of state bits rearrangement, 4 bits in same position within 4 different slices are stored in same register. Thus, parity bits of 8 slices can be computed in parallel. Mirror on the rows and slicewise rotations can be combined to be a bits exchanging among different lanes.
In our implementation for Scenario 1, since the lower half and the high half of one register hold lanes in two rows (lane \(\mathbf S [*,y,z]\) and lane \(\mathbf S [*,y+2,z]\)), addition of parity bit takes 7 instructions for 4 slices. Thus, addition of parity bit for the whole state takes 28 instructions per state. Since column \(\mathbf S [0,*,*]\) and column \(\mathbf S [3,*,*]\) in state S of PRINCE go through same MixColumns operations M\(_0\), slice \(\mathbf S [0,*,z]\) and slice \(\mathbf S [3,*,z]\) within column \(\mathbf S [0,*,*]\) and column \(\mathbf S [3,*,*]\) go through same mirror and rotation operations for \(z\in \{0,\dots ,3\}\). Thus, bit 0 and bit 3 in a register, and bit 4 and bit 7 in a register go through same operations. Likewise, since column \(\mathbf S [1,*,*]\) and column \(\mathbf S [2,*,*]\) in state S go through same MixColumns operations M\(_1\), bit 1 and bit 2 in a register, and bit 5 and bit 6 in a register go through same operations. Finally, we achieve a 4way parallel implementation for the combination between mirror on the rows and slicewise rotations. That takes \(2\times 9+2\times 8=34\) instructions per state.
In our implementation for Scenario 2, since the lower half and the high half of one register hold some lanes in two states (lane \(\mathbf S [*,y,z]\) and lane \(\mathbf S '[*,y,z]\)), addition of the parity bit takes 8 instructions for 8 slices. Thus, addition of parity bit for the whole state takes 32 instructions for two states (thus 16 instructions per state). Similar to the implementation for Scenario 1, the 0, 3rd, 4th and 7th bit in a register go through same operations, and the 1st, 2nd, 5th and 6th bit in a register go through another set of operations. We also achieve 4way and 8way parallel implementations for the combination between mirror on the rows and slicewise rotations. That takes \(4\times 16=64\) instructions for 2 states (thus 32 instructions per state).
On the whole, in respect of Scenario 1, the MixColumns takes \(28+34=62\) instructions per state. And in respect of Scenario 2, the MixColumns takes \(16+32=48\) instructions per state.
Bitsliced Implementation of the ShiftRows and the Inverse ShiftRows. In our way of state bits rearrangement, ShiftRows and Inverse ShiftRows correspond to rotate bits in lanes, which are depicted in Fig. 2b. Thus that needs to rotate high the half and the lower half of 8bit registers separately in our implementation. We implement this by logical AND (AND), logical shift left and right (LSL and LSR), bit load from the T flag in SREG to a bit in register (BLD) and bit store from bit in register to T flag in SREG (BST) instructions.
With respect to Scenario 1, it takes \(4\times 19=76\) instructions to implement ShiftRows (or Inverse ShiftRows) per state. With respect to Scenario 2, it takes \(4\times 19+2=78\) instructions to implement ShiftRows (or Inverse ShiftRows) per 2 states (thus 39 instructions per state).
4 LED AVR Implementations
In this section, we present the first (to our knowledge) bitsliced implementation of the LED cipher on 8bit AVR microcontroller.
4.1 The LED Cipher
LED is a 64bit block cipher, uses a key size from 64 to 128 bits, bases on an substitutionpermutation network (SPN). The two primary instances, 64bit key LED (named LED64) and 128bit key LED (named LED128), respectively has 32 rounds and 48 rounds. In this paper, we will focus on LED128. The key schedule of LED is very simple. In the case of LED128, the master key k is composed of two 64bit subparts, \(k=k_1k_2\), alternatively XORed to the internal state every 4 rounds. The 4round operation between two key addition is called a step. The whole encryption process of LED is described using key addition and step operation.
Similar to the round function of PRINCE, the round function of LED is also AESlike, which operates on a \(4\times 4\) state matrix of nibbles. It also uses the following operations AddConstants (round constant addition), SubNibbles (corresponds to SubCells in [4]), ShiftRows, and MixColumnsSerial. None of those operations is involution. Thus, we should also implement their inverse operations. For more details about LED, please refer to [4].
4.2 LED AVR Implementations
State Bits Rearrangement. We follow the same naming convention for PRINCE state mentioned above in Fig. 2a to define the names of parts of LED state. Our rearrangement of LED state is quite similar to that of PRINCE state. A difference is that, in Scenario 1, each lane of LED state is stored in a whole 8bit register, i.e. only the high half of the 8bit register holds meaningful bits. Thus, 16 lanes \(\mathbf S [*,y,z]\) are respectively stored in 16 8bit registers, leaving the lower half part of register empty. In Scenario 2, the lower half part of those 16 registers hold 16 lanes in another block state.
Bitsliced Implementation of the SBox and the Inverse SBox. LED uses the PRESENT Sbox. There are previous work in which bitsliced implementation of PRESENT Sbox is studied. In [38], the authors aim to improve the throughput of PRESENT using bitslice technique. Their 19instruction AVR implementation of the PRESENT Sbox based on the 14 terms representation found by Courtois in [43]. Using the automatic search tool [41], we try to find optimal bitsliced implementations of both PRESENT Sbox and the inverse Sbox. The best solutions also need 14 terms for PRESENT Sbox and it needs 15 terms for the inverse Sbox.
Similar to our work on PRINCE Sbox, we also try to find the best translation from those general solutions to instruction sequences in AVR instruction set. Through our manual optimization, there are 4 additional instructions (mov) penalty to implement the Sbox on AVR based on the 14 terms (resp. 15 terms) sequences. When take advantage of the MOVW instruction, and deal with two sets of 8bit registers together, 32 instructions (resp. 34 instructions) are needed to finish 16 Sboxes (in Scenario 1, since the lower part of a register is left empty, it takes 64 instructions to execute 16 Sboxes).
As mentioned above, our implementation of MixColumnsSerial for LED does not execute the last switching operation, instead, we combine this switching operation with the Sbox operation. To encrypt, the input of Sboxes are switched between bit 0 and bit 2, and between bit 1 and bit 3. To decrypt, the output of inverse Sboxes are switched similarly. Thus, in our implementation, we actually implement a bitpermutation version of the Sbox and the inverse Sbox, and the final instruction number is 33 for the 16 inputswitched Sboxes and 34 for the 16 outputswitched inverse Sboxes.
Bitsliced Implementation of the ShiftRows and the Inverse ShiftRows. ShiftRows and the Inverse ShiftRows of LED are same with that of PRINCE. The difference between the implementations of the two ciphers is caused by the difference in the arrangement of the state bits in Scenario 1. \(4\times 12=48\) instructions are needed in Scenario 1 to rotate 16 lanes in one state and \(4\times 19=76\) instructions are needed in Scenario 2 to rotate 32 lanes in two states (thus 38 instructions to rotate 16 lanes in average).
5 RECTANGLE AVR Implementations
Presented at [36], RECTANGLE [18] is the most recent cipher discussed here, which is not involved in the analyzed cipher list in [32]. Thus, in this paper, we aim to provide a software performance benchmark for this cipher to [34].
5.1 The RECTANGLE Cipher
RECTANGLE operates on 64bit cipher state. The key length can be 80 or 128 bits. In this paper, we mainly focus on the 128bit key version (named RECTANGLE128). The encryption is also based on an SP network. The substitution layer consists of 16 \(4\times 4\) Sboxes in parallel, which is called SubColumn. The permutation layer is composed of 3 rotations, which is called ShiftRows. For more details about RECTANGLE, please refer to [18].
5.2 RECTANGLE AVR Implementations
State Bits Arrangement. Since the main idea of the design of RECTANGLE is to allow fast implementations using bitslicing techniques, the state arrangement in the bitsliced implementation is quite straightforward. Each 16bit row of the cipher state is held by 2 registers, thus 8 registers are needed to hold the cipher state.
Bitsliced Implementation of the SBox and the Inverse SBox (SubColumns). Similar to our work on PRINCE and LED Sbox, we first find the optimal general bitsliced implementations of the Sbox and the inverse Sbox, both of which requires 12 terms. Then we try to get the best translation from those general solutions to instruction sequences in AVR instruction set. Our manual optimization needs 2 (resp. 3) additional instructions (mov) penalty to implement the Sbox (resp. inverse Sbox) on AVR. When take advantage of the MOVW instruction, and deal with two sets of 8bit registers together, \(26 = 12\times 2 + 2\) (resp. \(27 = 12\times 2 + 3\)) instructions are needed to finish 16 Sboxes.
Bitsliced Implementation of the ShiftRows. The 1bit rotation of the 16bit row can be carried out using AVR’s logical shift left (LSL), rotate left through carry (ROL) and add with carry (ADC) instructions, together with an all 0 register. We mainly focus on implementing the 12bit and 13bit rotation of the 16bit row with a minimized number of instructions. Thanks for our optimization on the 4bit rotation (both left and right) of 16bit row using swap nibbles (SWAP), copy register pair (MOVW), logical AND with immediate (ANDI) and exclusive OR (EOR) instructions and 2 temporary registers, it only needs 7 instructions to perform 12bit rotation (both left and right) of the 16bit row. Thus the total ShiftRows and the inverse ShiftRows only need 20 instructions per state.
Bitsliced Implementation of the Key Schedule and Adding Round Key. The 128bit key state are arranged as a \(4 \times 32\) matrix. The key schedule for RECTANGLE128 consists of applying the \(4 \times 4\) Sboxes to the 8 rightmost columns of the four 32bit rows of the key state, 1round generalized Feistel transformation on the 4 rows and 5bit round constant XORing on a single row. The 64bit round key rk consists of the 16 rightmost columns of the four 32bit rows.
In our implementation, applying the \(4 \times 4\) Sboxes to the 8 columns takes 14 logical instructions (28 bytes and 14 cycles), the 1round generalized Feistel transformation takes 18 instructions (36 bytes and 18 cycles), the 5bit round constant XORing takes 2 instructions (4 bytes and 4 cycles). There are two key bytes shared between every two successive round keys. According to this observation, we can use \(8 + 25\times 6\) bytes instead of \(26\times 8\) bytes to store 26 round keys. Meanwhile, by reordering the key bytes and using two additional registers, we can use 6 load instructions instead of 8 when adding the 8byte round keys during encryption and decryption.
Our results are consistent with that shown in [18], while evaluation in this paper considers the two typical realworld usage scenarios. In addition, we develop a high throughput implementation (LessTime) and a low flash implementation (LowFlash) in Scenario 2. In the high throughput implementation, two blocks are processed simultaneously. Thus, it allows to load the subkeys one time every two blocks and reduce the cycles by sacrificing 146 bytes of flash than that in the low flash implementation which processes blocks one by one using a loop.
6 Results Summary and Comparisons
The vast majority of instructions used in our bitsliced implementations are types of instruction which takes one clock cycle and 2 bytes. And there is no memory access except for load inputs, load round keys, load round constants and store outputs. We write the whole program in assembly code. And executions of operations in our implementations are all in place in registers. Thus, there is no extra RAM used to store local variables during the whole executions of encryption, decryption and key schedule. While, that inevitably gives rise to difficulty when making a comparison with the inline assembly implementation in [32] or [34], in which stack RAM is needed and PUSH and POP instructions are used to store and restore all modified registers. Thus, in Tables 1, 2 and 3, we subtract the stack consumption from the RAM usage of implementation in [32] or [34] to make a comparison. Besides, it would be hard to process blocks in parallel if comply with the C interface provided in [31].
All of our implementations have verified the test vectors provided in the cipher specifications, and the source codes are available on a web site [45].
For Scenario 1, our implementations include the key schedule and encryption, inverse key schedule (when needed) and decryption procedures. State rearrangement operations on the plaintext, ciphertext and roundkeys are all included in our measurement. For Scenario 2, we only need to implement the encryption procedure, no decryption and no key schedule procedure are needed. Since encryption of 128bit data using CTR mode, data is XORed with the output of the encryption procedure with counters as the input. There are two conventions on the usage of the two counters in our implementation for PRINCE and LED, because we rearrange the input of the encryptions. In the first convention, these two counters must be rearranged before going through the encryption procedure and must be back rearranged after the encryption procedure before XORed to the 128 bits message. We denoted this convention ReOrder. In the other convention, we encrypt the two counters directly and XOR the output with the 128bit message without rearrangement of the state bits. We denoted this convention FixOrder. In our opinion, FixOrder convention does not relate to the security issues and is more efficient, thus it can be used as the final performance benchmark.
Our results on the performance of PRINCE, LED and RECTANGLE are summarized in Tables 1 and 2, in which we also include the results of previous work to make comparisons. In addition, we have also implemented Simon and Speck in assembly according to the method provided in [44]. If use our results to update the results in [34], we get Table 3. Since AES and PRESENT are coded in assembly, we only include this two ciphers to make comparisons. As shown in Table 3, RECTANGLE get a higher rank than AES and slightly lower rank than SIMON both in Scenario 1 and Scenario 2. PRINCE and LED respectively get a higher rank than AES in Scenario 1 and higher rank than PRESENT in Scenario 2. We believe that comparison in [34] is inevitable unfair since only AES, PRESENT, SIMON and SPECK are coded in assembly. Several other ciphers may also get a performance improvement if coded in assembly. And as pointed above, there is also unfairness when making a comparison between an implementation using pure assembly code and an implementation using inline assembly code. Besides, it is difficult to compare the performance between ciphers which supports parallelization and which does not. We remain the optimization work on other ciphers and a more fair comparison to the future.
Footnotes
Notes
Acknowledgement
Many thanks go to the anonymous reviewers. The research presented in this paper is supported by the National Natural Science Foundation of China (No.61379138), the “Strategic Priority Research Program” of the Chinese Academy of Sciences (No.XDA06010701).
References
 1.Daemen, J., Rijmen, V.: The Design of Rijndael  AES  The Advanced Encryption Standard. Springer, Heidelberg (2002)zbMATHGoogle Scholar
 2.Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B., Knezevic, M., Knudsen, L.R., Leander, G., Nikov, V., Paar, C., Rechberger, C., Rombouts, P., Thomsen, S.S., Yalçın, T.: PRINCE – a lowlatency block cipher for pervasive computing applications. In: Sako, K., Wang, X. (eds.) ASIACRYPT 2012. LNCS, vol. 7658, pp. 208–225. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 3.Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.: The LED block cipher. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 326–341. Springer, Heidelberg (2011)CrossRefGoogle Scholar
 4.Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.: The LED block cipher. http://eprint.iacr.org/2012/600
 5.Bogdanov, A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A., Robshaw, M.J.B., Seurin, Y., Vikkelsoe, C.: PRESENT: an ultralightweight block cipher. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 450–466. Springer, Heidelberg (2007)CrossRefGoogle Scholar
 6.Beaulieu, R., Shors, D., Smith, J., TreatmanClark, S., Weeks, B., Wingers, L.: SIMON and SPECK: block ciphers for the internet of things. http://eprint.iacr.org/2015/585
 7.Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: a very compact and a threshold implementation of AES. In: Paterson, K.G. (ed.) EUROCRYPT 2011. LNCS, vol. 6632, pp. 69–88. Springer, Heidelberg (2011)CrossRefGoogle Scholar
 8.Poschmann, A.: Lightweight cryptography cryptographic engineering for a pervasive world. PhD Dissertation, Faculty of Electrical Engineering and Information Technology, RuhrUniversity Bochum, Germany (2009)Google Scholar
 9.Albrecht, M.R., Driessen, B., Kavun, E.B., Leander, G., Paar, C., Yalçın, T.: Block Ciphers – Focus on the linear layer (feat. PRIDE). In: Garay, J.A., Gennaro, R. (eds.) CRYPTO 2014, Part I. LNCS, vol. 8616, pp. 57–76. Springer, Heidelberg (2014)CrossRefGoogle Scholar
 10.Atmel Corporation. 8bit AVR Instruction Set. http://www.atmel.com/images/doc0856.pdf
 11.Atmel Corporation. AVR 8bit Microcontrollers. http://www.atmel.com/products/microcontrollers/avr/default.aspx
 12.Shirai, T., Shibutani, K., Akishita, T., Moriai, S., Iwata, T.: The 128bit blockcipher CLEFIA (Extended Abstract). In: Biryukov, A. (ed.) FSE 2007. LNCS, vol. 4593, pp. 181–195. Springer, Heidelberg (2007)CrossRefGoogle Scholar
 13.Hong, D., Sung, J., Hong, S.H., Lim, J.I., Lee, S.J., Koo, B.S., Lee, C.H., Chang, D., Lee, J., Jeong, K., Kim, H., Kim, J.S., Chee, S.: HIGHT: a new block cipher suitable for lowresource device. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 46–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
 14.De Cannière, C., Dunkelman, O., Knežević, M.: KATAN and KTANTAN — a family of small and efficient hardwareoriented block ciphers. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 272–288. Springer, Heidelberg (2009)CrossRefGoogle Scholar
 15.Gong, Z., Nikova, S., Law, Y.W.: KLEIN: a new family of lightweight block ciphers. In: Juels, A., Paar, C. (eds.) RFIDSec 2011. LNCS, vol. 7055, pp. 1–18. Springer, Heidelberg (2012)Google Scholar
 16.Wu, W., Zhang, L.: LBlock: a lightweight block cipher. In: Lopez, J., Tsudik, G. (eds.) ACNS 2011. LNCS, vol. 6715, pp. 327–344. Springer, Heidelberg (2011)CrossRefGoogle Scholar
 17.Suzaki, T., Minematsu, K., Morioka, S., Kobayashi, E.: TWINE: a lightweight block cipher for multiple platforms. In: Knudsen, L.R., Wu, H. (eds.) SAC 2012. LNCS, vol. 7707, pp. 339–354. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 18.Zhang, W., Bao, Z., Lin, D., Rijmen, V., Yang, B., Verbauwhede, I.: RECTANGLE: a bitslice ultralightweight block cipher suitable for multiple platforms. Sci. China Inf. Sci. 58(12), 1–15 (2015)Google Scholar
 19.CryptoLUX: Lightweight Block Ciphers. http://www.cryptolux.org/index.php/Lightweight_Block_Ciphers
 20.Law, Y.W., Doumen, J., Hartel, P.H.: Survey and benchmark of block ciphers for wireless sensor networks. ACM Trans. Sensor Networks (TOSN) 2(1), 65–93 (2006)CrossRefGoogle Scholar
 21.Eisenbarth, T., Kumar, S., Paar, C., Poschmann, A., Uhsadel, L.: A survey of lightweightcryptography implementations. IEEE Design & Test of Computers 24(6), 522–533 (2007)CrossRefGoogle Scholar
 22.Kerckhof, S., Durvaux, F., Hocquet, C., Bol, D., Standaert, F.X.: Towards green cryptography: a comparison of lightweight ciphers from the energy viewpoint. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 390–407. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 23.Knežević, M., Nikov, V., Rombouts, P.: Lowlatency encryption – is “Lightweight = Light + Wait”? In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 426–446. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 24.Eisenbarth, T., Gong, Z., Güneysu, T., Heyse, S., Indesteege, S., Kerckhof, S., Koeune, F., Nad, T., Plos, T., Regazzoni, F., Standaert, F.X., van Oldeneel tot Oldenzeel, L.: Compact implementation and performance evaluation of block ciphers in ATtiny devices. In: Vaudenay, S., Mitrokotsa, A. (eds.) AFRICACRYPT 2012. LNCS, vol. 7374, pp. 172–187. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 25.Eisenbarth, T., Gong, Z., Güneysu, T., Heyse, S., Indesteege, S., Kerckhof, S., Koeune, F., Nad, T., Plos, T., Regazzoni, F. et al.: Implementations of Low Cost Block Ciphers in Atmel AVR Devices, February 2015. http://perso.uclouvain.be/fstandae/lightweight_ciphers/
 26.Matsui, M., Murakami, Y.: Minimalism of software implementation extensive performance analysis of symmetric primitives on the RL78 microcontroller. In: Moriai, S. (ed.) FSE 2013. LNCS, vol. 8424, pp. 393–409. Springer, Heidelberg (2014)CrossRefGoogle Scholar
 27.Cazorla, M., Marquet, K., Minier, M.: Survey and benchmark of lightweight block ciphers for wireless sensor networks. http://eprint.iacr.org/2013/295
 28.Cazorla, M., Marquet, K., Minier, M.: Survey and Benchmark of Lightweight Block Ciphers for Wireless Sensor Networks. In: Samarati, P. (ed.), SECRYPT 2013  Proceedings of the 10th International Conference on Security and Cryptography, Reykjavík, Iceland, 29–31 July 2013, pp. 543–548. SciTePress (2013)Google Scholar
 29.Cazorla, M., Marquet, K., and Minier, M.: Implementations of lightweight block ciphers on a WSN430 sensor, February 2015. http://bloc.project.citilab.fr/library.html
 30.Dinu, D., Biryukov, A., Großschädl, J., Khovratovich, D., Corre, Y.L., Perrin, L.: FELICS  fair evaluation of lightweight cryptographic systems, July 2015. http://csrc.nist.gov/groups/ST/lwcworkshop2015/papers/session7dinupaper.pdf
 31.CryptoLUX.: FELICS (Fair Evaluation of Lightweight Cryptographic Systems), 15 August 2015. http://www.cryptolux.org/index.php/FELICS
 32.Dinu, D., Corre, Y.L., Khovratovich, D., Perrin, L., Großschädl, J., Biryukov, A.: Triathlon of lightweight block ciphers for the internet of things (2015). http://eprint.iacr.org/2015/209
 33.CryptoLUX.: FELICS Triathlon, 12 August 2015. http://www.cryptolux.org/index.php/FELICS_Triathlon
 34.Dinu, D., Corre, Y.L., Khovratovich, D., Perrin, L., Großschädl, J., Biryukov, A.: FELICS block ciphers brief results and FELICS block ciphers detailed results, 1 October 2015 http://www.cryptolux.org/index.php/FELICS_Block_Ciphers_Brief_Results, http://www.cryptolux.org/index.php/FELICS_Block_Ciphers_Detailed_Results
 35.Processor Watch, 8 January 2013. http://www.linleygroup.com
 36.National Institute of Standards and Technology (NIST). Lightweight Cryptography Workshop 2015. http://www.nist.gov/itl/csd/ct/lwc_workshop2015.cfm
 37.Shahverdi, A., Chen, C., and Eisenbarth, T.: AVRprince  an efficient implementation of PRINCE for 8bit microprocessors. Technical Report, Worcester Polytechnic Institute (2014). http://www.ashahverdi.com/files/papers/avrPRINCEv01.pdf
 38.Papapagiannopoulos, K.: High throughput in slices: the case of PRESENT, PRINCE and KATAN64 ciphers. In: Sadeghi, A.R., Saxena, N. (eds.) RFIDSec 2014. LNCS, vol. 8651, pp. 137–155. Springer, Heidelberg (2014)CrossRefGoogle Scholar
 39.Canteaut, A., Fuhr, T., Gilbert, H., NayaPlasencia, M., Reinhard, J.: Multiple differential cryptanalysis of roundreduced PRINCE. Presentation at Fast Software Encryption FSE 2014, London, 25 March 2014. http://fse2014.isg.rhul.ac.uk/slides/slides09_4.pdf
 40.Canteaut, A., Fuhr, T., Gilbert, H., NayaPlasencia, M., Reinhard, J.: Multiple Differential Cryptanalysis of RoundReduced PRINCE (Full version). eprint.iacr.org/2014/089
 41.Gladman, B.: Serpent S Boxes as Boolean Functions. http://www.gladman.me.uk/
 42.Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: The Keccak Reference, January 2011. http://keccak.noekeon.org/
 43.Courtois, N.T., Hulme, D., Mourouzis, T.: Solving circuit optimisation problems in cryptography and cryptanalysis. In: Electronic Proceedings of 2nd IMA Conference Mathematics in Defence, Swindon (2011)Google Scholar
 44.Beaulieu, R., Shors, D., Smith, J., TreatmanClark, S., Weeks, B., Wingers, L.: The Simon and Speck Block Ciphers on AVR 8bit Microcontrollers (2014). http://eprint.iacr.org/2014/947
 45.Bao, Z., Zhang, W., Luo, P., Lin, D.: Bitsliced Implementations of Block Ciphers on AVR 8bit Microcontrollers, October 2015. http://github.com/FreeDisciplina/BlockCiphersOnAVR