# Speed and Size-Optimized Implementations of the PRESENT Cipher for Tiny AVR Devices

## Abstract

This paper presents high-speed and low-size assembly implementations of the 80-bit version of the PRESENT cipher for the (Tiny)AVR family of microcontrollers. We report new speed and size records for our implementations, with the speed-optimized version achieving a full encryption in 8721 clock cycles and the size-optimized version compressing the cipher down to 272 bytes; the previous state of the art for (Tiny)AVR achieved 10723 clock cycles for encryption with a size of 936 bytes. Along with the two implementation extrema (speed and size optimized versions), we offer insight into techniques and representations that show the speed/area tradeoffs and provide intermediate solutions for various configurations.

### Keywords

PRESENT AVR ATtiny Assembly software implementation## 1 Introduction

Modern society is constantly witnessing an extensive and large scale deployment of tiny computing devices. Information processing and wireless communication are being thoroughly integrated into everyday objects and activities, developing a large distributed mobile infrastructure and ushering in the era of ubiquitous computing. RFID tags attached to products, cardiac pacemakers, fire-detecting sensor nodes and the like need to operate *securely* under particularly *restricted conditions*, namely low battery life, small processing power and bandwidth-demanding ad-hoc network protocols. To achieve sustainable security in this new landscape, researchers have developed new cryptographic primitives and techniques, namely lightweight cryptographic ciphers such as PRESENT [5], Klein [11], LED [13] and others.

The majority of these ciphers was designed with hardware performance in mind, leaving most software implementations relatively inefficient. For instance, the AVR-Crypto-Lib [21] often resorts to C language implementations, resulting in 100.000 clock cycles for a single encryption with PRESENT. AVR microcontrollers are often encountered regarding the Internet of Things and ubiquitous computing, thus they are an interesting platform on which to enable and optimize lightweight ciphers. The University of Louvain has initiated a project to draft efficient assembly implementations of various lightweight ciphers on resource-constrained AVR devices to make fair comparisons about their relative efficiency. Resulting from this project is the state of the art in both speed and size: an implementation [8] which achieves an encryption with PRESENT in 10723 clock cycles in 936 bytes.

**Our contribution.** This paper describes the details of a speed optimized [20] and a size-optimized [22] implementation of the PRESENT cipher on the ATtiny45, attained with the aid of algorithmic improvements and efficient programming techniques. Our speed-optimized version improves the state of the art [8] by an 18 % reduction in clock cycles, while the size-optimized version is 70 % smaller.

Squared S-Box representations, i.e. S-Boxes which are custom-made for fast access in the AVR 8-bit architecture, also used by Eisenbarth [8].

Compact S-Box representations, i.e. minimal footprint S-Boxes that reduce the implementation size.

Minimal key register rotations, allowing the key update procedure to complete in fewer instructions.

Memory access optimizations, grouping memory transactions to improve speed.

Algorithm serialization, by keeping part of the state in SRAM while we operate on fewer registers.

Indirect register access to let loops drive repeated operations on CPU registers.

Use of the stack to store intermediate values to avoid using more dedicated registers.

Code restructuring and efficient procedure callings.

## 2 Background

### 2.1 PRESENT Cipher

PRESENT [5] is an ultra-lightweight, 64-bit symmetric block cipher, using 80-bit or 128-bit keys. It is based on a substitution/permutation network and it is named as a reference to Serpent [2] due to its similar constructs. As of 2012, PRESENT (among other ciphers) was adopted by ISO as a standard for a lightweight block cipher (ISO/IEC 29192-2:2012 [17]). The full algorithm has so far been resistant to attempts at cryptanalysis, although the most successful attack has shown that up to 15 of its 31 rounds can be broken with \(2^{35.6}\) plaintext-ciphertext pairs in \(2^{20}\) operations [1, 7, 19].

PRESENT uses exclusive-or as its round key operation, a 4-bit substitution layer, a 4-bit period bit position permutation network in 31 rounds and a final round key operation. Key scheduling is a combination of bit rotation, S-Box application and exclusive-or with the round counter. Constructs found in PRESENT are also encountered in SPONGENT [4], in hash function constructs based on block ciphers as proposed by Hirose [6, 14, 15] (H-PRESENT) and in the similar Maya [9] or generalized SMALLPRESENT [18]. Thus the optimizations presented here can also be of interest with respect to the implementations of these ciphers. In our approach, we have implemented PRESENT for the recommended 80-bit key size in AVR assembly in two versions, optimized for maximal speed and minimal size. Support for 128-bit keys was also added to the size-optimized implementation as the required extra registers became available through optimizations, but we will focus on the implementation of the variant with 80-bit keys.

### 2.2 PRESENT Algorithm

### 2.3 The 8-bit Family of TinyAVR Microcontrollers

Atmel offers a wide range of 8-bit microcontrollers, including high performance devices (ATxmega), mid-range devices (ATmega) and low-end devices with limited memory, storage and processing power (ATtiny). Common applications of AVR microcontrollers include smart cards, motor control systems, medical applications et cetera.

Our focus is on the resource-constrained ATtiny architecture, which typically features less than 1 KB of static RAM (SRAM) and flash storage ranging from 1 to 8 KB. The architecture uses 32 general-purpose registers, *R0*-*R31*.

Several registers have special characteristics, namely registers *R0* through *R15* can not be accessed by instructions that provide an immediate value as an operand, i.e. you can only perform memory access from these registers. In addition, register pairs *R27:R26* (denoted as *X*), *R29:R28* (denoted as*Y*) , and *R31:R30* (denoted as *Z*) can access the SRAM. The *Y* register can also be used for indirect register addressing, the *Z* register can also access the flash storage and be used for indirect jumps and calls. Instructions using these 3 pointer register pairs also allow post-increment and pre-decrement of the pointer. At all times, the 6 special registers can be utilized as general-purpose registers.

The ATtiny instruction set consists of the basic 90 single-word instructions found in all AVR architectures. However, due to the limited size of its core, it does not support the extended instruction set which includes multiplication, in contrast with the ATmega architecture.

**ATtiny configuration.** We perform all simulations on the ATtiny45, which has a maximum clock frequency of 20 MHz, 256 bytes of SRAM and 4 KB of flash storage.

**Radix-**\(2^8\)**representation.** The PRESENT cipher requires representing integers of size 64 and 80 bits. Thus, we split the number into 8-bit components, using radix-\(2^8\) and an \(m\)-bit integer is represented as \(n=\lceil {m/8}\rceil \) bytes \((x_0,x_1,\dots ,x_{n-1})\) such that \(x=\sum _{i=0}^n{x_i 2^{8*i}}\).

## 3 High-Speed Implementation

### 3.1 PRESENT S-Box and P-Layer Implementation

In this section we examine the S-Box of the PRESENT cipher from the speed perspective, using lookup tables, and we offer several variations, utilizing the speed-area trade-off. We analyze the *original S-Box*, identify its performance issues and suggest two possible lookup table representations (with 8 and 16 bytes respectively). Subsequently, we expand it to the faster, yet space-demanding (256 bytes) *Squared S-Box*. In the last section, we examine the *combination of the S-Box and the permutation layer*, resulting in a very large lookup table (1024 bytes) that substantially boosts performance.

**Original S-Box.**The original S-Box, presented by Bogdanov et al. [5] consists of 16 different substitutions, each with a 4-bit input and a 4-bit output, as shown in Table 1.

The original S-Box of the PRESENT cipher.

x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

S[x] | C | 5 | 6 | B | 9 | 0 | A | D | 3 | E | F | 8 | 4 | 7 | 1 | 2 |

**Representation.**If we aim for a particularly small size footprint, it is possible to use either a

*packed*or an

*unpacked*representation of the original S-Box. The original,

*unpacked*version stores the lookup table in 16 bytes, where every 4-bit input to 4-bit output substitution is stored using 8 bits of space (i.e. there exists redundancy of in the representation). The

*packed*version, stores

*two*4-bit input to 4-bit output substitutions using 8 bits, i.e. without any redundancy, resulting in an 8 byte lookup table.

The packed representation of the original S-Box, using 8 bytes. Each table column represents two substitutions. This would give a size optimization of 8 bytes to begin with, but considerations for unpacking code apply. (See Sect. 4.2.)

x | 01 | 23 | 45 | 67 | 89 | AB | CD | EF |
---|---|---|---|---|---|---|---|---|

S[x] | C5 | 6B | 90 | AD | 3E | F8 | 47 | 12 |

**Performance.** The core performance issue regarding the 4-bit S-Box is the penalty in accessing it, if stored in a lookup table. The AVR architecture is designed to enable fast access for 8-bits at a time. Thus, a lookup table of the original S-Box is rather small (16 bytes for an unpacked version, 8 bytes for a packed one), but we can assume it is also relatively inefficient speed-wise due to the overhead operations that need to take place before and after each table lookup. Surely the packed S-Box (Table 2) is the least speed-efficient variant, since after the 4-bit lookup we also have to extract the upper or lower half. This issue is not encountered in the unpacked version, which makes better use of the AVR 8-bit architecture. However, performance is not optimal due to the fact that we only substitute 4 bits at a time, while we use 8-bit operations, i.e. the redundant representation results in more memory accesses than needed.

**Squared S-Box.** A solution to the aforementioned performance problem is to construct a new lookup table that: (a) is custom made for the 8-bit AVR architecture, like the unpacked S-Box and (b) uses a non-redundant representation similar to the packed S-Box.

**Representation.**In Table 3, we demonstrate a

*Squared S-Box*, which uses an 8-bit input and produces an 8-bit output. Within an 8-bit space, we can contain two 4-bit substitution values and the number of possible substitution values is 16, thus the total size of the lookup table is \(16 \cdot 16=256\) bytes. As a result, there is no need for overhead computation and the substitution consists of a single lookup. This approach has also been followed before by Eisenbarth [8].

The 256-byte Squared S-Box. It substitutes one byte at a time, without any overhead or redundancy.

x | 00 | 01 | 02 | 03 | \(\dots \) | 0C | 0D | 0E | 0F |
---|---|---|---|---|---|---|---|---|---|

S[x] | CC | C5 | C6 | CB | ... | C4 | C7 | C1 | C2 |

x | 10 | 11 | 12 | 13 | \(\dots \) | 1C | 1D | 1E | 1F |

S[x] | 5C | 55 | 56 | 5B | ... | 54 | 57 | 51 | 52 |

\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\dots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |

x | F0 | F1 | F2 | F3 | \(\dots \) | FC | FD | FE | FF |

S[x] | 2C | 25 | 26 | 2B | ... | 24 | 27 | 21 | 22 |

**Performance.** The Squared S-Box described is an efficient and viable solution with respect to the cipher’s substitution layer. It is custom made for the 8-bit AVR architecture and allows us to perform byte substitutions with a single flash memory lookup. Furthermore, it is relatively size-efficient, consisting of 256 bytes. We consider it could almost be transfered to ATtiny45 SRAM from flash memory during the initialization process of the algorithm, but we would be left without any stack space and therefore it would certainly be a special purpose implementation. That memory transfer *is* viable for applications that do dedicated encryption or decryption (provided we are left with some stack space for the workings of the rest of the algorithm). The instruction to load from flash (lpm) takes 3 clock cycles, whereas loading from or saving to SRAM (ld) takes only 2. Given the fact that the block size of the PRESENT cipher is 64 bits, it requires \(64/8=8\) S-Box lookups per round. The cipher consists of 31 rounds so a full encryption requires \(8 \cdot 31=248\) S-Box applications. If an application would singularly encrypt or decrypt, we could save 248 cycles per encryption after an initial \((3+2) \cdot 256 = 1280\) clock cycles. That means that the processing penalty to load a 256 byte table from flash storage to SRAM would not be compensated until over 5 encryptions or decryptions were computed sequentially, so this approach is not feasible for the general case of intermixed encryption and decryption.

**Merged SP Lookup Table.** Although the Squared S-Box lookup table solution is viable for the PRESENT cipher, we need a speed optimization for the most complex part of the PRESENT algorithm: the permutation layer. In contrast to hardware implementations, where it is a trivial rewiring of outputs, most software implementations require a large number of bit rotation and move operations. The AVR architecture is not capable of performing fast shifts and rotations (compared to for instance the ARM11 processors, which can do shifts and rotations for free), i.e. an \(n\)-bit shift requires \(n\) clock cycles, and thusly we have to look for alternatives. In this direction, work by Hutter and Schwabe [16] on ATmega^{1} suggested the usage of multiplications instead of rotations/shifts, however this is not viable on ATtiny chips due to the fact that they do not possess native multiplication instructions.

**Representation.**The fastest approach that we identified for the permutation layer is the idea developed by Zheng Gong and Bo Zhu [10, 12]. Due to the adaptation of this novel approach in the AVR architecture, we are able to improve performance over the state of the art [8]. Specifically, Gong and Zhu exploited the internal structure of the permutation layer, i.e. the fact that every output of a 4-bit S-Box will contribute one bit to the cipher. The underlying pattern for the permutation is the following:

*merge*the S-Box and the permutation layer and as a result, the whole SP network is performed via table lookups.

**Performance.** The 1024 byte lookup tables eliminate the need for an independent permutation layer, providing us with the fastest available solution. On the downside, we have to perform one lookup for every two bits, resulting in 32 lookups for a 64-bit state (compared to the Squared S-Box that required only 8 lookups for a 64-bit state). Moreover, we need 1024 bytes to store the tables, thus it is not possible to transfer them to the SRAM on our test platform. The theorized 33 % speedup considered with using SRAM instead of flash storage is only possible in an AVR microcontroller with at least 1024 bytes of internal^{2} SRAM, for instance ATtiny1634.

**Memory Optimizations for Lookup Tables.**All the aforementioned solutions with respect to substitution and/or permutation rely heavily on lookup tables. In order to decrease the computational penalty of the table lookups, we performed several code-level optimizations. In Fig. 2 we demonstrate the code required to perform a single table lookup from flash memory.

The lookup operation consists of two mov and one lpm instruction. Memory is addressed using 16-bit pointers, so the first mov loads the high part, while the second mov loads the table index that will be accessed.

**Table alignment.** We aim to keep the changes required in the high part (ZH) to a minimum. Thus, we align the four 256-byte tables required for the merged SP approach in Sect. 3.1 such that they can be accessed by using only the low part (ZL) register as an index and keeping ZH unchanged. Elaborating, the four lookup tables start from 0x0600, 0x0700, 0x0800, 0x0900 respectively and thus, the 8 high bits of the address part (0x06, 0x07, 0x08, 0x09) remain the same while the 8 low bits are sufficient to act as the table index, ranging from 0 to 255 (0x00 to 0xFF).

**Memory access grouping.** Performing two lookup operations in two different tables requires a total of 10 clock cycles (2 mov to change the high part of the operation, 2 mov to change the index and 2 lpm to perform the actual lookup). However, performing two lookup operations on the *same table* requires a total of 8 clock cycles (2 mov for the index, 2 lpm for the lookup), since the high part of the memory address remains the same. Thus, we we try to perform the maximum amount of grouped table lookups, given the limited number of registers, since within each group, ZH remains the same. For instance the following sequence of operations, * lookupTable1(i), lookupTable2(k), lookupTable1(j), lookupTable2(m)* will transform to * lookupTable1(i), lookupTable1(j), lookupTable2(k), lookupTable2(m)* in order to group memory access.

### 3.2 Key Update Implementation

This section focuses on implementing the key scheduling/update process of the PRESENT cipher efficiently. The key update function of the PRESENT cipher consists of three operations, namely, key rotation, key substitution and key XOR the round counter. We present the optimizations performed in the following subsections.

**Key Rotation.** The algorithm specifies that the key must be rotated by 61 bits to the left. Given the fact that rotations/shifts are computationally expensive in the AVR architecture, we transform 61 left rotations to 19 right rotations, which can be further reduced to 16 right rotations and 3 right rotations. The 16 right rotations can be easily performed by using the mov instructions on register level, i.e. rotate all the bits inside a register by moving the contents to the previous register used in our representation, an approach which is preferable to single rotations via the bit-level instructions. Only the 3 remaining rotations are carried out with the logical instructions for right rotation and shifting (ror and shr).

**Key Substitution.** The highest 4 bits of the 80-bit key used by the PRESENT cipher, must be substituted via the S-Box. To avoid 4-bit memory access or redundancy (Sect. 3.1), we construct a special-purpose Squared S-Box that substitutes the 4 high bits of the 8-bit input, while the low 4 bits remain unchanged. The resulting table applies a substitution operation on the upper nibble which takes only a single lookup operation. Should we encounter space constraints, it is possible to replace the Squared S-Box with the original, unpacked one; the key substitution occurs only once per round, so the performance loss incurred by the unpacked S-Box is relatively small.

**Key Exclusive-OR Operation.** The algorithm specifies that the key bits 15, 16, 17, 18, 19 must be XORed with the round counter. The issue is that -under the current representation- bits 0\(\dots \)7 will be stored in *R0*, bits 8\(\dots \)15 will be stored in *R1* and bits 16\(\dots \)23 will be stored in *R2*. As a result parts of the round counter need to be XORed with different parts of two separate registers, namely the counter needs to be XORed with both *R1* and *R2*. Similarly to Eisenbarth [8], we perform the XOR operation before the key rotation, thus the bits that are operated on are bits 34,35,36,37,38 which span a single register (under the previous representation they are located in *R4*). This restructuring of the algorithm, i.e. performing the XOR operation before the key rotation, does not affect the outcome or security of the algorithm.

**Latency****vs.****Throughput.** The implementations presented focus on reducing the cost of a *single* encryption. Thus, it can also be viewed as a *low-latency* implementation of PRESENT. Should we loosen up on the latency requirement, we can achieve increased *throughput*. Future work aims to perform multiple encryptions at a time, by using the *bitslicing* technique [3], which largely reduces the cost of permutations.

## 4 Size-Optimized Implementation

Here we will list some of the size improvements we were able to apply to the PRESENT algorithm. While these modifications make the algorithm operate more slowly, the reduction in size would allow the cipher to be included in microcontrollers with smaller available code area. Our version requires 128 AVR instructions (256 bytes of code) for both the encryption and decryption routines, plus two times 8 bytes for packed tables of S-Box values at memory addresses 0x100 and 0x200. We believe this should be sufficiently small for the code to be included in an AVR machine with 1K of available Flash storage, while still allowing almost three quarters of the available area to be devoted to application-specific code.

Unfortunately, every opcode in the AVR instruction set is expressed using one or more 16-bits words, so while using only the single-word instructions of the ATtiny there is no further possibility to exchange any instructions for equivalent smaller ones (such as for example *add Rd, 1* being more concisely expressed as *inc Rd* on x86 machines). Furthermore the AVR employs a Harvard architecture, where there is a strict separation between data and code memory; this prevents us from dynamically computing new opcodes in memory to be executed later. Finally we note there are no ‘bulk’ instructions which operate on several registers at once.

However, we do have access to some instructions that are specific to the AVR architecture and are uniquely suited to making parts of the code more condensed by virtue of their expressiveness, such as the *swap* and *cbr* instructions. We also have all kinds of branching instructions at our disposal that branch based on values in the state register (SREG) which we can explicitly or implicitly modify. We can use the stack to make procedure calls or as temporary storage, and finally have the powerful option of adressing the CPU registers indirectly through the *Y* pointer.

### 4.1 Serialization and SRAM Use

The greatest size optimization we have implemented is serialization of the algorithm, keeping most of the state in SRAM while we operate on only one byte of state wherever possible. This reduces the instruction count on all parts of the algorithm. It also allows us to make use of fewer dedicated registers.keeping scheduled keys entirely in registers after setup. The only procedure where this approach was not successful is the permutation layer, for which we chose to reserve 4 output registers as we believe it allows us to apply the 4-bit period permutation in software with the most size-efficiency.

By requiring only 4 dedicated registers to keep a partial block state, and 6 to keep the rest of the algorithms’ state, we were left with exactly 16 of the 32 general purpose registers available to keep the key register (as we rely on the 6 registers for X,Y,Z to navigate the SRAM, indirectly addressed registers and flash storage respectively.) This means support for 128-bit keys was able to be added to the implementation at no extra cost.

### 4.2 S-Box Packing

The PRESENT S-Boxes work on 4-bit nibbles, but defining a table of nibbles in the code at first seemed less size-efficient than packing the nibbles into bytes to be unpacked. This would save 8 bytes per S-Box table to start with, but we need 4 to 7 extra instructions depending on whether or not we care about timing attacks to unpack the nibbles which diminishes the size benefit to 2 bytes of code. See Fig. 3 and Table 2.

### 4.3 Permutation Layer

The code to apply the bit position permutation to the state in software borrows heavily from the AVR implementation of PRESENT drafted in Louvain by Eisenbarth et al. [8]. Since the permutation follows a 4-bit period in the input, their choice to use 4 bytes of I/O when rotating bits off of registers into corresponding new positions seems quite efficient. In the Louvain implementation one bit is rolled off from 4 state registers into one output register and this block is done twice, completing one output byte.

This implementation of the permutation layer requires availability of 4 bytes of temporary storage for half of the permuted state, which we save to the stack before applying the permutation to the other 4 bytes of the state. The construct in Fig. 4 allows the implementer to let a block of code be executed twice, while allowing them to take control of the machine before, after and in between these code blocks. As you can see this takes 4 instructions, whereas a *rcall*, *ret* construction would take us only 3, but this construct doesn’t use the stack which means we can keep our intermediate values there rather than requiring 4 more dedicated registers or make more complicated use of the SRAM.

The 4 extra registers that became available through this approach allowed us to implement support for 128-bit keys while keeping the scheduled round keys entirely in CPU registers at no extra size cost. This construct does not affect the cycle count relative to the state or input.

### 4.4 Indirect Register Access

The AVR platform allows the CPU registers to be addressed *indirectly*, meaning we can use a pointer (*Y*) in memory space to interact with them. We use the feature to load all 10 or 16 of the dedicated key registers in a loop, to iterate over them while applying the round key to the state in SRAM, and to rotate the key registers. This last optimization proved to be devastating to performance compared to inlined rotation of the registers, which is why we added an option to configure either approach.

Doing these operations in a loop which iterates over registers results in smaller code, and allows us to rotate an arbitrary number of registers at a fixed instruction cost. Faster (inlined) rotation makes the implementation about 4 times faster and requires 2/8 extra instructions depending on the configured key size.

### 4.5 Round Key Application and Key Scheduling

The round key is applied to the state in the SRAM by reading, XORing and writing one byte to/from a register at a time. The use of indirect register access allows us to iterate through the key bytes in CPU registers while iterating the state bytes in SRAM.

When scheduling keys, we still apply the exclusive-or to part of the key register in the ideal position (i.e. where the bytes of the key register line up with the round counter register), as explained in Sect. 3.2.

The inverse key scheduling procedure is only needed in the decryption routine, so we were able to inline it into the decryption round.

### 4.6 Limits Encountered

Although S-Box application and round key application always happen close to each other and have the same SRAM access pattern, the varying order in which they are applied in encryption/decryption rounds, or the omission of the S-Box application in the final step makes it impossible to combine the two steps into one procedure that makes PRESENT smaller.

It is possible to save a few instructions by iterating through the state in SRAM from wherever the X pointer is located rather than rewinding it to the start of the block in every round procedure of the algorithm. Still, the varying order in which they are applied in the encryption/decryption rounds makes it impossible to do so for the general case in smaller code.

The choice to rewind to the start of the block places the SRAM pointer back to the same address as before encryption or decryption, which seems like a good default for real-life use and also allows all round procedures to be callable from external applications, which seemed more desirable than having them rely on ‘hidden’ state which affects their behaviour.

### 4.7 Using the Code for Specific Applications

While the attained size of our implementation of PRESENT should suffice for use in real-world applications, most of the procedure calls can be inlined when only encryption or decryption is required in the application. If only encryption is required, the inverse S-Box and S-Box unpacking code can be omitted as well.

As mentioned in 4.4, the key register rotation procedure can be configured to either use indirect register addressing, or be inlined at a cost of 4/16 extra bytes depending on configured key size.

## 5 Conclusion

Speed (in clock cycles) and size (in bytes) comparisons to existing implementations of PRESENT for the AVR architecture, ordered by size, speed.

Encryption | Decryption | Size | |
---|---|---|---|

Papagiannopoulos [20] | 8721 | - | 1794 |

AVR Crypto-lib [21] | 105796 | 151624 | 1514 |

Eisenbarth [8] | 10723 | 11239 | 936 |

Verstegen [22] | |||

Inlined rotation, unpacked S-Boxes (128-bit) | 64506 | 119626 | 292 |

Inlined rotation (128-bit) | 67854 | 123346 | 290 |

Inlined rotation, unpacked S-Boxes | 52622 | 73952 | 280 |

Inlined rotation | 55784 | 77300 | 278 |

Unpacked S-Boxes | 186883 | 250032 | 274 |

Unpacked S-Boxes (128-bit) | 278189 | 568010 | 274 |

Default | 190045 | 253380 | 272 |

Default (128-bit) | 281537 | 571730 | 272 |

## Footnotes

### References

- 1.Abed, F., Forler, C., List, E., Lucks, S., Wenzel, J.: Biclique cryptanalysis of the PRESENT and LED lightweight ciphers. Technical report, Cryptology ePrint Archive, Report 2012/591 (2012)Google Scholar
- 2.Anderson, R., Biham, E., Knudsen, L.: Serpent: a proposal for the advanced encryption standard. NIST AES Proposal (1998)Google Scholar
- 3.Biham, E.: A fast new DES implementation in software. Technion, Technical, Report CS0891 (1997)Google Scholar
- 4.Bogdanov, A., Knežević, M., Leander, G., Toz, D., Varıcı, K., Verbauwhede, I.: spongent: A lightweight hash function. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 312–325. Springer, Heidelberg (2011)Google Scholar
- 5.Bogdanov, A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A., Robshaw, M.J.B., Seurin, Y., Vikkelsoe, C.: PRESENT: an ultra-lightweight block cipher. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 450–466. Springer, Heidelberg (2007)Google Scholar
- 6.Bogdanov, A., Leander, G., Paar, C., Poschmann, A., Robshaw, M.J.B., Seurin, Y.: Hash functions and RFID tags: mind the gap. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 283–299. Springer, Heidelberg (2008)Google Scholar
- 7.Collard, B., Standaert, F.-X.: A statistical saturation attack against the block cipher PRESENT. In: Fischlin, M. (ed.) CT-RSA 2009. LNCS, vol. 5473, pp. 195–210. Springer, Heidelberg (2009)Google Scholar
- 8.Eisenbarth, T., et al.: Compact implementation and performance evaluation of block ciphers in a tiny devices. In: Mitrokotsa, A., Vaudenay, S. (eds.) AFRICACRYPT 2012. LNCS, vol. 7374, pp. 172–187. Springer, Heidelberg (2012)Google Scholar
- 9.Gomathisankaran, M., Lee, R.B.: Maya: a novel block encryption function. In: Proceedings of International Workshop on Coding and Cryptography, vol. 33, p. 54 (2009)Google Scholar
- 10.Gong, Z., Hartel, P., Nikova, S., Zhu, B.: Towards secure and practical MACs for body sensor networks. In: Roy, B., Sendrier, N. (eds.) INDOCRYPT 2009. LNCS, vol. 5922, pp. 182–198. Springer, Heidelberg (2009)Google Scholar
- 11.Gong, Z., Nikova, S., Law, Y.W.: KLEIN: A new family of lightweight block ciphers. In: Juels, A., Paar, C. (eds.) RFIDSec 2011. LNCS, vol. 7055, pp. 1–18. Springer, Heidelberg (2012)Google Scholar
- 12.Gong, Z., Zhu, B.: Software implementation of block cipher PRESENT for 8-Bit platforms. http://cis.sjtu.edu.cn/index.php/Software_Implementation_of_Block_Cipher_PRESENT_for_8-Bit_Platforms (2013). Accessed 19 Feb 2013. Archived at http://www.webcitation.org/1370831045860897
- 13.Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.: The LED block cipher. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 326–341. Springer, Heidelberg (2011)Google Scholar
- 14.Hirose, S.: Provably secure double-block-length hash functions in a black-box model. In: Park, C., Chee, S. (eds.) ICISC 2004. LNCS, vol. 3506, pp. 330–342. Springer, Heidelberg (2005)Google Scholar
- 15.Hirose, S.: Some plausible constructions of double-block-length hash functions. In: Robshaw, M.J.B. (ed.) FSE 2006. LNCS, vol. 4047, pp. 210–225. Springer, Heidelberg (2006)Google Scholar
- 16.Hutter, M., Schwabe, P.: NaCl on 8-bit AVR microcontrollersGoogle Scholar
- 17.Information technology - Security techniques - Lightweight cryptography - Part 2: Block ciphers (2011)Google Scholar
- 18.Leander, G.: Small scale variants of the block cipher PRESENT. IACR ePrint Report, 143 (2010)Google Scholar
- 19.Nakahara Jr, J., Sepehrdad, P., Zhang, B., Wang, M.: Linear (hull) and algebraic cryptanalysis of the block cipher PRESENT. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 58–75. Springer, Heidelberg (2009)Google Scholar
- 20.Papagiannopoulos, K.: Speed-optimized implementation of PRESENT in AVR assembly. https://github.com/kostaspap88/PRESENT_speed_implementation/ (2013)
- 21.The GNU project. AVR-Crypto-Lib. http://avrcryptolib.das-labor.org/ (2013). Accessed 6 April 2013
- 22.Verstegen, A.: Size-optimized implementation of PRESENT in AVR assembly. https://github.com/aczid/ru_crypto_engineering/ (2013)