Keywords

1 Introduction

For years, the lightweight block cipher Present [6] has been in the spotlight as the most ideal solution for providing confidentiality under constrained environments. However, recent findings [5, 11] call into question the security properties of the scheme. It is clear that the study of alternatives which offer resilience against birthday attacks and linear or differential cryptanalysis is necessary.

In 2015, a possible NIST standard for lightweight cryptography was first mentioned. Over the course of two years NIST published a report detailing the scope and state of the art in lightweight cryptography [10] and the standardization works seem to be in progress. This hints to the fact that lightweight cryptographic primitives are key components in the development of future technologies and applications. Generating solid and reproducible implementation results and benchmarking is undoubtedly primordial for any future standards.

In this work we evaluate hardware realizations of the cryptographic algorithms Midori and Gift, which are believed to be secure. We compare these implementations against State of the Art architectures for Gimli and Present. Our main contributions are:

  1. 1.

    Novel architectural designs for the Midori and Gift block ciphers following area-reduction strategies.

  2. 2.

    The first implementation results for Gift and the first area-oriented results for Midori in FPGA.

  3. 3.

    All the proposed designs (VHDL) are available at https://www.tamps.cinvestav.mx/~hardware/.

The rest of the paper is structured as follows. Section 2 describes the different architectures for the selected lightweight algorithms, which are implemented and evaluated. Section 3 describes our experimental setup. Section 4 presents our findings. Section 5 concludes this work.

2 Methods

In this section we focus on encryption functions since these can be used to encrypt and decrypt data under the CTR mode of operation [7]. We use 128-bit key sizes for all the block ciphers.

For each lightweight block cipher we study its iterative and serial architectures [8]. We define two types of serial architectures. The first type (serial-1) targets a reduction in the number of 4-bit substitution boxes (SBOX) from n / 4 to two. The second type of architecture (serial-2) seeks to reduce not only the number of substitution boxes, but also the width of other transformations when possible.

2.1 Present

The Present block cipher follows a Substitution-Permutation Network (SPN) construction. It has a block size of 64-bit and supports key sizes of 80-bit and 128-bit. The specification for its encryption function is presented in [6].

We first study the basic implementation of the block cipher with IO ports of 8-bit described in [8]. That hardware realization of Present requires 17 substitution boxes (SBOX), 77 XOR gates, and 192 Flip-Flops (FF). In regards to latency, 16 cycles are required to input the plaintext and the cipher key, 31 cycles to encrypt the data, and 8 cycles to produce the output. In total this sums 55 cycles.

In the serial-1 architecture for Present found in [8], the the main optimization involves reducing the number of substitution boxes to two. The substitution boxes used in the key generation are also removed and the number of XOR gates is reduced. The trade-off is an increment in the number of cycles required to encrypt the data. The implementation of this design requires 2 SBOX, 21 XOR gates, and 192 FF. With this design 303 latency cycles are needed to encrypt a data block.

The serial-2 Present architecture under study is the one reported in [9]. In that design the main strategy was outlined as reducing the whole datapath to 16-bit, which is a quarter of its block size. The hardware realization for this design involves the use of 6 SBOX, 21 XOR gates, and 192 FF. The total latency of the design is of 136 cycles.

2.2 Midori

Midori is a lightweight block cipher “that is optimized with respect to the energy consumed by the circuit per bit in encryption or decryption operation” [2]. This block cipher operates over data blocks of 64 or 128 bits. A key size of 128-bit is used in both versions of the algorithm. Midori also has an SPN structure.

The iterative architecture for Midori created in this work is presented in Fig. 1 (left). This design can describe both Midori-64 and Midori-128 realizations. It follows the algorithm specification closely but uses 8-bit IO ports. In hardware, this architecture requires 16 SBox (which are of 4-bit for Midori-64 and of 8-bit for Midori-128), an \(n-\)bit transformation which can be simplified as n / 2 XOR gates (MixColumn), an n-bit XOR layer, 16 XOR gates for the key mechanism, \(r-2\) 16-bit round constants, and \(n+128\) FF. In total the iterative architectures for Midori-64 and Midori-128 have a latency of 41 and 53 cycles, respectively.

Fig. 1.
figure 1

Iterative (left), serial-1 (center), and serial-2 (right) architectures for Midori-64 and Midori-128

Figure 1 (center) illustrates the serial-1 architecture created in this work for Midori-64 and Midori-128. This version focuses on reducing the SBOX count. For Midori-64, the SBox illustrated represents two 4-bit SBOX. For Midori-128, eight 8-bit permutations are also allocated inside the SBox. These permutations work together with two 4-bit SBOX to produce the output of the substitution layer. Two of the four permutations are selected depending on the position in the state of the data nibble being processed. The hardware realization of this design uses two 4-bit SBOX, the n/2 XOR gates simplification of MixColumn, an n-bit XOR layer, the 16 XOR gates used in the key generation, \(r-2\) 16-bit round constants, and \(128+n\) FF. This Midori-64 architecture has a latency of 169 cycles while the Midori-128 design requires a total of 373 cycles.

The serial-2 architecture developed in this work for Midori is shown in Fig. 1 (right). In this design the datapath width d is reduced to 16-bit for Midori-64 and 32-bit for Midori-128. In both cases the operations which can be serialized are the substitution layer, the MixColumn step, and the key addition. The n-bit permutation is performed during an extra cycle in the round. In order to achieve this design we modified the Midori algorithm so that the SubCell and ShuffleCell operations are swapped. This allows pushing the ShuffleCell step from the i iteration back to the \(i-1\) iteration. From this, the serializable steps of the algorithm are now grouped at the beginning of the round and can be processed together in 4 cycles. The non serializable part is left at the end of the round and performed in the extra cycle. The cost of this modification only affects Midori-128 due to the 8-bit permutations used inside the SBox which have to be shuffled. For implementing Midori-64 this design requires four 4-bit SBOX, an 8 XOR gates version of MixColumn, 32 XOR gates, 14 16-bit round constants, and 192 FF. The latency of this design amounts to 96 cycles. For implementing Midori-128 the hardware requirements are four 8-bit SBox, a 16 XOR gates version of MixColumn, 48 XOR gates, 18 16-bit round constants, and 256 FF. A latency of 112 cycles is required to encrypt the data with this architecture.

2.3 Gift

The block cipher Gift is said to be a direct improvement to Present “that provides a much increased efficiency in all domains (smaller and faster)” and also patches security weaknesses of the latter. Two specifications of the algorithm were presented in [3] for block sizes of 64 and 128-bit. A key size of 128-bit is used in both versions of the algorithm.

The iterative architecture created for Gift is presented in Fig. 2 (left). This design is a direct implementation of the specification with 8-bit IO ports. For Gift-64 or Gift-128 the design requires n / 4 4-bit SBOX, \(n/2+6\) XOR gates, a NOT gate, and \(n+134\) FF. The latency for Gift-64 is 52 cycles and the latency for Gift-128 is 72 cycles.

Fig. 2.
figure 2

Iterative (left), serial-1 (center), and serial-2 (right, the value d equals n / 4) architectures for Gift-64 and Gift-128

Figure 2 (center) presents our serial-1 architecture for Gift. This design uses 8-bit IO ports and has a serialized application of the substitution layer based on two 4-bit SBOX. The architecture illustrated describes both Gift-64 and Gift-128. In the case of Gift-64 the implementation requires two 4-bit SBOX, 38 XOR gates, a NOT gate, and 198 FF. For this version 276 latency cycles are required. For Gift-128, two 4-bit SBOX, 70 XOR gates, a NOT gate, and 262 FF are used. In this case the latency is of 712 cycles.

Our serial-2 architecture for Gift, shown in Fig. 2 (right), was created by serializing the substitution, permutation, and key addition layers. The datapath width d was adjusted to 16-bit for Gift-64 and to 32-bit for Gift-128. The reduction of the substitution layer is straightforward for Gift. We used a regular pattern found in the original permutation to reduce the permutation layer width to a quarter of its original width. However, by using this reduction an additional transposition of the state is required. Let us use a 2-D representation of the state as described in [3]. The new reduced permutation will yield a transposed version of the 2-D state, arranged in 16 n/16-bit nibbles. Thus, the additional permutation is a shuffling of the state in 4-bit nibbles for Gift-64 and 8-bit nibbles for Gift-128. This strategy is similar to that used in [9] for Present. The small permutation is applied on a serialized manner while the transposition is applied over the state during an additional cycle. The round key also needs to be shuffled to accommodate for this intermediate result. In order to serialize the key addition step, we separated the addition of the keying materials and the addition of the round constants. The keying materials are derived from the key register, shuffled, and serialized, before being applied to the state. The round constants are applied to the state during the additional cycle while the key register is updated. Based on this architecture, the implementation of Gift-64 requires four 4-bit SBOX, 14 XOR gates, a NOT gate, and 198 FF. The implementation of Gift-128 uses eight 4-bit SBOX, 22 XOR gates, a NOT gate, and 262 FF. The total latency for Gift-64 and Gift-128 is of 152 and 208 cycles, respectively.

2.4 Gimli

Gimli is a 384-bit permutation “designed to achieve high security with high performance across a broad range of platforms”. According to its creators, this permutation can be easily used to build high-security block ciphers. We have included this algorithm into our review since its authors claim it was designed for “energy-efficient hardware” and “compactness”. The specification for this function is presented in [4]. Since the implementations provided in [4] do not implement a block cipher, a secret key is not used.

In the iterative implementation for Gimli provided in [4] a block size of 384-bit is used. The application of the parallel SP-box requires two 384-bit permutations, 768 XOR gates, 256 AND gates, and 128 OR gates. The Big-Swap and the Small-Swap can be seen as 384-bit permutations. Finally, 37 XOR gates are used for the addition of the round constants. This architecture has a latency of 120 cycles.

A serial-1 architecture for Gimli was also retrieved from [4]. The main strategy for reducing resources consists on serializing the application of the SP-box layer. In this instance, 96-bit of the state are processed in parallel so that four cycles are required for each application of the SP-box layer. The other transformations are applied to the state in a fifth cycle, which is present for half of the rounds. The application of the serialized SP-box requires two 96-bit permutations, 192 XOR gates, 64 AND gates, and 32 OR gates. The Big-Swap and the Small-Swap can still be represented as 384-bit permutations and 37 XOR gates are also used for the addition of the round constants. A latency of 204 cycles is required for this design.

2.5 Summary

Table 1 provides a summary of the different architectures discussed in this section.

Table 1. Summary of the different designs reviewed in this section

3 Experimental Evaluation

The different designs in Table 1 are used as configurations for our experimental evaluation. The VHDL description for the Present implementations is the one used in [8] and [9]. The hardware descriptions for the different Midori and Gift architectures were created in this work. Lastly, the VHDL description for the Gimli architectures is the one used in [4].

All the configurations were implemented for the xc6slx16-3csg324 FPGA using ISE Design Suite 14.2 and for the xc7a15t-1cpg236c FPGA using Vivado Design Suite 2017.3 Version. The synthesis process was configured with Area as optimization goal in both instances. The use of RAM/ROM elements was disabled for all the implementations. We provide Post-Place & Route area results in terms of slices (SLC), Look-Up-Tables (LUT), and Flip-Flops (FF) for all the configurations in the two implementation platforms.

In regards to performance, we report the total latency (LAT), the maximum achievable frequency (Fmax) from the Post-Place & Route report, the runtime (Time), and the throughput (Thr) for each configuration. The throughput was calculated for operational frequencies of 100 KHz and Fmax as \(\text {Thr}=(\text {state size } \times \text { Freq}) / \text {LAT}\).

A power analysis for the xc6slx16-3csg324 FPGA was performed using the Xilinx XPower Analyzer tool version 14.2 for operational frequencies of 100 KHz and Fmax. The power estimations were obtained after place and route using Xilinx XPower Analyzer 14.3 with HIGH overall confidence level. This analysis used the Post-Place & Route Design file (ncd), a Physical Constraints file (pcf) specific for the evaluation target, and a Simulation Activity file (saif) generated from a Post-Place & Route simulation in Isim. The Simulation Run Time was of 100 ms for all the 100 KHz instances and of 100 \(\mu \)s for all the Fmax instances. From this evaluation we report the quiescent and dynamic power for each design. The power dissipation and the performance at 100 KHz and Fmax were then used to calculate the energy consumption for each configuration.

We use three efficiency (EFF) metrics to evaluate the different configurations. The first figure represents the relation between performance and area and is given in Kbps per SLC. The second figure represents the relation between energy and area and is given in \(\mu \)J per SLC. Lastly, the third efficiency indicator represents the relation between the energy spent and the bits processed and is expressed in nJ per bit. These metrics are expected to indicate the prowess of the configurations for different trade offs, which might be attractive for different application scopes.

4 Results

The area and performance results for the implementations in the xc6slx16-3csg324 FPGA are presented in Table 2. The results for the power analysis and energy consumption calculations for the different configurations implemented in the xc6slx16-3csg324 FPGA are provided in Table 3. The area results in the xc7a15t-1cpg236c FPGA are shown in Fig. 3.

Table 2. Area and performance results for the xc6slx16-3csg324 FPGA using operational frequencies of 100 KHz and Fmax.
Table 3. Power and energy results for the xc6slx16-3csg324 FPGA using operational frequencies of 100 KHz and Fmax.
Fig. 3.
figure 3

Area results of lightweight block ciphers using the xc7a15t-1cpg236c FPGA. Results obtained after place and route

4.1 Discussion

The iterative architectures presented for Midori and Gift offer a good balance between area and performance. While iterative implementations are generally more efficient, serial architectures can be used in cases where further area reduction is needed.

The first type of serial architectures described (S1: reduction of the SBOX count) offers a reduction in the hardware resources over the iterative architectures for all the block ciphers reviewed. But the latency is the least favorable for every instance. The second type of serial architectures (S2: general reduction of the datapath) offers better performance than the S1 type. The hardware profile seems to vary from design to design. For Present, the serial-2 architecture (C03) appears to be ineffective compared to C01 in the xc6slx16-3csg324 FPGA. However, the improvement for this design (C03) is palpable when implemented on the xc7a15t-1cpg236c FPGA. Other instances where the serial-2 architecture is advantageous for area occur for Midori-64 and Gift-128 in the xc6slx16-3csg324 FPGA and for Midori-64 in the xc7a15t-1cpg236c FPGA.

The iterative architectures consistently achieved the smaller energy consumption figures. However, the second type of serial architectures dissipated the least power for Midori and Gift at low operational frequencies (100 KHz). While low energy consumption is a desirable trait for extending the lifetime of battery-powered applications such as WSN motes, low power dissipation is required in passive devices such as RFID tags.

Even though high operational frequencies lead to increased power dissipation, the execution times obtained from the frequency increment, and the resulting energy consumption, are greatly improved. For throughput, the variation from 100 KHz to Fmax is generally of three orders of magnitude, which coincides with the reduction of the execution time. The frequency increment causes the power dissipation to double for all the configurations, but due to the delay reduction the final energy consumption is also reduced three orders of magnitude for almost all the configurations. This experiment presents evidence that constrained devices can benefit from high operational frequencies, however, the application scope shall ultimately dictate the operational frequency to be used.

From the results it is possible to note how small IO buffers can be a burden for an implementation. It is known that most constrained devices can not afford to implement wide interfaces. But if the IO width selected is too small, the port interfacing will take longer than the data processing itself. This is more evident with primitives with large block sizes such as Midori-128, Gift-128 and Gimli.

The efficiency results allow drawing specific comparisons among the different configurations. From the performance per slice comparison it is possible to note that the iterative architectures (C01, C04, C07, C10, C13, C16) are consistently more efficient compared to the serial realizations. From this set, the iterative implementations of the Gift block cipher, in both 64 (C10) and 128 bits (C13) instances, resulted to be the most efficient. The results are consistent for both operational frequencies used.

In terms of energy per slice, the minimal energy expenditure per slice is observed for the iterative realization of Midori-64 (C04) and Midori-128 (C07). The maximum energy per slice was observed for the serial architectures of Gift (C14) and Present (C02), these designs both follow the approach of reducing the number of substitution boxes in the design. In this case the behavior for both operational frequencies is similar even though the difference of three orders of magnitude is noticeable.

Both implementations for the Gimli permutation (C16, C17) obtained the smaller expenditures in the energy per bit efficiency results. These were followed by the iterative implementations of Gift-128 (C13) and Gift-64 (C10). The same pattern can be discerned for both operational frequencies used.

4.2 Comparison with the State of the Art

In the literature we found one work which implements the Midori block cipher in FPGA [1]. In that reference the authors propose fault-diagnosis schemes for Midori-128 and compare them with the “Original Midori128 Encryption” in an xc7vx330t FPGA. Results in SLC, maximum frequency, power, and throughput are provided for four Midori-128 implementations. Since a different FPGA platform is used and not all the information is available (latency, synthesis criteria) it is difficult to have a fair comparison. In regards to area, the implementations in [1] cost from 155 to 171 SLC while our designs for Midori-128 in the xc6slx16-3csg324 FPGA cost from 112 to 162 SLC. In performance, our fastest implementation of Midori-128 can reach up to 433 Mbps while the range in [1] is 42.52 to 47.41 Gbps. The power requirements for our designs range from 20.42 mW to 22.02 mW while the more modest design in [1] requires 340 mW. Its clear that our implementations were created following different design goals. While the results in [1] were obtained for improved security and high performance, our implementations seek to provide low implementation size and energy consumption.

No FPGA implementations for Gift were found in our review.

5 Conclusions

In this paper we have studied cryptographic algorithms which can substitute the use of Present and might be considered for future standardization. Even though the modern constructions are efficient, they can not improve the resource requirements of Present for secure state sizes.

We have provided lightweight hardware architectures for the Midori and Gimli block ciphers. The proposed designs exhibit varying trade-offs which can be attractive for different applications. In order to increase the usability of our work the hardware descriptions for these architectures are made public.

To the best of our knowledge, we have obtained the first FPGA results for the Gift block cipher and the first area-optimized implementations for Midori.