1 Introduction

In the digital era, a cryptographic protocol is a collection of rules and processes for securing communication between two or more parties. It is necessary to guarantee that data transported across a network or saved in a database is secured against unauthorized access, alteration, or theft. Without secure communication, sensitive information may be compromised, resulting in monetary loss or reputational harm. Encryption, digital signatures, and cryptographic hash functions are among the technologies that assure communication security [1, 2].

In the domain of cryptography, a hash function is a mathematical procedure that accepts input data and produces an output of a fixed size (mentioned as a hash). The hash function is constructed, so it is almost impossible to identify authentic information from the hash result. This characteristic is helpful for data integrity verification, authentication, the calculation of digital signatures, and password storage. The capacity to maintain data integrity is one of the primary benefits of cryptographic hash algorithms. A hash value is unique to a single input; any change to the information will result in a new hash value. By comparing the hash values of the original and updated data, it is feasible to discover any illegal changes. This characteristic makes hash functions suitable for assuring data integrity in electronic transactions, such as online banking, e-commerce, and government activities [3, 4].

Keccak [5] architecture of the Secure Hash Algorithm-3 (SHA-3) has become popular among hash techniques and has replaced the previously used SHA-1 [6] and SHA-2 [7,8,9] techniques. SHA-3 has better benefits than its predecessors, such as diversity and reusability; methods have been explored since its introduction in 2012 to improve its parameters for specific applications and hardware devices. The hardware performance of SHA-3 is preferred over software due to its superior power, speed, and throughput implementation. Field-programmable gate array (FPGA) is preferred over application-specific integrated circuits (ASIC) as a hardware performance platform due to its lower price and shorter development time [10,11,12,13,14].

The advantage of SHA-3 in FPGA is speed over previous SHA algorithms in hardware implementations, and it is designed to perform well on various hardware platforms. Implementing SHA-3 in an FPGA allows for algorithm customization and reconfigurability flexibility. Since FPGAs can be designed to consume less power [15, 16] than traditional processors, which makes them ideal for implementing cryptographic functions like SHA-3. Finally, FPGAs can significantly increase the throughput of SHA-3 calculations [17]. As a consequence of this, several strategies have been suggested in order to implement the Keccak algorithm effectively. These approaches either concentrate on reducing the energy consumed, maximizing the area consumed, or enhancing the processing speed.

Overall, the contributions presented in this manuscript are summarised as follows:

  • We present a novel architectural design optimisation strategy based on unrolling the SHA-3 algorithm. Our approach enhances and maximises the throughput and efficiency performance metrics of FPGA devices, making it an ideal solution for many applications.

  • We propose a new simplified structure of the RC generator to achieve improved performance (throughput/efficiency) while effectively reducing hardware resources in the area. The new simplified structure RC generator only consists of 7 bits instead of 64, thus reducing the computation in the Iota (\(\varvec{\iota }\)) step where the number of required XORs reduces to 7.

The remains of the paper are organised as follows: In Sect. 2, we furnish the relevant studies similar to our research. In Sect. 3, we define the overall SHA-3 architecture. Section 4 describes our new proposed hardware implementation of the SHA-3 algorithm on FPGA boards. In Sect. 5, we present the experimental outcomes of our study. In Sect. 6, we discuss the results of our method and the comparisons with other relevant research. Finally, Sect. 7 summarises our research’s findings and future work.

2 Related work

The cryptographic community has conducted significant research on optimising models, architectures, and strategies for SHA-3 in FPGA devices [18,19,20]. All these architectures aim to increase the throughput, efficiency, and frequency while attempting to decrease the area and power consumption in the FPGA [21,22,23,24,25,26,27]. Nevertheless, there is a pressing need to increase throughput and efficiency with area reduction performance metrics. In this section, we present research endeavours similar to ours.

In [28], presented a method for Keccak architectures for output sizes of 256 and 512. The RC is stored in a distributed ROM of 24\(\times \)64 bits. The Virtex-5 architecture for output size 256 needs 1217 slices and a 277 MHz clock and achieves 12.56 Gbps throughput, and the Virtex-7 needs 998 slices and a 300 MHz clock and reaches 13.60 Gbps throughput. The Virtex-5 architecture for output size 512 needs 1200 slices and a 270 MHz clock and achieves 6.48 Gbps throughput, and the Virtex-7 needs 983 slices, a 298.68 MHz clock and reaches 7.17 Gbps throughput. However, this architecture produced poor frequency and throughput.

Paul and Shukla [29] presented two Keccak architectures for output size 256 and an RC method with a count generator to fetch the RC with 64 bits from onboard read-only memory (ROM). The first architecture needs 4188 slices, a 390.53 MHz clock, and achieves 16.492 Gbps throughput. The second architecture needs 7139 slices, a 234.97 MHz clock, and reaches 19.99 Gbps throughput. However, these architectures produced poor frequency and increased area.

Wong et al. [30] presented a method to decrease the area required for ROM by reducing the bit length from 64 to 8 and showing five different Keccak architectures for output size 512. The first architecture needs 871 slices and a 153 MHz clock, achieving 3.68 Gbps throughput and 4.22 Mbps/Slices. The second architecture needs 1393 slices and a 335 MHz clock, reaching 8.04 Gbps throughput and 5.77 Mbps/Slices. The third architecture needs 2145 slices and a 45 MHz clock, earning 2.16 Gbps throughput and 1.00 Mbps/Slices. The fourth architecture needs 1416 slices and an 85 MHz clock and attains 4.08 Gbps throughput and 2.88 Mbps/Slices. The fifth architecture needs 1406 slices and a 344 MHz clock, gaining 16.51 Gbps throughput and 11.47 Mbps/Slices. Even if the total area occupied was not very large, more than the highest frequency attained was needed to be satisfactory.

The unrolling approach, which reduces the total number of clock cycles with an additional round operation, is implemented in Virtex-5 in [31] for output size 256 and reaches 5.38 Gbps throughput. In [32] also decreased the number of clock cycles for all output sizes by using the unrolling approach with Virtex-5 and Virtex-6. Despite this, the frequency and throughput delivered by this design could have been better.

In [33], a basic architecture of Keccak was suggested for Virtex-7 FPGA with an output size of 512 bits. A distributed ROM with a dimension of 24\(\times \)64 bits was used to store the round constants (RC). The architecture operated 1454 slices and utilized a clock frequency of 374.035 MHz. This design earned a throughput of 7.979 Gbps and an efficiency rate of 5.49 Mbps/Slices. However, the area and throughput were both affected negatively by this implementation.

Assad et al. [34] suggested three Keccak implementations in Virtex-5 and Virtex-6 FPGA. The focus was on all output sizes. It is worth noting that the RC required for the Keccak implementation were stored in a ROM of 24\(\times \)64 bits. The basic implementation using Virtex-5 for output size 512 required 935 slices and operated at a clock frequency of 338.409 MHz. This design achieved a throughput of 8.12 Gbps and a rate of 8.68 Mbps/Slices. The basic implementation using Virtex-6 for output size 512 required 1019 slices and operated at a clock frequency of 376.081 MHz. This configuration achieved a higher throughput of 9.02 Gbps, with a rate of 8.85 Mbps/Slices. Nevertheless, the area and efficiency were affected negatively by this design.

In [35], a basic implementation of Keccak was proposed for Virtex-5 FPGA with an output size of 512 bits. The round constants (RC) were stored in a distributed ROM of 24\(\times \)64 bits. The implementation employed 1680 slices in the Virtex-5 FPGA and operated at a clock frequency of 387 MHz. This design achieved a throughput of 8.06 Gbps and an efficiency rate of 4.91 Mbps/Slices. Despite this, the area and efficiency delivered by this architecture could have been better.

Comprehensive examination and analysis of the above methodologies and their effects on the performance of the Keccak architecture indicated a demand for an improved architecture that yields high throughput combined with the low area. Hence, effectively handling RC is essential to attaining high throughput. Therefore, we propose a new RC value generation technique with a minimised structural design where the number of required XORs is reduced to 7 instead of 64. This approach has resulted in a substantial reduction in the area while simultaneously increasing a sizeable throughput. The presented Keccak architecture was tested and verified using the existing test vectors.

3 The SHA-3 architecture

In 2012, the National Institute of Standards and Technology (NIST) maintained a competition to establish a new standard hash function that would complement existing SHA-1 and SHA-2 standards. The objective was to choose a function that would be secure, efficient, and resistant to attacks such as collision and preimage attacks [36]. The winner of the competition was the Keccak hash function. Unlike the previous SHA standards, SHA-3 is founded on the sponge functions (absorb/squeeze) as presented in Fig. 1.

Fig. 1
figure 1

The (absorb/squeeze) sponge structure of the SHA-3 hash function

The sponge function is based on a state matrix of “\(b = r + c\)” bits, where “b” denotes the block size, “r” indicates the bit rate of the sponge function, and “c” defines the capacity. So, this state matrix starts with zero values once it is initialized for the first time. The Keccak hash algorithm ensures the state C as a three-dimensional matrix with the dimensions \(5 \times 5 \times (word-size)\).

Fig. 2
figure 2

The block diagram of the SHA-3 hash function

An input message is padded, adding bits to the message so that its total size becomes a multiple of a fixed number of bits, denoted as “r”. Once the message has been padded, it is separated into blocks of equal length, denoted as “Pi”. In the absorbing step, “r” bits XOR with each block and permutation function “f”. The “f” function is the central part of the processing of 24 rounds and consists of distinct steps, including i ) “Theta (\(\varvec{\theta }\))”, ii) “Rho (\(\varvec{\rho }\))”, iii ) “\({\text {P}}_i\) (\(\varvec{\pi }\))”, iv ) “Chi (\(\varvec{\chi }\))” and v) “Iota (\(\varvec{\iota }\))”, each of which performs a specific operation on a 1600-bit state matrix denoted as A [37]. The block chart of the SHA-3 is illustrated in Fig. 2.

Table 1 The standard round constants \(RC_i\) generator in Iota (\(\varvec{\iota }\)) step of the SHA-3 algorithm

The steps “Theta (\(\varvec{\theta }\))”, “Rho (\(\varvec{\rho }\))”, “\({\text {P}}_i\) (\(\varvec{\pi }\))”, “Chi (\(\varvec{\chi }\))” and “Iota (\(\varvec{\iota }\))” are shown in Eqs. (1)–(5). In particular, Eq. (1) refers to the computations performed in the “Theta (\(\varvec{\theta }\))” step. This step involves manipulating a two-dimensional array of size (\(5 \times 5\)), where C[i] and D[i] are one-dimensional arrays representing the lanes, and A[ij] denotes the slices. The “Rho (\(\varvec{\rho }\))” and “Chi (\(\varvec{\chi }\))” steps compute the B[ij] array from the state matrix A[ij]. During the “Chi (\(\varvec{\chi }\))” step, the value of A[ij] is recalculated in accordance with the Equation that is shown in (3). Finally, the “Iota (\(\varvec{\iota }\))” step involves adding a constant value, denoted as RC(i), to the first element of the A[0, 0] array.

The “Theta (\(\varvec{\theta }\))” step is the first step of the Keccak-f permutation. It involves i) a parity computation, ii) a rotation of one place, and iii) a bitwise XOR operation. The parity computation takes the XOR of every 5-bits in a 25-bit row, resulting in a 5-bit output. The rotation involves shifting the bits of each row by a fixed amount, which varies for each row. The bitwise XOR operation combines the output of the parity computation with the rotated row to produce a new row.

Step Theta (\(\theta \)):

$$\begin{aligned} \textrm{C}[\textrm{i}]&=\textrm{A}[\textrm{i}, 0] {\text {XOR}} \textrm{A}[\textrm{i}, 1] {\text {XOR}} \textrm{A}[\textrm{i}, 2] \nonumber \\&\quad {\text {XOR}} \textrm{A}[\textrm{i}, 3] {\text {XOR}} \textrm{A}[\textrm{i}, 4], \nonumber \\&\quad \quad \textrm{i} \le 4\nonumber \\ \textrm{D}[\textrm{i}]&=\textrm{C}[\textrm{i}-1] {\text {XOR}} {\text {ROTATE}}(\textrm{C}[\textrm{i}+1], 1), \nonumber \\&\qquad \textrm{i} \le 4 \nonumber \\ \mathrm {~A}^{\prime }[\textrm{i}, \textrm{j}]&=\textrm{A}[\textrm{i}, \textrm{j}] {\text {XOR}} \textrm{D}[\textrm{i}],\nonumber \\&\qquad \textrm{i} \le 4 \end{aligned}$$

The “Rho (\(\varvec{\rho }\))” step is a rotation step that involves rotating each bit of the state by an offset that hinges on the word assignment. The “\({\text {P}}_i\) (\(\varvec{\pi }\))” step is a permutation step that involves rearranging the words of the state. So, the state array A is also used to calculate a serviceable \(5 \times 5\) array B in the following two steps. Interestingly, a bit stream consisting of w bits is referred to by the array B[ij].

Step Rho (\(\rho \)):

$$\begin{aligned} A[i, j]={\text { ROTATE }}\left( A^{\prime }[i, j], r[i, j]\right) , \quad [i, j] \le 4 \end{aligned}$$

Step Pi (\(\pi \)):

$$\begin{aligned} \textrm{B}[\textrm{j}, 2 \textrm{i}+3 \textrm{j}]=\textrm{A}[\textrm{i}, \textrm{j}], \quad [i, j] \le 4 \end{aligned}$$

The “Chi (\(\varvec{\chi }\))” step is a bitwise logic operation that involves performing a bitwise XOR, NOT, and AND operation on the bits of the state.

Step Chi (\(\chi \)):

$$\begin{aligned} A[i, j]&=B[i, j] {\text {XOR}} ((-B[i+1, j]) AND B[i+2, j]),\nonumber \\&\quad [i, j]\le 4 \end{aligned}$$

The final step, “Iota (\(\varvec{\iota }\))”, involves adding a round constantly to a single bit of the state. The round constants are produced by the RC generator that is used in the “Iota (\(\varvec{\iota }\))” step. The \(RC_i\) function is present in Table 1 and comprises 24 unique permutation values that allocate 64-bit data to the SHA-3 operation [36].

Step lota (\(\iota \)):

$$\begin{aligned} A[0,0]=A[0,0] {\text {XOR}} R C[i] \end{aligned}$$

The NIST has determined four forms of the SHA-3 for generating hash values from a message M of any length and an output length size d, as presented in Table 2.

Table 2 The SHA-3 algorithm in its four different forms
Fig. 3
figure 3

Proposed optimization architectural system of the SHA-3

Several hash function applications prefer smaller output sizes, specifically, those not using them for security [38]. The larger the output length size, the stronger higher the security against assaults of the hash function. Nevertheless, the larger output length size also means a slower hash function operation, as more processing power is required to produce the hash value. Thus, this work presented a structure that permits generating all four probable output lengths.

4 Proposed optimization architectural system

This section analyses the design components we implemented for all output lengths (576, 832, 1088, 1152) of the SHA-3 algorithm. The primary target of our work is to achieve higher throughput (Gbps) by reducing the area in our system. This target is achieved with the new simplified structure of the proposed RC generator, which eliminates the need for further hardware resources in the area and provides higher performance.

4.1 The architectural design of the SHA-3 (Keccak)

Our system architecture is presented in Fig. 3. The architecture comprises (i) padding, (ii) mapping, (iii) the Keccak round, (iv) truncate, (iv) control and (v) counter. The Keccak round is at the core of the architectural design. The responsibility for controlling, synchronizing, and communicating the flow of data inside our system lies with the control unit. The input message data is 64-bits. The values for the select output length are shown in Table 3.

Table 3 The four different values for the select output length of the SHA-3 algorithm

4.2 Padding, mapping and truncating unit

The padding scheme of the SHA-3 (Keccak) for the input message is shown in Fig. 4. Initially, the input message of 64-bits is appended with “1”, then followed by bits of “0”, then appended with “1” so that the total message size is a multiple of “r” bits (576, 832, 1088, 1152) [39].

Fig. 4
figure 4

The SHA-3 hashing algorithm’s padding scheme

Fig. 5
figure 5

Padding block diagram of the SHA-3

Fig. 6
figure 6

Padding Unit of the SHA-3

The basic padding block diagram of the SHA-3 is illustrated in Fig. 5 and consists of multiplexers 2 to 1 for a 64-bits input message. The padding unit is shown in Fig. 6. The padding unit consists of one multiplexer, 4 to 1. If 224-bits are selected as the output length, then the padding scheme for \(r = 1152\) will be executed. The Padded input message (Pad) “r” bits are entered in the mapping unit and XOR with the initially of the “r” bits. After, appended the result with the initially of the “c” bits [36].

Fig. 7
figure 7

Keccak round with 24 clock cycles with a new simplified structure of the RC generator

Fig. 8
figure 8

Keccak round with 12 clock cycles with a new simplified structure of the RC generator

Data transformation is required, as shown in Eq. (6). The truncating unit, according to Eq. (6), cuts the digits of state depending on the output length selected (576, 832, 1088, 1152) and consists of one multiplexer 4 to 1.

$$\begin{aligned} {\text{ State }}[x, y, z]= & {} (({\text {Pad}} r {\text {XOR}} r)|| c)\nonumber \\ {}{} & {} \quad \times \left[ 64\times (5 y+x)+z\right] \end{aligned}$$

4.3 The Keccak round architecture

In this study, one of our primary goals has been to reduce the total number of clock cycles using an unrolling strategy, ensuring a reasonably low area. The base architecture of the permutation rounds block is seen in Fig. 7, with the counter ranging from 0 to 23, indicating no attempt to minimize the total number of clock cycles. As a result, as illustrated in Fig. 7, divide the total number of the counter by half to reduce the total number of clock cycles. This part of our methodology is shown in Fig. 8, using a total of just 12 clock cycles.

The unrolling strategy refers to a technique used to optimize the implementation of an algorithm by reducing loop overhead. In the base implementation of SHA-3, as shown in Fig. 7, the computation involves a single block of transformations (\(\varvec{\theta }\) \(\rightarrow \) \(\varvec{\rho }\) \(\rightarrow \) \(\varvec{\pi }\) \(\rightarrow \) \(\varvec{\chi }\) \(\rightarrow \) \(\varvec{\iota }\)). However, the unrolling strategy aims to further enhance the algorithm’s performance by executing multiple transformation blocks.

This work applied an unrolling factor of 2, as depicted in Fig. 8. This means that an additional block of transformations was included inside the Keccak Round module. With the unrolling factor of 2, two rounds of transformations (\(\varvec{\theta }\) \(\rightarrow \) \(\varvec{\rho }\) \(\rightarrow \) \(\varvec{\pi }\) \(\rightarrow \) \(\varvec{\chi }\) \(\rightarrow \) \(\varvec{\iota }\) \(\rightarrow \) \(\varvec{\theta }\) \(\rightarrow \) \(\varvec{\rho }\) \(\rightarrow \) \(\varvec{\pi }\) \(\rightarrow \) \(\varvec{\chi }\) \(\rightarrow \) \(\varvec{\iota }\)) are performed within a single clock cycle. By incorporating this unrolling strategy, the number of clock cycles required to complete the entire SHA-3 algorithm is halved. The standard SHA-3 algorithm comprises 24 rounds, and with the unrolling factor of 2, it will now take only 12 clock cycles to accomplish these 24 rounds.

This acceleration by unrolling strategy results in computations being completed in halved clock cycles, thus reducing overall execution time and to improved performance for the SHA-3 algorithm implementation, making it well-suited for various FPGA devices.

4.4 New simplified structure of the RC generator

Our research proposes a new simplified structure for the RC generator that significantly improves the algorithm’s performance while reducing hardware resources. The RC generator is a crucial component of the SHA-3 algorithm. Its primary function is to produce a sequence of pseudo-random bits used to encrypt the input data. The existing RC generator consists of 24 sets of 64-bits, which results in many computations in the “Iota (\(\varvec{\iota }\))” step of the SHA-3 algorithm. This step needs a large number of XOR operations to be executed, which can decrease performance and efficiency, especially in FPGA devices with limited resources.

To overcome this issue, we have designed a new simplified structure for the RC generator that only consists of 7-bits [36, 40]. By reducing the number of bits in the generator, we effectively reduce the computations required in the Iota step, resulting in improved performance and efficiency. Reducing the total number of bits also reduces the hardware resources required for the RC generator, resulting in a more compact design ideal for FPGA devices with limited resources.

The “Iota (\(\varvec{\iota }\))” step is to modify some of the bits of in state array A, as shown in Eq. (7).

$$\begin{aligned} A^{\prime }[x, y, z]=A[x, y, z] {\text {XOR}} R C\left[ i_{r}\right] \end{aligned}$$

According to the specifications of SHA-3, the RC are given by Eq. (8),

$$\begin{aligned} {\text {RC}}\left[ i_{\textrm{r}}\right] [0][0]\left[ 2^{j}-1\right] ={\text {rc}}\left[ j+7 i_{\textrm{r}}\right] ,\, {\text {for all}} \quad 0 \le j \le \ell \end{aligned}$$

and all other values of \(RC[i_{\textrm{r}}][x][y][z]\) are zero. From Eq. (8), it follows that only 7 of the 64 bits can have the value 1. Table 4 presents the specific positions for the 7 bits where \(\ell \) = 6, by the specifications of the SHA-3.

So only those 7 of the 64-bits are fundamental round constants and appear in specific places with non-zero bits while the other positions are zero. The specific bit positions that carry the value “1” are 0,1,3,7,15,31 and 63, with the rest being “0”. An example of the simplified structure for RC[3] of Table 5 is shown in Table 6. Thus, seven specific bits can be set for the XOR gate in state array A.

Table 4 Specific positions for the 7 bits with value 1
Table 5 The new simplified structure of the round constants \(RC_i\) in the Iota (\(\varvec{\iota }\)) step of the SHA-3 algorithm
Table 6 Example of the new simplified structure of the RC[3] in Iota (\(\varvec{\iota }\)) step

5 Experimental results

In our experiments, we used the Virtex-5, Virtex-6, Virtex-7, and Artix-7 FPGA boards in order to make a fair evaluation between the proposed design and the other existing works while providing a thorough, comprehensive comparison across multiple FPGA platforms for a broader assessment of the design’s efficiency and throughput achievements. Xilinx ISE was used to implement the design in the Virtex-5/Virtex-6, and Virtex-7/Artix-7 was used to implement the architecture in the Xilinx Vivado. The implementation has been done with Very High Speed Integrated Circuit Hardware Description Language (VHDL). The designs are simulated and confirmed for the whole functionality with valid examples provided by the NIST [41].

5.1 Performance metrics

In order to ensure a fair comparison between the proposed design and other existing works, we used the established definitions of efficiency and throughput [18, 42,43,44] used in the literature. Maintaining consistency in performance metrics is essential for meaningful comparisons and benchmarking between different designs. Additionally, using the established definition of performance metrics as in other works enables researchers to compare our results with the existing literature and understand the performance advancements achieved by our proposed design.

Throughput symbolises the total number of bits processed per period (time) unit and is defined in Gbps or Mbps. The throughput is computed utilising Eq. (9).

Table 7 The outcomes of the implementation in terms of throughput and comparison
Table 8 The outcomes of the implementation in terms of efficiency and comparison

In Eq. (9), Bmb (bits in a message block) are the bitrate size “r” (576, 832, 1088, 1152), \({\text {Max}}_{f}\) is the maximum clock periodicity frequency, and Ccmb (clock cycles per message block) represent the number of resumption needed for the five special operations: ( \(\varvec{\theta }\) \(\rightarrow \) \(\varvec{\rho }\) \(\rightarrow \) \(\varvec{\pi }\) \(\rightarrow \) \(\varvec{\chi }\) \(\rightarrow \) \(\varvec{\iota }\)) to generate the hash value. The efficiency is calculated by using Eq. (10).

$$\begin{aligned} {\text {Efficiency}}=\frac{{\text {Throughput}}}{{\textrm{Area}}} \end{aligned}$$

5.2 Results

The presented architectural design attains high throughput and assures reducing hardware resources in the area for various output lengths required to produce a hash value. The results of the implementation of this architectural design are summarized in Table 7, which shows the maximum frequency and throughput of all output lengths.

As shown in Eq. (10), decreasing the total number of clock cycles and reducing the area increases the throughput, which was our primary objective. Thus, our strategy concentrated on reducing the total number of iterations required to generate a hash value. Additionally, with the new simplified structure, the RC generator has several advantages, such as reduced resource utilization required for a design, faster design time, which reduces project complexity, and higher clock frequencies, which lead to higher performance.

According to the results, the proposed architectural design achieves a maximum throughput of 36.358 Gbps for 224 output length and 18.179 Gbps for 512 output length with an area of 1375 slices. Moreover, Table 7 provides a fair comparison of the achieved outcomes with the recent studies published in the literature.

In addition to the results in Table 8, the efficiency of the proposed design has been evaluated by taking the throughput (in Mbps) and dividing it by the consumption per area (total number of slices). The results of this evaluation are summarized in Table 8 for all output lengths.

6 Result in discussion

Throughput and area are essential data processing metrics, especially for information security. This measure represents the algorithm’s efficiency and resistance to cryptanalysis attacks that focus on hardware flaws.

On the Virtex-5 board with 24 total clock cycles, design [35] occupied the highest area of 1680 slices, whereas the proposed design consumed the lowest area of around 868 slices. On the Virtex-6 board, the design [32] occupied the highest area of 1432 slices, whereas the design [30] consumed the lowest area of around 871 slices. Although [30] occupied slightly less area than the proposed design, it resulted in poor efficiency, frequency, and throughput, indicating that the design’s area utilization is only one of the factors to be considered in evaluating its performance.

On the Virtex-7 board, the proposed design marginally occupied the highest area of 1094 slices, while the design [28] consumed the lowest area of around 998 slices. However, it is observed that [28] produced low frequency, efficiency, and throughput. Thus, it indicates that the proposed design may have slightly higher area utilization but is still more efficient in performance compared to the design [28].

On the Artix-7 board, the design [29] occupied a significantly larger area, specifically 4188 slices, whereas the proposed design consumed a much lower area of approximately 902 slices. Even though the design [29] achieved a higher throughput for 512 output length performed poorly, with an efficiency measure of 3.93 Mbps/slices. In contrast, the proposed design exhibited a higher efficiency rate of 10.57 Mbps/slices. This indicates that the proposed design is more effective at area and efficiency metrics and is more area-efficient than [29]. The proposed design with 24 clock cycles delivers the highest throughput when implemented on Virtex-5, Virtex-6, and Virtex-7 boards. In addition to its impressive throughput achievements, the proposed design also excels in low area utilization and efficiency when deployed on the Artix-7 board.

On the Virtex-5 board with 12 total clock cycles, the design mentioned in [32] occupied the highest area of 2144 slices, while the proposed design consumed the lowest area of around 1112 slices. It indicates that the proposed design is more area-efficient than the design mentioned in [32]. On the Virtex-6 board with 12 total clock cycles, the design mentioned in [32] occupied the highest area of 3557 slices, whereas the proposed design consumed the lowest area of around 1287 slices. In this case, the proposed design is also more area-efficient than the design mentioned in [32]. On the Artix-7 board, the design [29] occupied a much larger area, 7139 slices, but the proposed design consumed a substantially lower area, about 1184 slices. The design [29] performed poorly in terms of efficiency, with an efficiency value of 2.80 Mbps/slices, while the presented design demonstrated a higher efficiency rate of 10.31 Mbps/slices. The proposed design with 12 clock cycles achieves the highest throughput when utilized on Virtex-5, Virtex-6, and Virtex-7 FPGA boards. Furthermore, it exhibits exceptional efficiency and low area utilization when implemented on the Artix-7 board.

Consequently, the assessment of the presented architectural design concerning recent works in the literature demonstrates that it surpasses them in terms of throughput and area, making it an essential contribution to hash function design. So, applications that need speedy and efficient hash functions may benefit from this performance improvement.

7 Conclusions and future work

The importance of cryptography in ensuring the security and confidentiality of digital media cannot be overstated in today’s interconnected world. With the transmission of sensitive information in various forms, including text, image, video, and audio, it is crucial to have robust encryption algorithms that offer high-level security and resistance to attacks.

The SHA-3 (Keccak) algorithm is one such algorithm that has gained popularity due to its strong resistance to cryptanalysis attacks and its good combination of speed, performance, and security. Its adoption by NIST as a more secure replacement for SHA-1 and SHA-2 highlights its significance in ensuring the safety and integrity of digital data.

This study focuses on optimizing the performance of the SHA-3 algorithm for all output lengths (224, 256, 384, and 512 bits) on the and Artix-7, Virtex-5, Virtex-6, and Virtex-7 FPGA boards. The research compares the proposed innovative method to similar designs. It shows that the presented architectural design has the highest performance in the standard evaluation criteria of area (slices), throughput (Gbps), and efficiency (Mbps/slices).

The study achieved an area of 1375 slices, a throughput of 36.358 Gbps, and an efficiency of 26.44 Mbps/slices with the Virtex-7 FPGA board, demonstrating the efficacy of the proposed architecture. However, future research will enhance throughput and efficiency performance metrics per round and propose more practical experiments implementing FPGAs and entire systems-on-chip.