Exploring Parallelism to Improve the Performance of FrodoKEM in Hardware

FrodoKEM is a lattice-based key encapsulation mechanism, currently a semi-finalist in NIST’s post-quantum standardisation effort. A condition for these candidates is to use NIST standards for sources of randomness (i.e. seed-expanding), and as such most candidates utilise SHAKE, an XOF defined in the SHA-3 standard. However, for many of the candidates, this module is a significant implementation bottleneck. Trivium is a lightweight, ISO standard stream cipher which performs well in hardware and has been used in previous hardware designs for lattice-based cryptography. This research proposes optimised designs for FrodoKEM, concentrating on high throughput by parallelising the matrix multiplication operations within the cryptographic scheme. This process is eased by the use of Trivium due to its higher throughput and lower area consumption. The parallelisations proposed also complement the addition of first-order masking to the decapsulation module. Overall, we significantly increase the throughput of FrodoKEM; for encapsulation we see a 16×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$16\times $$\end{document} speed-up, achieving 825 operations per second, and for decapsulation we see a 14×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$14\times $$\end{document} speed-up, achieving 763 operations per second, compared to the previous state of the art, whilst also maintaining a similar FPGA area footprint of less than 2000 slices.


Introduction
The future development of a scalable quantum computer will allow us to solve, in polynomial time, several problems which are considered intractable for classical computers. Certain fields, such as biology and physics, would certainly benefit from this "quantum speed up"; however, this could be disastrous for security. The security of our current publickey infrastructure is based on the computational hardness of the integer factorisation problem (RSA) and the discrete logarithm problem (ECC). These problems, however, will be solved in polynomial time by a machine capable of executing Shor's algorithm [29].
To promptly react to the threat, the scientific community started to study, propose, and implement public-key algorithms, to be deployed on classical computers, but based on problems computationally difficult to solve also using a quantum or classical computer. This effort is supported by governmental and standardisation agencies, which are pushing for new and quantum resistant algorithms. The most notable example of these activities is the open contest that NIST [20] is running for the selection of the next public-key standardised algorithms. The contest started at the end of 2017 and is expected to run for 5-7 years.
Approximately seventy algorithms were submitted to the standardisation process, with the large majority of them being based on the hardness of lattice problems. Lattice-based cryptographic algorithms are a class of algorithms which base their security on the hardness of problems such as finding the shortest non-zero vector in a lattice. The reason for such a large number of candidates is because lattice-based algorithms are extremely promising: they can be implemented efficiently and they are extremely versatile, allowing to efficiently implement cryptographic primitives such as digital signatures, key encapsulation, and identity-based encryption.
As in the past case for standardising AES and SHA-3, the parameters which will be used for selection include the security of the algorithm and its efficiency when implemented in hardware and software. NIST have also stated that algorithms which can be made robust against physical attacks in an effective and efficient way will be preferred [21]. Thus, it is important, during the scrutiny of the candidates, to explore the potential of implementing these algorithms on a variety of platforms and to assess the overhead of adding countermeasures.
To this end, this paper concentrates on FrodoKEM, a key encapsulation mechanism submitted to NIST as a potential post-quantum standard. FrodoKEM is a conservative candidate due to its hardness being based on standard lattices, as opposed to Ring/Module-LWE, thus having limited practical evaluations. Thus, we explore the possibility to efficiently implement it in hardware and estimate the overhead of protection against power analysis attacks using first-order masking. To maximise the throughput, we rely on a parallelised implementations of the matrix multiplication. Although we do not utilise specialised techniques for parallelising the matrix multiplication, there exists a lot of prior art in this area of research [14,24]. We also aim to have a relatively low FPGA area consumption. To be parallelised, however, the matrix multiplication requires the use of a smaller and more performant pseudo-random number generator. We propose to achieve the performance required for the randomness generation by using Trivium, an international standard under ISO/IEC 29192-3 [13] and selected as part of the eSTREAM project, specifically selected for its hardware performance. 1 We utilise this instead of AES or SHAKE, as per the FrodoKEM specifications. We do this as a design exploration study and not (per se) as a recommendation; other alternative ciphers or hash functions with similar security arguments and performance profiles in hardware could equally be applied.
The rest of the paper is organised as follows. Section 2 discusses the background and the related works. Section 3 introduces the proposed hardware architectures and the main design decisions. Section 4 reports the results obtained while synthesising our design on re-configurable hardware and compares our performance against the state of the art. We conclude the paper in Sect. 5. 1 https://www.ecrypt.eu.org/stream/e2-trivium.html.

Background and related work
In this section, we provide some background on previous hardware implementations post-quantum cryptographic schemes, focusing on those which are candidates of NIST's standardisation effort. We will also elaborate more on FrodoKEM and its implementations as well as recalling the principles of masking.

Previous post-quantum hardware implementations
In order to provide a reference point on the state of the art in hardware designs of post-quantum candidates, we provide a brief summary here. Table 1 shows the area and throughput performances of candidates, separated by their post-quantum hardness type. Firstly, it is quite clear that SIKE is the largest and slowest of the schemes, consuming quite a large portion of the (expensive) FPGA they benchmark on. Hash-based and code-based schemes on the other hand, whilst requiring similarly large FPGA resources, make up for this and provide a high throughput. Lattice-based schemes generally enjoy the best-of-both-worlds in terms of area consumption and performance, having a relatively small FPGA area consumption and a relatively high throughput. Not only is this seen in Table  1, but this is also true for other lattice-based schemes, pre NIST's post-quantum competition. Within the lattice-based candidates, the ideal lattice schemes are, as expected, much more efficient in terms area throughput performance compared to standard lattices. This is essentially because of the complexity of their respective multiplications; in standard lattice schemes the matrix multiplications have O(n 2 ) complexity, whereas ideal and module schemes are able to use a NTT polynomial multiplier, reducing the complexity to O(n log n).

Implementations of FrodoKEM
FrodoKEM [19] is a key encapsulation mechanism (KEM) based on the original standard lattice problem learning with errors (LWE) [25]. FrodoKEM is a family of IND-CCA secure KEMs, the structure of which is based on a key exchange variant FrodoCCS [7]. FrodoKEM comes with two parameter sets FrodoKEM-640 and FrodoKEM-976, a summary of which is shown in Table 2. FrodoKEM key generation is shown in Algorithm 1, encapsulation is shown in Algorithm 2, and decapsulation is shown in Algorithm 3. The most computationally heavy operations in FrodoKEM are in Line 7 of Algorithm 1, Line 7 of Algorithm 2, and Line 11 of Algorithm 3, that is the matrix multiplication of two matrices, sampled from the error sampler and PRNG, respectively. The LWE instance is then completed by adding an 'error' value (as in Eq. 1). Some smaller operations such  Generate pseudo-random seed A ← H (z) 4: Generate A ∈ Z n×n q via A ← Frodo.Gen(seed A ) 5: Generate S ← Frodo.SampleMatrix(seed E , n,n, T χ , 1) 6: Generate E ← Frodo.SampleMatrix(seed E , n,n, T χ , 2) 7: Compute B ← AS + E 8: return public key pk ← seed A ||B and secret key sk ← (s||seed A ||B, S) 9: end procedure as message encoding is also required. The ciphertexts are the output of these calculations and are used to calculate a shared secret (ss) via SHAKE. The matrices generated heavily utilise the randomness sources, suggested by the authors via AES or SHAKE. The output of these algorithms has nice statistical properties, but the overhead required to achieve this is high.
There have been a number of software and hardware optimisations of FrodoKEM. Howe et al. [12] report both software and hardware designs for microcontroller and FPGA. The hardware design focuses on a plain implementation by using only one multiplier in order to fairly compare with previous work and the proposed software implementa- Compute μ ← Frodo.Decode(M) 6: Parse pk ← seed A ||b 7: Generate randomness seed E ||k ||d ← G( pk||μ ) 8: Generate S ← Frodo.SampleMatrix(seed E ,m, n, T χ , 4) 9: Generate E ← Frodo.SampleMatrix(seed E ,m, n, T χ , 5) 10: else return ss ← F(c 1 ||c 2 ||s||d) 16: end procedure tion. Due to their use of cSHAKE for randomness, they have to pre-store a lot of the randomness into BRAM and then constantly update these values. Due to this, the implementations do not have the ability to parallelise multipliers and incurs high memory costs.
So far there has been little investigation of side-channel analysis for FrodoKEM other than ensuring the implementations run in constant-time [12]. Bos et al. [8] have investigated FrodoKEM in terms of its resistance against power analysis. They find that the secret key is recoverable for a number of different scenarios, requiring a small amount of traces (< 1000) for any of the parameter sets. They propose a simple countermeasure to thwart their attack by changing the order during the inner product multiplication. A previous attack on Frodo by Aysu et al. [4] also suggests using random shuffling or by adding dummy instructions.
Thus, to counter this type of attack, it is important for masking to be investigated, and evaluated in terms of its practical performance. NIST have also stated many times that masking and countermeasures are an important evaluation criteria for analysing these post-quantum candidates [2,21].

SHAKE as a seed expander
The pqm4 project nicely summarises the percentage of time each post-quantum candidate spends using SHAKE in software [16,Section 5.3]. This shows that Kyber, NewHope, Round5, Saber, and ThreeBears spend upwards of 50% of their total runtimes using SHAKE in some form or another. For signature schemes, this value can reach upwards of 70% in some cases.
There has been previous investigations of using alternatives to SHAKE in software for NIST post-quantum standardisation candidates. Bos et al. [9] recently improved the throughput of software implementations of FrodoKEM by leveraging a different randomness source for generating the matrix A; xoshiro128**, increasing the throughput by 5×. Round5 has also been shown to improve its performance using an alternative randomness source [27], instead using a candidate from NIST's lightweight competition, which shows a performance improvement by 1.4×. SPHINCS+, using Haraka, has also been shown to have a 5× speed-up when considered instead of SHAKE [5]. These recent reports show there is room for further investigations (in hardware) for using SHAKE in post-quantum cryptographic schemes. Moreover, alternative random sources may be required for these NIST PQC schemes once they are integrated into the real world; e.g. in a Hardware Security Module (HSM) which require randomness from physical processes, i.e. a True Random Number Generator (TRNG). Despite these investigations into alternative randomness sources, utilising sources not specified in the scheme's specifications may break compatibility, which would be the case for FrodoKEM which only considers AES and SHAKE.

Side-channel analysis
In their call for proposals, NIST specified that algorithms which can be protected against side-channel attacks in an effective and efficient way are to be preferred [21]. To provide a whole picture about the performance of a candidate, it is thus important to evaluate also the cost of implementing "standard" countermeasures against these attacks. In FrodoKEM specifications, cache and timing attacks can be mitigated using well-known guidelines for implementing the algorithm. For timing attacks, these include to avoiding use of data derived from the secret to access the addresses and in conditional branches. To counteract cache attacks, it is necessary to ensure that all the operations depending on secrets are executed in constant-time.
Power analysis attacks can be addressed using masking and hiding. Masking is one of the most widespread and better understood techniques to protect against passive sidechannel attacks. In its most basic form, a mask is drawn uniformly from random and added to the secret. The resulting masked value, which is effectively a one-time-pad, and the mask are jointly called shares: if taken singularly they are statistically independent from the secret, and they must be combined to obtain the secret back.
Any operation that previously involved the secret has to be turned into an operation over its shares. As long as they are not combined, any leakage from them will be statistically independent of the secret too. In our context, we show how masking can easily be applied to FrodoKEM at a very low cost. We therefore argue the overhead that a masked implementation of FrodoKEM in hardware incurs is minimal, hence making it a strong candidate when side-channel analysis is a concern. In FrodoKEM the only operation using the secret matrix S is the computation of the matrix M as C − B S during decapsulation. When S is split in two (or more) shares S i using addition modulo q, the above multiplication by B can be simply applied to all shares independently. Results are then subtracted by C one by one, so that computations never depend on both shares simultaneously. More precisely, in a two-share scenario the flow of computation would be that first B S 1 is generated and subtracted from C to produce the first share of M as M 1 = C − B S 1 ; then, the actual value of M is derived via M = M 1 − B S 2 . All intermediate steps operate on values where at least one element is unknown to the adversary and thus no classical DPA style attack can succeed.
Masking can only be successful if an implementation features a low enough signal-to-noise ratio. Otherwise, single trace attacks, i.e. attacks where secrets can be extracted by analysing each trace individually, will succeed. Masking in itself cannot overcome this threat. Either an implementation has sufficient parallelism to ensure that the signal-to-noise ratio is sufficiently low, or, hiding countermeasures need to be deployed. Hiding countermeasures in hardware can take advantage of "unused" circuitry, e.g. it is possible to ensure parts of the circuit are always active; in our implementation we try to achieve this anyway to ensure high throughput. Other hiding countermeasures often increase the noise for the adversary by reordering of operations or the addition of dummy operations. For example, in the context of matrix multiplications, one can, with minimal overhead, change the ordering of the processing of rows/columns and even the ordering of the computation of the partial products. Such measures typically imply a small amount of extra circuitry when implemented in hardware, but they do not result in a different architecture for the matrix multiplication across different other choices.

Hardware design
Our main design goal is to improve the throughput of the lattice-based key encapsulation scheme FrodoKEM [19] when implemented in hardware. As described in Sect. 2, FrodoKEM is one of the leading conservative candidates submitted to the NIST post-quantum standardisation effort [20], currently a semi-finalist in the process. Moreover, it has been shown to have appealing qualities which make it an ideal candidate for hardware implementations, such as having a power-of-two modulus and significantly easier parameter selection. However, a complete exploration of the possible hardware optimisations applicable to FrodoKEM has yet to be done. For instance, previous implementations do not consider parallelisations or other design alternatives capable of significantly improving the throughput.
As described in Sect. 2, FrodoKEM requires heavy use of randomness generation and/or seed expanding. In the algo-rithm specifications, it is suggested to use either SHAKE or AES. In particular, the most computationally intensive operations, such as Line 11 of Algorithm 3, require 410 k or 953 k 16-bit pseudo-random values, depending on the parameter set used. In order for the generation of randomness not to be the bottleneck, it needs to achieve a very high throughput (ideally with relatively low area consumption) typically in the range of 16 bits per clock cycle. In a previous hardware design, proposed by Howe et al. [12], high throughput for the PRNG was achieved by pre-calculating randomness and storing it in BRAM. Random data newly calculated were then written into the memory, overwriting the random data previously stored. This is an efficient approach, however, a more efficient PRNG that would not require BRAM usage, potentially increasing the operating frequency of the design, and thus improve its throughput. Moreover, parallelisations were not possible for this design, as this would either require a faster SHAKE design, increasing the area consumption by 3-8× [6] or worse still having several SHAKE instances, incurring an even worse resource consumption overhead. The area consumption of SHAKE (or AES) was an issue with the previous hardware design. For example, cSHAKE used within FrodoKEM-640 Encaps occupies 42% of the overall hardware resources [12].
To improve the parallelism of our implementation, we further the discussions in Sect. 2.3. That is, we further the investigations that research alternative sources of randomness in post-quantum cryptographic schemes and translate this into hardware. As with other design explorations, this means we do not completely comply with the specifications (and test vectors) by not using a NIST standard. However, their security arguments that AES is an 'ideal cipher' for use as an seed expander still apply as we replace this with Trivium, as it has analogous security properties of being indistinguishable from random. Trivium does not provide the same level of classical security as AES or SHAKE; however, it is used to randomly generate a public element and suffices to eliminate the possibility of backdoors and all-for-the-price-of-one attacks [19]. Moreover, with NIST's lightweight competition happening in parallel, it is likely that there will be future NIST standards that are more efficient than SHAKE. We may also see specific use cases where an alternative PRNG is preferred to SHAKE. Thus, considering alternative PRNGs as a design exploration is an important contribution to the standardisation process.
We explored several options for the randomness source used in the Frodo.Gen operation, that is, sampling the matrix A, and we decided to integrate an unrolled ×32 Trivium [10] implementation into our design. The use of alternative PRNG sources is discussed in the FrodoKEM specifications, specifically they state that "the distribution of matrix A from a truly uniform distribution to one generated from a public random seed in a pseudorandom fashion does not affect the security of FrodoKEM or FrodoPKE, provided that the pseudorandom generator is modelled either as an ideal cipher (when using AES128) or a random oracle (when using SHAKE128)"; thus, we use Trivium as our 'ideal cipher', which also maintains good statistical pseudo-randomness properties as well as the high throughput performances we need for our designs.

Hardware optimisations
In order to fully explore the potential of FrodoKEM in hardware, we propose several architectures characterised by different design goals (in terms of throughput). We use the proposed architecture to implement key generation, encapsulation, and decapsulation, on two sets of parameters proposed in the specifications: FrodoKEM-640 and FrodoKEM-976. Our designs use 1×, 4×, 8×, and 16× parallel multiplications during the most computationally intensive parts in FrodoKEM. These operations are the LWE matrix multiplications of the form: required in key generation, encapsulation, and decapsulation. In the previous hardware implementations of FrodoKEM, the operations of the type of Eq. 1 took approximately 97.5% of the overall runtime of the designs [12]. As in the literature, we exploit DSP slices on the FPGA for the multiply-and-accumulate (MAC) operations required for matrix multiplication. Hence, each parallel multiplication of the proposed designs requires its own DSP slice. The LWE matrix multiplication component incurs a large computational overhead. Because of this, it is an ideal target for optimisations, and for our optimisations we heavily rely on parallelisations. Firstly, we describe the basic LWE multiplier that includes just one multiplication component. Then, we describe how this core is parallelised, allowing us to significantly improve the throughput. Figure 1 shows a high-level overview of the hardware architecture and the following descriptions will link to the design overview. The Arithmetic part of the LWE core is essentially made by vector-matrix multiplication (that is, S[row] × A), addition of a Gaussian error value (that is, E[row, col]), and, when needed, an addition of the Encoding of message data. Since the matrix S consists of a large number of column entries (either 640 or 976) but only 8 row entries (for both parameter sets), we decided to implement a vector-matrix multiplier, instead of (a larger) matrix-matrix one. By doing this, we can reuse the same hardware architecture for each row of S, saving significant hardware resources. Each run of the row-column MAC operation exploits a DSP slice on the FPGA, which fits within the 48-bit MAC size of the FPGA. The DSP slice is ideal for these operations, but it also ensures constant computational Fig. 1 A high-level overview of the proposed hardware designs for FrodoKEM for k parallel multipliers. The architecture is split into sections 'PRNGs' for Trivium modules, 'Error Sampling' for the Gaussian sampler, 'Arithmetic' for the LWE multiplier, and 'Outputs' for the shared-secret and ciphertexts runtime, since each multiplication requires one clock cycle. Once each row-column MAC operation is completed, an error value is added from the CDT sampler. These outputted ciphertext values are also consistently added into an instantiation of SHAKE, which is required to calculate the shared secret. This process is pipelined to ensure high throughput and constant runtime.
To avoid using BRAM (for pre-computing some of the matrix A) and while keeping the throughput needed by the MAC operations of the matrix multiplications, the designs require 16 bits of pseudo-randomness per multiplication per clock cycle. Thus, for every two parallel multiplications we require one Trivium instantiation, whose 32-bit output per clock cycle is split up to form two 16-bit pseudo-random integers. 2 This is shown in PRNGs part of Fig. 1. This pseudorandomness forms the matrix A in Eq. 1, whereas the matrix S and E require randomness taken from Gaussian sampler. The cumulative distribution table (CDT) sampler technique has been shown to be the most suitable one for hardware [11], and thus, we use it in our designs. However, compared with previous works, we replace the use of AES as a pseudo-random input with Trivium. This ensures the same high throughput, but requires significantly less area on the FPGA.
The technique we use to parallelise Eq. 1 is to vertically partition the matrix A into k equal sections, where k is the number of parallel multiplications, and DSPs, used. This is shown in Fig. 2  In order to produce enough randomness for these multiplications to have no delays, we need one instance of our PRNG, Trivium, for every two parallel multiplications. This is because each element of the matrix A is set to be a 16-bit integer and each output from Trivium is 32 bits, that is, two 16-bit integers. As the Trivium modules are relatively small in area consumption on the FPGA (169 slices), an increase in k is fairly scalable as an impact on the overall design.

Efficient first-order masking
We implement first-order masking scheme (discussed in Sect. 2.4) to the decapsulation operation M = C−B S, as this is the only instance where secret-key information is used. Our design allows us to implement this masking schema without affecting the area consumption or throughput. Essentially, this is achieved by re-using the parallelised matrix multiplier used through the proposed hardware design for FrodoKEM. The matrix S is split using the same technique from Fig. 2 and our secret shares are generated by using the Trivium modules as a PRNG source. By computing these calculations in parallel, the masked calculation of M has the same runtime as the one needed to complete the calculation when masking is not used. We ensure that the same row-column operation during the matrix multiplication is not computed in each parallel operation, to circumvent any attack that might combine the power traces and essentially remove the masking. To ensure this countermeasure operates effectively and has no implementation mistakes, one should further this by performing TVLA analysis.

A note on software implementations
The reason the performance of Trivium was investigated for use within FrodoKEM is due to Trivium's outstanding performance specifically in hardware, which was the reason it was chosen for the eSTREAM project. However, one should not expect a similar performance gain by using Trivium in FrodoKEM in software. To demonstrate, we can take the performance of the AES implementation [28] used by pqm4 [16] which operates at 101 cycles per byte on ARM Cortex M3/M4 and a Trivium implementation [1] which operates at 36 cycles per byte on ARM Cortex M0. 3 We should additionally consider that in the FrodoKEM-640 implementation using AES; key generation, encapsulation, and decapsulation, respectively, use AES for 73.6%, 77.1%, and 76.3% of its overall clock cycles. Thus, all other things being equal, when replacing AES with Trivium we might see an increase of 1.9-2× in the overall runtime of FrodoKEM-640 implementation.

Results
In this section, we present the results obtained when implementing our FrodoKEM architecture. We provide a table of results for each of the key generation, encapsulation, and decapsulation designs in Tables 3, 4, and 5, respectively. We also provide results for the PRNG and Gaussian sampler in Table 6. All tables give comparative results of the previous FrodoKEM design in hardware, which utilise 1× LWE multiplier per clock cycle and completely conform to the FrodoKEM specifications by using cSHAKE where we are using Trivium. Moreover, all results are benchmarked on the same FPGA device as previous work, Xilinx Artix-7 XC7A35T FPGA, running on Vivado 2019.1.
The first analysis is directed towards the performance of the PRNG. When compared to cSHAKE, the PRNG previously used in literature, Trivium (the PRNG we propose to use), occupies 4.5× less area on the FPGA (measured in slices). This means that when we instantiate a higher number of parallel multipliers, we consume far less FPGA area than what would be needed when using cSHAKE, as discussed in the algorithm proposal. The increase in area occupation, due to parallelising, is essentially the only reason for area increase when we move from a base design to a design of the same module with a higher number of parallel multipliers. This is because the vector being multiplied remains constant, we just require some additional registers to store these extra  random elements. There is obviously an increase when we move from parameter sets due to the matrix A increasing from 640 to 976 elements. Additionally, we are able to use a much smaller version of SHA-3 for generating the random seeds (< 400 FPGA slices) and shared secrets as the computational requirements for it have significantly decreased.
There is a significant increase in area consumption of all the decapsulation results which do not utilise BRAM. This is mainly due to the need of storing public-key and secret-key matrices. We provide results for both architectures with and without BRAM. The design without BRAM has a significantly higher throughput, due to the much higher frequency. These results are reported in Fig. 4, which shows the efficiency of each design (namely their throughput) per FPGA slice utilised. Figure 3 shows a slice count summary of all the proposed designs, showing a consistent and fairly linear increase in slice utilisation as the number of parallel multipliers increases. We note on decapsulation results in Fig. 3 where the results would lie if BRAM is used, hence the total results for without BRAM include both red areas (i.e. they overlap). In most cases, slice counts at least double for decapsulation when BRAM is removed, with only slight increases in throughout; hence, it might be not be useful in some use cases. BRAM usage, however, is not as friendly when hardware designs are considered for ASIC; thus, it is useful to consider designs both with and without BRAM.
By changing our source of randomness and parallelising the most computationally heaving components in FrodoKEM, we have shown significant improvements in FPGA area consumption and throughput performance compared to the previous works. For instance, comparing to FrodoKEM module [12] (that is using one multiplier) we reduce slice consumption by 3.6× and 5.4× for key generation and 1.6× for encapsulation, all whilst not requiring any BRAM, whereas previous results utilise BRAM. For decapsulation, we decrease the amount of slices used between 1.6× and 2.6× when BRAM is used and similarly decrease slice counts by 1.5× and 1.1× when BRAM is not used. These savings are expected since more than half of this is due to storage otherwise used in BRAM.
Tables 3, 4, and 5 also contain a metric to analyse the areatime efficiency of the proposed designs. This metric takes into account the hardware design's utilisation of slices (i.e. not BRAM) and the time taken (in this case, seconds) for a full key generation, encapsulation, or decapsulation operation to complete. There is a common trend when analysing these  We also see this in the increase in throughput performance in Fig. 4. Moreover, we can use this metric to compare with previous work by Howe et al. [12] to see that in all cases, parallelising results provide significant speed-ups; up to 35× improvement for key generation.
Since the majority of our proposed designs operate without BRAM 4 , we are able to attain a higher frequency than previous works. Overall our throughput outperforms previous comparable results, by factors between 1.13× and 1.19× [12]. Moreover, whilst maintaining less area consumption than previous research, we are able to increase the amount of parallel multipliers. As a result, we can achieve up to 840 key generations per second (a 16.5× increase), 825 encapsulations per second (a 16.2× increase), and 710 operations per second (a 15.6× increase). We also main- 4 We ensure BRAM is not inferred in our designs by setting -max_bram to zero for synthesis in Vivado. tain the constant runtime which the previous implementation attains, as well as implementing first-order masking during decapsulation. 5 The masking is also done using parallel multiplication and thus does not affect the runtime of the decapsulation module. The clock cycle counts for each module are easy to calculate; key generation requires (n 2n )/k clocks, encapsulation requires (n 2n +n 2 n)/k clocks, and decapsulation requires (n 2n + 2n 2 n)/k clocks, for dimensions n = 640 or 976,n = 8, and k referring to the number of parallel multipliers used.

Conclusions
The main contribution of this research is to evaluate the performance potential of FrodoKEM [19], a NIST post-quantum candidate for key encapsulation, when utilising a significantly more performant PRNG in hardware. We develop designs which can reach up to 825 operations per second, where most of the designs fit in under 1500 slices. Area consumption results are less than the previous state of the art and are much lower than many of the other post-quantum hardware designs shown in Table 1. We significantly improve the throughput performance compared to the state of the art, by increasing the number of parallel multipliers we use during matrix multiplication. In order to do this efficiently, we replace an inefficient PRNG previously used, cSHAKE, with a much faster and smaller PRNG, Trivium. As a result, we are able to obtain either a much lower FPGA footprint (up to 5x smaller) or a much higher throughput (up to 16× faster) compared to previous research. Our implementations run in constant computational time and the designs comply with the Round 2 version of FrodoKEM in all aspects except for this PRNG choice. To further evaluate the performance of FrodoKEM, we implemented first-order masking for decapsulation, and we showed that it can be achieved with almost no effect on performance. We expect this research would have an impact on real-world use cases such as in TLS, as shown previously from key exchange version of Frodo [7], potentially making its performance competitive with classical cryptographic schemes used today.
The results show that FrodoKEM is an ideal candidate for hardware designs, showing potential for high-throughput performances whilst still maintaining relatively small FPGA area consumption. Moreover, compared to other NIST lattice-based candidates, it has a lot more flexibility, such as increasing throughput without completely re-designing the multiplication component, compared to, for example, a NTT multiplier.