FPGAbased Key Generator for the Niederreiter Cryptosystem Using Binary Goppa Codes
 4 Citations
 3.3k Downloads
Abstract
This paper presents a postquantum secure, efficient, and tunable FPGA implementation of the keygeneration algorithm for the Niederreiter cryptosystem using binary Goppa codes. Our keygenerator implementation requires as few as 896,052 cycles to produce both public and private portions of a key, and can achieve an estimated frequency Fmax of over 240 MHz when synthesized for Stratix V FPGAs. To the best of our knowledge, this work is the first hardwarebased implementation that works with parameters equivalent to, or exceeding, the recommended 128bit “postquantum security” level. The key generator can produce a key pair for parameters \(m=13\), \(t=119\), and \(n=6960\) in only 3.7 ms when no systemization failure occurs, and in \(3.5 \cdot 3.7\) ms on average. To achieve such performance, we implemented an optimized and parameterized Gaussian systemizer for matrix systemization, which works for any largesized matrix over any binary field \(\text {GF}(2^m)\). Our work also presents an FPGAbased implementation of the GaoMateer additive FFT, which only takes about 1000 clock cycles to finish the evaluation of a degree119 polynomial at \(2^{13}\) data points. The Verilog HDL code of our key generator is parameterized and partly codegenerated using Python and Sage. It can be synthesized for different parameters, not just the ones shown in this paper. We tested the design using a Sage reference implementation, iVerilog simulation, and on real FPGA hardware.
Keywords
PostQuantum Cryptography Codebased cryptography Niederreiter key generation FPGA Hardware implementation1 Introduction
Once sufficiently large and efficient quantum computers can be built, they will be able to break many cryptosystems used today: Shor’s algorithm [22, 23] can solve the integerfactorization problem and the discretelogarithm problem in polynomial time, which fully breaks cryptosystems built upon the hardness of these problems, e.g., RSA, ECC, and DiffieHellman. In addition, Grover’s algorithm [10] gives a squareroot speedup on search problems and improves bruteforce attacks that check every possible key, which threatens, e.g., symmetric key ciphers like AES. However, a “simple” doubling of the key size can be used as mitigation for attacks using Grover’s algorithm. In order to provide alternatives for the cryptographic systems that are threatened by Shor’s algorithm, the cryptographic community is investigating cryptosystems that are secure against attacks by quantum computers using both Shor’s and Grover’s algorithm in a field called PostQuantum Cryptography (PQC).
Currently, there are five popular classes of PQC algorithms: hashbased, codebased, latticebased, multivariate, and isogenybased cryptography [3, 21]. Most codebased publickey encryption schemes are based on the McEliece cryptosystem [16] or its more efficient dual variant developed by Niederreiter [18]. This work focuses on the Niederreiter variant of the cryptosystem using binary Goppa codes. There is some work based on QCMDPC codes, which have smaller key sizes compared to binary Goppa codes [12]. However, QCMDPC codes can have decoding errors, which may be exploited by an attacker [11]. Therefore, binary Goppa codes are still considered the more mature and secure choice despite their disadvantage in the key size. Until now, the best known attacks on the McEliece and Niederreiter cryptosystems using binary Goppa codes are generic decoding attacks which can be warded off by a proper choice of parameters [5].
However, there is a tension between the algorithm’s parameters (i.e., the security level) and the practical aspects, e.g., the size of keys and computation speed, resulting from the chosen parameters. The PQCRYPTO project [20] recommends to use a McEliece cryptosystem with binary Goppa codes with binary field of size \(m = 13\), adding \(t = 119\) errors, code length \(n = 6960\), and code rank \(k = 5413\) in order to achieve 128bit postquantum security for publickey encryption when accounting for the worstcase impact of Grover’s algorithm [1]. The classical security level for these parameters is about 266bit [5]. This recommended parameter set results in a private key of about 13 kB, and a public key of about 1022 kB. These parameters provide maximum security for a public key of at most 1 MB [5]. Our tunable design is able to achieve these parameters, and many others, depending on the user’s needs.
The Niederreiter cryptosystem consists of three operations: key generation, encryption, and decryption. In this paper, we are focusing on the implementation of the most expensive operation in the Niederreiter cryptosystem: the key generation. The industry PKCS #11 standard defines a platformindependent API for cryptographic tokens, e.g., hardware security modules (HSM) or smart cards, and explicitly contains functions for publicprivate keypair generation [19]. Furthermore, hardware crypto accelerators, e.g., for IBM’s z Systems, have dedicated keygeneration functions. These examples show that efficient hardware implementations for key generation will also be required for postquantum schemes. We selected FPGAs as our target platform since they are ideal for hardware development and testing; most parts of the hardware code can also be reused for developing an ASIC design.
Due to the confidence in the Niederreiter cryptosystem, there are many publications on hardware implementations related to this cryptosystem, e.g., [13, 15, 24]. We are only aware of one publication [24] that presents a hardware implementation of the keygeneration algorithm. The keygeneration hardware design in [24], however, uses fixed, nontunable security and design parameters, which do not meet the currently recommended postquantum security level, and has a potential security flaw by using a nonuniform permutation, which may lead to practical attacks.

a key generator with tunable parameters, which uses codegeneration to generate vendorneutral Verilog HDL code,

a constructive, constanttime approach for generating an irreducible Goppa polynomial,

an improved hardware implementation of Gaussian systemizer which works for any largesized matrix over any binary field,

a new hardware implementation of GaoMateer additive FFT for polynomial evaluation,

a new hardware implementation of FisherYates shuffle for obtaining uniform permutations, and

design testing using Sage reference code, iVerilog simulation, and output from real FPGA runs.
2 Niederreiter Cryptosystem and Key Generation
The first codebased publickey encryption system was given by McEliece in 1978 [16]. The private key of the McEliece cryptosystem is a randomly chosen irreducible binary Goppa code \(\mathcal {G}\) with a generator matrix G that corrects up to t errors. The public key is a randomly permuted generator matrix \(G^\text {pub} = SGP\) that is computed from G and the secrets P (a permutation matrix) and S (an invertible matrix). For encryption, the sender encodes the message m as a codeword and adds a secret error vector e of weight t to get ciphertext \(c = mG^\text {pub} \oplus e\). The receiver computes \(cP^{1} = mSG \oplus eP^{1}\) using the secret P and decodes m using the decoding algorithm of \(\mathcal {G}\) and the secret S. Without knowledge of the code G, which is hidden by the secrets S and P, it is computationally hard to decrypt the ciphertext. The McEliece cryptosystem with correct parameters is believed to be secure against quantumcomputer attacks.
In 1986, Niederreiter introduced a dual variant of the McEliece cryptosystem by using a parity check matrix H for encryption instead of a generator matrix [18]. For the Niederreiter cryptosystem, the message m is encoded as a weightt error vector e of length n; alternatively, the Niederreiter cryptosystem can be used as a keyencapsulation scheme where a random error vector is used to derive a symmetric encryption key. For encryption, e is multiplied with H and the resulting syndrome is sent to the receiver. The receiver decodes the received syndrome, and obtains e. Originally, Niederreiter used ReedSolomon codes for which the system has been broken [25]. However, the scheme is believed to be secure when using binary Goppa codes. Niederreiter introduced a trick to compress H by computing the systemized form of the public key matrix. This trick can be applied to some variants of the McEliece cryptosystem as well.
We focus on the Niederreiter cryptosystem due to its compact key size and the efficiency of syndrome decoding algorithms. As the most expensive operation in the Niederreiter cryptosystem is key generation, it is often omitted from Niederreiter implementations on FPGAs due to its large memory demand. Therefore, our paper presents a new contribution by implementing the keygeneration algorithm efficiently on FPGAs.
2.1 Key Generation Algorithm
Algorithm 1 shows the keygeneration algorithm for the Niederreiter cryptosystem. The system parameters are: m, the size of the binary field, t, the number of correctable errors, and n, the code length. Code rank k is determined as \(k=nmt\). We implemented Step 2 of the keygeneration algorithm by computing an irreducible Goppa polynomial g(x) of degree t as the minimal polynomial of a random element r from a polynomial ring over \(\text {GF}(2^m)\) using a power sequence \(1, r, \dots , r^{t}\) and Gaussian systemization in \(\text {GF}(2^m)\) (see Sect. 5). Step 3 requires the evaluation of g(x) at points \(\{\alpha _0, \alpha _1, \dots , \alpha _{n1}\}\). To achieve high efficiency, we decided to follow the approach of [4] which evaluates g(x) at all elements of \(\text {GF}(2^m)\) using a highly efficient additive FFT algorithm (see Sect. 4.2). Therefore, we evaluate g(x) at all \(\alpha \in \text {GF}(2^m)\) and then choose the required \(\alpha _i\) using FisherYates shuffle by computing a random sequence \((\alpha _0, \alpha _1, \dots , \alpha _{n1})\) from a permuted list of indices P. For Step 5, we use the efficient Gaussian systemization module for matrices over \(\text {GF}(2)\) from [26].
2.2 Structure of the Paper
The following sections introduce the building blocks for our keygenerator module in a bottomup fashion. First, we introduce the basic modules for arithmetic in \(\text {GF}(2^m)\) and for polynomials over \(\text {GF}(2^m)\) in Sect. 3. Then we introduce the modules for Gaussian systemization, additive FFT, and FisherYates shuffle in Sect. 4. Finally, we describe how these modules work together to obtain an efficient design for key generation in Sect. 5. Validation of the design using Sage, iVerilog, and Stratix V FPGAs is presented in Sect. 6 and a discussion of the performance is in Sect. 7.
2.3 Reference Parameters and Reference Platform
Parameters and resulting configuration for the key generator.
Param.  Description  Size (bits)  Config.  Description  Size (bits) 

m  Size of the binary field  13  g(x)  Goppa polynomial  \( {0}120 \times 13\) 
t  Correctable errors  119  P  Permutation indices  \(8192 \times 13\) 
n  Code length  6960  H  Parity check matrix  \(1547 \times 6960\) 
k  Code rank  5413  K  Public key  \(1547 \times 5413\) 
Throughout the paper (except for Table 9), performance results are reported from Quartussynthesis results for the Altera Stratix V FPGA (5SGXEA7N), including Fmax (maximum estimated frequency) in MHz, Logic (logic usage) in Adaptive Logic Modules (ALMs), Mem. (memory usage) in Block RAMs, and Reg. (registers). Cycles are derived from iVerilog simulation. Time is calculated as quotient of Cycles and Fmax. Time \(\times \) Area is calculated as product of Cycles and Logic.
3 Field Arithmetic
The lowestlevel building blocks in our implementation are \(\text {GF}(2^m)\) finite field arithmetic and on the next higher level \(\text {GF}(2^m)[x]/f\) polynomial arithmetic.
3.1 \(\text {GF}(2^m)\) Finite Field Arithmetic
\(\text {GF}(2^m)\) represents the basic finite field in the Niederreiter cryptosystem. Our code for all the hardware implementations of \(\text {GF}(2^m)\) operations is generated by codegeneration scripts, which take in m as a parameter and then automatically generate the corresponding Verilog HDL code.
GF \(\varvec{(2^m)}\) Addition. In \(\text {GF}(2^m)\), addition corresponds to a simple bitwise xor operation of two mbit vectors. Therefore, each addition has negligible cost and can often be combined with other logic while still finishing within one clock cycle, e.g., a series of additions or addition followed by multiplication or squaring.
Performance of different field multiplication algorithms for \(\text {GF}(2^{13})\).
Algorithm  Logic  Reg.  Fmax (MHz) 

Schoolbook algorithm  90  78  637 
2split Karatsuba algorithm  99  78  625 
3split Karatsuba algorithm  101  78  529 
Bernstein  87  78  621 
GF \(\varvec{(2^m)}\) Squaring. Squaring over \(\text {GF}(2^m)\) can be implemented using less logic than multiplication and therefore an optimized squaring module is valuable for many applications. However, in the case of the keygeneration algorithm, we do not require a dedicated squaring module since an idle multiplication module is available in all cases when we require squaring. Squaring using \(\text {GF}(2^m)\) multiplication takes one clock cycle.
GF \(\varvec{(2^m)}\) Inversion. Inside the \(\text {GF}(2^m)\) Gaussian systemizer, elements over \(\text {GF}(2^m)\) need to be inverted. An element \(a \in \text {GF}(2^m)\) can be inverted by computing \(a^{1} = a^{\text {GF}(2^m)2}\). This can be done with a logarithmic amount of squarings and multiplications. For example, inversion in \(\text {GF}(2^{13})\) can be implemented using twelve squarings and four multiplications. However, this approach requires at least one multiplication circuit (repeatedly used for multiplications and squarings) plus some logic overhead and has a latency of at least several cycles in order to achieve high frequency. Therefore, we decided to use a precomputed lookup table for the implementation of the inversion module. For inverting an element \(\alpha \in \text {GF}(2^m)\), we interpret the bitrepresentation of \(\alpha \) as an integer value and use this value as the address into the lookup table. For convenience, we added an additional bit to each value in the lookup table that is set high in case the input element \(\alpha \) can not be inverted, i.e., \(\alpha = 0\). This saves additional logic that otherwise would be required to check the input value. Thus, the lookup table has a width of \(m+1\) and a depth of \(2^m\), and each entry can be read in one clock cycle. The lookup table is readonly and therefore can be stored in either RAM or logic resources.
3.2 \(\text {GF}(2^m)[x]/f\) Polynomial Arithmetic
Polynomial arithmetic is required for the generation of the secret Goppa polynomial. \(\text {GF}(2^m)[x]/f\) is an extension field of \(\text {GF}(2^m)\). Elements in this extension field are represented by polynomials with coefficients in \(\text {GF}(2^m)\) modulo an irreducible polynomial f. We are using a sparse polynomial for f, e.g., the trinomial \(x^{119} + x^8 + 1\), in order to reduce the cost of polynomial reduction.
Polynomial Addition. The addition of two degreed polynomials with \(d+1\) coefficients is equivalent to pairwise addition of the coefficients in \(\text {GF}(2^m)\). Therefore, polynomial addition can be mapped to an xor operation on two \(m(d+1)\)bit vectors and finishes in one clock cycle.
Polynomial Multiplication. Due to the relatively high cost of \(\text {GF}(2^m)\) multiplication compared to \(\text {GF}(2^m)\) addition, for polynomials over \(\text {GF}(2^m)\) Karatsuba multiplication [14] is more efficient than classical schoolbook multiplication in terms of logic cost when the size of the polynomial is sufficiently large.
Given two polynomials \(A(x) = \sum _{i=0}^{5} a_ix^i\) and \(B(x) = \sum _{i=0}^{5} b_ix^i\), schoolbook polynomial multiplication can be implemented in hardware as follows: Calculate \((a_5b_0, a_4b_0, \dots , a_0b_0)\) and store the result in a register. Then similarly calculate \((a_5b_i, a_4b_i, \dots , a_0b_i)\), shift the result left by \(i \cdot m\) bits, and then add the shifted result to the register contents, repeat for all \(i = 1, 2, \dots , 5\). Finally the result stored in the register is the multiplication result (before polynomial reduction). One can see that within this process, 6 \(\times \) 6 \(\text {GF}(2^m)\) multiplications are needed.
Karatsuba polynomial multiplication requires less finitefield multiplications compared to schoolbook multiplication. For the above example, Montgomery’s sixsplit Karatsuba multiplication [17] requires only 17 field element multiplications over \(\text {GF}(2^m)\) at the cost of additional finite field additions which are cheap for binary field arithmetic. For large polynomial multiplications, usually several levels of Karatsuba are applied recursively and eventually on some low level schoolbook multiplication is used. The goal is to achieve a tradeoff between running time and logic overhead.
Performance of different multiplication algorithms for degree118 polynomials.
Algorithm  Mult.  Cycles  Logic  Times \(\times \) Area  Fmax (MHz) 

1level Karatsuba \(17 \times (20 \times 20)\)  20  377  11,860  \(4.47 \cdot 10^{6}\)  342 
2level Karatsuba \(17 \times (4 \times 4)\)  16  632  12,706  \(8.03 \cdot 10^{6}\)  151 
2level Karatsuba \(17 \times (4 \times 4)\)  4  1788  11,584  \(2.07 \cdot 10^{7}\)  254 
In the final design, we implemented a onelevel sixsplit Karatsuba multiplication approach, which uses a size\(\lceil {\frac{d+1}{6}}\rceil \) schoolbook polynomial multiplication module as its building block. It only requires 377 cycles to perform one multiplication of two degree118 polynomials.
4 Key Generator Modules
The arithmetic modules are used as building blocks for the units inside the key generator, shown later in Fig. 2. The main components are: two Gaussian systemizers for matrix systemization over \(\text {GF}(2^m)\) and \(\text {GF}(2)\) respectively, GaoMateer additive FFT for polynomial evaluation, and FisherYates shuffle for generating uniformly distributed permutations.
4.1 Gaussian Systemizer
Matrix systemization is needed for generating both the private Goppa polynomial g(x) and the public key K. Therefore, we require one module for Gaussian systemization of matrices over \(\text {GF}(2^{13})\) and one module for matrices over \(\text {GF}(2)\). We use a modified version of the highly efficient Gaussian systemizer from [26] and adapted it to meet the specific needs for Niederreiter key generation. As in [26], we are using an \(N \times N\) square processor array to compute on column blocks of the matrix. The size of this processor array is parameterized and can be chosen to either optimize for performance or for resource usage.
The design from [26] only supports systemization of matrices over \(\text {GF}(2)\). An important modification that we applied to the design is the support of arbitrary binary fields — we added a binaryfield inverter to the diagonal “pivoting” elements of the processor array and binaryfield multipliers to all the processors. This results in a larger resource requirement compared to the \(\text {GF}(2)\) version but the longest path still remains within the memory module and not within the computational logic for computations on large matrices.
4.2 GaoMateer Additive FFT
In order to reduce this cost, we use a characteristic2 additive FFT algorithm introduced in 2010 by Gao and Mateer [9], which was used for multipoint polynomial evaluation by Chou in 2013 [4]. This algorithm evaluates a polynomial at all elements in the field \(\text {GF}(2^m)\) using a number of operations logarithmic in the length of the polynomial. Most of these operations are additions, which makes this algorithm particularly suitable for hardware implementations. The asymptotic time complexity of additive FFT is \(\text {O}\big (2^m\! \cdot \log _2{(d)}\big )\).
In general, to transform a polynomial f(x) of \(2^k\) coefficients into the form of \(f = f^{(0)}(x^2+x)+xf^{(1)}(x^2+x)\), we need \(2^i\) size\(2^{ki}, i = 0, 1, ..., k\) radix conversion operations. We will regard the whole process of transforming f(x) into the form of \(f^{(0)}(x^2+x)+xf^{(1)}(x^2+x)\) as one complete radix conversion operation for later discussion.
Twisting. As mentioned above, additive FFT applies Gao and Mateer’s idea recursively. Consider the problem of evaluating an 8coefficient polynomial f(x) for all elements in \(\text {GF}(2^4)\). The field \(\text {GF}(2^4)\) can be defined as: \(\text {GF}(2^4) = \{0, a, \dots , a^3+a^2+a, 1, a+1, \dots , (a^3+a^2+a)+1\}\) with basis \(\{1, a, a^2, a^3\}\). After applying the radix conversion process, we get \(f(x) = f^{(0)}(x^2+x)+xf^{(1)}(x^2+x)\). As described earlier, the evaluation on the second half of the elements (“\(... + 1\)”) can be easily computed from the evaluation results of the first half by using the \(\alpha \) and \(\alpha +1\) trick (for \(\alpha \in \{0, a, \dots , a^3+a^2+a\}\)). Now, the problem turns into the evaluation of \(f^{(0)}(x)\) and \(f^{(1)}(x)\) at points \(\{0, a^2+a, \dots , (a^3+a^2+a)^2+(a^3+a^2+a)\}\). In order to apply Gao and Mateer’s idea again, we first need to twist the basis: By computing \(f^{(0')}(x) = f^{(0)}((a^2+a)x)\), evaluating \(f^{(0)}(x)\) at \(\{0, a^2+a, \dots , (a^3+a^2+a)^2+(a^3+a^2+a)\}\) is equivalent to evaluating \(f^{(0')}(x)\) at \(\{0, a^2+a, a^3+a, a^3+a^2, 1, a^2+a+1, a^3+a+1, a^3+a^2+1\}\). Similarly for \(f^{(1)}(x)\), we can compute \(f^{(1')}(x) = f^{(1)}((a^2+a)x)\). After the twisting operation, \(f^{(0')}\) and \(f^{(1')}\) have element 1 in their new basis. Therefore, this step equivalently twists the basis that we are working with. Now, we can perform radix conversion and apply the \(\alpha \) and \(\alpha +1\) trick on \(f^{(0')}(x)\) and \(f^{(1')}(x)\) recursively again.
Performance of additive FFT using different numbers of multipliers for twist.
Multipliers  

Twist  Reduction  Cycles  Logic  Times \(\times \) Area  Mem.  Reg.  Fmax (MHz) 
4  32  1188  11,781  \(1.39 \cdot 10^{7}\)  63  27,450  399 
8  32  1092  12,095  \(1.32 \cdot 10^{7}\)  63  27,470  386 
16  32  1044  12,653  \(1.32 \cdot 10^{7}\)  63  27,366  373 
32  32  1020  14,049  \(1.43\cdot 10^{7}\)  63  26,864  322 
Performance. Table 4 shows performance and resourceusage for our additive FFT implementation. For evaluating a degree119 Goppa polynomial g(x) at all the data points in \(\text {GF}(2^{13})\), 32 finite filed multipliers are used in the reduction step of our additive FFT design in order to achieve a small cycle count while maintaining a low logic overhead. The twisting module is generated by a Sage script such that the number of multipliers can be chosen as needed. Radix conversion and twisting have only a small impact in the total cycle count; therefore, using only 4 binary filed multipliers for twisting results in good performance, with best Fmax. The memory required for additive FFT is only a small fraction of the overall memory consumption of the key generator.
4.3 Random Permutation: FisherYates Shuffle
Performance of the FisherYates shuffle module for \(2^{13}\) elements.
m  Size (\(=2^{m}\))  Cycles (avg.)  Logic  Time \(\times \) Area  Mem.  Reg.  Fmax (MHz) 

13  8192  23,635  149  \(3.52 \cdot 10^{6}\)  7  111  335 
We implemented a parameterized permutation module using a dualport memory block of depth \(2^m\) and width m. First, the memory block is initialized with contents \([0, 1, \dots , 2^m1]\). Then, the address of port A decrements from \(2^m1\) to 0. For each address A, a PRNG keeps generating new random numbers as long as the output is larger than address A. Therefore, our implementation produces a nonbiased permutation (under the condition that the PRNG has no bias) but it is not constanttime. Once the PRNG output is smaller than address A, this output is used as the address for port B. Then the contents of the cells addressed by A and B are swapped. We improve the probability of finding a random index smaller than address A by using only \(\lceil {\log _2(A)}\rceil \) bits of the PRNG output. Therefore, the probability of finding a suitable B always is at least 50%.
5 Key Generator for the Niederreiter Cryptosystem
Using two Gaussian systemizers, GaoMateer additive FFT, and FisherYates shuffle, we designed the key generator as shown in Fig. 2. Note that the design uses two simple PRNGs to enable deterministic testing. For real deployment, these PRNGs must be replaced with a cryptographically secure random number generator, e.g., [6]. We require at most m random bits per clock cycle per PRNG.
5.1 Private Key Generation
The private key consists of an irreducible Goppa polynomial g(x) of degree t and a permuted list of indices P.
Goppa Polynomial \(\varvec{g(x).}\) The common way for generating a degreed irreducible polynomial is to pick a polynomial g of degree d uniformly at random, and then to check whether it is irreducible or not. If it is not, a new polynomial is randomly generated and checked, until an irreducible one is found. The density of irreducible polynomials of degree d is about 1/d [16]. When \(d = t = 119\), the probability that a randomly generated degree119 polynomial is irreducible gets quite low. On average, 119 trials are needed to generate a degree119 irreducible polynomial in this way. Moreover, irreducibility tests for polynomials involve highly complex operations in extension fields, e.g., raising a polynomial to a power and finding the greatest common divisor of two polynomials. In the hardware key generator design in [24], the Goppa polynomial g(x) was generated in this way, which is inefficient in terms of both time and area.
We decided to explicitly generate an irreducible polynomial g(x) by using a deterministic, constructive approach. We compute the minimal (hence irreducible) polynomial of a random element in \(\text {GF}(2^m)[x]/h\) with \(\text {deg}(h) = \text {deg}(g) = t\): Given a random element r from the extension field \(\text {GF}(2^m)[x]/h\), the minimal polynomial g(x) of r is defined as the nonzero monic polynomial of least degree with coefficients in \(\text {GF}(2^m)\) having r as a root, i.e., \(g(r) = 0\). The minimal polynomial of a degree\((t1)\) element from field \(\text {GF}(2^m)[x]/h\) is always of degree t and irreducible if it exists.
In our hardware implementation, first a PRNG is used, which generates t random mbit strings for the coefficients of \(r(x) = \sum _{i=0}^{t1} r_ix^i\). Then the coefficient matrix R is calculated by computing the powers of \(1, r, \dots , r^t\), which are stored in the memory of the \(\text {GF}(2^m)\) Gaussian systemizer. We repeatedly use the polynomial multiplier described in Sect. 3.2 to compute the powers of r. After each multiplication, the resulting polynomial of t coefficients is written to the memory of the \(\text {GF}(2^m)\) Gaussian systemizer. (Our Gaussiansystemizer module operates on columnblocks of width \(N_R\). Therefore, the memory contents are actually computed blockwise.) This multiplythenwritetomemory cycle is repeated until R is fully calculated. After this step is done, the memory of the \(\text {GF}(2^m)\) Gaussian systemizer has been initialized with the coefficient matrix R.
After the initialization, the Gaussian elimination process begins and the coefficient matrix R is transformed into its reduced echelon form \([\mathbb {I}_{t}g]\). Now, the right part of the resulting matrix contains all the unknown coefficients of the minimal polynomial g.
The part of memory which stores the coefficients of the Goppa polynomial g(x) is shown as the “gportion” in Fig. 2. Later the memory contents stored in the gportion are read out and sent to the g(x) evaluation step, which uses the additive FFT module to evaluate the Goppa polynomial g(x) at every point in field \(\text {GF}(2^{m})\).
Performance of the \(\text {GF}(2^m)\) Gaussian systemizer for \(m=13\) and \(t=119\), i.e., for a \(119 \times 120\) matrix with elements from \(\text {GF}(2^{13})\).
\(N_{R}\)  Cycles  Logic  Time \(\times \) Area  Mem.  Reg.  Fmax (MHz) 

1  922,123  2539  \(2.34 \cdot 10^{9}\)  14  318  308 
2  238,020  5164  \(1.23 \cdot 10^{9}\)  14  548  281 
4  63,300  10,976  \(6.95 \cdot 10^{8}\)  13  1370  285 
Random Permutation \(\varvec{P}.\) In our design, a randomly permuted list of indices of size \(2^{13}\) is generated by the FisherYates shuffle module and the permutation list is stored in the memory P in Fig. 2 as part of the private key. Later memory P is read by the H generator which generates a permuted binary form the parity check matrix. In our design, since \(n \le 2^m\), only the contents of the first n memory cells need to be fetched.
5.2 Public Key Generation
As mentioned in Sect. 2, the public key K is the systemized form of the binary version of the parity check matrix H. In [24], the generation of the binary version of H is divided into two steps: first compute the nonpermuted parity check matrix and store it in a memory block A, then apply the permutation and write the binary form of the permuted paritycheck matrix to a new memory block B, which is of the same size as memory block A. This approach requires simple logic but needs two large memory blocks A and B.
In order to achieve better memory efficiency, we omit the first step, and instead generate a permuted binary form \(H'\) of the parity check matrix in one step. We start the generation of the public key K by evaluating the Goppa polynomial g(x) at all \(\alpha \in \text {GF}(2^m)\) using the GaoMateer additive FFT module. After the evaluation finishes, the results are stored in the data memory of the additive FFT module.
If a fail signal from the \(\text {GF}(2)\) Gaussian systemizer is detected, i.e., the matrix cannot be systemized, key generation needs to be restarted. Otherwise, the left part of the matrix has been transformed into a \(mt \times mt\) identity matrix and the right side is the \(mt \times k\) public key matrix K labeled as “Kportion” in Fig. 2.
Performance of the \(\text {GF}(2)\) Gaussian systemizer for a \(1547 \times 6960\) matrix.
\(N_{H}\)  Cycles  Logic  Time \(\times \) Area  Mem.  Reg.  Fmax (MHz) 

10  150,070,801  826  \(1.24 \cdot 10^{11}\)  663  678  257 
20  38,311,767  1325  \(5.08 \cdot 10^{10}\)  666  1402  276 
40  9,853,350  3367  \(3.32 \cdot 10^{10}\)  672  4642  297 
80  2,647,400  10,983  \(2.91 \cdot 10^{10}\)  680  14,975  296 
160  737,860  40,530  \(2.99 \cdot 10^{10}\)  720  55,675  290 
320  208,345  156,493  \(3.26 \cdot 10^{10}\)  848  213,865  253 
Performance. Table 7 shows the effect of different choices for parameter \(N_H\) on a matrix of size \(1547 \times 6960\) in \(\text {GF}(2)\). Similar to the \(\text {GF}(2^m)\) Gaussian systemizer, \(N_H\) has an impact on the number of required memory blocks. When doubling \(N_H\), the number of required cycles should roughly be quartered (which is the case for small \(N_H\)) and the amount of logic should roughly be quadrupled (which is the case for large \(N_H\)). The best timearea product is achieved for \(N_H=80\), because for smaller values the noncomputational logic overhead is significant and for larger values the computational logic is used less efficiently. Fmax is mainly limited by the paths within the memory.
6 Design Testing
We tested our hardware implementation using a Sage reference implementation, iVerilog, and an Altera Stratix V FPGA (5SGXEA7N) on a Terasic DE5Net FPGA development board.
Parameters and PRNG Inputs. First, we chose a set of parameters, which were usually the system parameters of the cryptosystem (m, t, and n, with \(k = n  mt\)). In addition, we picked two design parameters, \(N_R\) and \(N_H\), which configure the size of the processor arrays in the \(\text {GF}(2^m)\) and \(\text {GF}(2)\) Gaussian systemizers. In order to guarantee a deterministic output, we randomly picked seeds for the PRNGs and used the same seeds for corresponding tests on different platforms. Given the parameters and random seeds as input, we used Sage code to generate appropriate input data for each design module.
Sage Reference Results. For each module, we provide a reference implementation in Sage using builtin Sage functions for field arithmetic, etc. Given the parameters, seeds, and input data, we used the Sage reference implementation to generate reference results for each module.
iVerilog Simulation Results. We simulated the Verilog HDL code of each module using a “testbench” top module and the iVerilog simulator. At the end of the simulation, we stored the simulation result in a file. Finally, we compared the simulation result with the Sage reference result. If these reference and simulation results matched repeatedly for different inputs, we assumed the Verilog HDL code to be correct.
FPGA Results. After we tested the hardware design through simulation, we synthesized the design for an Altera Stratix V FPGA using the Altera Quartus 16.1 tool chain. We used a PCIe interface for communication with the FPGA. After a test finished, we wrote the FPGA output to a file. Then we compared the output from the FPGA testrun with the output of the iVerilog simulation and the Sage reference results. If the outputs matched, we assumed the hardware design to be correct.
7 Evaluation
Performance of the key generator for parameters \(m = 13\), \(t = 119\), and \(n = 6960\). All the numbers in the table come from compilation reports of the Altera and Xilinx tool chains respectively. For Xilinx, logic utilization is counted in LUTs.
Case  \(N_H\)  \(N_R\)  Cycles  Logic  Time \(\times \) Area  Mem.  Fmax  Time 

Altera Stratix V  
logic  40  1  11,121,220  29,711  \(3.30 \cdot 10^{11}\)  756  240 MHz  46.43 ms 
bal.  80  2  3,062,942  48,354  \(1.48 \cdot 10^{11}\)  764  248 MHz  12.37 ms 
time  160  4  896,052  101,508  \(9.10 \cdot 10^{10}\)  803  244 MHz  3.68 ms 
Xilinx Virtex Ultrascale+  
logic  40  1  11,121,220  42,632  \(4.74 \cdot 10^{11}\)  348.5  200 MHz  55.64 ms 
bal.  80  2  3,062,942  60,989  \(1.87 \cdot 10^{11}\)  356  221 MHz  13.85 ms 
time  160  4  896,052  112,845  \(1.01 \cdot 10^{11}\)  375  225 MHz  3.98 ms 
Comparison with related work. Cycles and time are average values, taking into account failure cases.
Design  m  t  n  Cycles (avg.)  Freq.  Time (avg.)  Arch. 

Shoufan et al. [24]  11  50  2048  \(1.47 \cdot 10^{7}\)  163 MHz\(^\mathrm{a}\)  90 ms  Virtex V 
this work  11  50  2048  \(2.72 \cdot 10^{6}\)  168 MHz\(^\mathrm{b}\)  16 ms  Virtex V 
Chou [7]  13  128  8192  \(1.24 \cdot 10^{9}\)  1–4 GHz\(^\mathrm{c}\)  1236–309 ms  Haswell 
this work  13  128  8192  \(4.30 \cdot 10^{6}\)  215 MHz\(^\mathrm{a}\)  20 ms  Stratix V 
Due to the large size of the permuted parity check matrix H, generating the public key K by doing matrix systemization on the binary version of H is usually the most expensive step both in logic and cycles in the keygeneration algorithm. In our key generator, independently of the security parameters, the design can be tuned by adjusting \(N_R\) and \(N_H\), which configure the size of the processor array of the \(\text {GF}(2^m)\) and \(\text {GF}(2)\) Gaussian systemizer respectively. Tables 6 and 7 show that by adjusting \(N_R\) and \(N_H\) in the two Gaussian systemizers, we can achieve a tradeoff between area and performance for the key generator.
Table 8 shows performance data for three representative parameter choices: The logic case targets to minimize logic consumption at the cost of performance, the time case focuses on maximising performance at the cost of resources, and the balanced case (bal.) attempts to balance logic usage and execution time.
Comparison of our results with other Niederreiter keygenerator implementations on FPGAs is not easy. Table 9 gives an attempt of comparing our result to the performance data given in [24]. The design in [24] occupies about 84% of the target FPGA for their entire Niederreitercryptosystem implementation including key generation, encryption, decryption, and IO. Our design requires only about 52% of the logic (for \(N_H = 30\) and \(N_R = 10\)), but only for the key generation. The design in [24] practically achieves a frequency of 163 MHz while we can only report estimated synthesis results for Fmax of 168 MHz for our design. Computing a privatepublic key pair using the design in [24] requires about 90 ms on average (their approach for generating the Goppa polynomial is not constant time and the keygeneration procedure needs to be repeated several times until the Gaussian systemization of the public key succeeds). Our design requires about 16 ms on average at 168 MHz.
We also compare our design to a highly efficient CPU implementation from [7] in Table 9. The results show that our optimized hardware implementation competes very well with the CPU implementation. In this case, we ran our implementation on an Altera Stratix V FPGA. The actual frequency that we achieved fits well to the estimated frequencies for Stratix V in Table 8.
8 Conclusion
This work presents a new FPGAbased implementation of the keygeneration algorithm for the Niederreiter cryptosystem using binary Goppa codes. It is the first hardware implementation of a key generator that supports currently recommended security parameters (and many others due to tunable parameters). Our design is based on novel hardware implementations of Gaussian systemizer, GaoMateer additive FFT, and FisherYates shuffle.
Notes
Acknowledgments
We want to thank Tung Chou for his invaluable help, in particular for discussions about the additive FFT implementation.
References
 1.Augot, D., Batina, L., Bernstein, D.J., Bos, J., Buchmann, J., Castryck, W., Dunkelman, O., Güneysu, T., Gueron, S., Hülsing, A., Lange, T., Mohamed, M.S.E., Rechberger, C., Schwabe, P., Sendrier, N., Vercauteren, F., Yang, B.Y.: Initial recommendations of longterm secure postquantum systems. Technical report, PQCRYPTO ICT645622 (2015). https://pqcrypto.eu.org/docs/initialrecommendations.pdf. Accessed 22 June 2017
 2.Bernstein, D.J.: Highspeed cryptography in characteristic 2. http://binary.cr.yp.to/m.html. Accessed 17 Mar 2017
 3.Bernstein, D.J., Buchmann, J., Dahmen, E. (eds.): PostQuantum Cryptography. Springer, Heidelberg (2009)zbMATHGoogle Scholar
 4.Bernstein, D.J., Chou, T., Schwabe, P.: McBits: fast constanttime codebased cryptography. In: Bertoni, G., Coron, J.S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 250–272. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 5.Bernstein, D.J., Lange, T., Peters, C.: Attacking and defending the McEliece cryptosystem. In: Buchmann, J., Ding, J. (eds.) PQCrypto 2008. LNCS, vol. 5299, pp. 31–46. Springer, Heidelberg (2008)CrossRefGoogle Scholar
 6.Cherkaoui, A., Fischer, V., Fesquet, L., Aubert, A.: A very high speed true random number generator with entropy assessment. In: Bertoni, G., Coron, J.S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 179–196. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 7.Chou, T.: McBits revisited. In: Fischer, W., Homma, N. (eds.) Cryptographic Hardware and Embedded Systems. LNCS, Springer (2017)Google Scholar
 8.Fisher, R.A., Yates, F.: Statistical Tablesfor Biological, Agriculturaland Medical Research. Oliver and Boyd, London (1948)Google Scholar
 9.Gao, S., Mateer, T.: Additive fast fourier transforms over finite fields. IEEE Trans. Inf. Theory 56(12), 6265–6272 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
 10.Grover, L.K.: A fast quantum mechanical algorithm for database search. In: Symposium on the Theory of Computing  STOC 1996, pp. 212–219. ACM (1996)Google Scholar
 11.Guo, Q., Johansson, T., Stankovski, P.: A key recovery attack on MDPC with CCA security using decoding errors. In: Cheon, J.H., Takagi, T. (eds.) ASIACRYPT 2016. LNCS, vol. 10031, pp. 789–815. Springer, Heidelberg (2016)CrossRefGoogle Scholar
 12.Heyse, S., Maurich, I., Güneysu, T.: Smaller keys for codebased cryptography: QCMDPC McEliece implementations on embedded devices. In: Bertoni, G., Coron, J.S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 273–292. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 13.Hu, J., Cheung, R.C.C.: An application specific instruction set processor (ASIP) for the Niederreiter cryptosystem. Cryptology ePrint Archive, Report 2015/1172 (2015)Google Scholar
 14.Karatsuba, A., Ofman, Y.: Multiplication of multidigit numbers on automata. Sov. Phys. Dokl. 7, 595–596 (1963)Google Scholar
 15.Massolino, P.M.C., Barreto, P.S.L.M., Ruggiero, W.V.: Optimized and scalable coprocessor for McEliece with binary Goppa codes. ACM Trans. Embed. Comput. Syst. 14(3), 45 (2015)CrossRefGoogle Scholar
 16.McEliece, R.J.: A publickey cryptosystem based on algebraic coding theory. DSN Prog. Rep. 42–44, 114–116 (1978)Google Scholar
 17.Montgomery, P.L.: Five, six, and seventerm Karatsubalike formulae. IEEE Trans. Comput. 54(3), 362–369 (2005)CrossRefzbMATHGoogle Scholar
 18.Niederreiter, H.: Knapsacktype cryptosystems and algebraic coding theory. Probl. Control Inf. Theory 15, 19–34 (1986)MathSciNetzbMATHGoogle Scholar
 19.PKCS #11 base functionality v2.30, p. 172. ftp://ftp.rsasecurity.com/pub/pkcs/pkcs11/v230/pkcs11v230bd6.pdf. Accessed 20 June 2017
 20.Postquantum cryptography for longterm security PQCRYPTO ICT645622. https://pqcrypto.eu.org/. Accessed 17 March 2017
 21.Rostovtsev, A., Stolbunov, A.: Publickey cryptosystem based on isogenies. Cryptology ePrint Archive, Report 2006/145 (2006)Google Scholar
 22.Shor, P.W.: Algorithms for quantum computation: discrete logarithms and factoring. In: Foundations of Computer Science  FOCS 1994, pp. 124–134. IEEE (1994)Google Scholar
 23.Shor, P.W.: Polynomialtime algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Rev. 41(2), 303–332 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
 24.Shoufan, A., Wink, T., Molter, G., Huss, S., Strentzke, F.: A novel processor architecture for McEliece cryptosystem and FPGA platforms. IEEE Trans. Comput. 59(11), 1533–1546 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
 25.Sidelnikov, V.M., Shestakov, S.O.: On insecurity of cryptosystems based on generalized ReedSolomon codes. Discrete Mathe. Appl. 2(4), 439–444 (1992)Google Scholar
 26.Wang, W., Szefer, J., Niederhagen, R.: Solving large systems of linear equations over GF(2) on FPGAs. In: Reconfigurable Computing and FPGAs  ReConFig 2016, pp. 1–7. IEEE (2016)Google Scholar