1 Introduction

Quantum technology has developed rapidly in recent years. Quantum computer is in a position to attack RSA [1], ECC [2], and other signature algorithms adopted by many chips due to the algorithm by Peter Shor [3]. Therefore, chip security is facing severe threats.

Fortunately, there are a few post-quantum candidates for signature chips, in which multivariate signature is included [4]. Multivariate signature belongs to Multivariate Quadratic Equations Public Key Cryptography (MPKC), which is secure to quantum computer attacks and general computer attacks [5, 6]. MPKC is first proposed by Matsumoto and Imai in the 1980s. During the past 30 years, various schemes of MPKC have been proposed [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32], which includes Rainbow [28], Unbalanced Oil-Vinegar (UOV) [29], and Tame Transformation Signature (TTS) [30, 31]. Software and hardware implementations of multivariate signature schemes have been one of the topics of many researchers [33,34,35,36,37,38,39,40]. enTTS belongs to the triangular family, which can be viewed as extensions of Tame Transformation Method (TTM) Among the existing enTTS schemes, enTTS(20,28) is believed to be one of the fastest signature schemes, which works with 20 B hashes and 28 B signatures. Compared with the implementations of other public key cryptosystems, e.g., RSA and ECC, we need to speed up multivariate signature generations.

Some previous works of efficient implementations of multivariate signature schemes are as follows:

The work in [33] proposed a fast implementation of SFlash;

The work in [34] presented an efficient public key generation for multivariate cryptosystems;

In [35], a minimized PKC of multivariate schemes on low-resource embedded systems was proposed;

In [36], an efficient implementation of multivariate quadratic systems was presented;

A fast implementation of Rainbow signature generation was proposed in [37];

A high-speed hardware architecture based on Rainbow signature on Field-Programmable Gate Arrays (FPGAs) was proposed in [38];

For low area design, a multivariate signature FPGA processor was proposed in [39];

The work of [40] was a time-area optimized design, which showed that multivariate cryptosystems are more efficient than ECC.

Among such multivariate signature schemes, enTTS is a Tame-like multivariate public key cryptosystem [32]. Hardware implementations of TTS (including enTTS) signature are mainly proposed in [31, 35, 36, 39, 40]. Most of these implementations are focusing on area optimizations. The main computations during generations of enTTS signature are addition, multiplication, inversion, and solving Systems of Linear Equations (LSEs) in a finite field.

Thus, the main contributions of this paper are as follows. We improve the finite field multiplications via using composite field expression and design a finite field inversion based on binary trees described in [41]. Besides, we improve solving LSEs in a finite field based on the algorithm of Gauss-Jordan elimination and use the XOR gates to compute additions.

We implement the high-speed hardware architecture based on the above improvements on an Altera Stratix FPGA. The comparison shows that the hardware architecture is much faster than other implementations of public key cryptosystems.

We organize the rest of this paper as follows: the algorithm of multivariate scheme is introduced in Section 2; the high-speed hardware architecture for multivariate signature generations is given in Section 3; implementation results of the high-speed hardware architecture on FPGAs and comparisons with related cryptosystems are given in Section 4; Conclusions are summarized in Section 5.

2 Method

This study originates from a need to speed up signature generations of multivariate scheme, since the efficiency of implementations should be improved as a quantum-resistance cryptosystems. Specifically, we propose a high-speed hardware architecture for multivariate scheme through improving finite field multiplications based on composite field expression, finite field inversions based on binary trees and solving LSEs based on the algorithm of Gauss-Jordan elimination.

We implement the high-speed hardware architecture based on the above improvements on an Altera Stratix FPGA and the comparison shows that the hardware architecture is much faster than other implementations.

The multivariate scheme, enTTS is employed to the architecture for hardware implementations of signature generations in a finite field. enTTS belongs to the triangular family, which can be viewed as extensions of Tame Transformation Method (TTM). enTTS is designed with a higher security level than TTS. We illustrate enTTS parameters in Table 1.

Table 1 Multivariate scheme parameters

We use GF((24)2) for implementation of enTTS, which is a composite field of GF(256). We suppose that y denotes the message (20 B) of multivariate scheme and y0, y1, …, y19 denote each byte from the message, where y0, y1, …, y19 are elements in GF((24)2). We suppose that x denotes the signature (28 B) and x0, x1, …, x27 denote each byte of the signature, where x0, x1, …, x27 are elements in GF((24)2).

The construction of this signature scheme uses affine transformation L1, central map transformation F, and affine transformation L2.

In order to sign a message, i.e., y(y0, y1, …, y19), it is required to compute several steps for the following equation:

$$ {F}^{{}^{\circ}}{L}_2\left({x}_0,{x}_1,\dots, {x}_{27}\right)={L}_1^{-1}\left({y}_0,{y}_1,\dots, {y}_{19}\right). $$

First, the following equation is required to solve:

$$ \overline{y}\left({\overline{y}}_0,{\overline{y}}_1,\dots, {\overline{y}}_{19}\right)={L}_1^{-1}\left({y}_0,{y}_1,\dots, {y}_{19}\right). $$

\( {L}_1^{-1} \) is an invertible affine transformation with the following form:

$$ \overline{y}= Ay+B. $$

A is a matrix with the size of 20 × 20, part of private keys of enTTS;

B is a vector with the size of 20, part of private keys of enTTS.

Then, \( \overline{y}\left({\overline{y}}_0,{\overline{y}}_1,\dots, {\overline{y}}_{19}\right) \) is the result of affine transformation L1, where \( {\overline{y}}_0,{\overline{y}}_1,\dots, {\overline{y}}_{19} \) are elements in GF((24)2).

Second, the following equation is required to solve:

$$ \overline{x}\left({\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27}\right)={F}^{-1}\left({\overline{y}}_0,{\overline{y}}_1,\dots, {\overline{y}}_{19}\right). $$

The construction of central map transformation depends on a map with the following representation.

$$ F\left({\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27}\right)=\left({f}_0,{f}_1,\dots, {f}_{19}\right). $$

We suppose that \( x\left({\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27}\right) \) denote the result of central map transformation F, where \( {\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27} \) are elements in GF((24)2). MQ polynomials f(f0, f1, …, f19) are defined by the following equations:

$$ {f}_{i-8}={\overline{x}}_i+{\sum}_{j=1}^7{p}_{ij}{\overline{x}}_j{\overline{x}}_{8+\left(\left(i+j\right)\kern0.62em \operatorname{mod}\kern0.62em 9\right)},i=8,9,\dots, 16, $$
$$ {f}_9={\overline{x}}_{17}+{p}_{17,1}{\overline{x}}_1{\overline{x}}_6+{p}_{17,2}{\overline{x}}_2{\overline{x}}_5+{p}_{17,3}{\overline{x}}_3{\overline{x}}_4\kern0.5em +{p}_{17,4}{\overline{x}}_9{\overline{x}}_{16}+{p}_{17,5}{\overline{x}}_{10}{\overline{x}}_{15}+{p}_{17,6}{\overline{x}}_{11}{\overline{x}}_{14}\kern0.5em +{p}_{17,7}{\overline{x}}_{12}{\overline{x}}_{13}, $$
$$ {f}_{10}={\overline{x}}_{18}+{p}_{18,1}{\overline{x}}_2{\overline{x}}_7+{p}_{18,2}{\overline{x}}_3{\overline{x}}_6+{p}_{18,3}{\overline{x}}_4{\overline{x}}_5+{p}_{18,4}{\overline{x}}_{10}{\overline{x}}_{17}+{p}_{18,5}{\overline{x}}_{11}{\overline{x}}_{16}+{p}_{18,6}{\overline{x}}_{12}{\overline{x}}_{15}+{p}_{18,7}{\overline{x}}_{13}{\overline{x}}_{14}, $$
$$ {f}_{i-8}={\overline{x}}_i+{p}_{i,0}{\overline{x}}_{i-11}{\overline{x}}_{i-9}+{\sum}_{j=19}^i{p}_{i,j-18}{\overline{x}}_{2\left(i-j\right)}{\overline{x}}_j+{\sum}_{j=i+1}^{27}{p}_{i,j-18}{\overline{x}}_{i-j+19}{\overline{x}}_j,i=19,20,\dots, 27, $$

p ij are coefficients, part of private key.

The MQ polynomials

$$ \overline{y}\left({\overline{y}}_0,{\overline{y}}_1,\dots, {\overline{y}}_{19}\right)=f\left({f}_0,{f}_2,\dots, {f}_{19}\right) $$

can be divided into three groups:

$$ {\displaystyle \begin{array}{l}{f}_i\mid i=0,1,\dots, 8\\ {}{f}_i\mid i=9,10\\ {}{f}_i\mid i=11,12,\dots, 19.\end{array}} $$

Similarly, \( \overline{x}\left({\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27}\right) \) are divided into four groups:

$$ {\displaystyle \begin{array}{l}{\overline{x}}_i\mid i=0,1,\dots, 7\\ {}{\overline{x}}_i\mid i=8,9,\dots, 16\\ {}{\overline{x}}_i\mid i=17,18\\ {}{\overline{x}}_i\mid i=19,20,\dots, 27.\end{array}} $$

The first group variables of \( \overline{x} \), i.e., \( {\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_7 \) are randomly chosen and then the first group polynomials of f i , i.e., f0, f1, …, f8 are evaluated.

Then, the second group variables of \( \overline{x} \) are \( {\overline{x}}_8,{\overline{x}}_9,\dots, {\overline{x}}_{16} \), and we solve the LSEs on such variables.

Next, we evaluate the second group polynomials of f i , i.e., f9, f10 and solve the third group variables of \( \overline{x} \), i.e., \( {\overline{x}}_{17},{\overline{x}}_{18} \).

Then, the third group polynomials of f i , i.e., f11, f12, …, f19 are evaluated and we solve the LSEs on such variables of the fourth group variables of \( \overline{x} \), i.e., \( {\overline{x}}_{19},{\overline{x}}_{20},\dots, {\overline{x}}_{27} \).

After that, the result of central map transformation F, i.e., \( x\left({\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27}\right) \) is computed.

Last, we solve the following equations based on the values of \( {\overline{x}}_{19},{\overline{x}}_{20},\dots, {\overline{x}}_{27} \).

$$ x\left({x}_0,{x}_1,\dots, {x}_{27}\right)={L_2}^{-1}\left({\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_{27}\right). $$

\( {L}_2^{-1} \) is an invertible affine transformation

$$ x=C\overline{x}+D. $$

C is a matrix with the size of 28 × 28, part of private keys of enTTS;

D is a vector with the size of 28, part of private keys of enTTS.

Finally, we have computed the signature of y(y0, y1, …, y19), which is x(x0, x1, …, x27).

3 A high-speed hardware architecture for multivariate signature

3.1 Overview of the hardware architecture

We choose enTTS (20,28) scheme described in Section 2 for hardware implementations in a composite field GF((24)2), where the size of message (hash value) is 20 B and the signature size is 28 B.

We illustrate the generation of a multivariate signature in Fig. 1. It can be observed from Fig. 1 that the signature generations of multivariate scheme include seven steps:

Fig. 1
figure 1

The generation process of multivariate signatures

(1) Affine transformation L1.

\( {L}_1^{-1} \) is an invertible affine transformation with the following form.

$$ \overline{y}= Ay+B. $$

A is a matrix with the size of 20 × 20.

B is a vector with the size of 20.

It can be observed that \( {L}_1^{-1} \) is performed via matrix-vector multiplications and vector additions, where A and B are parts of private keys.

(2) Polynomial evaluation (first part F)

First, we randomly choose the variables of \( {\overline{x}}_0,{\overline{x}}_1,\dots, {\overline{x}}_7 \), i.e., the first group variables of \( \overline{x} \).

Second, we evaluate the polynomials of f0, f1, …, f8, i.e., the first group polynomials of f i .

After that, this part of polynomial evaluation is performed via using additions and multiplications in a finite field.

(3) Solving (LSEs) in a finite field

During the signature generations of multivariate scheme, it is required to perform solving LSEs twice with the same matrix of size 9 × 9.

First, for the second group variables of \( \overline{x} \), i.e., \( {\overline{x}}_8,{\overline{x}}_9,\dots, {\overline{x}}_{16} \), we solve the LSEs on such variables.

Second, for the fourth group variables of \( \overline{x} \), i.e., \( {\overline{x}}_{19},{\overline{x}}_{20},\dots, {\overline{x}}_{27} \), we solve the LSEs on such variables.

During this step, solving LSEs is performed via using a variant Gauss-Jordan elimination in a finite field.

(4) Polynomial evaluation (second part F)

The third group variables of \( \overline{x} \), i.e., \( {\overline{x}}_{17},{\overline{x}}_{18} \) are solved by evaluating the second group polynomials of f i , i.e., f9, f10.

This part of polynomial evaluation is performed via using additions and multiplications in a finite field;

(5) Polynomial evaluation (third part F)

We evaluate the third group polynomials of f i , i.e., f11, f12, …, f19.

This part of polynomial evaluation is performed via using additions and multiplications in a finite field.

(6) Affine transformation L2: \( {L}_2^{-1} \) is an affine transformation with the following form:

$$ x=C\overline{x}+D. $$

C: is a matrix with the size of 28 × 28.

D is a vector with the size of 28.

It can be observed that \( {L}_2^{-1} \) is performed via matrix-vector multiplications and vector additions, where C and D are parts of private keys.

Our hardware architecture for the signature generation of multivariate scheme is depicted in Fig. 2. It can be observed from Fig. 2 that the hardware architecture consists of adders, multipliers, inverter, parallel Gauss-Jordan eliminator, polynomial evaluation, matrix vector multiplication, vector addition, polynomial evaluation, and processor components in a finite field, where only the first four components are computing components and the others are logical components.

Fig. 2
figure 2

Hardware architecture of multivariate schemes

3.2 Performance evaluation of irreducible polynomial in composite fields

Irreducible polynomials in composite fields are involved in the additions, multiplications, and other operations during signature generations. Thus, the performance evaluation of the irreducible polynomial in the composite field GF((24)2) is very critical for the implementation of high-speed hardware architecture of multivariate scheme.

We suppose that q(x) denotes the irreducible polynomial in GF((24)2), and it has the following form.

$$ q(x)={x}^2+{q}_1x+{q}_0. $$

p1, p0 are elements in GF(24).

We suppose that p(x) denotes the irreducible polynomial in the subfield of GF((24)2), i.e., GF(24), and it has the following form:

$$ p(x)={x}^4+{p}_3{x}^3+{p}_2{x}^2+{p}_1x+1. $$

p3, p2, p1 are bits, i.e., 0 or 1.

The performance of the multiplications and inversions has been evaluated based on such irreducible polynomials, respectively. q(x) = x2 + x + 9 is chosen as the irreducible polynomials in GF((24)2) and p(x) = x4 + x + 1 is chosen as the irreducible polynomials in the subfield GF(24).

3.3 Finite field adder

Let a(x) = a h x + a l and b(x) = b h x + b l be the elements in GF((24)2), where a h , a l , b h , and b l are elements in GF(24).

Then the addition of a(x) and b(x) can be expressed as

$$ a(x)+b(x)= $$
$$ \left({a}_hx+{a}_l\right)+\left({b}_hx+{b}_l\right)= $$
$$ \left({a}_h+{b}_h\right)x+{a}_l+{b}_l. $$

Then, we suppose that c h , c l are elements in GF(24), and we can compute their values via the following expressions:

$$ {c}_h={a}_h+{b}_h, $$
$$ {c}_l={a}_l+{b}_l. $$

Thus, c(x) = c h x + c l is the addition result of a(x) and b(x).

3.4 Finite field multiplier

Let a(x) = a h x + a l and b(x) = b h x + b l be the elements in GF((24)2), where a h , a l , b h , and b l are elements in GF(24).

Then the multiplication of a(x) and b(x) can be expressed as

$$ a(x)\times b(x)= $$
$$ \left({a}_hx+{a}_l\right)\left({b}_hx+{b}_l\right)= $$
$$ \left({a}_h{b}_h{x}^2+\left({a}_h{b}_l+{a}_l{b}_h\right)x+{a}_l{b}_l\right)\operatorname{mod}q(x). $$

We perform the polynomial multiplication and reduction module the irreducible polynomial q(x) = x2 + x + 9. We suppose that c h and c l are elements in GF(24), and we can compute their values via the following expressions:

$$ {c}_h=\left({a}_h+{a}_l\right)\left({b}_h+{b}_l\right)+{a}_l{b}_l, $$
$$ {c}_l={a}_l{b}_l+9{a}_h{b}_h. $$

It can be observed that the critical path of multiplication of two elements in GF((24)2) includes one multiplication, one constant multiplication, and one addition in GF(24).

p(x) is the irreducible polynomial in GF(24). Let \( a(x)=\sum \limits_{i=0}^3{a}_i{x}^i \) and \( b(x)=\sum \limits_{i=0}^3{b}_i{x}^i \) be elements in GF(24), a i , b i  ∈ GF(2), and we suppose that

$$ c(x)=a(x)\times b(x)\left( \operatorname {mod}\;\left(p(x)\right)\right)=\sum \limits_{i=0}^3{c}_i{x}^i $$

is the multiplication result of two elements, where c i  ∈ GF(2).

First, we compute v ij for i = 0, 1, …, 6and j = 0, 1, 2, 3 according to the following equation:

$$ {x}^i \operatorname {mod}\;p(x)=\sum \limits_{j=0}^3{v}_{ij}{x}^j. $$

Next, we compute S i for i = 0, 1, …, 6 by the following equation:

$$ {S}_i=\sum \limits_{j+k=i}{a}_j{b}_k. $$

After that, we compute c i for i = 0, 1, 2, 3 by the following equation:

$$ {c}_i=\sum \limits_{j=0}^6{v}_{ji}{S}_j. $$

Finally, the multiplication result is

$$ c(x)=\sum \limits_{i=0}^3{c}_i{x}^i. $$

3.5 Multiplicative inverter

Let a(x) = a h x + a l and b(x) = b h x + b l be the elements in GF((24)2), where a h , a l , b h , and b l are elements in GF(24).

We suppose that b(x) is the inverse of a(x). Then,

$$ {\displaystyle \begin{array}{l}{b}_h={\left({a}_h+{a}_l\right)}^{-1}{a}_h{b}_l,\\ {}{b}_l={\left({a}_l+9{a}_h^2{\left({a}_l+{a}_h\right)}^{-1}\right)}^{-1}.\end{array}} $$

We use two binary trees for inversions in subfield GF(24), which are illustrated as follows:

Each binary tree has four layers in GF(24);

Root nodes are on the third layer;

Each node has at most two child nodes, left node represents value of zero and right node represents value of one;

Each child must either be a leaf or the root of another tree, each node has a father node when it is not a root node;

Each element in a finite field (except (0000)2 and (0001)2) has a unique traversal from root to leaf due to the fact that (0000)2 has no inverse and the inverse of (0001)2 is itself;

Each leaf is linked to another leaf.

Figure 3 depicts the architecture for inversions in GF(24).

Fig. 3
figure 3

Multiplicative inversion based on binary trees

Example 1. It can be observed from Fig. 4, if it is required to inverse the element (0100)2, we search the binary tree from root nodes to leaf nodes, the path from n1 to n4 represents (0100)2. n4 is linked with n8, thus the path from n5 to n8 represents the inverse of (0100)2, i.e., (1001)2.

Fig. 4
figure 4

An example of inversion

3.6 Parallel Gauss-Jordan eliminator

During central map transformation in signature generations, it is required to solve LSEs in a finite field twice with the same matrix size 9 × 9.

We adopt a parallel Gauss-Jordan elimination, which is depicted in Fig. 5. It can solve a LSE with matrix size of 9 × 9. The parallel Gauss-Jordan eliminator solves systems of linear equations with 9 iterations, which is enhanced in the following directions:

Fig. 5
figure 5

Parallel Gauss-Jordan eliminator with matrix size 9 × 9

First, exclusive adders are used in the parallel Gauss-Jordan elimination based on the design described in Section 3.3;

Second, exclusive multipliers are used in the parallel Gauss-Jordan elimination based on the design described in Section 3.4;

Third, exclusive inverters are used in the parallel Gauss-Jordan elimination based on the design described in Section 3.5.

It can be observed from Fig. 4, I, N l , and E kl are three kinds of cells in the architecture, where k = 1, 2, …, 9 and l = 1, 2, …, 10.

The I cell is used for multiplicative inversion in a finite field, which includes an exclusive inverter described in Section 3.5.

The N l cells are used for normalization of finite field elements, which includes exclusive multipliers described in Section 3.4.

The E kl cells are used for elimination of finite field elements, which includes exclusive adders and multipliers described in Sections 3.3 and 3.4.

In conclusion, the architecture includes one I cell, 9 N l cells, and 90 E lk cells and solves the LSEs within 9 clock cycles with the matrix size of 9 × 9.

4 Results

In this section, we investigate the performance of the high-speed hardware architecture for multivariate scheme through hardware implementations on an Altera Stratix FPGA. The implementation is programmed in the hardware programming language, Verilog.

The implementation of multivariate signature generations is illustrated in Table 2, where the executing time for a signature generation of enTTS is 0.9 μs, the time frequency is 100 MHz, and the clock cycle is 90. It should be noted that, all of the results from implementations mentioned are extracted after place and route on the Altera Stratix FPGA.

Table 2 FPGA implementation of hardware architecture for multivariate signature generation

We compare the high-speed hardware architecture with the related implementations of multivariate schemes and other public key schemes, which is depicted in Table 3. The ECC cryptosystems proposed in [2] are efficient implementations and Rainbow cryptosystems proposed in [38] are fast implementations.

Table 3 Comparison on public key cryptographic systems

It can be observed from Table 3 that the high-speed hardware architecture is much faster than the related implementations of public key cryptosystems.

5 Conclusions

We propose a high-speed cryptographic architecture for hardware implementation of multivariate signature generations in this paper. The main computations of signature generations of multivariate scheme are multiplications, inversions, and solving LSEs in a finite field. Thus, we improve the finite field multiplications via using composite field expression and design a finite field inversion via using binary trees. Besides, we improve solving LSEs in a finite field based on the variant algorithm of Gauss-Jordan elimination.

We implement the high-speed hardware architecture based on the above improvements on an Altera Stratix FPGA device. The implementation results show that the executing time for a signature generation of multivariate scheme is 0.9 μs, the time frequency is 100 MHz, and the clock cycle is 90. The comparison shows that the hardware architecture is much faster than other implementations.