Keywords

1 Introduction

Digital-signature schemes are among the most important and widely used cryptographic primitives. Schemes used in practice today (RSA [30], DSA [14], ECDSA [20], and EdDSA [4]) are based on assumptions regarding the computational difficulty of solving certain mathematical problems. Due to Shor’s algorithm [32] and its variants, some of these problems, such as integer factorisation and discrete logarithms, can be efficiently solved on a quantum computer. Since the National Institute of Standards and Technology (NIST) started a project (NIST-PQCFootnote 1) to evaluate and standardise post-quantum cryptographic algorithms, many solutions have been proposed. Hash-based signature schemes (HBS) are among the most attractive candidates for quantum-safe signature schemes. Every signature scheme requires a hash function to reduce a message to a small representation that can be easily signed. While other signature schemes rely on additional computational hardness assumptions, hash-based approaches only needs a secure hash function. HBS have been intensively analysed [5, 10, 13, 16, 27] and the two schemes discussed in this work are currently undergoing a standardisation process [18, 26]. The Leighton-Micali Signature system (LMS) [26] and the eXtended Merkle Signature Scheme (XMSS) [18] have been proposed in the Internet Engineering Task Force (IETF) as quantum-secure HBS. NIST proposed [11] to approve the use of LMS and XMSS and their multi-tree variants Hierarchical Signature System (HSS) and multi-tree XMSS (\(\text {XMSS}^{MT}\)), respectively. This recommendation suggests the use of some of the parameter sets from the RFCs and defines some new parameter sets. It considers SHA-256 or SHAKE256 as underlying hash functions, with outputs of 192-bit or 256-bit length. HBS provide through the choice of parameters several trade-offs between time and size. Hence, the parameter selection has a major impact on how feasible it is to deploy HBS on resource-constrained environments such as embedded microcontrollers. In this work, we chose a subset of parameters from the suggested sets of the NIST recommendation which are suitable for embedded devices.

Due to the popularity and widespread use of Cortex-M4 microcontrollers in different applications, NIST recommended it to submission teams as an optimisation target for the second round of NIST-PQC. The pqm4Footnote 2 project [22] investigates the feasibility and performance of the proposed NIST-PQC approaches on microcontrollers. It provides a framework for testing and benchmarking NIST-PQC submissions on a Cortex-M4 microcontroller. It includes reference and optimised implementations of key-encapsulation mechanisms and signature schemes. The implementations and measurements in our work were realised within the pqm4 framework.

Related Work. Many aspects regarding the implementations of HBS have been studied in the literature. Rohde, Eisenbarth, Dahmen, Buchmann, Paar [31] presented the first implementation of GMSS [8], an improvement of Merkle’s hash-based signature scheme, on an 8-bit smart-card microprocessor. Hülsing, Busold, Buchmann [17] implemented a variant of XMSS on a 16-bit smart card. A comparison between stateful and stateless HBS was given by Hülsing, Rijneveld, Schwabe [19]. For this, the authors implemented SPHINCS and \(\text {XMSS}^{MT}\) on an ARM Cortex M3. Van der Laan, Poll, Rijneveld, de Ruiter, Schwabe, Verschuren [23] presented an implementation of XMSS on the Java Card platform. Kannwischer, Rijneveld, Schwabe, Stoffelen [22] presented the pqm4 framework for testing, speed benchmarking, and measurement of stack consumption of NIST-PQC submissions on an ARM Cortex-M4 microcontroller. Kampanakis, Fluhrer [21] provided the only comparison between LMS and XMSS on a x86-architecture regarding their security assumptions, signature/public key sizes, performance, and some other aspects.

Our Contribution. This paper aims at comparing stateful HBS on microcontrollers. To achieve this, LMS and XMSS and their multi-tree variants were compared on an ARM Cortex-M4. For this, we provide an adapted implementation of LMS for the Cortex-M4, which represents the first implementation to date to the best of the authors’ knowledge. We evaluated suitable parameter sets for constrained devices from the NIST recommendation for stateful hash-based signature schemes [11]. Furthermore, deviating from the RFC 8391 [18], we slightly modified the reference implementation of XMSS, leading to noticeable speedups. We provide a comparative performance and stack consumption analysis for several parameter sets of the instantiated versions of LMS and XMSS. Thereby we instantiate both HBS with several optimised hash functions. All software and results described in this paper are available in the public domain. It is publicly available at https://doi.org/10.5281/zenodo.3631571. Further, we refer to the respective projects included in our implementation for licensing information.

Organisation. The remainder of this document is structured as follows. First, we start by giving preliminary information on hash-based signature schemes. In Sect. 3, we reflect the main structural differences between LMS and XMSS. Details about the implemented hash functions and the approaches to speed up XMSS are presented in Sect. 4. Our implementation results are given in Sect. 5. Next, we discuss the results and draw a conclusion in Sect. 6. Finally, Appendix A contains further evaluated results.

2 Hash-Based Signature Schemes

While the security of other post-quantum cryptographic approaches like isogeny-based cryptography is still object to further research, hash-based schemes come with well-understood security assumptions.

Both discussed stateful schemes in this work use a tree construction along with a variant of a one-time signature schemes (OTS). Unlike in stateless schemes, in LMS and XMSS the signer needs to keep track of which key pairs have already been used. Therefore, the current state (index) is stored in the secret key, indicating which key pair to use next. XMSS provides methods to decrease the worst case runtime by keeping state information beyond the index [9]. To allow a fair comparison, this have not been considered in this work.

2.1 One-Time Signature Schemes

Many techniques have been proposed for constructing OTS schemes [7, 24, 27]. One of the most prominent OTS is the Winternitz OTS (WOTS) scheme [27], which is relatively efficient, has been used in practice and allows space/time trade-offs. LMS and XMSS use variants of WOTS.

Winternitz One-Time Signature Scheme. The main idea of all WOTS variants is to use a function chain to sign multiple bits starting from random inputs. The key generation is processed as shown in Algorithm 1, where n is the security parameter, w is (a power of 2) the “Winternitz parameter”, and \(f: \{0,1\}^* \rightarrow \{0,1\}^n\) defines a one-way function. Thereby, \(f^{w-1}\) should be interpreted as the (\(w-1\))-th iteration of the one-way function f. Increasing the value of the Winternitz parameter w will linearly shrink the size of a signature and increase exponentially the effort to perform key generation, signing and verification. Thus, the Winternitz parameter w enables space/time trade-offs.

figure a

In order to protect against trivial attacks, a checksum C is computed and signed along with the message, as shown in Algorithm 2 in line 5–7. A signature is computed by mapping the i-th chunk of \(M'\) to one intermediary value of the respective function chain, by iterating the one-way function \(M'_i\) times. As shown in Algorithm 3, in WOTS the public key can be calculated directly from the signature.

figure b

According to [13], assuming f is a collision-resistant one-way function, this scheme is existentially unforgeable under chosen-message attacks. XMSS makes use of the variant WOTS+. WOTS+, proposed by Hülsing [16], introduced a slight modification of the chaining function by adding a random bitmask \(r_i\) for each iteration, such that \(f^{0}(x)=x\), and \(f^{i}(x)=f(f^{i-1}(x) \oplus r_i)\) for \(i>0\). This modification eliminates the requirement for a collision resistant hash function.

figure c

2.2 Many-Time Signature Schemes

Merkle trees enable the use of a single long-term public key created from a large set of OTS public keys. In the following we will only briefly describe the methods for the construction of many-time schemes and refer to [26] and [18] for further details on the respective approach.

Merkle Trees. Based on the idea of one-time signature schemes Merkle’s approach [27] is to construct a balanced binary tree (a so-called Merkle Tree) using a given hash function to enable the use of a single public key (root of the tree) for verifying several messages. A signer generates \(2^h\) one-time key pairs \((X_j, Y_j)\) where \(0 \le j < 2^h\) for a selected \(h \in \mathbb {N}\) and \(h \ge 2\). The leaves of the tree are represented by the public keys \(X_j\) of the OTS which are derived from the secret keys \(Y_j\) for \(0 \le j < 2^h\). Parameter h defines the height of the resulting binary tree whose inner nodes are represented by the value computed as \(n = f(n_l\ ||\ n_r)\), where \(n_l\) and \(n_r\) are the values of the left and right children of n. To verify a signature at leaf with index i, one additionally needs the authentication path of i which is a sequence of h nodes. This authentication path contains the siblings of all the nodes on the path between leaf i and the root. Thus summarizing, a signature on a message m contains the one-time signature on m produced using \(X_j\), the authentication path, and the index j to indicate which key pair of the OTS was used.

Multi-trees. Rather than scaling up a single tree, LMS and XMSS define single and multi-tree (hypertree) variants of their signature schemes. In the multi-tree variant, the trees on the lowest layer are used to sign messages and the trees on higher layers are used to sign the roots of the trees on the layer below. Considering a hypertree of total height h that has d layers of trees of height h/d, the top layer \(d - 1\) contains one tree, layer \(d - 2\) contains \(2^{(h / d)} \) trees, and so on. Finally, the lowest layer contains \(2^{(h - (h/d))}\) trees. In order to generate the public key, only the single tree at the top of the structure needs to be generated. This requires generating the OTS keys along the bottom of this tree. The lower trees are generated deterministically as required. Thus, for a given h, key generation in a hypertree is faster than in a single tree. A signature consists of all the signatures on the way to the highest tree. Hence, the signature size increases and signing and verifying takes slightly longer. The root of the top-level tree is the public key. For further details on the multi-tree variants of LMS and XMSS, we refer to [26] and [18], respectively.

Fig. 1.
figure 1

Overview with L-trees and WOTS chains (adopted from [34], Fig. 1). Grey nodes are the private keys and the black nodes the public keys of the WOTS chains. The black node at the top is the public key.

3 Comparison

Roughly speaking, LMS and XMSS have a very similar construction. Both schemes use Merkle trees [27] along with a variant of WOTS. For this reason, we will focus on the most relevant structural differences of the schemes.

LMS and XMSS use different notations to specify equivalent parameters. As shown in Table 1, we define a common notation for parameters used in this work. For further details on the definition of the parameters, we refer to [26] and [18].

Table 1. Notation.

3.1 Prefixes and Bitmasks

In order to move away from collision resistance and towards collision resilience, within LMS and XMSS whenever an input is hashed, a specific prefix is added to the input. In the case of XMSS as mentioned in Sect. 2.1, WOTS+ [16] requires a random bitmask for each chaining iteration as additional input. Although LMS and XMSS apply different mechanisms to strengthen the security, the underlying constructions are very similar. To describe this principle theoretically, Bernstein, Hülsing, Kölbl, Niederhagen, Rijneveld, Schwabe [5] introduced an abstraction called tweakable hash functions (\(\mathbf {Th}\)) as follows.

Definition 1

(Tweakable hash function): Let \(n, \alpha \in \mathbb {N}, \mathcal {P}\) be the public parameters space, and \(\mathcal {T}\) be the tweak space. A tweakable hash function is an efficient function

$$\mathbf {Th}:\mathcal {P} \times \mathcal {T} \times \{0,1\}^{\alpha } \rightarrow \{0,1\}^n, \ \ \ \ \mathsf {MD} \leftarrow \mathbf {Th}(P,T,M)$$

mapping an \(\alpha \)-bit message M to an n-bit hash value \(\mathsf {MD}\) using a public parameter \(P \in \mathcal {P}\), also called function key, and a tweak \(T \in \mathcal {T}\).

Thus, a tweakable hash function adds specific context information (tweak) and public parameters (function key) to the input. According to this definition, the constructions within LMS and XMSS can roughly be described as follows.

Construction 1

(Prefix construction/LMS): Given a hash function \(H: \{0,1\}^{2n+\alpha } \rightarrow \{0,1\}^n,\) we construct \(\mathbf {Th}\) with \(\mathcal {P}=\mathcal {T}=\{0,1\}^n\), as

$$\mathbf {Th}(P,T,M)=H(P||T||M).$$

Construction 2

(Prefix and bitmask construction/XMSS): Given two hash functions \(H_1: \{0,1\}^{2n} \times \{0,1\}^{\alpha } \rightarrow \{0,1\}^n\) with 2n-bit keys, and \(H_2: \{0,1\}^{2n} \rightarrow \{0,1\}^{\alpha },\) we construct \(\mathbf {Th}\) with \(\mathcal {P}=\mathcal {T}=\{0,1\}^n\), as

$$\mathbf {Th}(P,T,M)=H_1(P||T,M^{\oplus })\text {, with } M^{\oplus }=M \oplus H_2(P||T).$$

As defined in Construction 2, while XMSS additionally generates distinct random inputs for each invocation of the hash function, LMS provides inputs with predictable changes to the hash function. Construction 1 reduces the effort, but comes in return at the cost of stronger security assumptions. For further details on the security model of LMS and XMSS, we refer to [21] and for further security notions for the defined constructions, we refer to [5].

3.2 WOTS Public Key Compression

Both schemes combine the public keys (final values) of a WOTS chain into an n-bit value. While LMS hashes them together as a single message (see Fig. 2), XMSS uses a tree (called L-tree) to compress these values (see Fig. 1). The construction in XMSS obviously leads to a higher number of hash operations.

4 LMS and XMSS on the Cortex-M4

In the case of XMSSFootnote 3, we removed all file-based procedures and implemented an interface to the pqm4 framework. For this, we used a slightly modified version of the pqm4 framework. This modification allows updating the secret key during the signing process by not passing the secret key as a constant. Thus, we enable the signing algorithm to be stateful. For further practical considerations around statefulness in this context, we refer to [25]. To port the reference implementation of LMSFootnote 4 to Cortex-M4, apart from smaller modifications, we integrated the single-thread version, and turned floating-point operations off.

4.1 Implemented Hash Functions

Primarily for the purpose of speedup and to achieve a broader comparison range, we integrated two more lightweight hash functions in addition to those recommended by NIST [11] (SHA-256 and SHAKE256) and already available in pqm4. In particular, we additionally evaluated LMS and XMSS using different variants of Keccak and Gimli-Hash.

Keccak-f[800]. Keccak-f describes a family of permutations originally specified in [1]. The Keccak-p permutations within Keccak-f are specified by a fixed width of the permutation (b) and the respective number of rounds (\(n_r\)) required. Furthermore, the permutation is denoted by Keccak-p\([b, n_r]\), where \(b \in \{25, 50, 100, 200, 400, 800, 1600\}\) and \(n_r \in \{12, 14, 16, 18, 20, 22, 24\}\). Thus, according to [28], Keccak-f[800], a permutation with 800 bits of width, applies to Keccak-p[800, 22]. For further details on Keccak, we refer to [1] and [28].

In the case of Keccak-f[800], we additionally considered a Keccak permutation with only 12 rounds (Keccak-p[800, 12] similar to River KeyakFootnote 5) to reduce the computational workload per hash invocation. Evidently, a reduced number of rounds provides a smaller safety margin than the full 22 rounds recommended for Keccak-f[800] [28]. Nevertheless, since the best known practical collision attack against SHA-3 exists only up to 5 rounds [15], the margin provided by 12 rounds is still comfortable. In a similar manner, Aumasson [2] proposed a general revision of the number of rounds of widely used symmetric primitives to speed up the standards without increasing the security risk. Furthermore to achieve a certain security level, we set the capacity \(c=256\) as specified in River Keyak (see footenote 5).

Gimli-Hash. The family of hash functions Gimli-Hash is built on top of a 384-bit permutation called Gimli. The Gimli permutation [6] was designed to achieve high security with high performance. According to the authors, the proposed permutation is distinguished from other permutation-based primitives for its high cross-platform performance. Furthermore, one of the core idea of Gimli was to define one standard that achieves high performance in lightweight as well as in non-lightweight environments. Due to the selected design, Gimli fits into 14 easily usable integer registers on 32-bit ARM microcontrollers. Gimli-Hash works on a 48-byte state with a rate of 16-byte.

We chose Gimli-Hash as an exemplary approach for the current round-2 candidates in NIST’s Lightweight Cryptography StandardisationFootnote 6 process. It is of practical importance to investigate the performance of the remaining candidates.

Fig. 2.
figure 2

Overview without L-trees (adopted from [34], Fig. 1). Grey nodes are the private keys and the black nodes the public keys of the WOTS chains. The black node at the top is the public key.

4.2 Speeding up XMSS

In this section, we discuss three methods for speeding up XMSS deviating from RFC 8391 [18]. The first described technique replaces the tree-based WOTS public-key compression with a single hash call. This approach was first proposed in SPHINCS+ [3]. The second one, a structure omitting the use of bitmasks (the so-called “simple” version) was proposed in the round-2 submission of SPHINCS+ [3] at NIST-PQC. Finally, we describe a technique called “hash pre-computation”. This approach was first mentioned by Kampanakis, Fluhrer [21] and first described by Wang, Jungk, Wälde, Deng, Gupta, Szefer, Niederhagen [34]. Thereby, recurring intermediate results of a certain type of hash calls are temporarily stored and reused in the subsequent hash calls.

All these methods lead to speedups during key generation, signing and verifying. However, during the signature verification, the hash pre-computation method only leads to small speedups in certain parameter sets. Although the methods presented in the following can also be implemented in other cases, in this work we will mainly focus on the parameter sets from Table 3. Other approaches, which lead to possible speedups in both LMS and XMSS, were intentionally not considered in this work.

Other acceleration methods, such as storing some top nodes in the secret key [12], applying a more efficient tree traversal scheme [33] (already part of the XMSS reference implementationFootnote 7 and our implementation), or instantiating the schemes with shorter hash functions, were intentionally not considered in this work. Although these methods lead to significant speedups, they can be applied in LMS and XMSS and therefore have no fundamental impact in our comparison.

The instantiation of the different parameter sets is managed by conditional compilation. In the case of XMSS, the modifications presented in this section are also controlled by preprocessing allowing to compile different versions of XMSS.

Tree-Less WOTS+ Public Key Compression. As described in SPHINCS+ [3], we compress the end nodes of the WOTS chains (black nodes in Fig. 2) with a single call to a tweakable hash function, as shown in Fig. 2. A tree-based compression (see L-trees in Fig. 1) is slower than using a single call to a tweakable hash function with the concatenated digest of all end nodes of the WOTS chains (see black nodes in Fig. 2) as input.

Fig. 3.
figure 3

Hash pre-computation within Keccak-f[800] with a rate of 512 bits.

Bitmask-Less Hashing. In this construction no bitmasks are generated and XORed with the input of the tweakable hash functions. In this case, the tweakable hash function is defined according to Construction 1 instead of Construction 2 (see Sect. 3.1). For the resulting implications for security by applying Construction 1 in XMSS, we strongly refer to [5].

Hash Pre-computation. Within XMSS, for a given key pair and a security parameter n, the first 2n-bit block (n-bit domain separator and n-bit hash-function key) of the input to the pseudo-random function (of type \( \mathcal {F} :\{0,1\}^{3n} \rightarrow \{0,1\}^{n}\)) is the same for all calls. Considering this fact, we store the digest of the first 2n-bit block at the first call to the pseudorandom function (PRF) and skip this effort by reusing this result in all further calls. This approach can easily be applied whenever the internal block size/rate of the used hash function is less than or equal to 2n bits. Depending on the internal block size of the used hash function, the number of saved calls to the internal compression respectively permutation function (Speedup\(_{PRF}\)) can be calculated as follows. Let \(B_{bits} \ge 2n\) bits be the internal block size/rate in bits and \(\#\text {call}_{PRF}\) be the number of calls to the PRF, then

As in Fig. 3 exemplified for the case of Keccak-f[800] and \(n=256\), this method can basically be applied in every sponge construction, by reducing the rate to 2n bits whenever the rate is longer than 2n bits. Hence, even in the case \(n=256\), it can be implemented in SHAKE256 (Keccak-f[1600]) by reducing the width of the rate from 1088 bits to 2n bits. However, in hash calls apart from the PRF invocations this would increase the number of permutations required for inputs longer than 2n bits. A “hybrid approach” (not considered in this paper) with variable rate width (512 bits for PRF calls and 1088 for other hashing cases) could lead to a possible acceleration.

In the case of SHA-256 and \(n=256\), where the 512-bit block fits into a 512 bit SHA-256 internal block, this approach reduces the number of calls to the compression function by half. According to the standard definition [28] in Keccak-f[800] with a capacity of 256-bit length, the length of the rate should be 544 bits. In order to enable hash pre-computation, we reduced the length of the rate to 512 bits. In other words, the rate within an instantiation of XMSS using Keccak-f[800] applying hash pre-computation is 512 bits long, while a version without hash pre-computation makes use of the whole 544 bits. This modified design with a longer capacity obviously has no negative influence on the security of the hash function. In the case of Keccak-f[800], this approach reduces the number of required permutations by half. Since the rate in the sponge construction within Gimli-Hash is 128 bits long, it results in saving 4 permutation runs per PRF invocation.

From now on as shown in Table 2, we call an implementation of XMSS with L-trees using Construction 2 (see Sect. 3.1) without hash pre-computation XMSS_ROBUST, the variant without L-trees using Construction 1 XMSS_SIMPLE, and the one without L-trees applying Construction 1 and hash pre-computation XMSS_SIMPLE+PRE. The multi-tree variants are called XMSS\(^{MT}\)ROBUST, XMSS\(^{MT}\)SIMPLE, and XMSS\(^{MT}\)SIMPLE+PRE, respectively. XMSS_ROBUST and XMSS\(^{MT}\)ROBUST represent the current version of XMSS from RFC 8391Footnote 8.

Table 2. Implemented variants of XMSS.

5 Evaluation

We measured the performance of our implementations on a commercially available microcontroller. We use the widely available board STM32F4DISCOVERY featuring a 32-bit ARM Cortex-M4 with FPU core, 1-Mbyte Flash ROM, and 192-Kbyte RAM. The reference implementation of LMSFootnote 9 and XMSSFootnote 10 provided the basis for our implementation. The methods used for cycle counter reading, device communication at runtime, and hardware-based random byte generation were provided by the pqm4Footnote 11 framework. This framework in turn includes the libopencm3Footnote 12 library for providing these methods. All test instances were compiled with GNU Tools for ARM Embedded Processors 9-2019-q4-majorFootnote 13 (gcc version 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]) using the flags:

-03 -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16.

We additionally evaluated LMS and XMSS using optimised assembly implementations of Keccak-f[800] (KeccakP-800-u2-armv7m-le-gcc) from the eXtended Keccak Code PackageFootnote 14 and of the GimliFootnote 15 (arm-m4 version) permutation.

Table 3. Selected parameter sets.

In this work, LMS and XMSS share the same implementations to perform the hash computations, clock-cycle measurement, and stack analysis, hence yielding an unbiased comparison. The selection of the evaluated parameter sets is based on the recommendation of NIST [11]. The parameter sets from Table 3 were implemented in combination with Gimli-Hash, Keccak (Keccak-p[800, 22] and Keccak-p[800, 12]), SHAKE256, and SHA-256. The resulting signature size for each parameter set is also shown in Table 3.

As shown in Table 4, the implemented modifications in XMSS and \(\text {XMSS}^{MT}\) lead to significant speedups. XMSS_SIMPLE achieves a speedup of up to \(3.03\times \) for key generation and signing, and up to \(4.32\times \) for verifying. In combination with the hash pre-computation approach, key generation and signing achieve accelerations up to 3.11 times. However, when applying the hash pre-computation method, a speedup only occurs in certain parameter sets, mostly when the number of rounds of the hash function and the number of calls to the PRF are large enough to compensate for the additional effort. In the case of verification, a speedup through hash pre-computation occurred rarely (see Table 9 and Table 10).

Table 4. Speedup in XMSS and \(\text {XMSS}^{MT}\) exemplary with SHA-256.
Table 5. Number of hash operations for SHA-256, \(n=256\), and \(w=16\).

Reducing the number of rounds in Keccak-f[800] to 12 instead of 22 yields a speedup of up to roughly \(1.66 \times \) for key generation and signing, and \(1.72 \times \) for verifying in all implemented variants of XMSS, and up to roughly \(1.70 \times \) for key generation and signing, and \(1.76 \times \) for verifying in all implemented variants of LMS (see Table 9, Table 10, and Table 11).

Table 6. Performance comparison for SHA-256, \(n=256\), \(w=16\), and \(h=10\).

Structurally, XMSS_SIMPLE, the variant without L-trees using Construction 1, differs only marginally from LMS. To confirm this analysis, we measured the number of hash operations required in LMS and XMSS_SIMPLE. As Table 5 shows, XMSS_SIMPLE and XMSS\(^{MT}\)SIMPLE hash operations are almost equivalent to LMS and HSS, respectively. As shown in Table 6, although the changes in XMSS result in a slightly smaller number of hash calls than in LMS, LMS unexpectedly requires fewer clock cycles for all tested cases. We further measured the time spent performing hash operations for each scheme. The results of this measurement are given in Table 7. In both schemes, at least \(85\%\) of the time was spent on performing the hash computations. XMSS spends \(15\%\) of the evaluated time on computing other operations, while LMS spends up to \(94\%\) of time on hashing.

Table 7. Percentage of time on hashing for SHA-256, \(n=256\), \(w=16\), \(h=10\), and \(d=2\).

During key generation, the stack consumption of XMSS is on average slightly higher than for LMS. However, as shown in Table 8, the difference during signing and verification is \(1.6\times \) and almost \(4\times \) as high, respectively.

Table 8. Stack memory usage (bytes) for XMSS\(^{MT}\) and HSS using Gimli-Hash.

The round-reduced version of Keccak (Keccak-p[800, 12]) achieved the best performance (see Table 9, Table 10, and Table 11) while Gimli-Hash the lowest stack consumption (see Table 12).

A complete overview of our results can be found in Appendix A.

6 Conclusion

We showed that the current reference implementation of LMS with some required modifications achieves good performance results on a Cortex-M4. Further, we presented that the implemented modifications in XMSS lead to a significant speedup. Although the XMSS_SIMPLE version of XMSS is structurally very similar to LMS, LMS still achieves significantly better performance. Therefore, these performance differences are not based on properties of the schemes but rather on properties of the reference implementation. In addition, the currently discussed correct selection of safety margins for round-based symmetric cryptographic primitives is also considered in this work. In considering the fact that post-quantum approaches are more resource intensive than those currently in use, it is worth considering round-reduced and lightweight designs and concepts of hash functions in an embedded environment.

Our results based on reference implementations should merely give an idea on how practical the evaluated stateful schemes could be in an embedded environment.