LMS vs XMSS: Comparison of Stateful Hash-Based Signature Schemes on ARM Cortex-M4

Stateful hash-based signature schemes are among the most efficient approaches for post-quantum signature schemes. Although not suitable for general use, they may be suitable for some use cases on constrained devices. LMS and XMSS are hash-based signature schemes that are conjectured to be quantum secure. In this work, we compared multiple instantiations of both schemes on an ARM Cortex-M4. More precisely, we compared performance, stack consumption, and other figures for key generation, signing and verifying. To achieve this, we evaluated LMS and XMSS using optimised implementations of SHA-256, SHAKE256, Gimli-Hash, and different variants of Keccak. Furthermore, we present slightly optimised implementations of XMSS achieving speedups of up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.11{\times }$$\end{document} for key generation, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.11{\times }$$\end{document} for signing, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.32{\times }$$\end{document} for verifying.


Introduction
Digital-signature schemes are among the most important and widely used cryptographic primitives. Schemes used in practice today (RSA [30], DSA [14], ECDSA [20], and EdDSA [4]) are based on assumptions regarding the computational difficulty of solving certain mathematical problems. Due to Shor's algorithm [32] and its variants, some of these problems, such as integer factorisation and discrete logarithms, can be efficiently solved on a quantum computer. Since the National Institute of Standards and Technology (NIST) started a project (NIST-PQC 1 ) to evaluate and standardise postquantum cryptographic algorithms, many solutions have been proposed. Hash-based signature schemes (HBS) are among the most attractive candidates for quantum-safe Cortex-M4. For this, we provide an adapted implementation of LMS for the Cortex-M4, which represents the first implementation to date to the best of the authors' knowledge. We evaluated suitable parameter sets for constrained devices from the NIST recommendation for stateful hash-based signature schemes [11]. Furthermore, deviating from the RFC 8391 [18], we slightly modified the reference implementation of XMSS, leading to noticeable speedups. We provide a comparative performance and stack consumption analysis for several parameter sets of the instantiated versions of LMS and XMSS. Thereby we instantiate both HBS with several optimised hash functions. All software and results described in this paper are available in the public domain. It is publicly available at https://doi.org/10.5281/zenodo.3631571. Further, we refer to the respective projects included in our implementation for licensing information.
Organisation. The remainder of this document is structured as follows. First, we start by giving preliminary information on hash-based signature schemes. In Sect. 3, we reflect the main structural differences between LMS and XMSS. Details about the implemented hash functions and the approaches to speed up XMSS are presented in Sect. 4. Our implementation results are given in Sect. 5. Next, we discuss the results and draw a conclusion in Sect. 6. Finally, Appendix A contains further evaluated results.

Hash-Based Signature Schemes
While the security of other post-quantum cryptographic approaches like isogeny-based cryptography is still object to further research, hash-based schemes come with wellunderstood security assumptions.
Both discussed stateful schemes in this work use a tree construction along with a variant of a one-time signature schemes (OTS). Unlike in stateless schemes, in LMS and XMSS the signer needs to keep track of which key pairs have already been used. Therefore, the current state (index) is stored in the secret key, indicating which key pair to use next. XMSS provides methods to decrease the worst case runtime by keeping state information beyond the index [9]. To allow a fair comparison, this have not been considered in this work.

One-Time Signature Schemes
Many techniques have been proposed for constructing OTS schemes [7,24,27]. One of the most prominent OTS is the Winternitz OTS (WOTS) scheme [27], which is relatively efficient, has been used in practice and allows space/time trade-offs. LMS and XMSS use variants of WOTS.
Winternitz One-Time Signature Scheme. The main idea of all WOTS variants is to use a function chain to sign multiple bits starting from random inputs. The key generation is processed as shown in Algorithm 1, where n is the security parameter, w is (a power of 2) the "Winternitz parameter", and f : {0, 1} * → {0, 1} n defines a one-way function. Thereby, f w−1 should be interpreted as the (w − 1)-th iteration of the one-way function f . Increasing the value of the Winternitz parameter w will linearly shrink the size of a signature and increase exponentially the effort to perform key generation, signing and verification. Thus, the Winternitz parameter w enables space/time trade-offs.

Algorithm 1: Key generation.
Input : security parameter n, Winternitz paramater w. Output: one-time key pair: (secret key X, public key Y ).
In order to protect against trivial attacks, a checksum C is computed and signed along with the message, as shown in Algorithm 2 in line 5-7. A signature is computed by mapping the i-th chunk of M to one intermediary value of the respective function chain, by iterating the one-way function M i times. As shown in Algorithm 3, in WOTS the public key can be calculated directly from the signature.
According to [13], assuming f is a collision-resistant one-way function, this scheme is existentially unforgeable under chosen-message attacks. XMSS makes use of the variant WOTS + . WOTS + , proposed by Hülsing [16], introduced a slight modification of the chaining function by adding a random bitmask r i for each iteration, such that f 0 (x) = x, and f i (x) = f ( f i−1 (x) ⊕ r i ) for i > 0. This modification eliminates the requirement for a collision resistant hash function.

Algorithm 3: Verifying.
Input : signature σ , message M, public key Y , security parameter n, Winternitz parameter w. Output: valid or invalid.

Many-Time Signature Schemes
Merkle trees enable the use of a single long-term public key created from a large set of OTS public keys. In the following we will only briefly describe the methods for the construction of many-time schemes and refer to [26] and [18] for further details on the respective approach.
Merkle Trees. Based on the idea of one-time signature schemes Merkle's approach [27] is to construct a balanced binary tree (a so-called Merkle Tree) using a given hash function to enable the use of a single public key (root of the tree) for verifying several messages. A signer generates 2 h one-time key pairs (X j ,Y j ) where 0 ≤ j < 2 h for a selected h ∈ N and h ≥ 2. The leaves of the tree are represented by the public keys X j of the OTS which are derived from the secret keys Y j for 0 ≤ j < 2 h . Parameter h defines the height of the resulting binary tree whose inner nodes are represented by the value computed as n = f (n l || n r ), where n l and n r are the values of the left and right children of n. To verify a signature at leaf with index i, one additionally needs the authentication path of i which is a sequence of h nodes. This authentication path contains the siblings of all the nodes on the path between leaf i and the root. Thus summarizing, a signature on a message m contains the one-time signature on m produced using X j , the authentication path, and the index j to indicate which key pair of the OTS was used.

Multi-trees.
Rather than scaling up a single tree, LMS and XMSS define single and multi-tree (hypertree) variants of their signature schemes. In the multi-tree variant, the trees on the lowest layer are used to sign messages and the trees on higher layers are used to sign the roots of the trees on the layer below. Considering a hypertree of total height h that has d layers of trees of height h/d, the top layer d − 1 contains one tree, layer d − 2 contains 2 (h/d) trees, and so on. Finally, the lowest layer contains 2 (h−(h/d)) trees. In order to generate the public key, only the single tree at the top of the structure needs to be generated. This requires generating the OTS keys along the bottom of this tree. The lower trees are generated deterministically as required. Thus, for a given h, key generation in a hypertree is faster than in a single tree. A signature consists of all the signatures on the way to the highest tree. Hence, the signature size increases and signing and verifying takes slightly longer. The root of the top-level tree is the public key. For further details on the multi-tree variants of LMS and XMSS, we refer to [26] and [18], respectively.

Comparison
Roughly speaking, LMS and XMSS have a very similar construction. Both schemes use Merkle trees [27] along with a variant of WOTS. For this reason, we will focus on the most relevant structural differences of the schemes. LMS and XMSS use different notations to specify equivalent parameters. As shown in Table 1, we define a common notation for parameters used in this work. For further details on the definition of the parameters, we refer to [26] and [18].

Prefixes and Bitmasks
In order to move away from collision resistance and towards collision resilience, within LMS and XMSS whenever an input is hashed, a specific prefix is added to the input. In the case of XMSS as mentioned in Sect. 2.1, WOTS + [16] requires a random bitmask for each chaining iteration as additional input. Although LMS and XMSS apply different mechanisms to strengthen the security, the underlying constructions are very similar. To describe this principle theoretically, Bernstein, Hülsing, Kölbl, Niederhagen, Rijneveld, Schwabe [5] introduced an abstraction called tweakable hash functions (Th) as follows.
Definition 1 (Tweakable hash function): Let n, α ∈ N, P be the public parameters space, and T be the tweak space. A tweakable hash function is an efficient function mapping an α-bit message M to an n-bit hash value MD using a public parameter P ∈ P, also called function key, and a tweak T ∈ T .
Thus, a tweakable hash function adds specific context information (tweak) and public parameters (function key) to the input. According to this definition, the constructions within LMS and XMSS can roughly be described as follows.

Construction 2 (Prefix and bitmask construction/XMSS): Given two hash functions
As defined in Construction 2, while XMSS additionally generates distinct random inputs for each invocation of the hash function, LMS provides inputs with predictable changes to the hash function. Construction 1 reduces the effort, but comes in return at the cost of stronger security assumptions. For further details on the security model of LMS and XMSS, we refer to [21] and for further security notions for the defined constructions, we refer to [5].

WOTS Public Key Compression
Both schemes combine the public keys (final values) of a WOTS chain into an n-bit value. While LMS hashes them together as a single message (see Fig. 2), XMSS uses a tree (called L-tree) to compress these values (see Fig. 1). The construction in XMSS obviously leads to a higher number of hash operations.

LMS and XMSS on the Cortex-M4
In the case of XMSS 3 , we removed all file-based procedures and implemented an interface to the pqm4 framework. For this, we used a slightly modified version of the pqm4 framework. This modification allows updating the secret key during the signing process by not passing the secret key as a constant. Thus, we enable the signing algorithm to be stateful. For further practical considerations around statefulness in this context, we refer to [25]. To port the reference implementation of LMS 4 to Cortex-M4, apart from smaller modifications, we integrated the single-thread version, and turned floating-point operations off.

Implemented Hash Functions
Primarily for the purpose of speedup and to achieve a broader comparison range, we integrated two more lightweight hash functions in addition to those recommended by NIST [11] (SHA-256 and SHAKE256) and already available in pqm4. In particular, we additionally evaluated LMS and XMSS using different variants of KECCAK and Gimli-Hash.

KECCAK-f [800]
. KECCAK-f describes a family of permutations originally specified in [1]. The KECCAK-p permutations within KECCAK-f are specified by a fixed width of the permutation (b) and the respective number of rounds (n r ) required. Furthermore, the permutation is denoted by In the case of KECCAK-f [800], we additionally considered a KECCAK permutation with only 12 rounds (KECCAK-p[800, 12] similar to River Keyak 5 ) to reduce the computational workload per hash invocation. Evidently, a reduced number of rounds provides a smaller safety margin than the full 22 rounds recommended for KECCAKf [800] [28]. Nevertheless, since the best known practical collision attack against SHA-3 exists only up to 5 rounds [15], the margin provided by 12 rounds is still comfortable. In a similar manner, Aumasson [2] proposed a general revision of the number of rounds of widely used symmetric primitives to speed up the standards without increasing the security risk. Furthermore to achieve a certain security level, we set the capacity c = 256 as specified in River Keyak (see footenote 5).
Gimli-Hash. The family of hash functions Gimli-Hash is built on top of a 384-bit permutation called Gimli. The Gimli permutation [6] was designed to achieve high security with high performance. According to the authors, the proposed permutation is distinguished from other permutation-based primitives for its high cross-platform performance. Furthermore, one of the core idea of Gimli was to define one standard that Fig. 2. Overview without L-trees (adopted from [34], Fig. 1). Grey nodes are the private keys and the black nodes the public keys of the WOTS chains. The black node at the top is the public key.
achieves high performance in lightweight as well as in non-lightweight environments. Due to the selected design, Gimli fits into 14 easily usable integer registers on 32-bit ARM microcontrollers. Gimli-Hash works on a 48-byte state with a rate of 16-byte.
We chose Gimli-Hash as an exemplary approach for the current round-2 candidates in NIST's Lightweight Cryptography Standardisation 6 process. It is of practical importance to investigate the performance of the remaining candidates.

Speeding up XMSS
In this section, we discuss three methods for speeding up XMSS deviating from RFC 8391 [18]. The first described technique replaces the tree-based WOTS public-key compression with a single hash call. This approach was first proposed in SPHINCS+ [3]. The second one, a structure omitting the use of bitmasks (the so-called "simple" version) was proposed in the round-2 submission of SPHINCS+ [3] at NIST-PQC. Finally, we describe a technique called "hash pre-computation". This approach was first mentioned by Kampanakis, Fluhrer [21] and first described by Wang, Jungk, Wälde, Deng, Gupta, Szefer, Niederhagen [34]. Thereby, recurring intermediate results of a certain type of hash calls are temporarily stored and reused in the subsequent hash calls.
All these methods lead to speedups during key generation, signing and verifying. However, during the signature verification, the hash pre-computation method only leads to small speedups in certain parameter sets. Although the methods presented in the following can also be implemented in other cases, in this work we will mainly focus on the parameter sets from Table 3. Other approaches, which lead to possible speedups in both LMS and XMSS, were intentionally not considered in this work.
Other acceleration methods, such as storing some top nodes in the secret key [12], applying a more efficient tree traversal scheme [33] (already part of the XMSS reference implementation 7 and our implementation), or instantiating the schemes with shorter hash functions, were intentionally not considered in this work. Although these methods lead to significant speedups, they can be applied in LMS and XMSS and therefore have no fundamental impact in our comparison.
The instantiation of the different parameter sets is managed by conditional compilation. In the case of XMSS, the modifications presented in this section are also controlled by preprocessing allowing to compile different versions of XMSS. Fig. 2) with a single call to a tweakable hash function, as shown in Fig. 2. A tree-based compression (see Ltrees in Fig. 1) is slower than using a single call to a tweakable hash function with the concatenated digest of all end nodes of the WOTS chains (see black nodes in Fig. 2) as input.

Tree-Less WOTS+ Public Key Compression. As described in SPHINCS+ [3], we compress the end nodes of the WOTS chains (black nodes in
Bitmask-Less Hashing. In this construction no bitmasks are generated and XORed with the input of the tweakable hash functions. In this case, the tweakable hash function is defined according to Construction 1 instead of Construction 2 (see Sect. 3.1). For the resulting implications for security by applying Construction 1 in XMSS, we strongly refer to [5].
Hash Pre-computation. Within XMSS, for a given key pair and a security parameter n, the first 2n-bit block (n-bit domain separator and n-bit hash-function key) of the input to the pseudo-random function (of type F : {0, 1} 3n → {0, 1} n ) is the same for all calls. Considering this fact, we store the digest of the first 2n-bit block at the first call to the pseudorandom function (PRF) and skip this effort by reusing this result in all further calls. This approach can easily be applied whenever the internal block size/rate of the used hash function is less than or equal to 2n bits. Depending on the internal block size of the used hash function, the number of saved calls to the internal compression respectively permutation function (Speedup PRF ) can be calculated as follows. Let B bits ≥ 2n bits be the internal block size/rate in bits and #call PRF be the number of calls to the PRF, then Speedup PRF (B bits , #call PRF ) = 2n bits/B bits * #call PRF . Fig. 3 exemplified for the case of KECCAK-f [800] and n = 256, this method can basically be applied in every sponge construction, by reducing the rate to 2n bits whenever the rate is longer than 2n bits. Hence, even in the case n = 256, it can be implemented in SHAKE256 (KECCAK-f [1600]) by reducing the width of the rate from 1088 bits to 2n bits. However, in hash calls apart from the PRF invocations this would increase the number of permutations required for inputs longer than 2n bits. A "hybrid approach" (not considered in this paper) with variable rate width (512 bits for PRF calls and 1088 for other hashing cases) could lead to a possible acceleration.

As in
In the case of SHA-256 and n = 256, where the 512-bit block fits into a 512 bit SHA-256 internal block, this approach reduces the number of calls to the compression function by half. According to the standard definition [28] in KECCAK-f [800] with a capacity of 256-bit length, the length of the rate should be 544 bits. In order to enable hash pre-computation, we reduced the length of the rate to 512 bits. In other words, the rate within an instantiation of XMSS using KECCAK-f [800] applying hash precomputation is 512 bits long, while a version without hash pre-computation makes use of the whole 544 bits. This modified design with a longer capacity obviously has no negative influence on the security of the hash function. In the case of KECCAK-f [800], this approach reduces the number of required permutations by half. Since the rate in the sponge construction within Gimli-Hash is 128 bits long, it results in saving 4 permutation runs per PRF invocation.
From now on as shown in Table 2, we call an implementation of XMSS with L-trees using Construction 2 (see Sect. 3.1) without hash pre-computation XMSS_ROBUST, the variant without L-trees using Construction 1 XMSS_SIMPLE, and the one without L-trees applying Construction 1 and hash pre-computation XMSS_SIMPLE+PRE. The multi-tree variants are called XMSS MT ROBUST, XMSS MT SIMPLE, and XMSS MT SIMPLE+PRE, respectively. XMSS_ROBUST and XMSS MT ROBUST represent the current version of XMSS from RFC 8391 8 . Table 2. Implemented variants of XMSS.

Design
Multi-tree Tree-less WOTS+ Bitmask-less hashing Pre-computation

Evaluation
We measured the performance of our implementations on a commercially available microcontroller. We use the widely available board STM32F4DISCOVERY featuring a 32-bit ARM Cortex-M4 with FPU core, 1-Mbyte Flash ROM, and 192-Kbyte RAM. The reference implementation of LMS 9 and XMSS 10 provided the basis for our implementation. The methods used for cycle counter reading, device communication at runtime, and hardware-based random byte generation were provided by the pqm4 11 framework. This framework in turn includes the libopencm3 12 library for providing these methods. All test instances were compiled with GNU Tools for ARM Embedded Processors 9-2019-q4-major 13  In this work, LMS and XMSS share the same implementations to perform the hash computations, clock-cycle measurement, and stack analysis, hence yielding an unbiased comparison. The selection of the evaluated parameter sets is based on the recommendation of NIST [11]. The parameter sets from Table 3 were implemented in combination with Gimli-Hash, KECCAK (KECCAK-p[800, 22] and KECCAK-p[800, 12]), SHAKE256, and SHA-256. The resulting signature size for each parameter set is also shown in Table 3.
As shown in Table 4, the implemented modifications in XMSS and XMSS MT lead to significant speedups. XMSS_SIMPLE achieves a speedup of up to 3.03× for key generation and signing, and up to 4.32× for verifying. In combination with the hash precomputation approach, key generation and signing achieve accelerations up to 3.11 times. However, when applying the hash pre-computation method, a speedup only occurs in certain parameter sets, mostly when the number of rounds of the hash function and the number of calls to the PRF are large enough to compensate for the additional effort. In the case of verification, a speedup through hash pre-computation occurred rarely (see Table 9 and Table 10). Reducing the number of rounds in KECCAK-f [800] to 12 instead of 22 yields a speedup of up to roughly 1.66× for key generation and signing, and 1.72× for verifying in all implemented variants of XMSS, and up to roughly 1.70× for key generation and signing, and 1.76× for verifying in all implemented variants of LMS (see Table 9, Table 10, and Table 11). Structurally, XMSS_SIMPLE, the variant without L-trees using Construction 1, differs only marginally from LMS. To confirm this analysis, we measured the number of hash operations required in LMS and XMSS_SIMPLE. As Table 5 shows, XMSS_SIMPLE and XMSS MT SIMPLE hash operations are almost equivalent to LMS and HSS, respectively. As shown in Table 6, although the changes in XMSS result in a slightly smaller number of hash calls than in LMS, LMS unexpectedly requires fewer clock cycles for all tested cases. We further measured the time spent performing hash operations for each scheme. The results of this measurement are given in Table 7. In both schemes, at least 85% of the time was spent on performing the hash computations. XMSS spends 15% of the evaluated time on computing other operations, while LMS spends up to 94% of time on hashing. During key generation, the stack consumption of XMSS is on average slightly higher than for LMS. However, as shown in Table 8, the difference during signing and verification is 1.6× and almost 4× as high, respectively. The round-reduced version of KECCAK (KECCAK-p[800, 12]) achieved the best performance (see Table 9, Table 10, and Table 11) while Gimli-Hash the lowest stack consumption (see Table 12).
A complete overview of our results can be found in Appendix A.

Conclusion
We showed that the current reference implementation of LMS with some required modifications achieves good performance results on a Cortex-M4. Further, we presented that the implemented modifications in XMSS lead to a significant speedup. Although the XMSS_SIMPLE version of XMSS is structurally very similar to LMS, LMS still achieves significantly better performance. Therefore, these performance differences are not based on properties of the schemes but rather on properties of the reference implementation. In addition, the currently discussed correct selection of safety margins for round-based symmetric cryptographic primitives is also considered in this work. In considering the fact that post-quantum approaches are more resource intensive than those currently in use, it is worth considering round-reduced and lightweight designs and concepts of hash functions in an embedded environment.
Our results based on reference implementations should merely give an idea on how practical the evaluated stateful schemes could be in an embedded environment.

A.1 Speed and Stack Memory
Speed is measured in CPU clock cycles. Stack memory (bytes) excludes the space required to store key material, messages, and in the case of hash pre-computation the intermediate state.