This section describes our implementation of X25519 on MSP430X microcontrollers, which is based on and improves the software presented in [21]. We implemented X25519 for MSP430X devices that feature a 16-bit hardware multiplier as well as for those that feature a 32-bit hardware multiplier. We present execution results measured on an MSP430FR5969 [41], which has an MSP430X CPU, 64 KB of non-volatile memory (FRAM), 2 kB SRAM and a 32-bit memory-mapped hardware multiplier. The result of a \(16 \times 16\)-bit multiplication is available in 3 cycles on both types of MSP430X devices, those that have a 32-bit hardware multiplier as well as those that have a 16-bit hardware multiplier (cf. [41, 42]). Thus, our measurement results can be generalized to other microcontrollers from the MSP430X family.
All cycle counts presented in this section were obtained when executing the code on a MSP-EXP430FR5969 Launchpad development board and measuring the execution time using the debugging functionality of the IAR Embedded Workbench IDE.
The MSP430X
The MSP430X has a 16-bit RISC CPU with 27 core instructions and 24 emulated instructions. The CPU has 16 16-bit registers. Of those, only R4 to R15 are freely usable working registers, and R0 to R3 are special-purpose registers (program counter, stack pointer, status register, and constant generator). All instructions execute in one cycle, if they operate on contents that are stored in CPU registers. However, the overall execution time for an instruction depends on the instruction format and addressing mode. The CPU features 7 addressing modes. While indirect auto-increment mode leads to a shorter instruction execution time compared to indexed mode, only indexed mode can be used to store results in RAM.
We consider MSP430X microcontrollers, which feature a memory-mapped hardware multiplier that works in parallel to the CPU. Four types of multiplications, namely signed and unsigned multiply as well as signed and unsigned multiply-and-accumulate are supported. The multiplier registers have to be loaded with CPU instructions. The hardware multiplier stores the result in two (in case of 16-bit multipliers) or four (in case of 32-bit multipliers) 16-bit registers. Further a SUMEXT register indicates for the multiply-and-accumulate instruction, whether accumulation has produced a carry bit. However, it is not possible to accumulate carries in SUMEXT. The time required for the execution of a multiplication is determined by the time that it takes to load operands to and store results from the peripheral multiplier registers.
The MSP430FR5969 (the target under consideration) belongs to a new MSP430X series featuring FRAM technology for non-volatile memory. This technology has two benefits compared to flash memory. It leads to a reduced power consumption during memory writes and further increases the number of possible write operations. However, as a drawback, while the maximum operating frequency of the MSP430FR5969 is 16 MHz, the FRAM can only be accessed at 8 MHz. Hence, wait cycles have to be introduced when operating the MSP430FR5969 at 16 MHz. For all cycle counts that we present in this section we assume a core clock frequency of 8 MHz. Increasing this frequency on the MSP430FR5969 would incur a penalty resulting from those introduced wait cycles. Note, that this is not the case for MSP430X devices that use flash technology for non-volatile memory.
Multiplication
In our MSP430X implementation we use an unsigned radix-\(2^{16}\) representation for field elements. An element f in \(\mathbb {F}_{2^{255}-19}\) is thus represented as \(f=\sum _{i=0}^{15} f_i2^{16i} \,\hat{=}\,(f_0, f_1, \ldots f_{15})\) with \(f_i \in \{0, \ldots , 2^{16}-1\}\). In order to be conform with other implementations of X25519, we consider inputs and outputs to and from the scalar multiplication on Curve25519 to be 32-byte arrays. Thus conversions to and from the used representation have to be executed at the beginning and the end of the scalar multiplication. As reduction modulo \(2^{255}-19\) requires bit shifts in the chosen representation of field elements, we reduce intermediate results modulo \(2^{256}-38\) during the entire execution of the scalar multiplication and only reduce the final result modulo \(2^{255}-19\).
Hinterwälder, Moradi, Hutter, Schwabe, and Paar presented and compared implementations of various multiplication techniques on the MSP430X architecture in [21]. They considered the carry-save, operand-caching and constant-time Karatsuba multiplication, for which they used the operand-caching technique for the computation of intermediate results. Among those implementations, the Karatsuba implementation performed best. To the best of the authors knowledge, the fastest previously reported result for 256-bit multiplication on MSP430X devices was presented by Gouvêa et al. [18]. In their work the authors have used the product-scanning technique for the multi-precision multiplication. We implemented and compared the product-scanning multiplication and the constant-time Karatsuba multiplication, and this time used the product-scanning technique for the computation of intermediate results of the Karatsuba implementation. It turns out that on devices that have a 16-bit hardware multiplier, the constant-time Karatsuba multiplication performs best. On devices that have a 32-bit hardware multiplier the product-scanning technique performs better than constant-time Karatsuba, as it makes best use of the 32-bit multiply-and-accumulate unit of the memory-mapped hardware multiplier. We thus use constant-time Karatsuba in our implementation of X25519 on MSP430X microcontrollers that have a 16-bit hardware multiplier and the product-scanning technique for our X25519 implementation on MSP430Xs that have a 32-bit hardware multiplier.
In our product-scanning multiplication implementation, where \(h=f \times g \mod 2^{256}-38\) is computed, we first compute the coefficients of the double-sized array, which results from multiplying f with g and then reduce this result modulo \(2^{256}-38\). We only have 7 general-purpose registers available to store input operands during the multiplication operation. Hence, we cannot store all input operands in working registers, but we keep as many operands in them as possible. For the computation of a coefficient of the double-sized array, which results from multiplying f by g, one has to access the contents of f in incrementing and g in decrementing order, e.g. the coefficient \(h_2\) is computed as \(h_2 = f_0 g_2 + f_1 g_1 + f_2 g_0\). As there is no indirect auto-decrement addressing mode available on the MSP430X microcontroller, we put the contents of g on the stack in reverse order at the beginning of the multiplication, which allows us to access g using indirect auto-increment addressing mode for the remaining part of the multiplication. Including function-call and reduction overhead, our 32-bit product-scanning multiplication implementation executes in 2079 cycles on the MSP430FR5969. Without function call and modular reduction, it executes in 1693 cycles.
For MSP430X microcontrollers that have a 16-bit hardware multiplier we implemented the constant-time one-level Karatsuba multiplication (refer to Sect. 3). We use the product-scanning technique to compute the three intermediate results L, H and M. For the computation of L, H and M we have seven working registers available to store input operands. Hence, we can store almost the full input that is accessed in decrementing order in working registers and access the eighth required operand of it using indirect addressing mode. Again we first compute the double-sized array resulting from the multiplication of f and h and then reduce this result modulo \(2^{256}-38\). Our modular multiplication implementation dedicated for devices that have a 16-bit hardware multiplier executes in 3193 cycles including function call and modular reduction, and in 2718 cycles excluding those.
Squaring
In order to compute \(h=f^2 \mod 2^{256}-38\), we first compute a double-sized array resulting from squaring f and then reduce this result modulo \(2^{256}-38\). Similar to our multiplication implementation, we use the product-scanning technique for our implementation targeting devices that have a 32-bit hardware multiplier. We again store the input f on the stack in reverse order, allowing us to use indirect auto-increment addressing mode to access elements of f in decrementing order. As mentioned in Sect. 3, many multiplications of cross-product terms occur twice during the execution of the squaring operation. These do not have to be computed multiple times, but can be accounted for by multiplying an intermediate result by two, i.e. shifting it to the left by one bit. As shift operations on the result registers of the memory-mapped hardware multiplier are expensive, we move results of a multiplication back to CPU registers before executing this shift operation. Including function call and modular reduction overhead our squaring implementation executes in 1563 cycles on MSP430X microcontrollers that have a 32-bit hardware multiplier. Without reduction and function call this number decreases to 1171 cycles.
Our squaring implementation for MSP430X microcontrollers that have a 16-bit hardware multiplier follows the constant-time Karatsuba approach, where intermediate results are computed using the product-scanning technique. This function executes in 2426 cycles including function call and reduction overhead and in 1935 cycles without.
Putting it together
We implemented all finite-field arithmetic in assembly language and all curve arithmetic as well as the conversion to and from the internal representation in C.
The x-coordinate-only doubling formula requires a multiplication with the constant 121666. One peculiarity of the MSP430 hardware multiplier greatly improves the performance of the computation of \(h=f \cdot 121666 \mod 2^{256}-38\), which is that contents of the hardware multiplier’s MAC registers do not have to be loaded again, in case the processed operands do not change. In case of having a 32-bit hardware multiplier we proceed as follows: The number 121666 can be written as \(1 \cdot 2^{16} + 56130\). We store the value 1 in MAC32H and 56130 in MAC32L and then during each iteration load two consecutive coefficients of the input array f, i.e. \(f_i\) and \(f_{i+1}\) to OP2L and OP2H for the computation of two coefficients of the resulting array namely \(h_i\) and \(h_{i+1}\). The array that results from computing \(f^2\) is only two elements longer than the input array, which we reduce as the next step. Using this method, the multiplication with 121666 executes in 352 cycles on MSP430s that have a 32-bit hardware multiplier, including function call and reduction.
For the 16-bit hardware multiplier version, we follow a slightly different approach. As we cannot store the full number 121666 in the input register of the hardware multiplier, we proceed as follows: To compute \(h=f \cdot 121666 \mod 2^{256}-38\) we store the value 56130 in the hardware-multiplier register MAC. We then compute each \(h_i\) as \(h_i = f_i \cdot 56130 + f_{i-1}\) for \(i \in [1 \dots 15]\) such that we add the \((i-1)\)th input coefficient to the multiplier’s result registers RESLO and RESHI. This step takes care of the multiplication with \(1 \cdot 2^{16}\) for the \((i-1)\)th input coefficient. We further load the ith input coefficient to the register OP2, thus executing the multiply-and-accumulate instruction to compute the ith coefficient of the result. Special care has to be taken with the coefficient \(h_0\), where \(h_0 = f_0 \cdot 56130 + 38 \cdot f_{15}\). The method executes in 512 cycles including function call and reduction overhead.
The reduction of a double-sized array modulo \(2^{256}-38\) is implemented in a similar fashion. We store the value 38 in the MAC-register of the hardware multiplier. We then add the ith coefficient of the double-sized input to the result registers of the hardware multiplier and load the \((i+16)\)th coefficient to the OP2-register. In the 32-bit version of this reduction implementation the only difference is that two consecutive coefficients can be processed in each iteration, i.e. the ith and \((i+1)\)th coefficients are added to the result registers and and the \((i+16)\)th and \((i+17)\)th coefficient are loaded to the OP2-registers.
The modular addition \(h=f+g \mod 2^{256}-38\), which executes in 186 cycles on the MSP430, first adds the two most significant words of f and g. It then extracts the carry and the most significant bit of this result and multiplies those with 19. This is added to the least significant word of f. All other coefficients of f and g are added with carry to each other. The carry resulting from the addition of the second most significant words of f and g is added to the sum that was computed first.
For the computation of \(h=f-g\), we first subtract g with borrow from f. If the result of the subtraction of the most significant words produces a negative result, the carry flag is cleared, while, if it produces a positive result the carry flag is set. We add this carry flag to a register tmp that was set to 0xffff before, resulting in the contents of tmp to be 0xffff in case of a negative result and 0 in case of a positive result of the subtraction. We AND tmp with 38, subtract this from the lowest resulting coefficient and ripple the borrow through. Again a possible resulting negative result of this procedure is reduced using the same method, minus the rippling of the borrow. This modular subtraction executes in the same time as the modular addition, i.e. in 199 cycles including function-call overhead.