Sandy2x: New Curve25519 Speed Records
Abstract
This paper sets speed records on wellknown Intel chips for the Curve25519 ellipticcurve DiffieHellman scheme and the Ed25519 digital signature scheme. In particular, it takes only \(159\,128\) Sandy Bridge cycles or \(156\,995\) Ivy Bridge cycles to compute a DiffieHellman shared secret, while the previous records are \(194\,036\) Sandy Bridge cycles or \(182\,708\) Ivy Bridge cycles.
There have been many papers analyzing ellipticcurve speeds on Intel chips, and they all use Intel’s serial \(64 \times 64 \rightarrow 128\)bit multiplier for field arithmetic. These papers have ignored the 2way vectorized \(32 \times 32 \rightarrow 64\)bit multiplier on Sandy Bridge and Ivy Bridge: it seems obvious that the serial multiplier is faster. However, this paper uses the vectorized multiplier. This is the first speed record set for ellipticcurve cryptography using a vectorized multiplier on Sandy Bridge and Ivy Bridge. Our work suggests that the vectorized multiplier might be a better choice for ellipticcurve computation, or even other types ofcomputation that involve primefield arithmetic, even in the case where the computation does not exhibit very nice internal parallelism.
Keywords
Elliptic curves DiffieHellman Signatures Speed Constant time Curve25519 Ed25519 Vectorization1 Introduction
In 2006, Bernstein proposed Curve25519, which uses a fast Montgomery curve for DiffieHellman (DH) key exchange. In 2011, Bernstein, Duif, Schwabe, Lange and Yang proposed the Ed25519 digital signature scheme, which uses a fast twisted Edwards curve that is birationally equivalent to the same Montgomery curve. Both schemes feature a conservative 128bit security level, very small key sizes, and consistent fast speeds on various CPUs (cf. [1, 8]), as well as microprocessors such as ARM ([3, 16]), Cell ([2]), etc.
Curve25519 and Ed25519 have gained public acceptance and are used in many applications. The IANIX site [17] has lists for Curve25519 and Ed25519 deployment, which include the Tor anonymity network, the QUIC transport layer network protocol developed by Google, openSSH, and many more.
This paper presents Sandy2x, a new software which sets speed records for Curve25519 and Ed25519 on the Intel Sandy Bridge and Ivy Bridge microarchitectures. Previous softwares set speed records for these CPUs using the serial multiplier. Sandy2x, instead, uses of a vectorized multiplier. Our results show that previous ellipticcurve cryptography (ECC) papers using the serial multiplier might have made a suboptimal choice.
A part of our software (the code for Curve25519 sharedsecret computation) has been submitted to the SUPERCOP benchmarking toolkit, but the speeds have not been included in the eBACS [8] site yet. We plan to submit the whole software soon for public use.
1.1 Serial Multipliers Versus Vectorized Multipliers
Prime field elements are usually represented as big integers in softwares. The integers are usually divided into several small chunks called limbs, so that field operations can be carried out as sequences of operations on limbs. Algorithms involving field arithmetic are usually bottlenecked by multiplications, which are composed of limb multiplications. On Intel CPUs, each core has a powerful \(64 \times 64 \rightarrow 128\)bit serial multiplier, which is convenient for limb multiplications. There have been many ECC papers that use the serial multiplier for field arithmetic. For example, [1] uses the serial multipliers on Nehalem/Westmere; [6] uses the serial multipliers on Sandy Bridge; [5] uses the serial multipliers on Ivy Bridge.
On some other chips, it is better to use a vectorized multiplier. The Cell Broadband Engine has 7 Synergistic Processor Units (SPUs) which are specialized for vectorized instructions; the primary processor has no chance to compete with them. ARM has a 2way vectorized \(32 \times 32 \rightarrow 64\)bit multiplier, which is clearly stronger than the \(32 \times 32 \rightarrow 64\) serial multiplier. A few ECC papers exploit the vectorized multipliers, including [3] for ARM and [2] for Cell. In 2014, there is finally one effort for using a vectorized multiplier on Intel chips, namely [4]. The paper uses vectorized multipliers to carry out hyperellipticcurve cryptography (HECC) formulas that provide a natural 4way parallelism. ECC formulas do not exhibit such nice internal parallelism, so vectorization is expected to induce much more overhead than HECC.
Our speed records rely on using a 2way vectorized multipliers on Sandy Bridge and Ivy Bridge. The vectorized multiplier carries out only a pair of \(32 \times 32 \rightarrow 64\)bit multiplication in one instruction, which does not seem to have any chance to compete with the \(64 \times 64 \rightarrow 128\)bit serial multiplier, which is used to set speed records in previous Curve25519/Ed25519 implementations. In this paper we investigate how serial multipliers and vectorized multipliers work (Sect. 2), and give arguments on why the vectorized multiplier can compete.
Performance results for Curve25519 and Ed25519 of this paper, the CHES 2011 paper [1], and the implementation by Andrew Moon “floodyberry” [7]. All implementations are benchmarked on the Sandy Bridge machine “h6sandy” and the Ivy Bridge machine “h9ivy” (abbreviated as SB and IB in the table), of which the details can be found on the eBACS website [8]. Each cycle count listed is the measurement result of running the software on one CPU core, with Turbo Boost disabled. The table sizes (in bytes) are given in two parts: readonly memory size \(+\) writable memory size.
SB cycles  IB cycles  Table size  Reference  Implementation  

Curve25519 publickey generation  \(54\,346\)  \(52\,169\)  30720 + 0  (new) this paper  
\(61\,828\)  \( {0}57\,612\)  24576 + 0  [7]  
\(194\,165\)  \(182\,876\)  0 + 0  [1] CHES 2011  amd6451  
Curve25519 shared secret computation  \(159\,128\)  \(156\,995\)  0 + 0  (new) this paper  
\(194\,036\)  \(182\,708\)  0 + 0  [1] CHES 2011  amd6451  
Ed25519 publickey generation  \(57\,164\)  \(54\,901\)  30720 + 0  (new) this paper  
\(63\,712\)  \(59\,332\)  24576 + 0  [7]  
\(64\,015\)  \(61\,099\)  30720 + 0  [1] CHES 2011  amd645130k  
Ed25519 sign  \(63\,526\)  \(59\,949\)  30720 + 0  (new) this paper  
\(67\,692\)  \(62\,624\)  24576 + 0  [7]  
\(72\,444\)  \(67\,284\)  30720 + 0  [1] CHES 2011  amd645130k  
Ed25519 verification  \(205\,741\)  \(198\,406\)  10240 + 1920  (new) this paper  
\(227\,628 \)  \(204\,376\)  5120 +960  [7]  
\(222\,564\)  \(209\,060\)  5120 +960  [1] CHES 2011  amd645130k 
1.2 Performance Results
The performance results for our software are summarized in Table 1, along with the results for [1, 7]. [1] is chosen because it holds the speed records on the eBACS site for publicly verifiable benchmarks [7, 8] is chosen because it is the fastest constanttime public implementation for Ed25519 (and Curve25519 publickey generation) to our knowledge. The speeds of our software (as [1, 7]) are fully protected against simple timing attacks, cachetiming attacks, branchprediction attacks, etc.: all load addresses, all store addresses, and all branch conditions are public.
For comparison, Longa reported \(\approx 298\,000\) Sandy Bridge cycles for the “ECDHE” operation, which is essentially 1 publickey generation plus 1 secretkey computation, using Microsoft’s 256bit NUMS curve [19]. OpenSSL 1.0.2, after heavy optimization work from Intel, compute a NIST P256 scalar multiplication in 311 434 Sandy Bridge cycles or 277 994 Ivy Bridge cycles.
For Curve25519 publickey generation, [7] and our implementation gain much better results than [1] by performing the fixedbase scalar multiplications on the twisted Edwards curve used in Ed25519 instead of the Montgomery curve; see Sect. 3.2. Our implementation strategy for Ed25519 publickey generation and signing is the same as Curve25519 publickey generation. Also see Sect. 3.1 for Curve25519 sharedsecret computation, and Sect. 4 for Ed25519 verification.
We also include the tables sizes of [1, 7] and Sandy2x in Table 1. Note that our current code uses the same window sizes as [1, 7] but larger tables for Ed25519 verification. This is because we use a data format that is not compact but more convenient for vectorization. Also note that [1] has two implementations for Ed25519: amd645130k and amd646424k. The tables sizes for amd646424k are \(20\,\%\) smaller than those of amd645130k, but the speed records on eBACS are set by amd645130k.
1.3 Other Fast DiffieHellman and Signature Schemes
On the eBACS website [8] there are a few DH schemes that achieve fewer Sandy/Ivy Bridge cycles for sharedsecret computation than our software: gls254prot from [12] uses a GLS curve over a binary field; gls254 is a nonconstanttime version of gls254prot; kummer from [4] is a HECC scheme; kumfp127g from [13] implements the same scheme as [4] but uses an obsolete approach to perform scalar multiplication on hyperelliptic curves as explained in [4].
GLS curves are patented, making them much less attractive for deployment, and papers such as [14, 15] make binaryfield ECC less confidenceinspiring. There are algorithms that are better than the Rho method for highgenus curves; see, for example, [20]. Compared to these schemes, Curve25519, using an elliptic curve over a prime field, seems to be a more conservative (and patentfree) choice for deployment.
The eBACS website also lists some signature schemes which achieve better signing and/or verification speeds than our work. Compared to these schemes, Ed25519 has the smallest publickey size (32 bytes), fast signing speed (superseded only by multivariate schemes with much larger key sizes), reasonably fast verification speed (can be much better if batched verification is considered, as shown in [1]), and a high security level (128bit).
2 Arithmetic in \(\mathbb {F}_{2^{{255}_{19}}}\)
Since the choice of radix is often platformdependent, several radices have been used in existing software implementations of Curve25519 and Ed25519. This section describes and compares the radix\(2^{51}\) representation (used by [1]) with the radix\(2^{25.5}\) representation (used by [3] and this paper), and explains how a smallradix implementation can beat a largeradix one on Sandy Bridge and Ivy Bridge, even though the vectorized multiplier seems to be slower. The radix\(2^{64}\) representation by [1] appears to be slower than the radix\(2^{51}\) representation for Curve25519 sharedsecret computation, so only the latter is discussed in this section.
2.1 The Radix\(2^{51}\) Representation
The radix\(2^{51}\) representation is designed to fit the \(64 \times 64 \rightarrow 128\)bit serial multiplier, which can be accessed using the mul instruction. The usage of the mul is as follows: given a 64bit integer (either in memory or a register) as operand, the instruction computes the 128bit product of the integer and rax, and stores the higher 64 bits of in rdx and lower 64 bits in rax.
The field multiplication function begins with computing \(f_0 g_0, f_0 g_1, \dots , f_0 g_4\). For each \(g_j\), \(f_0\) is first loaded into rax, and then a mul instruction is used to compute the product; some mov instructions are required to move the rdx and rax to the registers where \(h_j\) is stored. Each monomial involving \(f_i\) where \(i > 0\) also takes a mul instruction, and an addition (add) and an addition with carry (adc) are required to accumulate the result into \(h_k\). Multiplications by 19 can be handled by the imul instruction. In total, it takes 25 mul, 4 imul, 20 add, and 20 adc instructions to compute \(h_0, h_1, \dots , h_4\)^{1}. Note that some carries are required to bring the \(h_k\) back to around 51 bits. We denote such a radix51 field multiplication including carries as \(\mathbf {m}\); \(\mathbf {m}^\) represents \(\mathbf {m}\) without carries.
2.2 The Radix\(2^{25.5}\) Representation
To compute \(h=fg\) and \(h'=f'g'\) at the same time, we follow the strategy of [3] but replace the vectorized addition and multiplication instructions by corresponding ones on Sandy/Ivy Bridge. Given \((f_0, f'_0), \dots (f_9, f'_9)\) and \((g_0, g'_0), \dots (g_9, g'_9)\), first prepare 9 vectors \((19 g_1, 19 g'_1), \dots , (19 g_9, 19 g'_9)\) with 10 vpmuludq instructions and \((2f_1, 2f'_1),\)\((2f_3, 2f'_3),\)\(\dots ,\)\((2f_9, 2f'_9)\) with 5 vectorized addition instructions vpaddq. Note that the reason to use vpaddq instead of vpmuludq is to balance the loads of different execution units on the CPU core; see analysis in Sect. 2.3. Each \((f_0g_j, f'_0g'_j)\) then takes 1 vpmuludq, while each \((f_ig_j, f'_ig'_j)\) where \(i>0\) takes 1 vpmuludq and 1 vpaddq. In total, it takes 109 vpmuludq and 95 vpaddq to compute \((h_0, h'_0), (h_1, h'_1), \dots , (h_9, h'_9)\). We denote such a vector of two field multiplications as \(\mathbf {M^2}\), including the carries that bring \(h_k\) (and also \(h'_k\)) back to \(26  (k\,{\text {mod}}\,2)\) bits; \(\mathbf {M^{2}}\) represents \(\mathbf {M^2}\) without carries. Similarly, we use \(\mathbf {S^2}\) and \(\mathbf {S^{2}}\) for squarings.

Perform a logical right shift for the 64bit words in \(h_k\) using a vpsrlq instruction. The shift amount is \(26(k\,{\text {mod}}\,2)\).

Add the result of the first step into \(h_{k+1}\) using a vpaddq instruction.

Mask out the most significant \(38 + (k\,{\text {mod}}\,2)\) bits of \(h_k\) using a vpand instruction.
For \(h_9 \rightarrow h_0\) the result of the shift has to be multiplied by 19 before being added to \(h_0\). Note that the usage of vpsrlq suggests that we are using unsigned limbs; there is no vectorized arithmetic shift instruction on Sandy Bridge and Ivy Bridge.
2.3 Why Is Smaller Radix Better?
Instructions field arithmetic used in [1] and this paper. The data is mainly based on the wellknown survey by Fog [10]. The survey does not specify the port utilization for mul, so we figure this out using the performance counter (accessed using perfstat). Throughputs are percycle. Latencies are given in cycles .
Instruction  Port  Throughput  Latency 

vpmuludq  0  1  5 
vpaddq  either 1 or 5  2  1 
vpsubq  either 1 or 5  2  1 
mul  0 and 1  1  3 
imul  1  1  3 
add  either 0, 1, or 5  3  1 
adc  either two of 0,1,5  1  2 
On Intel microarchitechtures, an instruction is decoded and decomposed into some microoperations (\(\mu \)ops). Each \(\mu \)op is then stored in a pool, waiting to be executed by one of the ports (when the operands are ready). On each Sandy Bridge and Ivy Bridge core there are 6 ports. In particular, Port 0, 1, 5 are responsible for arithmetic. The remaining ports are responsible for memory access, which is beyond the scope of this paper (Table 2).
The arithmetic ports are not identical. For example, vpmuludq is decomposed into 1 \(\mu \)op, which is handled by Port 0 each cycle with latency 5. vpaddq is decomposed into 1 \(\mu \)op, which is handled by Port 1 or 5 each cycle with latency 1. Therefore, an \(\mathbf {M^{2}}\) would take at least 109 cycles. Our experiment shows that \(\mathbf {M^{2}}\) takes around 112 Sandy Bridge cycles, which translates to 56 cycles per multiplication.
The situation for \(\mathbf {m}\) is more complicated: mul is decomposed into 2 \(\mu \)ops, which are handled by Port 0 and 1 each cycle with latency 3. imul is decomposed into 1 \(\mu \)op, which is handled by Port 1 each cycle with latency 3. add is decomposed into 1 \(\mu \)op, which is handled by one of Port 0,1,5 each cycle with latency 1. adc is decomposed into 2 \(\mu \)ops, which are handled by two of Port 0,1,5 each cycle with latency 2. In total it takes 25 mul, 4 imul, 20 add, and 20 adc, accounting for at least \((25\cdot 2 + 4 + 20 + 20\cdot 2)/3 = 38\) cycles. Our experiment shows that \(\mathbf {m}^{}\) takes 52 Sandy Bridge cycles. The mov instructions explain a few cycles out of the \(5238 = 14\) cycles. Also, the performance counter shows that the core fails to distribute \(\mu \)ops equally over the ports.

\(\mathbf {m}\) spends more cycles on carries than \(\mathbf {M^2}\) does: \(\mathbf {m}\) takes 68 Sandy Bridge cycles, while \(\mathbf {M^2}\) takes 69.5 Sandy Bridge cycles per multiplication.

The algorithm built upon \(\mathbf {M^2}\) might have additions/subtractions. Some speedup can be gained by interleaving the code; see Sect. 2.5.

The computation might have some nonfieldarithmetic part which can be improved using vector unit; see Sect. 3.2.
2.4 Importance of Using a Small Constant
For the ease of reduction, the prime fields used in ECC and HECC are often a big power of 2 subtracted by a small constant c. It might seem that as long as c is not too big, the speed of field arithmetic would remain the same. However, in the following example, we show that using the slightly larger \(c=31\) (\(2^{255}31\) is the large prime before \(2^{255}19\)) might already cause some overhead.
Consider two field elements f, g which are the results of two field multiplications. Because the limbs are reduced, the upper bound of \(f_0\) would be close to \(2^{26}\), and the upper bound of \(f_1\) would be close to \(2^{25}\), and so on; the same bounds apply for g. Now suppose we need to compute \((f  g)^2\), which is batched with another squaring to form an \(\mathbf {S}^2\). To avoid possible underflow, we compute the limbs of \(h=fg\) as \(h_i = (f_i + 2 \cdot q_i)  g_i\) instead of \(h_i = f_i  g_i\), where \(q_i\) is the corresponding limb of \(2^{255}19\). As the result, the upper bound of \(h_6\) is around \(3 \cdot 2^{26}\). To perform the squaring \(c \cdot h^2_6\) is required. When \(c=19\) we can simply multiply \(h_6\) by 19 using 1 vpmuludq, and then multiply the product by \(h_6\) using another vpmuludq. Unfortunately the same instructions do not work for \(c=31\), since \(31 \cdot h_6\) can take more than 32 bits.
To overcome such problem, an easy solution is to use a smaller radix so that each (reduced) limb takes fewer bits. This method would increase number of limbs and thus increase number of vpmuludq required. A better solution is to delay the multiplication by c: instead of computing \(31 f_{i_1}g_{j_1} + 31 f_{i_2}g{j_2} + \cdots \) by first computing \(31 g_{j_1}, 31 g_{j_2}, \dots \), compute \(f_{i_1}g_{j_1} + f_{i_2}g_{j_2} + \cdots \) and then multiply the sum by 31. The sum can take more than 32 bits (and vpmuludq takes only 32bit inputs), so the multiplication by 31 cannot be handled by vpmuludq. Let \(s = f_{i_1}g_{j_1} + f_{i_2}g_{j_2} + \cdots \), one way to handle the multiplication by 19 is to compute 32s with one shift instruction vpsllq and then compute \(32s  s = 31s\) with one subtraction instruction vpsubq. This solution does not make Port 0 busier as vpsllq also takes only one cycle in Port 0 as vpmuludq, but it does make Port 1 and 5 busier (because of vpsubq), which can potentially increase the cost for \(\mathbf {S^{2}}\) by a few cycles.
It is easy to imagine for some c’s the multiplication can not be handled in such a cheap way as 31. In addition, delaying multiplication cannot handle as many c’s as using a smaller radix; as a trivial example, it does not work if \(cf_{i_1}g_{j_1} + cf_{i_2}g_{j_2} + \cdots \) takes more than 64 bits. We note that the computation pattern in the example is actually a part of ellipticcurve operation (see lines 6–9 in Algorithm 1), meaning a bigger constant c actually can slow down ellipticcurve operations.
We comment that usage of a larger c has bigger impact on constrained devices. If c is too big for efficient vectorization, at least one can go for the \(64\times 64 \rightarrow 128\)bit serial multiplier, which can handle a wide range of c without increasing number of limbs. However, on ARM processors where the serial multiplier can only perform \(32 \times 32 \rightarrow 64\)bit multiplications, even the serial multiplier would be sensitive to the size of c. For even smaller devices the situation is expected to be worse.
2.5 Instruction Scheduling for Vectorized Field Arithmetic
The fact that \(\mu \)ops are stored in a pool before being handled by a port allows the CPU to achieve so called outoforder execution: a \(\mu \)op can be executed before another \(\mu \)op which is from an earlier instruction. This feature is sometimes viewed as the CPU core being able to “look ahead” and execute a later instruction whose operands are ready. However, the ability of outoforder execution is limited: the core is not able to look too far away. It is thus better to arrange the code so that each code block contains instructions for each port.
While Port 0 is quite busy in \(\mathbf {M^2}\), Port 1 and 5 are often idle. In an ellipticcurve operation (see the following sections) an \(\mathbf {M^2}\) is often preceded by a few field additions/subtractions. Since vpaddq and the vectorized subtraction instruction vpsubq can only be handled by either Port 1 and Port 5, we try to interleave the multiplication code with the addition/subtraction code to reduce the chance of having an idle port. Experiment results show that the optimization brings a small yet visible speedup. It seems more difficult for an algorithm built upon \(\mathbf {m}\) to use the same optimization.
3 The Curve25519 EllipticCurveDiffieHellman Scheme
Given a 32byte secret key and the 32byte encoding of a standard base point defined in [11], the function outputs the corresponding public key. Similarly, given a 32byte secret key and a 32byte public key, the function outputs the corresponding shared secret. Although the same routine can be used for generating both public keys and shared secrets, the publickey generation can be done much faster by performing the scalar multiplication on an equivalent curve. The rest of this section describes how we implement the Curve25519 function for sharedsecret computation and publickey generation.
3.1 SharedSecret Computation

Compute multiplications by 121666 without carries using 5 vpmuludq.

Compute multiplications by \(x_1\) without carries. This can be completed in 50 vpmuludq since we precompute the products of small constants (namely, 2, 19, and 38) and limbs in \(x_1\) before the ladder begins.

Perform batched carries for the two multiplications.
This uses far fewer cycles than handling the carries for the two multiplications separately.
Note that we often have to “transpose” data in the ladder step. More specifically, after an \(\mathbf {M^2}\) which computes \((h_0, h'_0), \dots , (h_9, h'_9)\), we might need to compute \(h + h'\) and \(h  h'\); see lines 6–8 of Algorithm 1. In this case, we compute \((h_{i}, h_{i+1})\), \((h'_{i}, h'_{i+1})\) from \((h_i, h'_i), (h_{i+1}, h'_{i+1})\) for \(i \in \{0,2,4,6,8\}\), and then perform additions and subtractions on the vectors. The transpositions can be carried out using the “unpack” instructions vpunpcklqdq and vpunpckhqdq. Similarly, to obtain the operands for \(\mathbf {M^2}\) some transpositions are also required. Unpack instructions are the same as vpaddq and vpsubq in terms of port utilization, so we also try to interleave them with \(\mathbf {M^2}\) or \(\mathbf {S^2}\) as described in Sect. 2.5.
3.2 PublicKey Generation
We do better by vectorizing between computations of \(P_0\) and \(P_1\): all the data related to \(P_0\) and \(P_1\) are loaded into the lower half and upper half of the 128bit registers, respectively. This type of computation pattern is very friendly for vectorization since there no need to “transpose” the data as in the case of Sect. 3.1.

Load \(s_iP\) in constant time, which is the main bottleneck of the table lookup.

Negate \(s_iP\) if \(s_i\) is negative.
For the first step it is convenient to use the conditional move instruction (cmov): To obtain each limb (of each coordinate) of \(s_iP\), first initialize a 64bit register to the corresponding limb of \(\infty \), then for each of \(P, 2P, \dots , 8P\), conditionally move the corresponding limb into the register. Computation of the conditions and conditional negation are relatively cheap compared to the cmov instructions. [1] uses a 3coordinate system for precomputed points, so the tablelookup function takes \(3 \cdot 8 \cdot 5 = 120\)cmov instructions. The function takes 159 Sandy Bridge cycles or 158 Ivy Bridge cycles.
4 Vectorizing the Ed25519 Signature Scheme
This section describes how the Ed25519 verification is implemented with focus on the challenge of vectorization. Since the publickey generation and signing process, as the Curve25519 publickey generation, is bottlenecked by a fixedbase scalar multiplication on \(E_T\), the reader can check Sect. 3.2 for the implementation strategy.
4.1 Ed25519 Verification
[1] verifies a message by computing the doublescalar multiplication of the form \(s_1P_1 + s_2P_2\). The doublescalar multiplication is implemented using a generalization of the slidingwindow method such that \(s_1P_1\) and \(s_2P_2\) share the doublings. With the same window sizes, we do better by vectorizing the point doubling and point addition functions.
On average each verification takes about 252 point doublings, accounting for more than 110000 cycles. There are two doubling functions in our implementation; ge_dbl_p2, which is adapted from the “\(\mathcal {E} \leftarrow 2\mathcal {E}\)” doubling described in [9], is the most frequently used one; see [9] for the reason to use different doubling and addition functions. On average ge_dbl_p2 is called 182 times per verification, accounting for more than 74000 cycles. The function is summarized in Algorithm 2. Given (X : Y : Z) representing \((X/Z, Y/Z) \in E_T\), the function returns \((X':Y':Z') = (X:Y:Z) + (X:Y:Z)\). As in Sect. 3.1, squarings and multiplications are paired whenever convenient. However it is not always possible to do so, as the multiplication in line 9 can not be paired with other operations. The single multiplication slows down the function, and the same problem also appears in addition functions.
Another problem is harder to see. \(E = X^2 + Y^2  (X+Y)^2\) has limbs with upper bound around \(4 \cdot 2^{26}\), and \(I = X^2  Y^2 + 2Z^2\) has limbs with upper bound around \(5 \cdot 2^{26}\). For the multiplication \(E \cdot I\), limbs of either E or I have to be multiplied by 19 (see Sect. 2.2), which can be more than 32 bits. This problem is solved by performing extra carries on limbs in E before the multiplication. The same problem appears in the other doubling function.
In general the computation pattern for verification is not so friendly for vectorization. However, even in this case our software still gains nonnegligible speedup over [1, 7]. We conclude that the power of vector unit on recent Intel microarchitectures might have been seriously underestimated, and implementors for ECC software should consider trying vectorized multipliers instead of serial multipliers.
Footnotes
 1.
[1] uses one more imul; perhaps this is for reducing memory access.
 2.
The starting ‘v’ indicate that the instruction is the VEX extension of the pmuludq instruction. The benefit of using vpmuludq is that it is a 3operand instruction. In this paper we show vector instructions in their VEX extension form, even though vector instructions are sometimes used without the VEX extension.
References
 1.Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: Highspeed highsecurity signatures. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 124–142. Springer, Heidelberg (2011). Citations in this document: §1, §1.1, §1.2, §1.2, §1.2, §1.2, §1.2, §1, §1, §1.2, §1.2, §1.2, §1.2, §1.2, §1.2, §1.2, §1.3, §2, §2, §2.1, §1, §2, §2, §2.3, §3.1, §3.2, §3.2, §3.2, §3.2, §3.2, §4.1, §4.1 CrossRefGoogle Scholar
 2.Costigan, N., Schwabe, P.: Fast ellipticcurve cryptography on the cell broad band engine. In: AFRICACRYPT 2009, pp. 368–385 (2009). Citations in this document: §1, §1.1 Google Scholar
 3.Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012). Citations in this document: §1, §1.1, §1.1, §2, §2.2, §2.2, §2.2, §2.2, §3.1, §3.1 CrossRefGoogle Scholar
 4.Bernstein, D.J., Chuengsatiansup, C., Lange, T., Schwabe, P.: Kummer strikes back: new DH speed records. In: EUROCRYPT 2015, pp. 317–337 (2014). Citations in this document: §1.1, §1.1, §1.3, §1.3, §1.3 Google Scholar
 5.Costello, C., Hisil, H., Smith, B.: Faster compact DiffieHellman: endomorphisms on the xline. In: EUROCRYPT 2014, pp. 183–200 (2014). Citations in this document: §1.1 Google Scholar
 6.Longa, P., Sica, F., Smith, B.: Fourdimensional GallantLambertVanstone scalar multiplication. In: Asiacrypt 2012, pp. 718–739 (2012). Citations in this document: §1.1 Google Scholar
 7.Andrew, M.: “Floodyberry”, Implementations of a fast Ellipticcurve Digital Signature Algorithm (2013). https://github.com/floodyberry/ed25519donna. Citations in this document: §1.2, §1.2, §1.2, §1.2, §1, §1, §1.2, §1.2, §1.2, §1.2, §1.2, §1.2, §3.2, §4.1
 8.Bernstein, D.J., Lange, T. (eds.): eBACS: ECRYPT Benchmarking of Cryptographic Systems (2014). http://bench.cr.yp.to. Citations in this document: §1, §1, §1, §1, §1.2, §1.3
 9.Hisil, H., Wong, KK.H., Carter, G., Dawson, Ed.: Twisted Ed wards curves revisited. In: Asiacrypt 2008, pp. 326–343 (2008). Citations in this document: §3.2, §4.1, §4.1 Google Scholar
 10.Fog, A.: Instruction tables (2014). http://www.agner.org/optimize/instruction_tables.pdf. Citations in this document: §2, §2
 11.Bernstein, D.J.: Curve25519: new DiffieHellman speed records. In: PKC 2006, pp. 207–228 (2006). Citations in this document: §3, §3 Google Scholar
 12.Oliveira, T., López, J., Aranha, D.F., RodríguezHenríquez, F.: Lambda coordinates for binary elliptic curves. In: Bertoni, G., Coron, J.S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 311–330. Springer, Heidelberg (2013). Citations in this document: §1.3 CrossRefGoogle Scholar
 13.Bos, J.W., Costello, C., Hisil, H., Lauter, K.: Fast cryptography in genus 2. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 194–210. Springer, Heidelberg (2013). Citations in this document: §1.3 CrossRefGoogle Scholar
 14.Petit, C., Quisquater, J.J.: On polynomial systems arising from a Weil descent. In: ASIACRYPT 2012, pp. 451–466 (2012). Citations in this document: §1.3 Google Scholar
 15.Semaev, I.: New algorithm for the discrete logarithm problem on elliptic curves (2015). https://eprint.iacr.org/2015/310.pdf. Citations in this document: §1.3
 16.Düll, M., Haase, B., Hinterwälder, G., Hutter, M., Paar, C., Sánchez, A.H., Schwabe, P.: Highspeed Curve25519 on 8bit, 16bit, and 32bit microcontrollers. Des. Codes Crypt. 77(2), 493–514 (2015). http://cryptojedi.org/papers/#mu25519. Citations in this document: §1 MathSciNetCrossRefzbMATHGoogle Scholar
 17.IANIX. www.ianix.com. Citations in this document: §1
 18.Bernstein, D.J.: qhasm sofware package (2007). http://cr.yp.to/qhasm.html. Citations in this document: §3.2
 19.Longa, P.: NUMS Elliptic Curves and their Implementation. http://patricklonga.webs.com/NUMS_Elliptic_Curves_and_their_ImplementationUoWashington.pdf. Citations in this document: §1.2
 20.Thériault, N.: Index calculus attack for hyperelliptic curves of small genus. In: Asiacrypt 2003, pp. 75–92 (2003). Citations in this document: §1.3 Google Scholar