Timing attacks and local timing attacks against Barrett’s modular multiplication algorithm

Montgomery’s and Barrett’s modular multiplication algorithms are widely used in modular exponentiation algorithms, e.g. to compute RSA or ECC operations. While Montgomery’s multiplication algorithm has been studied extensively in the literature and many side-channel attacks have been detected, to our best knowledge no thorough analysis exists for Barrett’s multiplication algorithm. This article closes this gap. For both Montgomery’s and Barrett’s multiplication algorithm, differences of the execution times are caused by conditional integer subtractions, so-called extra reductions. Barrett’s multiplication algorithm allows even two extra reductions, and this feature increases the mathematical difficulties significantly. We formulate and analyse a two-dimensional Markov process, from which we deduce relevant stochastic properties of Barrett’s multiplication algorithm within modular exponentiation algorithms. This allows to transfer the timing attacks and local timing attacks (where a second side-channel attack exhibits the execution times of the particular modular squarings and multiplications) on Montgomery’s multiplication algorithm to attacks on Barrett’s algorithm. However, there are also differences. Barrett’s multiplication algorithm requires additional attack substeps, and the attack efficiency is much more sensitive to variations of the parameters. We treat timing attacks on RSA with CRT, on RSA without CRT, and on Diffie–Hellman, as well as local timing attacks against these algorithms in the presence of basis blinding. Experiments confirm our theoretical results.


Introduction
In his famous pioneer paper [19], Kocher introduced timing analysis. Two years later, [13] presented a timing attack on an early version of the Cascade chip. Both papers attacked unprotected RSA implementations which did not apply the Chinese remainder theorem (CRT). While in [19], the execution times of the particular modular multiplications and squarings are at least approximately normally distributed, this is not the case for the implementation in [13] since the Cascade chip applied the wide-spread Montgomery multiplication algorithm [21]. Due to conditional integer subtractions (so-called extra reductions), the execution times can only B Werner Schindler werner.schindler@bsi.bund.de Johannes Mittmann johannes.mittmann@bsi.bund.de 1 Bundesamt für Sicherheit in der Informationstechnik (BSI), Godesberger Allee 185-189, 53175 Bonn, Germany attain two values, and the probability whether an extra reduction occurs depends on the preceding Montgomery operations within the modular exponentiation. This fact caused substantial additional mathematical difficulties.
In [24], the random behaviour of the occurrence of extra reductions within a modular exponentiation was studied. The random extra reductions were modelled by a non-stationary time-discrete stochastic process. The analysis of this stochastic process (combined with an efficient error detection and correction strategy) allowed to drastically reduce the sample size, i.e. the number of timing measurements, namely from 200,000 to 300,000 [13] down to 5000 [28].
The analysis of the above-mentioned stochastic process turned out to be very fruitful also beyond this attack scenario. First, the insights into the probabilistic nature of the occurrence of extra reductions within modular exponentiations enabled the development of a completely new timing attack against RSA with CRT and Montgomery's multiplication algorithm [22]. This attack was extended to an attack on the sliding-window-based RSA implementation in OpenSSL v.0.9.7b [8], which caused a patch. The efficiency of this attack (in terms of the sample size) was increased by a factor of ≈ 10 in [3]. Years later, it was shown that exponent blinding (cf. [19], Sect. 10) does not suffice to prevent this type of timing attack [26,27].
Moreover, in [2,14,23] local timing attacks were considered. There, a side-channel attack (e.g. a power attack or an instruction cache attack) is carried out first, which yields the execution times of the particular Montgomery operations. This plus of information (compared to 'pure' timing attacks) allows to overcome basis blinding (a.k.a. message blinding, cf. [19], Sect. 10), and the attack works against both RSA with CRT and RSA without CRT. We mention that [2] led to a patch of OpenSSL v.0.9.7e.
Barrett's (modular) multiplication algorithm (a.k.a. Barrett reduction) [4] is a well-known alternative to Montgomery's algorithm. It is described in several standard manuals covering RSA, Diffie-Hellman (DH) or elliptic curve cryptosystems (e.g. [10,20]). The efficiency (e.g. running time) of Barrett's algorithm compared to Montgomery's algorithm has been analysed for both software implementations [6] and hardware implementations [18]. However, to our knowledge, there do not exist thorough security evaluations of Barrett's multiplication algorithm. In this paper, we close this gap. For the sake of comparison with previous work on Montgomery's algorithm, we focus again on RSA with and without CRT. In addition, we cover static DH, which can be handled almost identically to RSA without CRT.
Similar to Montgomery's algorithm, timing differences in Barrett's multiplication algorithm are caused by conditional subtractions (so-called extra reductions), which suggests to apply similar mathematical methods. However, for Barrett's algorithm, the mathematical challenges are significantly greater. One reason is that more than one extra reduction may occur. In particular, in place of a stochastic process over {0, 1}, a two-dimensional Markov process over [0, 1) × {0, 1, 2} has to be analysed and understood. Again, probabilities can be expressed by multidimensional integrals over [0, 1) , but the integrands are less suitable for explicit computations than in the Montgomery case. This causes additional numerical difficulties in particular for the local attacks, where is usually very large. Our results show many parallels to the Montgomery case, and after suitable modifications, all the known attacks on Montgomery's algorithm can be transferred to Barrett's multiplication algorithm. However, there are also significant differences. First of all, for Barrett's multiplication algorithm the attack efficiency is very sensitive to deviations of the modulus (i.e. of the RSA primes p 1 and p 2 if the CRT is applied), and attacks on RSA with CRT require additional attack steps.
The paper is organized as follows: in Sect. 2, we study the stochastic behaviour of the execution times of Barrett's multiplication algorithm in the context of the square & multi-ply exponentiation algorithm. We develop, prove and collect results, which will be needed later to perform the attacks. In Sect. 3, properties of Montgomery's and Barrett's multiplication algorithms are compared, and furthermore, a variant of Barrett's algorithm is investigated. In Sect. 4, the particular attacks are described and analysed, while Sect. 5 provides experimental results which confirm the theoretical considerations. Interesting in its own right is also an efficient look-ahead strategy. Finally, Sect. 6 discusses countermeasures.

Stochastic modelling of modular exponentiation
In this section, we analyse the stochastic timing behaviour of modular exponentiation algorithms when Barrett's multiplication is applied. We consider a basic version of Barrett's multiplication algorithm (cf. Algorithm 1). The slightly optimized version of this algorithm due to Barrett [4] will be discussed in Sect. 3

Barrett's modular multiplication algorithm
In this subsection, we study the basic version of Barrett's modular multiplication algorithm (cf. Algorithm 1).
cation modulo M of two integers x, y ∈ Z M can be computed by an integer multiplication, followed by a modular reduction. The resulting remainder is r := (x · y) mod M = z − q M, where z := x y and q := z/M . The computation of q is the most expensive part, because it involves an integer division. The idea of Barrett's multiplication algorithm is to approximate q by If the integer reciprocal μ := b 2k /M of M has been precomputed and if b is a power of 2, then q can be computed using only multiplications and bit shifts, which on common computer architectures are cheaper operations than divisions. From q an approximation r := z − q M of r can be obtained. Since q can be smaller than q, it may be necessary to correct r by some conditional subtractions of M, which we call extra reductions. This leads to Algorithm 1.
Barrett showed that at most two extra reductions are required in Algorithm 1. The following lemma provides an exact characterization of the number of extra reductions and is at the heart of our subsequent analysis. In particular, the lemma identifies two important constants α ∈ [0, 1) and β ∈ (b −1 , 1] associated with M and b.

Lemma 1 On input x, y ∈ Z M , the number of extra reductions carried out in Algorithm 1 is
where α := (M 2 /b 2k ){b 2k /M} ∈ [0, 1) and We first study the distribution of the number of extra reductions which are needed in Algorithm 1 for random inputs. To this end, we introduce the following stochastic model. Random variables are denoted by capital letters, and realizations of these random variables (i.e. values taken on by these random variables) are denoted with the corresponding small letters.
A realization r of R(s, t) expresses the random quantity of extra reductions which is required in Algorithm 1 for normalized inputs x/M, y/M ∈ M −1 Z M within a small neighbourhood of s and t in [0, 1). Justification of Stochastic Model 1: Assume that N (s) and N (t) are small neighbourhoods of s and t in [0, 1), respectively. Let s min , s ∈ N (s) ∩ M −1 Z M such that s min is minimal. Analogously, let t min , t ∈ N (t) ∩ M −1 Z M such that t min is minimal. Then, there are m, n ∈ Z such that s = s min + m/M and t = t min + n/M. Then, x := Ms , x min := Ms min , y := Mt , y min := Mt min are integers in Z M such that The integers m and n assume values in respectively. For cryptographically relevant modulus sizes, these numbers are very large so that one may assume that the admissible terms {(my min + nx min + mn) mod M} are essentially uniformly distributed on Z M , justifying the model assumption that the random variable V is uniformly distributed on [0, 1). The assumptions on the uniformity of U and the independence of U and V have analogous justifications. (i) The term (an) −1 (ax + b) n + is an antiderivative of (ax + b) n−1 + for all a, b ∈ R with a = 0 and all n ∈ N >0 .
(i) For r ∈ Z 3 , we have By Lemma 2 (iii), we obtain Distinguishing the cases αt + β ≤ 1 and αt + β > 1, the expectation and variance of R(S, t) can be determined by careful but elementary computations.

Lemma 4
Let S be a uniformly distributed random variable on [0, 1).
(i) For r ∈ Z 3 , we have Pr R(s, s) ≤ r ds.
By Lemma 2 (iii), we obtain This implies (i). Distinguishing the cases α + β ≤ 1 and α + β > 1, the expectation and variance of R(S, S) can be determined by careful but elementary computations.
The number of extra reductions in Barrett's multiplication algorithm depends decisively on the parameters α and β. In Table 1 For example, we get α ≈ 0.29 and β ≈ 0.69 on average if b = 2. (ii) By (2), two extra reductions can only occur if α + β > 1 and the probability of this event increases for larger sums α + β. This sum can be bounded by in particular α and β cannot attain their individual maxima simultaneously. For b = 2, we numerically determined Pr(A + B > 1) ≈ 0.5, which means that two extra reductions do occur for roughly one half of the moduli in this case. (iii) Although the probability of two extra reductions can be very small or even 0, it does not simplify the stochastic representations (2) and (3)

Modular exponentiation (square and multiply algorithms)
Now we consider the left-to-right binary exponentiation algorithm (see Algorithm 2), where modular squarings and multiplications are performed using Barrett's algorithm (see Algorithm 1). Our goal is to define and analyse a stochastic process which allows to study the stochastic behaviour of the execution time of Algorithm 2. Sect. 2.2 provides a sequence of technical lemmata which be needed later. Let y ∈ Z M be an input basis of Algorithm 2. We denote the intermediate values computed in the course of Algorithm 2 by x 0 , x 1 , . . . ∈ Z M and associate the sequence of squaring and multiplication operations with a string O 1 O 2 · · · over the alphabet {S, M}. For the sake of defining an infinite stochastic process, we assume that Algorithm 2 may run forever; hence, x 0 , x 1 , . . . is an infinite sequence and O 1 O 2 · · · ∈ {S, M} ω . Consequently, we have x 0 = y and for all n ∈ N. Note that O 1 O 2 · · · does not contain the substring MM. We will refer to strings O 1 O 2 · · · without substring MM as operation sequences.
Note that the square and multiply algorithm applied to an exponent d corresponds to a particular finite (d-specific) Stochastic Model 2 Let t ∈ [0, 1) and let O 1 O 2 · · · ∈ {S, M} ω be an operation sequence. We define a stochastic process (S n , R n ) n∈N on the state space S := [0, 1) × Z 3 as follows. Let S 0 , S 1 , . . . be independent random variables on [0, 1).
The value t represents a normalized input y/M of Remark 4 Remark 2 (ii) applies to the stochastic Stochastic Model 2 as well. In Sect. 4.3, we will adjust the stochastic process (S n , R n ) n∈N to a table-based exponentiation algorithm (fixed window exponentiation), where multiplications by 1 occur frequently. Multiplications by 1 will therefore be handled separately.
(i) The stochastic process (S n , R n ) n∈N is a non-homogeneous Markov process on S . (ii) The random vector (S n , R n ) has a density f n (s n , r n ) with respect to λ ⊗ η. Here, S and T denote independent random variables that are uniformly distributed on [0, 1). We consider the cases of multiplication of a random input with a fixed input (with normalized value t = 0.9 and t = 0.7), squaring of a random input, and multiplication of two random inputs  (iii) For B ∈ B([0, 1)), r n+1 ∈ Z 3 , and fixed (s n , r n ) ∈ S , we have where the conditional density is given by e. the (n + 1)-th operation is squaring) and This shows that the pushforward measure of (S n , R n ) is absolutely continuous with respect to λ ⊗ η; therefore, assertion (ii) follows from the Radon-Nikodym theorem.
Lemma 6 (i) For r n+1 ∈ Z 3 , we have f n (s n , r n )ds n dη(r n ) In particular, the distribution of R n+1 does not depend on n but only on the operation type O n+1 ∈ {S, M}.
O n+1 for all n ∈ N. In particular, we have The expectation μ M is strictly monotonously increasing in t = y/M.
Lemma 7 (i) For r n+1 , . . . , r n+u ∈ Z 3 , we have h n+u (s n+u , r n+u | s n+u−1 )ds n+u · · · h n+1 (s n+1 , r n+1 | s n )ds n+1 f n (s n , r n )ds n dη(r n ) In particular, the joint distribution of R n+1 , . . . , R n+u does not depend on n but only on the operation types we have Cov(R n , R n+s ) = 0 for all s ≥ 2.
Proof Let r n+1 , . . . , r n+u ∈ Z 3 . Then, and assertion (i) follows from Lemma 5, Lemma 6 (i), and the Ionescu-Tulcea theorem. Assertion (ii) is an immediate consequence of (i). To prove (iii), let 1 ≤ u < v ≤ n such that v − u > 1, let r 1 , . . . , r u , r v , . . . , r n ∈ Z 3 , and define the events Using the Markov property of (S n , R n ) n∈N and (i), we obtain

Definition 5
The normal distribution with mean μ and variance σ 2 is denoted by N (μ, σ 2 ), and For strings x, y ∈ {S, M} * , we denote by # x (y) ∈ N the number of occurrences of x in y.
Below we will use the following version of the central limit theorem for m-dependent random variables due to Hoeffding & Robbins.

Lemma 9
Let O 1 O 2 · · · ∈ {S, M} ω be an operation sequence such that the limit exists uniformly for all i ∈ N and define Then, lim s→∞ Var(R n+1 + · · · + R n+s )/s = A and Using the identities As u → ∞, the ratio # i+1,i+u M /u converges to ρ uniformly for all i by assumption; therefore, u −1 u h=1 A i+h converges to A uniformly for all i. Since Var(R n+1 + · · · + R n+s )/s converges to A as s → ∞. Finally, (R n+1 +· · ·+ R n+s )/ √ s has the limiting distribution N (0, A) by Lemma 8.
We note that for random operation sequences O 1 O 2 · · · (corresponding to random exponents with independent and unbiased bits), the convergence of (7) is not uniform with probability 1. However, for any given finite operation sequence O n+1 · · · O n+s we may construct an infinite sequence O 1 O 2 · · · with subsequence O n+1 · · · O n+s for which convergence of (7) is uniform and ρ ≈ # M (O n+1 · · · O n+s )/s. Therefore, if s is sufficiently large, it is reasonable to assume that the normal approximation is appropriate. We mention that in our experiments in Sect. 5.1 approximation (8) is applied and leads to successful attacks.

Summary of the relevant facts
In this section, we studied the random behaviour of the number of extra reductions when Barrett's modular multiplication algorithm is used within the square and multiply exponentiation algorithm. In Sect. 4.3, we generalize this approach to table-based exponentiation algorithms. We defined a stochastic process (S n , R n ) n∈N . The random variable S n represents the (random) normalized intermediate value (= intermediate value divided by the modulus M) in Algorithm 2 after the n-th Barrett operation, and the random variable R n represents the (random) number of extra reductions needed for the n-th Barrett operation.
Algorithm 1 needs 0, 1 or 2 extra reductions. The stochastic process (S n , R n ) n∈N is a non-homogeneous Markov chain on the state space S = [0, 1) × Z 3 . The projection onto the first component gives independent random variables S 1 , S 2 , . . ., which are uniformly distributed on the unit interval [0, 1). However, we are interested in the stochastic process R 1 , R 2 , . . . on Z 3 , which is more difficult to analyse. In particular, E(R n ) and Var(R n ) depend on the operation type O n of the n-th Barrett operation (multiplication M or squaring S), while the covariances Cov(R n , R n+1 ) depend on the operation types of the n-th and the (n + 1)-th Barrett operation (SM, MS or SS). The formulae (4), (5) and (6) provide explicit formulae for the expectations and the variances, while Lemma 7 (i), (ii) explains how to compute the covariances. Further, the stochastic process (R n ) n∈N ≥1 is 1-dependent. In particular, a version of the central limit theorem for dependent random variables can be applied to approximate the distribution of standardized finite sums (cf. (8)).

Montgomery multiplication versus Barrett multiplication
In Sect. 3.1, we briefly treat Montgomery's multiplication algorithm (MM) [21] and summarize relevant stochastic properties. This is because in Sect. 4 we consider the question whether the (known) pure and local timing attacks against Montgomery's multiplication algorithm can be transferred to implementations that apply Barrett's algorithm.

Montgomery's multiplication algorithm in a nutshell
Montgomery's multiplication algorithm is widely used to compute modular exponentiations because it transfers modulo operations and divisions to moduli and divisors, which are powers of 2.
For an odd modulus M (e.g. an RSA modulus or a prime), the integer R : with a version of Algorithm 3. Here, ws denotes the word size of the arithmetic operations (typically, depending on the platform ws ∈ {8, 16, 32, 64}), which divides the exponent t. Further, r = 2 ws , so that R = r v with v = t/ws. In Algorithm 3, the operands x, y and s are expressed in the r -adic representation. That is, Step 4, called 'extra reduction' (ER), is carried out iff s ∈ [M, 2M). This conditional integer subtraction is responsible for timing differences, and thus is the source of side channel attacks. (9) which means that an MM operation costs time c if no ER is needed, and c ER equals the time for an ER. (The constants c and c ER depend on the concrete implementation.)

Assumption 1 (Montgomery modular multiplication) For fixed modulus M and fixed Montgomery constant R,
Justification of Assumption 1: (See [26], Remark 1, for a comprehensive analysis.) For known-input attacks (with more or less randomly chosen inputs), Assumption 1 should usually be valid. An exception is pure timing attacks on RSA with CRT implementations in old versions of OpenSSL [3,7], cf. Sect. 4.2. The reason is that OpenSSL applies different subroutines to compute the for-loop in Algorithm 3, depending on whether x and y have identical word size or not. The before-mentioned timing attacks on RSA with CRT are adaptive chosen input attacks, and during the attack certain MM-operands become smaller and smaller. This feature makes the attack to some degree more complicated but does not prevent it because new sources for timing differences occur. RSA implementations on smart cards and microcontrollers usually should not care about word lengths (and meet Assumption 1 in any case) because in normal use operands with different word size rarely occur so that an optimization of this case seems to be useless.
In the following, we summarize some well-known fundamental stochastic properties of Montgomery's multiplication algorithm, or more precisely, on the distribution of random extra reductions within a modular exponentiation algorithm. Their knowledge is needed to develop (effective and efficient) pure or local timing attacks [3,7,[22][23][24][25][26][27][28].
We interpret the normalized intermediate values of Algorithm 4 as realizations of random variables S 0 , S 1 , . . .. With the same arguments as in Sect. 2.2 (for Barrett's multiplication), one concludes that for Algorithm 4 the random variables S 1 , S 2 , . . . are iid uniformly distributed on [0, 1). We set w i = 1 if the i-th Montgomery operation requires an ER and w i = 0 otherwise. We interpret the values w 1 , w 2 , . . . as realizations of {0, 1}-valued random variables Interestingly, it does not depend on the word size ws whether an ER is necessary but only on (a, b, M, R). This allows to consider the case ws = t (i.e. v = 1) when analysing the stochastic behaviour of the random variables W i in modular exponentiation algorithms. In particular, the computation of MM(a, b; M) requires an ER iff This observation allows to express the random variable W i in terms of S i−1 and S i . For Algorithm 4, this implies The random variables W 1 , W 2 , . . . have interesting properties which are similar to those of R 1 , R 2 , . . .. In particular, they are neither stationary distributed nor independent but 1dependent and under weak assumption they fulfil a version of the central limit theorem for dependent random variables. Relation (11) allows to represent joint probabilities

A closer look at Barrett's multiplication algorithm
In this subsection, we develop and justify equivalents to Assumption 1 for two (different) Barrett multiplication algorithms (Algorithm 1 and Algorithm 5). Therefrom, we deduce stochastic representations, which describe the random timing behaviour of modular exponentiations y → y d mod M.

Modular exponentiation with Algorithm 1
At first we formulate an equivalent to Assumption 1.

Assumption 2 (Barrett modular multiplication) For fixed modulus M,
Here, the 'setup-time' t set summarizes the time needed for all operations that are not part of Algorithm 1, e.g. the time needed for input and output and maybe for the computation of the constant μ (if not stored). The random variable N quantifies the 'timing noise'. This includes measurement errors and possible deviations from Assumption 2. We assume that N ∼ N (0, σ 2 ). We allow σ 2 = 0, which means 'no noise' (i.e. N ≡ 0), while a nonzero expectation E(N ) is 'moved' to t set . The data-dependent timing differences are quantified by the stochastic process R 1 , R 2 , . . . , R +ham(d)−2 , which is thoroughly analysed in Sect. 2. Recall that the distribution of this stochastic process depends on the secret exponent d and on the ratio t = y/M.

Modular exponentiation with Algorithm 5
Algorithm 5 is a modification of Algorithm 1 containing an optimization which was already proposed by Barrett [4]. Its Lines 3-6 substitute Line 3 of Algorithm 1. We may assume b = 2 ws > 2, where ws is the word size of the integer arithmetic (typically, ws ∈ {8, 16, 32, 64}). Line 3 of Algorithm 1 computes a multiplication q · M of two integers, which are in the order of b k , and a subtraction of two integers, which are in the order of b 2k . In contrast, Line 3 of Algorithm 5 only requires a modular multiplication ( q · M) mod 2 ws(k+1) , the subtraction of two integers in the order of b k+1 and Line 5 possibly one addition by b k+1 .
By Lemma 1, the values r after Line 3 of Algorithm 1 and r after Line 6 of Algorithm 5 thus coincide, which means that the number of extra reductions is the same for both algorithms. If c add denotes the time needed for an addition of b k+1 in Line 5 of Algorithm 5, this leads to an equivalent of Assumption 2.

Assumption 3 (Barrett modular multiplication, optimized)
For fixed modulus M, for all a, b ∈ Z M , which means that a Barrett multiplication (BM) with Algorithm 5 (without extra reductions or an extra addition by b k+1 ) costs time c, while c ER and c add equal the time for one ER or for an addition by b k+1 , respectively. (The constants c, c ER and c add depend on the concrete implementation.) When implemented on the same platform, the constant c in (15) should be smaller than in (13).

Justification of Assumption 3:
The justification is rather similar to that of Assumption 2. The relevant arguments have already been discussed at the beginning of this subsection. This also concerns the general expositions to the impact of possible optimizations of the integer multiplication algorithm.
The if-condition in Line 4 of Algorithm 5 introduces an additional source of timing variability, which has to be anal-ysed. We have already explained that this if-condition does not affect the number of extra reductions. Next, we determine the probability of an extra addition by b k+1 . Let x, y ∈ Z M and let z = x · y. We denote by r a the number of extra additions by b k+1 in the computation of z mod M. Obviously, r a = 1 iff z mod b k+1 < q M mod b k+1 (and r a = 0 otherwise), or equivalently, when dividing both terms by b k+1 , of z, where leading zero digits are permitted. Then, We now assume that Algorithm 5 is applied in the modular exponentiation algorithm Algorithm 2. In analogy to the extra reductions, we interpret the number of extra additions by b k+1 in the (n + 1)-th Barrett operation, denoted by r a;n+1 , as a realization of a {0, 1}-valued random variable R a;n+1 . In particular, (x · y) mod M either represents a squaring (x = y) or a multiplication of the intermediate value x by the basis y, respectively. As in Stochastic Model 2, we model v 1 = (x y)/M 2 as a realization of S 2 n (squaring) or S n t (multiplication by the basis y) with t = y/M, respectively, and v 2 as a realization of the random variable U n+1 which is uniformly distributed on [0, 1). With the same argumentation as for v 2 , we model We model v n+1 := v 3 as a realization of a random variable V n+1 that is uniformly distributed on [0, 1). By (18) and (19), the values u , v 1 , v 2 and v 3 essentially depend on z k (resp., on (z k , z k−1 ) if ws is small), on the most significant digits of z in the b-ary representation, on z k−2 (resp. on (z k−2 , z k−3 ) if ws is small) and on the weighted sum of the b-ary digits z 2k−1 , . . . , z k−1 , respectively. This justifies the assumption that the random variables U n+1 , S n , U n+1 , V n+1 (essentially) behave as if they were independent.
We could extend the non-homogeneous Markov chain (S n , R n ) n∈N on the state space [0, 1) × Z 3 from Sect. 2 to a non-homogeneous Markov chain (S n , R n , R a;n ) n∈N on the state space [0, 1) × Z 3 × Z 2 . Its analysis is analogous to that in Sect. 2 but more complicated in detail. Since for typical word sizes w the impact of the extra additions on the execution time is by an order of magnitude smaller than the impact of the extra reductions (due to the factor γ , cf. (20) and (21)), we do not elaborate on this issue. We only mention that Pr(R a;n+1 = 1) and analogously For the reason mentioned above, we treat the extra additions as noise. Equation (22) is the equivalent to (14) for the modified version of Barrett's multiplication algorithm. We use the same notation as in (14).

Algorithm 2 & Algorithm 5:
The expected time for all extra additions has been moved to the setup time, and the variance became part of N * . While the formula for the expectation is exact, we used a coarse approximation for the variance which neglects any dependencies. We just mention that R n+1 and R a;n+1 are positively correlated. This follows from the fact that apart from the factor γ the terms 'αγ S 2 n ', 'αγ S n t', and 'βγ U n+1 ' in (20) and (21) coincide with terms in (3) and since R a;n+1 and R n+1 are both 'large' if the corresponding terms in (20), (21) and (3) are 'large'.
Remark 6 (i) The stochastic representations (14) and (22) are essentially identical although (22) has slightly larger noise. (ii) The ratio c add /c ER depends on the implementation.

Special values for˛andŤ
he stochastic behaviour of the Barrett multiplication algorithm depends on α and β. In particular, α has significant impact on most of the attacks discussed in Sect. 4. In this subsection, we briefly analyse the extreme cases α ≈ 0 and β ≈ 0.
The condition k = log b M + 1 (cf. Algorithm 1 and Algorithm 5) implies b k−1 ≤ M < b k . Now assume that b k /2 < M < b k , which is typically fulfilled if the modulus M has 'maximum length' (e.g. 1024 or 2048 bits). Then, If b = 2 ws with ws 1 (e.g. ws ≥ 16), then β ≈ 0. In this case, one may neglect the term 'βU ' in (2), accepting a slight inaccuracy of the stochastic model.
Going the next step, cancelling 'βU n+1 ' in (3) simplifies Stochastic Model 2 as (3) can be rewritten as R n+1 = 1 {S n+1 <αS 2 n } and R n+1 = 1 {S n+1 <αS n t} , respectively. This representation is equivalent to the Montgomery case (11), simplifying the computation of the variances, covariances and the probabilities in Lemma 7 (i) considerably (as for the Montgomery multiplication). The Barrett-specific features and difficulties yet remain.
More generally, let us assume With the same strategy as in (23) and (24), we conclude The impact of α ≈ 0 and β ≈ 0 on the attacks in Sect. 4 will be discussed in Sect. 4.4.

A short summary
The stochastic process R 1 , R 2 , . . . is the equivalent to W 1 , W 2 , . . . (Montgomery multiplication). Both stochastic processes are 1-dependent. Hence, it is reasonable to assume that attacks on Montgomery's multiplication algorithm can be transferred to implementations which use Barrett's multiplication algorithm. In Sect. 4, we will see that this is indeed the case.
However, for Barrett's multiplication algorithm additional problems arise. In particular, there is no equivalent to the characterization (10), which allows to directly analyse the stochastic process W 1 , W 2 , . . . For Barrett's algorithm, a 'detour' to the two-dimensional Markov process (S i , R i ) i∈N is necessary. Moreover, for Montgomery's multiplication algorithm, the respective integrals can be computed much easier than for Barrett's algorithm since simple closed formulae exist. If β ≈ 0, the evaluation of the integrals becomes easier (as for Montgomery's algorithm), and if α ≈ 0, the computations become absolutely simple. For CRT implementations, the parameter estimation is more difficult for Barrett's multiplication algorithm than for Montgomery's algorithm. We return to these issues in Sect. 4.

Timing attacks against Barrett's modular multiplication
The conditional extra reduction in Montgomery's multiplication algorithm is the source of many timing attacks and local timing attacks [

Timing attacks on RSA without CRT and on DH
In this subsection, we assume that M is an RSA modulus or the modulus of a DH-group (i.e. a subgroup of F * M ) and that y d mod M is computed with Algorithm 2, where d = (d −1 , . . . , d 0 ) 2 is a secret exponent. Blinding techniques are not applied. We transfer the attack from Sect. 6 in [24] to Barrett's multiplication algorithm and extend it by a lookahead strategy.
The attacker (or evaluator) measures the execution times t j = Time(y d j mod M) for j = 1, . . . , N for known bases y j . The t j may be noisy (cf. (14)). Moreover, we assume that the attacker knows (or has estimated) c and c ER . (Sect. 6 in [29] explains a guessing procedure for Montgomery's multiplication algorithm.) In a pre-step, the sum + ham(d) can be estimated in a straight-forward way. We may assume that the attacker knows and thus also ham(d). (If necessary the attack could be restarted with different candidates for . However, apart from its end the attack is robust against small deviations from the correct value .) At the beginning, the attacker subtracts the data-independent terms t set and ( + ham(d) − 2)c from the timings t j and divides the differences by c ER , yielding the 'discretized' execution times The attack strategy is to guess subsequently the exponent bits d −1 = 1, d −2 , . . .. For the moment we assume that the guesses d −1 = 1, d −2 , . . . , d k+1 have been correct. Now we focus on the guessing procedure of the exponent bit d k . Currently, Algorithm 2 'halts' before the if-statement (for i = k) so that k squarings and m (calculated from ham(d) and the guesses d −1 , . . . , d k+1 ) multiplications still have to be carried out. On the basis of the previous guesses, the attacker computes the intermediate values x j , and the number of extra reductions needed for the squarings and multiplications executed so far are subtracted from the t d, j , yielding the discretized remaining execution times t drem, j . In terms of random variables, this reads Recall that the distribution of those R i , which belong to multiplications, depends on the basis y j ; see, for example, the stochastic representation (3).
To optimize our guessing strategy, we apply statistical decision theory. We point the interested reader to [25], Sect. 2, where statistical decision theory is introduced in a nutshell and the presented results are tailored to side-channel analysis. In the following, := {0, 1} denotes the parameter space, where θ ∈ corresponds to the hypothesis d k = θ .
We may assume that the probability that the exponent bit d k equals θ is approximately 0.5. (In the case of RSA, d 0 = 1 and for indices k close to log 2 (n) the exponent bits may be biased.) More formally, if we view d k , d k−1 , . . . as realizations of iid uniformly {0, 1}-distributed random variables Z k , Z k−1 , . . . we obtain the a priori distribution To guess the next exponent bit d k , we employ a look-ahead strategy. For look-ahead depth λ ∈ N ≥1 , the decision for exponent bit d k is based on information obtained from the next λ exponent bits. As the attacker knows the intermediate value x j (of sample j), he is able to determine the number of extra reductions needed to process the λ exponent bits d k , . . . , d k−λ+1 for each of the 2 λ admissible values ρ = (ρ 0 , . . . , ρ λ−1 ) ∈ {0, 1} λ . This yields the discretized time needed to process the left-over exponent bits d k−λ , . . . , d 0 , which is fictional except for the correct vector ρ.
In this subsection, t ρ, j denotes the number of extra reductions required to process the next λ exponent bits for the basis y j if (d k , . . . , d k−λ+1 ) = ρ. For these computations, λ modular squarings and ham(ρ) modular multiplications by y j are performed. Furthermore, t drem, j − t ρ, j may be viewed as a realization of

By (8), Lemma 6 (ii), and Lemma 7 (ii), (iii), this random variable is approximately
We define the observation space Ω := (Ω ) N consisting of vectors ω = (ω 1 , . . . , ω N ) of timing observations for 1 ≤ j ≤ N . For the remainder of this subsection, we denote the Lebesgue density of N (e ρ, j , v ρ, j ) by f ρ, j (·). The joint distribution of all N traces is given by the product density f ρ,1 (·) · · · f ρ,N (·) with the arguments t drem,1 − t ρ,1 , . . . , t drem,N − t ρ,N . For λ = 1 we have ρ = ρ 0 = θ , and for hypothesis d k = θ the distribution of the discretized computation time needed for the left-over exponent bits d k−λ , d k−λ−1 , . . . has the product density N j=1 f θ, j (·). If λ > 1, the situation is more complicated. More precisely, for hypothesis θ ∈ the distribution of the left-over time is given by a convex combination of normal distributions with density f θ : Ω → R given by where the coefficients μ ρ are given by Finally, we choose A = as the set of alternatives and consider the loss function i.e. we penalize the wrong decisions ('0' instead of '1', '1' instead of '0') equally since all forthcoming guesses then are useless. We obtain the following optimal decision strategy for look-ahead depth λ.

Decision Strategy 1 (Guessing d k )
Let ω = (ω 1 , . . . , ω N ) be a vector of N timing observations ω j ∈ Ω as in (27). Then, the indicator function is an optimal strategy (Bayes strategy) against the a priori distribution η.
For look-ahead depth λ = 1, Decision Strategy 1 is essentially equivalent to Theorem 6.5 (i) in [24]. values x 1 , . . . , x N and therefore values t ρ, j that are not correlated to the correct number of extra reductions t ρ, j . However, the situation is not symmetric in 0 and 1 because for d k = 0 one uncorrelated term and for d k = 1 two uncorrelated terms are subtracted. In [28] (look-ahead depth λ = 1) for Montgomery's multiplication algorithm, an efficient three-option error detection and correction strategy was developed, which allowed to reduce the number of attack traces by ≈ 40%. We do not develop an equivalent strategy for Barrett's multiplication algorithm but apply a dynamic look-ahead strategy. This is much more efficient as we will see in Sect. 5.1. To the best of our knowledge, this look-ahead strategy is new if we ignore the fact that the idea was very roughly sketched in [24], Remark 4.1.

Timing attacks on RSA with CRT
The references [3,7,22] introduce and analyse or improve timing attacks on RSA implementations which use the CRT and Montgomery's multiplication algorithm, including the square & multiply exponentiation algorithm and table-based exponentiation algorithms. Even more, these attacks can be extended to implementations which are protected by exponent blinding [26,27].
Unless stated otherwise, we assume in this subsection that RSA with CRT applies Algorithm 6 with Barrett's multiplication (Algorithm 1). Let n = p 1 p 2 be an RSA modulus, let d be a secret exponent, and let y ∈ Z n be a basis. We set y (i) := y mod p i and d (i) := d mod ( p i − 1) for i = 1, 2. For y ∈ Z n let T (y) := Time(y d mod n).
Let ν := log 2 n + 1 be the bit-length of n. We may assume that p 1 , p 2 have bit-length ≈ ν/2 and that d (1) , d (2) have Hamming weight ≈ ν/4. From (4), we obtain E(T (y)) ≈ t set + 2c where t i := y (i) / p i . Now assume that 0 < u 1 < u 2 < n with u 2 − u 1 p 1 , p 2 . Three cases are possible: By (31), we conclude For RSA without CRT, the parameters α and β can easily be calculated, while for RSA with CRT, the parameters α 1 , α 2 , β 1 , β 2 are unknown and thus need to be estimated. We note that The parameter β i is not sensitive against small deviations of p i and could be approximated by 0, 1). However, this estimate can be improved at the end of attack phase 1 below because then more precise information on p 1 and p 2 is available. We mention that in the context of this timing attack the knowledge of β 1 and β 2 is only relevant to estimate Var(T (y)), which allows to determine an appropriate sample size for the attack steps. Unlike β i , the second term {b 2k / p i } of α i = ( p 2 i /b 2k ){b 2k / p i } and thus α i is very sensitive against deviations of p i since p i b 2k .
In the remainder of this subsection, we assume p 1 < p 2 < 2 p 1 , i.e. that p 1 , p 2 have bit-length ≈ ν/2. It follows that the interval I 1 := √ n/2, √ n contains p 1 but no multiple of p 2 and the interval I 2 := √ n, √ 2n contains p 2 but no multiple of p 1 . (In the general case, we would have to guess r ∈ N ≥2 such that (r − 1) p 1 < p 2 < r p 1 . Then, I 1 := √ n/r , √ n/(r − 1) contains p 1 but no multiple of p 2 and I 2 := √ (r − 1)n, √ rn contains p 2 but no multiple of p 1 .) Let u 0 := √ n/2 < u 1 < . . . < u h := √ n be approximately equidistant integers in I 1 and let u 0 := √ n < u 1 < . . . < u h := √ 2n be approximately equidistant integers in I 2 , where h ∈ N is a small constant (say h = 4). Further, define The goal of attack phase 1 is to identify j , j such that The selection of j and j follows from the quantitative interpretation of (32). If α i is small but α 3−i is significantly larger, the decision (for j , resp., for j ) in attack phase 1 might be incorrect, but this is of minor importance since attack phase 2 searches p 3−i anyway. If both α 1 and α 2 are small, the efficiency of the attack is low anyway. To be on the safe side one then may repeat phase 1 with larger sample size N 1 . Moreover, (33) allows to check the selection of j and j .
(3) While log 2 (u 2 − u 1 ) > /2 − 6, do the following: then set u 2 ← u 3 (the attacker believes that Case A is correct); else set u 1 ← u 3 (the attacker believes that Case B i is correct).
The decision rule follows from (32) (Case A vs. Case B i ). After phase 2 more than half of the upper bits of u 1 and u 2 coincide, which yields more than half of the upper bits of p i (more precisely, ≈ /2 + 6). This enables attack phase 3.
Of course, all decisions in attack phase 2 (including the initial choice of u 1 and u 2 ) need to be correct. However, it is very easy to verify from time to time whether all decisions in attack phase 2 have been correct so far, or equivalently, whether the current interval (u 1 , u 2 ) indeed contains p i . If this confirms the assumption that (u 1 , u 2 ) contains p i , and (u 1 , u 2 ) is called a 'confirmed interval', but if not, one com- If this difference is < − 1 16 ν αc ER , then (u 1 , u 2 ) becomes a confirmed interval. Otherwise, the attack goes back to the preceding confirmed interval (u 1;c , u 2;c ) and restarts with values in the neighbourhood of u 1;c and u 2;c , which have not been used before when the attack already was at this stage.

Remark 8 (i) Similarities to Montgomery's multiplication
algorithm. By (4), the expected number of extra reductions needed for a multiplication by y i := y mod p i is an affine function in t i = y i / p i . (For Montgomery's multiplication algorithm, it is a linear function in (y R mod p i )/ p i , cf. (12).) As for Montgomery's multiplication algorithm, (31) allows to decide whether an interval contains a prime p 1 or p 2 and finally to factorize the RSA modulus n.
(ii) Differences to Montgomery's multiplication algorithm. If y 1 < p i < y 2 , the expectation E(T (y 1 ) − T (y 2 )) is linear in α i , which is very sensitive to variations in p i . Consequently, the attack efficiency may be very different whether the attacker targets p 1 or p 2 . This is unlike to Montgomery's multiplication algorithm where the corresponding expectation is linear in p i /R ≈ √ n/R. As a consequence, attack phase 1 is very different in both cases, depending on whether the targeted implementation applies Barrett's or Montgomery's multiplication algorithm.
It should be noted that this timing attack against Barrett's multiplication algorithm can be adapted to fixed window exponentiation and sliding window exponentiation and also works against exponent blinding. For table-based methods, the timing difference in (32) gets smaller, while exponent blinding causes large algorithmic noise. In both cases, the parameters N 1 and N 2 must be selected considerably larger, which of course lowers the efficiency of the timing attack. This is rather similar to timing attacks on Montgomery's multiplication algorithm [26,27].

Local timing attacks
Unlike for the 'pure' timing attacks discussed in Sects. 4.1 and 4.2, we assume that a potential attacker is not only able to measure the overall execution time but also the timing for each squaring and multiplication, which means that he knows the number of extra reductions. This may be achieved by power measurements. In [2], an instruction cache attack was applied against Montgomery's multiplication algorithm. The task of a spy process was to realize when a particular routine from the BIGNUM library is applied, which is only used to calculate the extra reduction. This approach may not be applicable against Barrett's multiplication algorithm because here more than one extra reduction is possible.
In this subsection, we assume that fixed window exponentiation is applied where basis blinding (introduced in [19], Sect. 10, a.k.a. message blinding) is a applied as a security measure. Algorithm 7 updates the blinding values to prevent an attacker from calibrating an attack to fixed blinding values. Our attack works against RSA without CRT and against RSA with CRT as well. The papers [1,2,14,23] consider several modular exponentiation algorithms with Montgomery's multiplication algorithm.

RSA without CRT and DH
In this subsection, we assume that y d mod M is computed with Algorithm 7. The exponentiation phase starts in Step 6. Analogously to Algorithm 2 (left-to-right exponentiation), the exponentiation phase of Algorithm 7 can be described by a sequence of operations where M θ stands for a multiplication by the table entry u θ . Since one multiplication by some table entry u j follows each w squarings, it remains to guess the operations O j(w+1) ∈ {M 0 , . . . , M 2 w −1 } for j = 1, 2, . . . .
The conditional density h However, the attacker does not know the ratios t 1 , . . . , t 2 w −1 and in Sect. 4.3.2, additionally, not even the moduli p 1 and p 2 . Anyway, the table initialization phase is the source of our attack since the attacker knows the type of operations. In analogy to the exponentiation phase, we formulate a stochastic process (S j , R j ) 1≤ j≤2 w −1 on the state space [0, 1) × Z 3 , where S j corresponds to the normalized table entry t j := u j /M. This again defines a two-dimensional Markov process, and S 1 , . . . , S 2 w −1 are iid uniformly distributed on [0, 1). This leads to Lemma 11.
Remark 9 (i) Similarities to Montgomery's multiplication algorithm. For both Barrett's and Montgomery's multiplication algorithm, the joint probabilities can be expressed by integrals over high-dimensional unit cubes. In both cases, the type of the first multiplication (operation w + 1) has to be guessed exhaustively. (ii) Differences to Montgomery's multiplication algorithm.

RSA with CRT
In this subsection, we assume that y d mod n is computed with Algorithm 8. Algorithm 8 calls Algorithm 7 twice, which applies basis blinding. (Alternatively, the basis blinding could also be moved to a pre-step 0 and to Step 6 in Algorithm 8, but this would not affect the attack.) Here, the situation for the attacker is even less favourable than in Sect. 4.3.1 because not only the blinded basis and thus the table entries are unknown but also the moduli p 1 and p 2 . However, apart from additional technical difficulties our attack still applies. We first note that it suffices to recover d (1) or d (2) . In fact, if x = y d mod n, then gcd(x − y d(i) mod n, n) = p i for i = 1, 2 (cf. [2], eq. (1)). If necessary, the attacker may construct such a pair, e.g. by setting y = x e mod n and x = x for some x ∈ Z n .
The plan is to apply the attack method from the preceding subsection to the modular exponentiation mod p 1 or mod p 2 . First, similarly as in Sect. 4.2, we estimate α 1 , β 1 , α 2 , β 2 . To estimate α i and β i we consider all k S squarings in the modular exponentiation mod p i , beginning with operation w + 2. We neglect all multiplications by 1, which can be identified by the fact that no extra reductions can occur, so that k M relevant multiplications remain. Counting the extra reductions in the relevant squarings and multiplications over all N exponentiations give numbers n S and n M , respectively. We may assume that for θ > 0 the normalized table entries behave like realizations of iid random variables, which are uniformly distributed on [0, 1). Thus, we may assume that the average of all normalized table entries over all N samples is ≈ 0.5. From (4), we thus conclude Replacing ≈ by = and solving the linear equations provides estimates α i and β i for α i and β i . The attacker focuses on the exponentiation mod p 1 iff α 1 > α 2 .
Remark 10 (i) Similarities to Montgomery's multiplication algorithm. As for Montgomery's multiplication algorithm, the attack remains feasible if the CRT is applied. (ii) Differences to Montgomery's multiplication algorithm.
Unlike for Montgomery's multiplication algorithm, the choice, whether the modular exponentiation mod p 1 or mod p 2 is attacked, may have significant impact on the attack's efficiency. This is similar to the situation in Sect. 4.2.
For β ≈ 0, all before-mentioned attacks remain feasible. The attacks in Sect. 4.2 and Sect. 4.3 exploit differences between multiplications with different factors, which are caused by different α-values. As β ≈ 0, this reduces the part of the variances that does not depend on the particular factor. Hence, β ≈ 0 there leads to even more efficient attacks, while this has little relevance for the attack in Sect

Experiments
In this section, we report experimental results for the attacks presented in Sect. 4. In each experiment, we used simulations of the exponentiation algorithms returning the exact number of extra reductions needed by the multiplications and squarings, either cumulative (pure timing attacks) or per operation (local timing attacks). This represents the ideal (noise-free) case (accurate timing measurement, no time noise by other operations). Of course, non-ideal scenarios would require larger sample sizes.
We performed timing attacks on RSA without CRT and on Diffie-Hellman exemplarily on 512-bit RSA moduli and on a 3072-bit DH group, and for RSA with CRT, we considered 1024-bit moduli and 3072-bit moduli. Finally, we carried out local timing attacks on 512-bit RSA moduli and on a 3072-bit DH group. Our experiments confirmed the theoretical results. Of course, 512-bit RSA moduli have not been secure for many years, and also 1024-bit RSA is no longer state-of-the-art. These choices yet allow direct comparisons to results from previous papers on Montgomery's multiplication algorithm which considered these modulus sizes.
We primarily report on experiments using Barrett's multiplication algorithm with base b = 2, but in the respective subsections, we also explain how the results change when a more common base b = 2 ws for ws ≥ 8 is used. The case b = 2 is usually less favourable for the attacker in terms of efficiency and represents a worst-case analysis in this sense. Mathematically, it is also the most challenging case as it requires the most general stochastic model.
The experiments were implemented using the Julia programming language [5] with the Nemo algebra package [15].

Timing attacks on RSA without CRT and on DH
We implemented the timing attack with the look-ahead strategy as presented in Sect. 4.1.
Since our simulations are noise-free, we set Var(N d, j ) := 0 in (26). For simplicity, in (26) we approximated the covariances cov MS, j , cov SM, j and cov SS by the empirical covariances of suitable samples generated from Stochastic Model 2. For instance, cov MS, j can be approximated using samples from R n = αS n−1 t j + βU n − S n , where t j := y j /M is the normalized basis of the j-th timing sample. (Recall that the random variables S n−1 , S n , U n and U n+1 are iid uniformly distributed on [0, 1).) Furthermore, the n-th and the (n + 1)-th Barrett operations correspond to a multiplication by the basis and a squaring, respectively. We point out that Lemma 7 (i) would theoretically allow an exact computation of these covariances since Cov(R n , In order to make the attack more efficient, we chose the look-ahead depth λ dynamically during the attack. Our strategy is based on the observation that when Decision Strategy 1 fails for the first time, the decisions usually are close, i.e. using the notation of Sect. 4.1, is small. Close decisions happen mostly at the beginning of the attack when the number of remaining exponent bits k is large (if Var(N d, j ) ≈ 0, then the variance v ρ, j in (26) decreases (essentially) linearly in k) or if the number N of timing samples is insufficient. For each decision, we gradually incremented the look-ahead depth λ starting from 1 until either the heuristic condition was fulfilled or we reached some maximum look-ahead depth λ max .
For simplicity, we assumed in our experiments that we know the bit-length and Hamming weight of the secret exponent (cf. Sect. 4.1).
For the first experiment, we used Diffie-Hellman group dhe3072 defined in Appendix A.2 of RFC 7919 [16]. This reference recommends to use this group with secret exponents of bit-length at least 275. The 3072-bit modulus of this group has Barrett parameters α ≈ 0.63 and β ≈ 0.5 for the base b = 2 we used. The results of the experiment are reported in Table 2. For each sample size N and maximum look-ahead depth λ max given in the table, we conducted 100 trials of the timing attack. For each trial, a 275-bit secret exponent and N input bases were chosen independently at random. Table 2 shows that in terms of the sample size the applied (dynamic) look-ahead strategy is much more efficient than the constant look-ahead strategy λ = 1.
For the second experiment, we considered RSA with 512bit moduli and again used Barrett's multiplication with base b = 2. Of course, factoring 512-bit RSA moduli has been an easy task for many years, but this choice allows a direct comparison with the results on Montgomery's multiplication algorithm in [28]. The results of these experiments are reported in Table 3. For each sample size N and maximum look-ahead depth λ max given in the table, we conducted 100 trials of the timing attack. For each trial, a 512-bit RSAmodulus and N input bases were chosen independently at random. Since we chose e := 65537 as public exponents, the secret exponents were ensured to be of bit-length near 512 as well.
Reference [28] treats 512-bit RSA moduli, the square and multiply algorithm and Montgomery's multiplication algorithm. In our terminology, [28] applies the optimal decision strategy for the constant look-ahead strategy λ = 1. For the sample size N = 5000 (for N = 7000, for N = 9000), simulations yielded success probabilities of 12% (55%, 95%). The results in Table 3 underline that the efficiency of the attacks on Barrett's and on Montgomery's multiplication algorithm is rather similar for λ = 1. Moreover, in [28] also so-called reallife attacks were conducted where the timings were gained from emulations of the Cascade chip (see [28], Remark 4). For the above-mentioned sample sizes, the real-life attack was successful in 15%, 40% or 72% of the trials. In [28], further experiments were conducted where the optimal decision strategy was combined with an error detection and correction strategy. There already N = 5000 led to success rates of 85% (simulation) and 74% (real-world attack). This improved the efficiency of the original attack on the Cascade chip in [13] by a factor ≈ 50. We refer the interested reader to [28], Table 1 and Table 2, for further experimental results.
This error detection and correction strategy can be adjusted to Barrett's multiplication algorithm, and for λ = 1 this should also increase the attack efficiency (in terms of N ) considerably. Of course, one might combine this error detection and correction strategy with our look-ahead strategy. We do not expect a significant improvement because the look-ahead strategy treats suspicious ('close') decisions with particular prudence (by enlarging λ). However, the other way round the look-ahead strategy can be applied to Montgomery's multiplication algorithm as well and should yield similar improvements.
Decision Strategy 1 considers the remaining execution time and the following λ exponent bits. Since the variance of the remaining execution time v ρ, j (26) is approximately linear in the number of exponent bits, wrong decisions should essentially occur in the beginning of the attack. This observation suggests a coarse rule of thumb to extrapolate the necessary sample size to different exponent lengths: When the exponent length increases from 1 to 2 , the sample size N should increase by factor ≈ 2 / 1 . The experimental results in Table 2 and Table 3 are in line with this formula, cf. the results for N = 1000 (Table 2) and N = 2000 (Table 3), for instance.
Our experiments showed the interesting feature that unlike for the attacks in Sects. 5.2 and 5.3, the parameters α and β only play a small role for the efficiency, at least under optimal conditions when no additional noise occurs. Qualitatively, this feature can be explained as follows: Decision Strategy 1 essentially exploits the fact that the variances of the 2 λ (hypothetical) remaining execution times should be the smaller the more of the left-hand bits of the λ-bit window are correct. Large (resp., small) α, β imply large (resp., small) variances and differences of variances. In the presence of additional noise large α, β should favour the attack efficiency because then the relative impact of the additional noise on the variance is smaller.
The results of our experiments are reported in Table 4 and Table 5 for RSA moduli of bit-size ν := 1024 (allowing a comparison with [22]) and ν := 3072, respectively. We used Barrett's multiplication algorithm with base b = 2 and RSA primes of bit-size ν/2. Each row of the table represents an experiment consisting of 100 trials. For each trial, we used rejection sampling to generate an RSA modulus such that α max and β are contained in the given intervals.
In attack phase 1, we divided the initial intervals containing p 1 and p 2 into h := 4 subintervals each and chose a sample size of N 1 := 32. For attack phase 2, we estimated the sample size N 2 as follows. For an interval [u 3 , u 2 ], we consider Δ := T (u 2 ) − T (u 3 ) as random variable. Then, we have Var although p i ∈ [u 3 , u 2 ]. Each difference T (u 2 + j) − T (u 1 + j) may be viewed as a realization of the difference of two normally distributed random variables, and thus, the error probability γ is approximately Therefore, a desired maximum error probability γ for a single decision can approximately be achieved by setting We point out that (41) does not depend on c ER because σ Δ,i is a multiple of c ER . In the simulation, we thus may assume c ER = 1. Of course, in a noisy setting the relation between the noise and c ER is relevant. For the experiments in Tables  4 and 5, we chose γ := 2 −8 and γ := 2 −9 , respectively. The median and mean of the values chosen for N 2 in successful attacks are denoted by N 2 and N 2 ; the median and mean of the total number of timing samples required for an successful attack are denoted by N total and N total . The experiments were implemented using the error detection and correction strategy as outlined in Sect. 4.2. We 'confirmed' every 64-th interval using additional N 2 timing samples and aborted attacks as soon as 10 errors had been detected. The average number of errors that were corrected in successful attacks is denoted by E total in Tables 4 and 5.
Our experiments confirm the theory developed in Sect. 4.2. In particular, the efficiency of the attack increases with α max , which is the reason why in Step 2 of the attack we focus on the prime p i with larger α i .
It may be surprising that the average sample sizes for 1024bit moduli and for 3072-bit moduli are comparable although the latter takes almost three times as many individual decisions. The reason is that the average gap E(T (u 2 ) − T (u 1 )) (32) increases linearly in the exponent size, while the standard deviation of MeanTime(u, N ) (34) (in the absence of additional noise, for fixed N ) only grows as its square root.
For b = 2 ws with ws ≥ 8, we have β 1 , β 2 ≈ 0 and the algorithmic noise is considerably reduced, whereas the gap remains constant. In this case, the required sample sizes are smaller on average compared to those reported in Tables 4 and 5 (except for ranges where N 2 is already 1).

Local timing attacks
We implemented the local timing attacks presented in Sect. 4.3.
For window width w, the computation of the decision rule (39) requires the evaluation of (2 w + 1)-dimensional integrals in (38). In contrast to the corresponding attack against Montgomery's algorithm, those integrals cannot easily be determined analytically, which is why we have to resort to numerical methods. However, due to the so-called curse of dimensionality, generic numerical integration algorithms break down completely for w ≥ 3 in terms of either efficiency or accuracy. We therefore take a step back and consider the stochastic model for the table initialization phase, . . . and of the exponentiation phase, Let (r 2 , . . . , r 2 w −1 , r j(w+1) ) be a corresponding realization of extra reductions. Let us assume for the moment that we are able to draw samples (s 1,i , . . . , s 2 w −1,i ) and (u 2,i . . . , u 2 w −1,i ) from (42) for 1 ≤ i ≤ N which give rise to the given extra reduction vector (r 2 , . . . , r 2 w −1 ). For N sufficiently large (we used N := 10,000 in our experiments), we obtain the approximation for all θ ∈ {1, . . . , 2 w − 1}, where R(·, ·) is defined as in (2). The probabilities on the right-hand side of (43) can be computed explicitly using Lemma 3 (i). Since is independent of θ (and > 0), the joint probabilities Pr θ (R 2 = r 2 , . . . , R 2 w −1 = r 2 w −1 , R j(w+1) = r j(w+1) ) in decision rule (39) can be replaced by the conditional probabilities in (43) without affecting the decision. A required sample (s 1 , . . . , s 2 w −1 ) and (u 2 . . . , u 2 w −1 ) from (42) giving rise to an extra reduction vector (r 2 , . . . , r 2 w −1 ) can in principle be generated using rejection sampling. For w ≥ 3, however, this is way too inefficient for aforementioned reasons. We therefore use an approach akin to Gibbs sampling. First, instead of generating the components of (s 1 , . . . , s 2 w −1 ) and (u 2 . . . , u 2 w −1 ) independently and uniformly from [0, 1), we sample them adaptively in the order s 1 , s 2 , u 2 , . . . , s 2 w −1 , u 2 w −1 , with each choice conditioned on the previous choices (when we reach a dead end, we start over). At this moment, the samples (s 1 , . . . , s 2 w −1 ) and (u 2 . . . , u 2 w −1 ) give rise to (r 2 , . . . , r 2 w −1 ), but are biased and require some correction. Therefore, we re-sample the elements s 1 , s 2 , u 2 , . . . , s 2 w −1 , u 2 w −1 cyclically in this order, with each choice conditioned on the current values of the other variables. In our experiments, we used just 10 such rounds. The final values of (s 1 , . . . , s 2 w −1 ) are taken as the desired sample. For the next sample, we restart the whole process from the beginning. (Our experiments have shown that continuing the process from the previous sample may lead to a biased distribution.) Although this method is somewhat experimental, our experiments below demonstrate that it is sufficient for our attacks to work. Note that the time complexity of this approach is linear in 2 w , while the complexity of rejection sampling is in general exponential in 2 w .
For the first experiment, we used Diffie-Hellman group dhe3072 defined in Appendix A.2 of RFC 7919 [16] with 275-bit exponents (cf. Sect. 5.1). The 3072-bit modulus of this group has Barrett parameters α ≈ 0.63 and β ≈ 0.5 for the base b = 2 we used. The results of the experiment are reported in Table 6. For each window size w and sample size N given in the table, we conducted 100 trials of the timing attack. For each trial, a 275-bit secret exponent and N (unknown) input bases were chosen independently at random. We counted attacks with at most 2 errors as successful since one or two errors can easily be corrected by exhaustive search, beginning with the most plausible alternatives. The mean number of errors is denoted by E.
It can be observed that the attack is exceptionally efficient for window size w = 1. The reason is that in this case only the operations M 0 (multiplication by 1) and M 1 (multiplication by the unknown input basis) have to be distinguished, which is easy because for M 0 extra reductions never occur.
For the second experiment, we considered RSA without CRT with 512-bit moduli. We used Barrett's multiplication algorithm with base b = 2 and b = 2 8 , and RSA primes of bit-size 256. In this experiment, we limited ourselves to window size w = 4. The results of the experiment for b = 2 are reported in Table 7 and for b = 2 8 in Table 8. Since the attack is sensitive to the value α of the modulus ('signal'), we conducted trials for several ranges of α. For each row of the table, we conducted 100 trials. For each trial, we used rejection sampling to generate an RSA modulus with α in the given interval and we chose N (unknown) input bases independently at random. Again, we counted attacks with at most 2 errors as successful. As in Sect. 5.1, the choice of 512-bit RSA moduli allows a comparison with the results in [23] and [25], Section 6. There Montgomery's multiplication algorithm was applied together with a slightly modified exponentiation algorithm, which resigned on the multiplication by 1 (resp. by Montgomery constant R) when a zero block of the exponent bits was processed. Even for large α the attack on Barrett's multiplication algorithm with b = 2 is to some extent less efficient than the attack against Montgomery's algorithm although there only one guessing error was allowed. In contrast, for ws = 8 (and large α), the success rates are similar to those for Montgomery's multiplication algorithm. The results for ws > 8 should be alike because β ≈ 0 in all cases. For the local timing attack against RSA with CRT, the Barrett parameters α 1 , β 1 , α 2 , β 2 of the unknown primes p 1 , p 2 have to be estimated in a pre-step as outlined in Sect. 4.3.2. We successfully verified this procedure in experiments. Since the remaining part of the attack is equivalent to the local timing attack against RSA without CRT, we dispense with reporting additional results on the full attack against RSA with CRT.
A single decision by Decision Strategy 2 depends on the whole table initialization phase but only on one Barrett operation within the exponentiation phase. For given parameters α, β, w, b and N its error probability thus does not depend on the length of the exponent. Since the number of individual decisions increases linearly in the length of the exponent, for longer exponents the sample size N has to be raised to some extent to keep the expected number of guessing errors roughly constant. Tables 6 to 8 underline that local timing attacks scale well when the exponent length increases. Consider w = 4 in Table 6, for instance: increasing N from 1000 to 1400 reduces the error probability of single decisions to ≈ 25%, which in turn implies that N = 1400 should lead to a similar success probability for exponent length 1024 as N = 1000 for exponent length 275.

Countermeasures
In Sect. 4, we discussed several attack scenarios against Barrett's multiplication algorithm. It has been pointed out that these attacks are rather similar to those when Montgomery's multiplication algorithm is applied. Consequently, the same countermeasures apply.
The most rigorous countermeasure definitely is when all modular squarings and multiplications within a modular exponentiation need identical execution times. Obviously, then timing attacks and (passive) local timing attacks cannot work. Identical execution times could be achieved by inserting up to two dummy operations per Barrett multiplication if necessary. However, this should be done with care. A potential disadvantage of a dummy operation approach is that dummy operations might be identified by a power attack, and for software implementations on PCs the compiler might unnoticeably cancel the dummy operations if they are not properly implemented. An additional difficulty for software implementations is that secret-dependent branches and memory accesses must be avoided in order to thwart microarchitectural attacks.
It should be noted that for Montgomery's multiplication algorithm, a smarter approach exists: one may completely resign on extra reductions if not only R > M but even R > 4M for modulus M [30]. For Barrett's multiplication algorithm, a similar approach exists, see the variant of this algorithm presented in [12].
Basis blinding and exponent blinding work differently. While basis blinding shall prevent an attacker from learning and controlling the input, exponent blinding shall prevent an attacker from combining information on particular exponent bits from several exponentiations because the exponent is constantly changing.
State-of-the-art (pure) timing attacks on RSA without CRT (or on DH) neither work against basis blinding nor against exponent blinding. While basis blinding suffices to protect RSA with CRT implementations against pure timing attacks, exponent blinding does not suffice; the attack from [26,27] can easily be transferred to Barrett's multiplication algorithm. On the other hand, sole basis blinding does not prevent local timing attacks as shown in Sects. 4.3 and 5.3, whereas exponent blinding counteracts these attacks.
Principally, state-of-the-art knowledge might be used to determine minimal countermeasures, but we recommend to stay on the safe side by applying combinations of blinding techniques (at least basis blinding and exponent blinding). Since blinding also counteracts other types of attacks, we recommend to use blinding techniques even if the implementation does not have timing differences.

Conclusion
We have thoroughly analysed the stochastic behaviour of Barrett's multiplication algorithm when applied in modular exponentiation algorithms. Unlike Montgomery's multiplication algorithm, Barrett's multiplication algorithm does not only allow one but even two extra reductions, a feature, which increases the mathematical difficulties considerably. All known timing attacks and local timing attacks against Montgomery's multiplication algorithm were adapted to Barrett's multiplication algorithm, but specific features require additional attack substeps when RSA with CRT is attacked. Moreover, for timing attacks against RSA without CRT and against DH, we developed an efficient look-ahead strategy. Extensive experiments confirmed our theoretical results. Fortunately, effective countermeasures exist.