1 Introduction

Consider a polynomial function \(f : \mathbb {K}^n \rightarrow \mathbb {K}\) over a field \(\mathbb {K}\) given through a black box capable of evaluating f at points in \(\mathbb {K}^n\). The problem of sparse interpolation is to recover the representation of \(f \in \mathbb {K} [x_1, \ldots , x_n]\) in its usual form, as a linear combination

$$\begin{aligned} f = \sum _{1 \leqslant i \leqslant t} c_i \varvec{x}^{\varvec{e}_i} \end{aligned}$$
(1)

of monomials \(\varvec{x}^{\varvec{e}_i} = x_1^{e_{1, 1}} \cdots x_n^{e_{1, n}}\). One popular approach to sparse interpolation is to evaluate f at points in a geometric progression. This approach goes back to work of Prony in the eighteen’s century [15] and became well known after Ben-Or and Tiwari’s seminal paper [2]. It has widely been used in computer algebra, both in theory and in practice; see [16] for a nice survey.

More precisely, if a bound T for the number of terms t is known, then we first evaluate f at \(2 T - 1\) pairwise distinct points \(\varvec{\alpha }^0, \varvec{\alpha }^1, \ldots , \varvec{\alpha }^{2 T - 2}\), where \(\varvec{\alpha }= (\alpha _1, \ldots , \alpha _n) \in \mathbb {K}^n\) and \(\varvec{\alpha }^k :=(\alpha _1^k, \ldots , \alpha _n^k)\) for all \(k \in \mathbb {N}\). The generating function of the evaluations at \(\varvec{\alpha }^k\) satisfies the identity

$$\begin{aligned} \sum _{k \in \mathbb {N}} f (\varvec{\alpha }^k) z^k = \sum _{1 \leqslant i \leqslant t} \sum _{k \in \mathbb {N}} c_i \varvec{\alpha }^{\varvec{e}_i k} z^k = \sum _{1 \leqslant i \leqslant t} \frac{c_i}{1 -\varvec{\alpha }^{\varvec{e}_i} z} = \frac{N (z)}{\varLambda (z)}, \end{aligned}$$

where \(\varLambda = (1 -\varvec{\alpha }^{\varvec{e}_1} z) \cdots (1 -\varvec{\alpha }^{\varvec{e}_t} z)\) and \(N \in \mathbb {K} [z]\) is of degree \(< t\). The rational function \(N / \varLambda \) can be recovered from \(f (\varvec{\alpha }^0), f (\varvec{\alpha }^1), \ldots , f (\varvec{\alpha }^{2 T - 2})\) using fast Padé approximation [4]. For well chosen points \(\varvec{\alpha }\), it is often possible to recover the exponents \(\varvec{e}_i\) from the values \(\varvec{\alpha }^{\varvec{e}_i} \in \mathbb {K}\). If the exponents \(\varvec{e}_i\) are known, then the coefficients \(c_i\) can also be recovered using fast structured linear algebra [5]. This leaves us with the question how to compute the roots \(\varvec{\alpha }^{-\varvec{e}_i}\) of \(\varLambda \) in an efficient way.

For practical applications in computer algebra, we usually have \(\mathbb {K}=\mathbb {Q}\), in which case it is most efficient to use a multi-modular strategy, and reduce to coefficients in a finite field \(\mathbb {K}=\mathbb {F}_p\), where p is a prime number that we are free to choose. It is well known that polynomial arithmetic over \(\mathbb {F}_p\) can be implemented most efficiently using FFTs when the order \(p - 1\) of the multiplicative group is smooth. In practice, this prompts us to choose p of the form \(s 2^l + 1\) for some small s and such that p fits into a machine word.

The traditional way to compute roots of polynomials over finite fields is using Cantor and Zassenhaus’ method [6]. In [10, 11], alternative algorithms were proposed for our case of interest when \(p - 1\) is smooth. The fastest algorithm was based on the tangent Graeffe transform and it gains a factor \(\log t\) with respect to Cantor–Zassenhaus’ method. The aim of the present paper is to report on a parallel implementation of this new algorithm and on a few improvements that allow for a further constant speed-up.

In Sect. 2, we recall the Graeffe transform and the heuristic root finding method based on the tangent Graeffe transform from [10]. In Sect. 3, we present the main new theoretical improvements, which all rely on optimizations in the FFT-model for fast polynomial arithmetic. Our contributions are twofold. In the FFT-model, one backward transform out of four can be saved for Graeffe transforms of order two (see Sect. 3.2). When composing a large number of Graeffe transforms of order two, FFT caching can be used to gain another factor of 3/2 (see Sect. 3.3). In the longer preprint version of the paper [12], we also show how to generalize our methods to Graeffe transforms of general orders and how to use it in combination with the truncated Fourier transform.

Section 4 is devoted to our new sequential and parallel implementations of the algorithm in C and Cilk C. Our sequential implementation confirms the gain of a new factor of two when using the new optimizations. So far, we have achieved a parallel speed-up by a factor of 4.6 on an 8-core machine. Our implementation is freely available at http://www.cecm.sfu.ca/CAG/code/TangentGraeffe.

2 Root Finding Using the Tangent Graeffe Transform

2.1 Graeffe Transforms

The traditional Graeffe transform of a monic polynomial \(P \in \mathbb {K} [z]\) of degree d is the unique monic polynomial \(G (P) \in \mathbb {K} [z]\) of degree d such that

$$\begin{aligned} G (P) (z^2) = P (z) P (- z). \end{aligned}$$
(2)

If P splits over \(\mathbb {K}\) into linear factors \(P = (z - \beta _1) \cdots (z - \beta _d)\), then one has

$$\begin{aligned} G (P) = (z - \beta _1^2) \cdots (z - \beta _d^2). \end{aligned}$$

More generally, given \(r \geqslant 2\), we define the Graeffe transform of order r to be the unique monic polynomial \(G_r (P) \in \mathbb {K} [z]\) of degree d such that \(G_r (P) (z) = (- 1)^{rd} {\text {Res}}_u (P (u), u^r - z)\). If \(P = (z - \beta _1) \cdots (z - \beta _d)\), then

$$\begin{aligned} G_r (P) = (z - \beta _1^r) \cdots (z - \beta _d^r). \end{aligned}$$

If \(r, s \geqslant 2\), then we have

$$\begin{aligned} G_{rs} = G_r \circ G_s = G_s \circ G_r . \end{aligned}$$
(3)

2.2 Root Finding Using Tangent Graeffe Transforms

Let \(\epsilon \) be a formal indeterminate with \(\epsilon ^2 = 0\). Elements in \(\mathbb {K} [\epsilon ] / (\epsilon ^2)\) are called tangent numbers. Now let \(P \in \mathbb {K} [z]\) be of the form \(P = (z - \alpha _1) \cdots (z - \alpha _d)\) where \(\alpha _1, \ldots , \alpha _d \in \mathbb {K}\) are pairwise distinct. Then the tangent deformation \(\tilde{P} (z) :=P (z + \varepsilon )\) satisfies

$$\begin{aligned} \tilde{P} = P + P' \epsilon = (z - (\alpha _1 - \epsilon )) \cdots (z - (\alpha _d - \epsilon )) . \end{aligned}$$

The definitions from the previous subsection readily extend to coefficients in \(\mathbb {K} [\epsilon ]\) instead of \(\mathbb {K}\). Given \(r \geqslant 2\), we call \(G_r (\tilde{P})\) the tangent Graeffe transform of P of order r. We have

$$\begin{aligned} G_r (\tilde{P}) = (z - (\alpha _1 - \epsilon )^r) \cdots (z - (\alpha _d - \epsilon )^r), \end{aligned}$$

where

$$\begin{aligned} (\alpha _k - \epsilon )^r = \alpha _k^r - r \alpha _k^{r - 1} \epsilon , \qquad k = 1, \ldots , d. \end{aligned}$$

Now assume that we have an efficient way to determine the roots \(\alpha _1^r, \ldots , \alpha _d^r\) of \(G_r (P)\). For some polynomial \(T \in \mathbb {K} [z]\), we may decompose \(G_r (\tilde{P}) = G_r (P) + T \epsilon \) For any root \(\alpha _k^r\) of \(G_r (P)\), we then have

$$\begin{aligned} G_r (\tilde{P}) (\alpha _k^r - r \alpha _k^{r - 1} \epsilon )= & {} G_r (P) (\alpha _k^r) + (T (\alpha _k^r) - G_r (P)' (\alpha _k^r) r \alpha _k^{r - 1}) \epsilon \\= & {} (T (\alpha _k^r) - G_r (P)' (\alpha _k^r) r \alpha _k^{r - 1}) \epsilon = 0. \end{aligned}$$

Whenever \(\alpha _k^r\) happens to be a single root of \(G_r (P)\), it follows that

$$\begin{aligned} r \alpha _k^{r - 1} = \frac{T (\alpha _k^r)}{G_r (P)' (\alpha _k^r)}. \end{aligned}$$

If \(\alpha _k^r \ne 0\), this finally allows us to recover \(\alpha _k\) as \( \displaystyle \alpha _k = r \frac{\alpha _k^r}{r \alpha _k^{r - 1}}\).

2.3 Heuristic Root Finding over Smooth Finite Fields

Assume now that \(\mathbb {K}=\mathbb {F}_p\) is a finite field, where p is a prime number of the form \(p = \sigma 2^m + 1\) for some small \(\sigma \). Assume also that \(\omega \in \mathbb {F}_p\) be a primitive element of order \(p - 1\) for the multiplicative group of \(\mathbb {F}_p\).

Let \(P = (z - \alpha _1) \cdots (z - \alpha _d) \in \mathbb {F}_p [z]\) be as in the previous subsection. The tangent Graeffe method can be used to efficiently compute those \(\alpha _k\) of P for which \(\alpha _k^r\) is a single root of \(G_r (P)\). In order to guarantee that there are a sufficient number of such roots, we first replace P(z) by \(P (z + \tau )\) for a random shift \(\tau \in \mathbb {F}_p\), and use the following heuristic:

  • H For any subset \(\{ \alpha _1, \ldots , \alpha _d \} \subseteq \mathbb {F}_p\) of cardinality d and any \(r \leqslant (p - 1) / (4 d)\), there exist at least p/2 elements \(\tau \in \mathbb {F}_p\) such that \(\{ (\alpha _1 - \tau )^r, \ldots , (\alpha _d - \tau )^r \}\) contains at least 2d/3 elements.

For a random shift \(\tau \in \mathbb {F}_p\) and any \(r \leqslant (p - 1) / (4 d)\), the assumption ensures with probability at least 1/2 that \(G_r (P (z + \tau ))\) has at least d/3 single roots.

Now take r to be the largest power of two such that \(r \leqslant (p - 1) / (4 d)\) and let \(s = (p - 1) / r\). By construction, note that \(s = O (d)\). The roots \(\alpha _1^r, \ldots , \alpha _d^r\) of \(G_r (P)\) are all s-th roots of unity in the set \(\{ 1, \omega ^r, \ldots , \omega ^{(s - 1) r} \}\). We may thus determine them by evaluating \(G_r (P)\) at \(\omega ^i\) for \(i = 0, \ldots , s - 1\). Since \(s = O (d)\), this can be done efficiently using a discrete Fourier transform. Combined with the tangent Graeffe method from the previous subsection, this leads to the following probabilistic algorithm for root finding:

figure b

Remark 1

To compute \(G_2 (\tilde{P}) = G_2 (A + B \epsilon )\) we may use \(G_2 (\tilde{P} (z^2)) = A (z) A (- z) + (A (z) B (- z) + B (z) A (- z)) \epsilon \), which requires three polynomial multiplications in \(\mathbb {F}_p [z]\) of degree d. In total, step 5 thus performs \(O (\log (p / s))\) such multiplications. We discuss how to perform step 5 efficiently in the FFT model in Sect. 3.

Remark 2

For practical implementations, one may vary the threshold \(r \leqslant (p - 1) / (4 d)\) for r and the resulting threshold \(s \geqslant 4 d\) for s. For larger values of s, the computations of the DFTs in step 6 get more expensive, but the proportion of single roots goes up, so more roots are determined at each iteration. From an asymptotic complexity perspective, it would be best to take \(s \asymp d \sqrt{\log p}\). In practice, we actually preferred to take the lower threshold \(s \geqslant 2 d\), because the constant factor of our implementation of step 6 (based on Bluestein’s algorithm [3]) is significant with respect to our highly optimized implementation of the tangent Graeffe method. A second reason we prefer s of size O(d) instead of \(O (d \sqrt{\log p})\) is that the total space used by the algorithm is linear in s. In the future, it would be interesting to further speed up step 6 by investing more time in the implementation of high performance DFTs of general orders s.

3 Computing Graeffe Transforms

3.1 Reminders About Discrete Fourier Transforms

Assume \(n \in \mathbb {N}\) is invertible in \(\mathbb {K}\) and let \(\omega \in \mathbb {K}\) be a primitive n-th root of unity. Consider a polynomial \(A = a_0 + a_1 z + \cdots + a_{n - 1} z^{n - 1} \in \mathbb {K} [z]\). Then the discrete Fourier transform (DFT) of order n of the sequence \((a_i)_{0 \leqslant i < n}\) is defined by

$$\begin{aligned} {\text {DFT}}_{\omega } ((a_i)_{0 \leqslant i< n}) :=(\hat{a}_k)_{0 \leqslant k < n}, \qquad \hat{a}_k :=A (\omega ^k). \end{aligned}$$

We will write \(\mathsf {F}_{\mathbb {K}} (n)\) for the cost of one discrete Fourier transform in terms of the number of operations in \(\mathbb {K}\) and assume that \(n = o \left( \mathsf {F}_{\mathbb {K}} (n) \right) \). For any \(i \in \{ 0, \ldots , n - 1 \}\), we have

$$\begin{aligned} {\text {DFT}}_{\omega ^{- 1}} ((\hat{a}_k)_{0 \leqslant k< n})_i = \sum _{0 \leqslant k< n} \hat{a}_k \omega ^{- ik} = \sum _{0 \leqslant j< n} a_j \sum _{0 \leqslant k < n} \omega ^{(j - i) k} = na_i . \end{aligned}$$
(4)

If n is invertible in \(\mathbb {K}\), then it follows that \({\text {DFT}}_{\omega }^{- 1} = n^{- 1} {\text {DFT}}_{\omega ^{- 1}}\). The costs of direct and inverse transforms therefore coincide up to a factor O(n).

If \(n = n_1 n_2\) is composite, \(0 \leqslant k_1 < n_1\), and \(0 \leqslant k_2 < n_2\), then it is well known [7] that

$$\begin{aligned} \hat{a}_{k_2 n_1 + k_1}= & {} {\text {DFT}}_{\omega ^{n_1}} \left( \left( \omega ^{i_2 k_1} {\text {DFT}}_{\omega ^{n_2}} ((a_{i_1 n_2 + i_2})_{0 \leqslant i_1< n_1})_{k_1} \right) _{0 \leqslant i_2 < n_2} \right) _{k_2}. \end{aligned}$$
(5)

This means that a DFT of length n reduces to \(n_1\) transforms of length \(n_2\) plus \(n_2\) transforms of length \(n_1\) plus n multiplications in \(\mathbb {K}\):

$$\begin{aligned} \mathsf {F}_{\mathbb {K}} (n_1 n_2) \leqslant n_1 \mathsf {F}_{\mathbb {K}} (n_2) + n_2 \mathsf {F}_{\mathbb {K}} (n_1) + O (n). \end{aligned}$$

In particular, if \(r = O (1)\), then \(\mathsf {F}_{\mathbb {K}} (rn) \sim r \mathsf {F}_{\mathbb {K}} (n)\).

It is sometimes convenient to apply DFTs directly to polynomials as well; for this reason, we also define \({\text {DFT}}_{\omega } (A) :=(\hat{a}_k)_{0 \leqslant k < n}\). Given two polynomials \(A, B \in \mathbb {K} [z]\) with \(\deg (AB) < n\), we may then compute the product AB using

$$\begin{aligned} AB= & {} {\text {DFT}}_{\omega }^{- 1} ({\text {DFT}}_{\omega } (A) {\text {DFT}}_{\omega } (B)) . \end{aligned}$$

In particular, if \(\mathsf {M}_{\mathbb {K}} (n)\) denotes the cost of multiplying two polynomials of degree \(< n\), then we obtain \(\mathsf {M}_{\mathbb {K}} (n) \sim 3 \mathsf {F}_{\mathbb {K}} (2 n) \sim 6 \mathsf {F}_{\mathbb {K}} (n)\).

Remark 3

In Algorithm 1, we note that step 6 comes down to the computation of three DFTs of length s. Since r is a power of two, this length is of the form \(s = \sigma 2^k\) for some \(k \in \mathbb {N}\). In view of (5), we may therefore reduce step 6 to \(3 \sigma \) DFTs of length \(2^k\) plus \(3 \cdot 2^k\) DFTs of length \(\sigma \). If \(\sigma \) is very small, then we may use a naive implementation for DFTs of length \(\sigma \). In general, one may use Bluestein’s algorithm [3] to reduce the computation of a DFT of length \(\sigma \) into the computation of a product in \(\mathbb {K} [z] / (z^{\sigma } - 1)\), which can in turn be computed using FFT-multiplication and three DFTs of length a larger power of two.

3.2 Graeffe Transforms of Order Two

Let \(\mathbb {K}\) be a field with a primitive (2n)-th root of unity \(\omega \). Let \(P \in \mathbb {K} [z]\) be a polynomial of degree \(d = \deg P < n\). Then the relation (2) yields

$$\begin{aligned} G (P) (z^2) = {\text {DFT}}_{\omega }^{- 1} ({\text {DFT}}_{\omega } (P (z)) {\text {DFT}}_{\omega } (P (- z))) . \end{aligned}$$
(6)

For any \(k \in \{ 0, \ldots , 2 n - 1 \}\), we further note that

$$\begin{aligned} {\text {DFT}}_{\omega } (P (- z))_k = P (- \omega ^k) = P (\omega ^{(k + n) {\text {rem }} 2 n }) = {\text {DFT}}_{\omega } (P (z))_{(k + n) {\text {rem }} 2 n }, \end{aligned}$$
(7)

so \({\text {DFT}}_{\omega } (P (- z))\) can be obtained from \({\text {DFT}}_{\omega } (P)\) using n transpositions of elements in \(\mathbb {K}\). Concerning the inverse transform, we also note that

$$\begin{aligned} {\text {DFT}}_{\omega } (G (P) (z^2))_k = G (P) (\omega ^{2 k}) = {\text {DFT}}_{\omega ^2} (G (P))_k, \end{aligned}$$

for \(k = 0, \ldots , n - 1\). Plugging this into (6), we conclude that

$$\begin{aligned} G (P) = {\text {DFT}}_{\omega ^2}^{- 1} (({\text {DFT}}_{\omega } (P)_{k} {\text {DFT}}_{\omega } (P)_{k + n})_{0 \leqslant k < n}). \end{aligned}$$

This leads to the following algorithm for the computation of G(P):

figure c

Proposition 1

Let \(\omega \in \mathbb {K}\) be a primitive 2n-th root of unity in \(\mathbb {K}\) and assume that 2 is invertible in \(\mathbb {K}\). Given a monic polynomial \(P \in \mathbb {K} [z]\) with \(\deg P < n\), we can compute G(P) in time \(\mathsf {G}_{2, \mathbb {K}} (n) \sim 3 \mathsf {F}_{\mathbb {K}} (n)\).

Proof

We have already explained the correctness of Algorithm 2. Step 1 requires one forward DFT of length 2n and cost \(\mathsf {F}_{\mathbb {K}} (2 n) = 2 \mathsf {F}_{\mathbb {K}} (n) + O (n)\). Step 2 can be done in O(n). Step 3 requires one inverse DFT of length n and cost \(\mathsf {F}_{\mathbb {K}} (n) + O (n)\). The total cost of Algorithm 2 is therefore \(3 \mathsf {F}_{\mathbb {K}} (n) + O (n) \sim 3 \mathsf {F}_{\mathbb {K}} (n)\).

Remark 4

In terms of the complexity of multiplication, we obtain \(\mathsf {G}_{2, \mathbb {K}} (n) \sim (1 / 2) \mathsf {M}_{\mathbb {K}} (n)\). This gives a \(33.3\%\) improvement over the previously best known bound \(\mathsf {G}_{2, \mathbb {K}} (n) \sim {(2 / 3) \mathsf {M}_{\mathbb {K}} (n)}\) that was used in [10]. Note that the best known algorithm for squaring polynomials of degree \(< n\) is \(\sim (2 / 3) \mathsf {M}_{\mathbb {K}} (n)\). It would be interesting to know whether squares can also be computed in time \(\sim (1 / 2) \mathsf {M}_{\mathbb {K}} (n)\).

3.3 Graeffe Transforms of Power of Two Orders

In view of (3), Graeffe transforms of power of two orders \(2^m\) can be computed using

$$\begin{aligned} G_{2^m} (P) = \left( G \circ \overset{m \times }{\ldots } \circ G \right) (P) . \end{aligned}$$
(8)

Now assume that we computed the first Graeffe transform G(P) using Algorithm 2 and that we wish to apply a second Graeffe transform to the result. Then we note that

$$\begin{aligned} {\text {DFT}}_{\omega } (G (P))_{2 k} = {\text {DFT}}_{\omega ^2} (G (P))_k = \hat{G}_{2 k} \end{aligned}$$
(9)

is already known for \(k = 0, \ldots , n - 1\). We can use this to accelerate step 1 of the second application of Algorithm 2. Indeed, in view of (5) for \(n_1 = 2\) and \(n_2 = n\), we have

$$\begin{aligned} {\text {DFT}}_{\omega } (G (P))_{2 k + 1} = {\text {DFT}}_{\omega ^2} ( ( \omega ^i G (P)_i \, )_{0 \leqslant i < n} )_k \end{aligned}$$
(10)

for \(k = 0, \ldots , n - 1\). In order to exploit this idea in a recursive fashion, it is useful to modify Algorithm 2 so as to include \({\text {DFT}}_{\omega ^2} (P)\) in the input and \({\text {DFT}}_{\omega ^2} (G (P))\) in the output. This leads to the following algorithm:

figure d

Proposition 2

Let \(\omega \in \mathbb {K}\) be a primitive 2n-th root of unity in \(\mathbb {K}\) and assume that 2 is invertible in \(\mathbb {K}\). Given a monic polynomial \(P \in \mathbb {K} [z]\) with \(\deg P < n\) and \(m \geqslant 1\), we can compute \(G_{2^m} (P)\) in time \(\mathsf {G}_{2^m, \mathbb {K}} (n) \sim (2 m + 1) \mathsf {F}_{\mathbb {K}} (n)\).

Proof

It suffices to compute \({\text {DFT}}_{\omega ^2} (P)\) and then to apply Algorithm 3 recursively, m times. Every application of Algorithm 3 now takes \(2 \mathsf {F}_{\mathbb {K}} (n) + O (n) \sim 2 \mathsf {F}_{\mathbb {K}} (n)\) operations in \(\mathbb {K}\), whence the claimed complexity bound.

Remark 5

In [10], Graeffe transforms of order \(2^m\) were directly computed using the formula (8), using \(\sim 4 m \mathsf {F}_{\mathbb {K}} (n)\) operations in \(\mathbb {K}\), which is twice as slow as the new algorithm.

4 Implementation and Benchmarks

We have implemented the tangent Graeffe root finding algorithm (Algorithm 1) in C with the optimizations presented in Sect. 3. Our C implementation supports primes of size up to 63 bits. In what follows all complexities count arithmetic operations in \(\mathbb {F}_p\).

In Tables 1 and 2 the input polynomial P(z) of degree d is constructed by choosing d distinct values \(\alpha _i \in \mathbb {F}_p\) for \(1 \leqslant i \leqslant d\) at random and creating \(P (z) = \prod _{i = 1}^d (z - \alpha _i)\). We will use \(p = 3 \times 29 \times 2^{56} + 1\), a smooth 63 bit prime. For this prime \(\mathsf {M} (d)\) is \(O (d \log d)\).

One goal we have is to determine how much faster the Tangent Graeffe (TG) root finding algorithm is in practice when compared with the Cantor-Zassenhaus (CZ) algorithm which is implemented in many computer algebra systems. In Table 1 we present timings comparing our sequential implementation of the TG algorithm with Magma’s implementation of the CZ algorithm. For polynomials in \(\mathbb {F}_p [z]\), Magma uses Shoup’s factorization algorithm from [17]. For our input P(z), with d distinct linear factors, Shoup uses the Cantor–Zassenhaus equal degree factorization method. The average complexity of TG is \(O (\mathsf {M} (d) ( \log (p/s) + \log d ) )\) and of CZ is \(O (\mathsf {M} (d) \log p \log d)\).

Table 1. Sequential timings in CPU seconds for \(p = 3 \cdot 29 \cdot 2^{56} + 1\) and using \(s \in [2d, 4d)\).

The timings in Table 1 are sequential timings obtained on a Linux server with an Intel Xeon E5-2660 CPU with 8 cores. In Table 1 the time in column “first” is for the first application of the TG algorithm (steps 1–9 of Algorithm 1), which obtains about 69% of the roots. The time in column “total” is the total time for the TG algorithm. Columns step 5, step 6, and step 9 report the time spent in steps 5, 6, and 9 in Algorithm 1 and do not count time in the recursive call in step 10.

The Magma timings are for Magma’s +Factorization+ command. The timings for Magma version V2.25-3 suggest that Magma’s CZ implementation involves a subalgorithm with quadratic asymptotic complexity. Indeed it turns out that the author of the code implemented all of the sub-quadratic polynomial arithmetic correctly, as demonstrated by the second set of timings for Magma in column V2.25-5, but inserted the d linear factors found into a list using linear insertion! Allan Steel of the Magma group identified and fixed the offending subroutine for Magma version V2.25-5. The timings show that TG is faster than CZ by a factor of 76.6 (=8.43/0.11) to 146.3 (=2809/19.2).

We also wanted to attempt a parallel implementation. To do this we used the MIT Cilk C compiler from [8]. Cilk provides a simple fork-join model of parallelism. Unlike the CZ algorithm, TG has no gcd computations that are hard to parallelize. We present some initial parallel timing data in Table 2. The timings in parentheses are parallel timings for 8 cores.

Table 2. Real times in seconds for 1 core (8 cores) and \(p = 3 \cdot 29 \cdot 2^{56} + 1\).

4.1 Implementation Notes

To implement the Taylor shift \(P (z + \tau )\) in step 3, we used the \(O (\mathsf {M} (d))\) method from [1, Lemma 3]. For step 5 we use Algorithm 3. It has complexity \(O (\mathsf {M} (d) \log \frac{p}{s})\). To evaluate \(A (z), A' (z)\) and B(z) in step 6 in \(O (\mathsf {M} (s))\) we used the Bluestein transformation [3]. In step 9 to compute the product \(Q (z) = \Pi _{\alpha \in S} (z - \alpha )\), for \(t = |S|\) roots, we used the \(O (\mathsf {M} (t) \log t)\) product tree multiplication algorithm [9]. The division in step 10 is done in \(O (\mathsf {M} (d))\) with the fast division.

The sequential timings in Tables 1 and 2 show that steps 5, 6 and 9 account for about 90% of the total time. We parallelized these three steps as follows. For step 5, the two forward and two inverse FFTs are done in parallel. We also parallelized our radix 2 FFT by parallelizing recursive calls for size \(n \geqslant 2^{17}\) and the main loop in blocks of size \(m \geqslant 2^{18}\) as done in [14]. For step 6 there are three applications of Bluestein to compute \(A (\omega ^{ir})\), \(A' (\omega ^{ir})\) and \(B (\omega ^{ir})\). We parallelized these (thereby doubling the overall space used by our implementation). The main computation in the Bluestein transformation is a polynomial multiplication of two polynomials of degree s. The two forward FFTs are done in parallel and the FFTs themselves are parallelized as for step 5. For the product in step 9 we parallelize the two recursive calls in the tree multiplication for large sizes and again, the FFTs are parallelized as for step 5.

To improve parallel speedup we also parallelized the polynomial multiplication in step 3 and the computation of the roots in step 8. Although step 8 is O(|S|), it is relatively expensive because of two inverse computations in \(\mathbb {F}_p\). Because we have not parallelized about 5% of the computation the maximum parallel speedup we can obtain is a factor of \(1 / (0.05 + 0.95 / 8) = 5.9\). The best overall parallel speedup we obtained is a factor of 4.6 = 1465/307.7 for \(d = 2^{25} - 1\).