1 Introduction

1.1 Background

RSA [2] and elliptic curve cryptography (ECC) [3, 4] are public-key cryptosystems that are the most widely used in practice. RSA and ECC are believed to be secure since there are no known polynomial time algorithms for solving the factorization problem and elliptic curve discrete logarithm problem (ECDLP). NIST [5] recommends elliptic curves for ECC over a prime field \({\mathbb {F}}_q\) and a binary field \({\mathbb {F}}_{2^n}\). Specifically, degrees \(n = 163, 233, 283\), and 571 are recommended for binary elliptic curves. However, Shor [6] proposed a quantum algorithm that solves the factorization problem and ECDLP in polynomial time. Then, designing post-quantum public-key cryptosystems (PQC) has been paid much attention and the timing of the transition to PQC has been actively discussed.

Despite the theoretical effectiveness, Shor’s algorithm is currently not efficient in practice. For example, there are several reports of the quantum algorithm to solve the factorization problem [7,8,9,10,11,12,13,14,15,16]; however, the target composite integers are mainly 15 and 21, while the classical factorization of 795-bit composite integers has been reported [17]. The situation stems from the fact that physical realizations of large-scale quantum computers have a lot of technical barriers. Thus, there are several papers [18,19,20,21,22,23,24,25,26,27] that estimate the concrete resource estimates of quantum factoring and its improvements in terms of the number of qubits, the number of quantum gates, and depth of circuits.

Compared with the situation of quantum factoring, the quantum resource estimates of the ECDLP were not studied until recently. Although the first attempt was given by Proos and Zalka [28], their analysis lacks the implementation of elliptic curve additions that are the most dominant step to run Shor’s quantum algorithm. Roetteler et al. [29] showed the first concrete resource estimates of ECDLP over a prime field \({\mathbb {F}}_q\) by indicating how to perform elliptic curve additions quantumly. Subsequently, Banegas et al. [30] gave the alternative results for a binary field \({\mathbb {F}}_{2^n}\) and the work was followed by Putranto et al. [31].

In this paper, we focus on binary elliptic curves. We especially study an inversion in \({\mathbb {F}}_{2^n}\), where the computation is the most dominant operation to realize elliptic curve additions. For this purpose, Banegas et al. [30] proposed two quantum methods for inversion in \({\mathbb {F}}_{2^n}\), i.e., an extended GCD-based inversion and FLT-based inversionFootnote 1 inspired by Bernstein and Yang’s inversion [32] and Itoh and Tsujii’s inversion [33], respectively. Their results indicate that the extended GCD-based inversion requires fewer qubits, while the FLT-based inversion requires fewer Toffoli gates and a smaller depth of circuits. Although Banegas et al. [30] tried to minimize the required number of qubits, Putranto et al. [31] revisited the analysis to minimize the depth of circuits. Then, Putranto et al. proposed a quantum FLT-based inversion algorithm that works with a smaller depth of circuits and larger qubits than Banegas et al.’s FLT-based inversion algorithm, while the numbers of Toffoli gates are unchanged.

1.2 Our contribution

In this paper, we propose two quantum FLT-based inversion algorithms. We concretely analyze quantum resources for the algorithms over NIST-recommended curves. Then, we show that our proposed algorithms improve previous FLT-based inversion algorithms by Banegas et al. [30] and Putranto et al. [31] for all degrees \(n = 163, 233, 283\), and 571. Briefly speaking, our first and second algorithms are based on FLT-based inversion algorithms by Putranto et al. and Banegas et al., respectively. Intuitively, our algorithms successfully overcome the disadvantages of previous FLT-based inversion algorithms. Indeed, for all degrees n, our first and second algorithms require fewer qubits and smaller depth of circuits than Putranto et al. and Banegas et al., respectively. Moreover, we want to claim two further benefits of our algorithms. At first, our algorithms do not sacrifice the advantages of previous FLT-based inversion algorithms in the sense that the number of qubits, number of Toffoli gates, and depth of circuits of our first and second algorithms do not exceed those of Putranto et al. and Banegas et al., respectively. Next, our algorithms successfully reduce the number of Toffoli gates of previous FLT-based inversion algorithms for \(n = 571\). In other words, our algorithms improve all three factors of previous FLT-based inversion algorithms for \(n = 571\). For example, our first (resp. second) algorithm for \(n = 571\) requires \(74\%\), \(93\%\), and \(97\%\) (resp. \(93\%\), \(93\%\), and \(79\%\)) of qubits, Toffoli gates, and depth of Putranto et al.’s algorithm (resp. Banegas et al.’s algorithm). We also apply windowing to our algorithms. Windowing is a way for reducing Toffoli gates by using quantum read-only memory (QROM). Both Banegas et al. [30] and Putranto et al. [31] also estimated the number of Toffoli gates when windowing is applied.

Difference from preliminary version In the preliminary version [1], we use quantum multiplication by Hoof [34] to estimate quantum resources. Recently, Kim et al. [35] proposed a new quantum multiplication that has an advantage over Hoof’s one in terms of the number of Toffoli gates and depth. Therefore, in this version, we use Kim et al.’s quantum multiplication to estimate quantum resources.

1.3 Technical overview

Both previous quantum FLT-based inversion algorithms by Banegas et al. [30] and Putranto et al. [31] are modifications of Itoh and Tsujii’s classical FLT-based inversion algorithm [33]. Given \(f \in {\mathbb {F}}_{2^n}^*\), both classical and quantum FLT-based inversion algorithms compute \(f^{-1} \in {\mathbb {F}}_{2^n}^*\) based on the fact that \(f^{2^n - 2} = f^{-1}\). Itoh and Tsujii’s inversion finally computes \(f^{-1}\) by \(\left( f^{2^{n - 1} - 1}\right) ^2 = f^{2^n - 2}\) and the main step of the algorithm is a computation of \(f^{2^{n - 1} - 1}\). Here, we describe how to compute \(f^{2^{n - 1} - 1} = f^{2^{162} - 1}\) when \(n = 163\). Observe that 162 has Hamming weight three in binary, where \(162 = 128 + 32 + 2 = 2^7 + 2^5 + 2^1\). We start from \(f = f^{2^{2^0} - 1}\) and compute each \(f^{2^{2^1} - 1}, f^{2^{2^2} - 1}, \ldots , f^{2^{2^7} - 1}\). Specifically, given \(f^{2^{2^{k - 1}} - 1}\) for \(k = 1, 2, \ldots , 7 = \lfloor \log 162 \rfloor \), we can compute \(f^{2^{2^k} - 1}\) by

$$\begin{aligned} f^{2^{2^{k - 1}} - 1} \times \left( f^{2^{2^{k - 1}} - 1}\right) ^{2^{2^{k - 1}}} =f^{2^{2^{k - 1}} - 1}\times f^{2^{2^k} - 2^{2^{k - 1}}} =f^{2^{2^k} - 1} \end{aligned}$$

with seven field multiplications. Then, we compute \(f^{2^{2^7 + 2^5} - 1}\) and \(f^{2^{2^7 + 2^5 + 2^1} - 1} = f^{2^{162} - 1}\) by

$$\begin{aligned} \left( f^{2^{2^7} - 1}\right) ^{2^{2^5}} \times f^{2^{2^5} - 1} =&~ f^{2^{2^7 + 2^5} - 2^{2^5}} \times f^{2^{2^5} - 1} = f^{2^{2^7 + 2^5} - 1},\\ \left( f^{2^{2^7 + 2^5} - 1}\right) ^{2^{2^1}} \times f^{2^{2^1} - 1} =&~ f^{2^{2^7 + 2^5 + 2^1} - 2^{2^1}} \times f^{2^{2^1} - 1} = f^{2^{2^7 + 2^5 + 2^1} - 1}, \end{aligned}$$

with two field multiplications. Thus, nine field multiplications in total are required for computing \(f^{2^{162} - 1}\). In general, Itoh and Tsujii’s inversion requires \(\lfloor \log (n - 1) \rfloor + t - 1\) field multiplications, where t denotes the Hamming weight of \(n - 1\) in binary.

Next, we explain how to perform FLT-based inversion quantumly. Putranto et al.’s algorithm [31] is simpler than Banegas et al.’s algorithm [30] since Banegas et al.’s algorithm can be viewed as a modification of Putranto et al.’s algorithm by clearing garbages and reduces the required number of qubits. Therefore, we use Putranto et al.’s algorithm to explain an overview of quantum FLT-based inversion. For simplicity, we focus on the number of qubits to perform Putranto et al.’s algorithm. At first, we describe how to compute each \(f^{2^{2^1} - 1}, f^{2^{2^2} - 1}, \ldots , f^{2^{2^7} - 1}\). A point to note is that when given \(f^{2^{2^{k - 1}} - 1}\) as a quantum superposition in i-th register, we cannot efficiently compute \(f^{2^{2^{k}} - 1}\) in the next register. In turn, we apply CNOT gates and copy \(f^{2^{2^{k - 1}} - 1}\) in an \((i + 1)\)-th register. Then, we apply CNOT gates to the i-th register and obtain \(\left( f^{2^{2^{k - 1}} - 1}\right) ^{2^{k - 1}} = f^{2^{2^k} - 2^{2^{k - 1}}}\) in the i-th register. Finally, we apply Toffoli gates to the i-th and \((i + 1)\)-th registers and obtain \(f^{2^{2^{k - 1}} - 1}\times f^{2^{2^k} - 2^{2^{k - 1}}} =f^{2^{2^k} - 1}\) in the \((i + 2)\)-th register. Thus, when given \(f = f^{2^{2^0} - 1}\) in the first register, \(2\lfloor \log 162 \rfloor + 1 = 15\) registers, i.e., 15n qubits, are required so far. Next, we explain how to compute \(f^{2^{2^7 + 2^5} - 1}\) and \(f^{2^{2^7 + 2^5 + 2^1} - 1} = f^{2^{162} - 1}\). When given \(f^{2^{2^7} - 1}\) in i-th register and \(f^{2^{2^5} - 1}\) in j-th register, we apply CNOT gates to the i-th register and obtain \(\left( f^{2^{2^7} - 1}\right) ^{2^{2^5}} = f^{2^{2^7 + 2^5} - 2^{2^5}}\) in the i-th register. Then, we apply Toffoli gates to the i-th and j-th registers and obtain \(f^{2^{2^7 + 2^5} - 2^{2^5}} \times f^{2^{2^5} - 1}= f^{2^{2^7 + 2^5} - 1}\) in the 16-th register. Similarly, we can compute \(f^{2^{2^7 + 2^5 + 2^1} - 1} = f^{2^{162} - 1}\) to the 17-th register. Finally, we apply CNOT gates to the 17-th register and obtain \( = f^{2^{163} - 2}\) in the 17-th register. Therefore, 17 registers, i.e., 17n qubits, are required in total. In general, Putranto et al.’s quantum FLT-based inversion algorithm requires \((2\lfloor \log (n - 1) \rfloor + t)n\) qubits.

Summarizing the above discussion, given \(f = f^{2^{2^0} - 1}\) and the previous FLT-based inversion algorithms for \(n = 163\) computes \(f^{2^{2^1} - 1}, f^{2^{2^2} - 1}, \ldots , f^{2^{2^7} - 1}\), \(f^{2^{2^7 + 2^5} - 1}\), and \(f^{2^{2^7 + 2^5 + 2^1} - 1} = f^{2^{162} - 1}\). The first key observation of our improvement is that the exponents of 2 during the calculation, i.e.,

$$\begin{aligned} \{2^0 = 1, 2^1, 2^2, \ldots , 2^7, 2^7 + 2^5, 2^7 + 2^5 + 2^1 = 162\}, \end{aligned}$$

is an addition chain for \(n - 1 = 162\). In general, an addition chain for N is a sequence \(p_0 = 1, p_1, \ldots , p_\ell = N\), where \(p_s = p_i + p_j\) holds for some \(0 \le i, j < s\). Here, \(\ell \) is called a length of an addition chain. We show that \(f^{2^{n - 1} - 1}\) can be computed with an arbitrary addition chain for \(n - 1\) by following the similar steps of Putranto et al.’s algorithm. For example, there is another addition chain

$$\begin{aligned} \{1, 2, 4, 8, 16, 32, 33, 65, 97, 162\} \end{aligned}$$

for 162. Keen readers may think that the observation is not interesting since the relation between FLT-based inversion and addition chain has been already discussed in the context of classical computation [36,37,38,39,40]. These papers mentioned that the computational cost of FLT-based inversion relates to the length of addition chains in the sense that the number of field multiplications \(\lfloor \log (n - 1) \rfloor + t - 1\) is the same as the length of addition chains. Similarly, the computational cost of quantum FLT-based inversion relates to the length of addition chains in the sense that the number of Toffoli gates is determined by the length of addition chains. Here, the length of an addition chain \(\{1, 2, 4, 8, 16, 32, 33, 65, 97, 162\}\) is nine which is the same as that of previous addition chain \(\{2^0 = 1, 2^1, 2^2, \ldots , 2^7, 2^7 + 2^5, 2^7 + 2^5 + 2^1 = 162\}\).

However, we show that the computational cost of quantum FLT-based inversion also depends on other properties of addition chains. Hereafter, for an addition chain \(\{p_s\}^\ell _{s = 0}\), we call \(p_s\) a doubled term if it is computed by \(p_s = p_i + p_i\) for some \(0 \le i < s\) and an added term otherwise. In the above example for \(n = 163\), \(2^1, 2^2, \ldots , 2^7\) are doubled terms and \(2^7 + 2^5, 2^7 + 2^5 + 2^1\) are added terms for \(\{2^0 = 1, 2^1, 2^2, \ldots , 2^7, 2^7 + 2^5, 2^7 + 2^5 + 2^1 = 162\}\) whereas \(\{2, 4, 8, 16, 32\}\) are doubled terms and \(\{33, 65, 97, 162\}\) are added terms for \(\{1, 2, 4, 8, 16, 32, 33, 65, 97, 162\}\). For an addition chain \(\{p_s\}^\ell _{s = 0}\), let d and m denote the number of doubled terms and added terms, where \(\ell = d + m\). Then, we show that the number of qubits \((2\lfloor \log (n - 1) \rfloor + t)n\) for Putranto et al.’s algorithm is essentially described by \((2d + m + 1)n\). In other words, even if the lengths of addition chains are the same, the computational costs of the quantum FLT-based inversion algorithm may not be the same depending on other properties of addition chains. Indeed, an addition chain \(\{2^0 = 1, 2^1, 2^2, \ldots , 2^7, 2^7 + 2^5, 2^7 + 2^5 + 2^1 = 162\}\) has seven doubled terms and two added terms whereas \(\{1, 2, 4, 8, 16, 32, 33, 65, 97, 162\}\) has five doubled terms \(\{2, 4, 8, 16, 32\}\) and four added terms \(\{33, 65, 97, 162\}\). Therefore, quantum FLT-based inversion based on the latter addition chain requires fewer qubits than that on the former. Based on the discussion and more, we find more appropriate addition chains for all \(n = 163, 233, 283, 571\) and obtain our improvements.

1.4 Organization

In Sect. 3, we review previous FLT-based inversion algorithms. In Sect. 4, we propose quantum FLT-based inversion algorithms. In Sect. 5, we compare our proposed algorithms and previous quantum algorithms. In Sect. 6, we apply windowing to our algorithms.

2 Preliminaries

In Sect. 2.1, we review binary elliptic curves and the binary elliptic curve discrete logarithm problem (ECDLP). Then, we briefly explain Shor’s algorithm for binary ECDLP in Sect. 2.2. We also describe an overview of quantum computing on the field \({\mathbb {F}}_{2^n}\) in Sect. 2.3.

2.1 Elliptic curve discrete logarithm problem

Let n be a positive integer. A binary elliptic curve of degree n is given by \(y^2+xy=x^3+ax^2+b\), where \(a \in {\mathbb {F}}_{2^n}\) and \(b \in {\mathbb {F}}_{2^n}^*\). In general, the set of rational points on an elliptic curve along with a special point O called a point at infinity forms a group under point addition, where O is a neutral element. Let \(P=(x_1,y_1)\) and \(Q=(x_2,y_2)\) denote points on a binary elliptic curve. When \(P \ne Q\), a point addition \(P + Q = (x_3,y_3)\) is given by

$$\begin{aligned} x_3 = \lambda ^2+\lambda +x_1+x_2+a,\quad y_3 = (x_2+x_3)\lambda +x_3+y_2 \end{aligned}$$

with \(\lambda = (y_1+y_2)/(x_1+x_2)\). Let [k]P denote \(P + \cdots + P\) that is a sum of k P’s under point addition. Then, \([2]P = (x_3, y_3)\) is given by

$$\begin{aligned} x_3 = \lambda ^2+\lambda +a,\quad y_3 = x_1^2 + (\lambda + 1)x_3 \end{aligned}$$

with \(\lambda = x_1 + y_1/{x_1}\). It is known that only basic arithmetic in \({\mathbb {F}}_{2^n}\) is sufficient for computing point addition on a binary elliptic curve. Then, the task of the binary ECDLP is computing k from P and [k]P.

2.2 Shor’s algorithm for binary ECDLP

Shor’s algorithm for the binary ECDLP of degree n consists of two parts, i.e., the point addition part and Quantum Fourier Transform part. The point addition part requires \(2n+2\) times point additions with \(O(n^3)\) gates, while the Quantum Fourier Transform part requires \(O(n^2)\) gates. Therefore, the point addition part is dominant in Shor’s algorithm. As we mentioned in Sect. 2.1, an inversion in \({\mathbb {F}}_{2^n}\), i.e., computation of \(\lambda \), is required for performing point addition \(P+Q\). Moreover, several works [29,30,31, 41] indicate that the inversion computation requires the largest quantum resources in point addition. Therefore, the efficiency of quantum inversion computations greatly affects the total quantum resources for Shor’s algorithm.

2.3 Quantum computation in \({\mathbb {F}}_{2^n}\)

In quantum computation, we use a “qubit” represented by \(\mathinner {|{0}\rangle }, \mathinner {|{1}\rangle }\) and their superposition. We represent an element of \({\mathbb {F}}_{2^n}\) by n qubits. Here, we use the fact that for \(m(x) \in {\mathbb {F}}_2[x]\) which is an irreducible polynomial of degree n, it holds that \({\mathbb {F}}_{2^n} \simeq {\mathbb {F}}_2[x]/(m(x))\). Thus, we can express an element of \(\mathbb F_{2^n}\) as a polynomial of degree at most \(n-1\) with its coefficients 0 or 1.

In quantum circuits, we use some quantum gates that are similar to NOT, AND, and OR in classical circuits. In this paper, we consider only CNOT gates, Toffoli (TOF) gates, and swap gates. Let ab, and c denote quantum states of one-qubit. Then, CNOT, TOF, and swap operations are given by

$$\begin{aligned} \text {CNOT}(a,b) =&~ (a,a \oplus b), \qquad \text {TOF}(a,b,c) = (a,b,c \oplus (a \cdot b)), \\ \text {swap}(a,b) =&~ (b,a), \end{aligned}$$

respectively. The swap gate consists of three CNOT gates, while the TOF gate is more expensive than a CNOT and swap gate.

We summarize known quantum algorithms which we will use for performing basic arithmetic in \({\mathbb {F}}_{2^n}\). Let ADD and \(\texttt {SQUARE}\) denote Banegas et al.’s algorithms [30] for addition and squaring, respectively, while \(\texttt {MODMULT}\) denote Kim et al.’s algorithm [35] for multiplication. Let fg, and h be quantum states of elements in \({\mathbb {F}}_{2^n}\). Then, the algorithms are described as follows:

$$\begin{aligned} \texttt {ADD}(f,g) =&~ (f,f + g), \qquad \texttt {SQUARE}(f) = f^2, \\ \texttt {MODMULT}(f,g,h) =&~ (f,g,h + fg). \end{aligned}$$

Similarly, we also use a \(\texttt {SQUARE}^{-1}\) operation given by

$$\begin{aligned} \texttt {SQUARE}^{-1}(f^2) = f. \end{aligned}$$

Here, ADD, SQUARE, and SQUARE \(^{-1}\) are based on only CNOT gates. Specifically, the number of CNOT gates are n for ADD, and at most \(n^2 - n\) for SQUARE or SQUARE \(^{-1}\). In contrast, \(\texttt {MODMULT}\) requires not only CNOT gates but also TOF gates. Throughout the paper, \(\texttt {ADD}\) and \(\texttt {MODMULT}\) may take only specific inputs. Let \(\varvec{0}\) denote a quantum state of a zero element in \({\mathbb {F}}_{2^n}\). Then, when we set \(g = \varvec{0}\) as the input of \(\texttt {ADD}\), given f and \(\texttt {ADD}(f, \varvec{0}) = (f, f)\) copy f to a new n-qubit register. Similarly, when we set \(h = \varvec{0}\) as the input of \(\texttt {MODMULT}\), given fg and \(\texttt {MODMULT}(f, g, \varvec{0}) = (f, g, fg)\) writes fg in a new n-qubit register.

3 FLT-based inversion

In this section, we review previous FLT-based inversion algorithms. In Sect. 3.1, we briefly explain Itoh and Tsujii’s classical FLT-based inversion [33]. Then, in Sects. 3.2 and 3.3, we review Putranto et al.’s [31] and Banegas et al.’s [30] quantum FLT-based inversion algorithm.

3.1 Classical FLT-based inversion

Let f be an element of \({\mathbb {F}}_{2^n}^*\). For simplicity, we use a notation

$$\begin{aligned} \langle \alpha \rangle {:=}f^{\alpha } \end{aligned}$$

hereafter. The task of inversion is computing \(\langle -1 \rangle \) from \(\langle 1 \rangle \). Based on the extended Fermat’s little theorem, the FLT-based inversion method performs inversion by computing \(\langle 2^n-2 \rangle = \langle -1 \rangle \). For this purpose, we use the following three relations:

$$\begin{aligned} \langle 2^{2^{k-1}}-1 \rangle \times \langle 2^{2^{k-1}}-1 \rangle ^{2^{2^{k-1}}}&= \langle 2^{2^k}-1 \rangle ,\end{aligned}$$
(1)
$$\begin{aligned} \langle 2^{\alpha }-1 \rangle ^{2^{\beta }}\times \langle 2^{\beta }-1 \rangle&= \langle 2^{\alpha +\beta }-1 \rangle , \end{aligned}$$
(2)
$$\begin{aligned} \langle 2^{n-1}-1 \rangle ^2&= \langle 2^{n}-2 \rangle . \end{aligned}$$
(3)

Let t denote the Hamming weight of \(n-1\) in binary. Then, we have \(n - 1 = \sum _{s=1}^{t}2^{k_s}\) with \(k_1 = \lfloor \log _2(n-1) \rfloor> k_2> \cdots > k_t \ge 0\). The FLT-based inversion consists of three steps as follows.

First Step::

The step computes \(\langle 2^{2^1} - 1 \rangle , \langle 2^{2^2} - 1 \rangle , \ldots , \langle 2^{2^{k_1}} - 1 \rangle \) from \(\langle 2^{2^0} - 1 \rangle = \langle 1 \rangle \). For this purpose, we apply (1) to \(\langle 2^{2^{i-1}}-1 \rangle \) and obtain \(\langle 2^{2^i}-1 \rangle \) for \(i = 1, 2, \ldots , k_1\) sequentially.

Second Step::

The step computes \(\langle 2^{n-1}-1 \rangle \) from \(\langle 2^{2^{k_1}}-1 \rangle , \langle 2^{2^{k_2}}-1 \rangle , \ldots , \langle 2^{2^{k_t}}-1 \rangle \) which were computed in the first step. For this purpose, we apply (2) to \(\langle 2^{2^{k_{i + 1}}}-1 \rangle \) and \(\langle 2^{\sum ^{i}_{s = 1}2^{k_s}}-1 \rangle \), and obtain \(\langle 2^{2^{k_{i + 1}}}-1 \rangle \times \langle 2^{\sum ^{i}_{s = 1}2^{k_s}}-1 \rangle ^{2^{2^{k_{i + 1}}}} = \langle 2^{\sum ^{i + 1}_{s = 1}2^{k_s}}-1 \rangle \) for \(i = 1, 2, \ldots , t - 1\) sequentially, where the last output is \(\langle 2^{\sum ^t_{s = 1}2^{k_s}}-1 \rangle = \langle 2^{n - 1}-1 \rangle \).

Third Step::

The step applies (3) to \(\langle 2^{n - 1}-1 \rangle \) and obtain \(\langle 2^n-2 \rangle = \langle -1 \rangle \).

Since the procedure may be complicated at the first glance, we describe the above procedure in a case of \(n=163\). In this case, it holds that \(n - 1 = 162 = 2^7 + 2^5 + 2^1\), where \(t = 3\) and \(k_1 = 7, k_2 = 5, k_3 = 1\). In the first step, we compute \(\langle 2^{2^1} - 1 \rangle , \langle 2^{2^2} - 1 \rangle , \ldots , \langle 2^{2^{7}} - 1 \rangle \) from \(\langle 2^{2^0} - 1 \rangle = \langle 1 \rangle \). For this purpose, we apply (1) to \(\langle 2^{2^{0}}-1 \rangle , \langle 2^{2^{1}}-1 \rangle , \ldots , \langle 2^{2^{6}}-1 \rangle \) and obtain \(\langle 2^{2^{1}}-1 \rangle , \langle 2^{2^{2}}-1 \rangle , \ldots , \langle 2^{2^{7}}-1 \rangle \), respectively. In the second step, we compute \(\langle 2^{2^7+2^5}-1 \rangle \) and \(\langle 2^{2^7+2^5+2^1}-1 \rangle = \langle 2^{162}-1 \rangle \) from \(\langle 2^{2^{7}}-1 \rangle , \langle 2^{2^{5}}-1 \rangle , \langle 2^{2^{1}}-1 \rangle \). For this purpose, we first apply (2) to \(\langle 2^{2^{7}}-1 \rangle \) and \(\langle 2^{2^{5}}-1 \rangle \), and obtain \(\langle 2^{2^{7}}-1 \rangle ^{2^{2^{5}}} \times \langle 2^{2^{5}}-1 \rangle = \langle 2^{2^7+2^5}-1 \rangle \). Then, we apply (2) to \(\langle 2^{2^{7} + 2^5}-1 \rangle \) and \(\langle 2^{2^{1}}-1 \rangle \), and obtain \(\langle 2^{2^{7} + 2^5}-1 \rangle ^{2^{2^{1}}} \times \langle 2^{2^{1}}-1 \rangle = \langle 2^{2^7+2^5 + 2^1}-1 \rangle = \langle 2^{162} - 1 \rangle \). Finally, in the third step, we apply (3) to \(\langle 2^{162}-1 \rangle \) and obtain \(\langle 2^{163}-2 \rangle = \langle -1 \rangle \).

3.2 Putranto et al.’s quantum FLT-based inversion algorithm

Algorithm 1
figure a

Putranto et al.’s quantum FLT-based inversion algorithm

We explain Putranto et al.’s quantum FLT-based inversion algorithm [31] that is a simple quantum translation of Itoh and Tsujii’s classical FLT-based inversion [33]. Putranto et al.’s algorithm is given in Algorithm 1. The algorithm saves the number of TOF gates by using SQUARE which uses only CNOT gates. Here, we explain the main parts of Algorithm 1, i.e., the loop from line 1 to 5 and from line 6 to 9.

Loop from line 1 to 5::

The loop performs the first step of Itoh and Tsujii’s FLT-based inversion. Specifically, for \(i = 1, 2, \ldots , k_1\), the i-th loop takes \(f_{2(i-1)} = \langle 2^{2^{i-1}}-1 \rangle \) as input and outputs \(\langle 2^{2^i}-1 \rangle \) by applying (1). For this purpose, we first apply ADD to copy \(f_{2(i-1)} = \langle 2^{2^{i-1}}-1 \rangle \) in a new register \(f_{2(i-1) + 1}\). Then, we apply the \(\texttt {SQUARE}\) operation \(2^{i-1}\) times to \(f_{2(i-1) + 1} = \langle 2^{2^{i-1}}-1 \rangle \) and obtain \(\langle 2^{2^{i-1}}-1 \rangle ^{2^{2^{i-1}}}\) in the same register. Finally, we apply \(\texttt {MODMULT}\) to \(f_{2(i-1)} = \langle 2^{2^{i-1}}-1 \rangle \) and \(f_{2(i-1) + 1} = \langle 2^{2^{i-1}}-1 \rangle ^{2^{2^{i-1}}}\), and obtain \(\langle 2^{2^i}-1 \rangle \) in a new register \(f_{2(i-1) + 2}\). Therefore, we use the \(\texttt {MODMULT}\) operation \(k_1\) times and new \(2k_1\) registers, i.e., \(2k_1n\) qubits.

Loop from line 6 to 9::

The loop performs the second step of Itoh and Tsujii’s FLT-based inversion. Specifically, for \(i = 1, 2, \ldots , t - 1\), the i-th loop takes \(f_{2k_{i + 1}} = \langle 2^{2^{k_{i + 1}}}-1 \rangle \) and \(f_{2k_1 + i - 1} = \langle 2^{\sum ^{i}_{s = 1}2^{k_s}}-1 \rangle \) as input, and outputs \(\langle 2^{\sum ^{i + 1}_{s = 1}2^{k_s}}-1 \rangle \) by applying (2). For this purpose, we first apply the \(\texttt {SQUARE}\) operation \(2^{k_{i + 1}}\) times to \(f_{2k_1 + i - 1} = \langle 2^{\sum ^{i}_{s = 1}2^{k_s}}-1 \rangle \) and obtain \(\langle 2^{\sum ^{i}_{s = 1}2^{k_s}}-1 \rangle ^{2^{2^{k_{i + 1}}}}\) in the same register. Then, we apply \(\texttt {MODMULT}\) to \(f_{2k_{i + 1}} = \langle 2^{2^{k_{i + 1}}}-1 \rangle \) and \(f_{2k_1 + i - 1} = \langle 2^{\sum ^{i}_{s = 1}2^{k_s}}-1 \rangle ^{2^{2^{k_{i + 1}}}}\), and obtain \(\langle 2^{\sum ^{i + 1}_{s = 1}2^{k_s}}-1 \rangle \) in a new register \(f_{2k_1 + i}\).

Therefore, we use \(\texttt {MODMULT}\) operation \(t-1\) times and new \(t-1\) registers, i.e., \((t-1)n\) qubits. We note that the last output of the loop is \(f_{k_p} = \langle 2^{\sum ^t_{s = 1}2^{k_s}}-1 \rangle = \langle 2^{n - 1}-1 \rangle \).

Although we omit the detail, the line 12 performs the third step of Itoh and Tsujii’s FLT-based inversion. To sum up, Algorithm 1 applies the \(\texttt {MODMULT}\) operation \(k_1+t-1\) times and uses new \((2k_1+t-1)n = k_pn\) qubits.

We note that we use Algorithm 1 two times for an inversion computation each. The second operation uncomputes the ancillary qubits.

3.3 Banegas et al.’s quantum FLT-based inversion algorithm

Algorithm 2
figure b

Banegas et al.’s quantum FLT-based inversion algorithm

We explain Banegas et al.’s quantum FLT-based inversion algorithm [30] that is a fewer-qubit variant of Putranto et al.’s algorithm. Banegas et al.’s algorithm is given in Algorithm 2 by clearing garbages. Algorithm 2 is similar to Algorithm 1 except the additional step in from line 6 to 8. To demonstrate the effectiveness of the step, we again focus on Algorithm 1. From line 1 to 5, for \(i = 1, 2, \ldots , k_1\), the i-th loop takes \(f_{2(i-1)} = \langle 2^{2^{i-1}}-1 \rangle \) as input and outputs \(f_{2(i-1)} = \langle 2^{2^i}-1 \rangle \). During the computation, we also use a register \(f_{2(i-1) + 1}\) that results in \(f_{2(i-1) + 1} = \langle 2^{2^{i-1}}-1 \rangle ^{2^{2^{i-1}}}\). A point to note is that the register \(f_{2(i-1) + 1}\) is used only for the computation and remains as it is. Therefore, Algorithm 2 initializes the register and successfully reduce the qubits by applying SQUARE \(^{-1}\). On the other hand, due to the additional procedure, Algorithm 2 requires larger depth and more CNOT gates than Algorithm 1. We explain the loop from line 1 to line 8 in Algorithm 2 below.

Loop from line 1 to 8::

The loop performs the same step of the loop from line 1 to 5 in Algorithm 1. In particular, \(f_{k_{i - 1}}, f_{k_b}\), and \(f_i\) in Algorithm 2 play the same role as \(f_{k_{2(i - 1)}}, f_{2(i - 1) + 1}\), and \(f_{2(i - 1) + 2}\) in Algorithm 1, respectively. Thus, the loop takes \(f_{i-1} = \langle 2^{2^{i-1}}-1 \rangle \) as input and results in \(f_{i-1} = \langle 2^{2^{i-1}}-1 \rangle \), \(f_{k_b} = \langle 2^{2^{i-1}}-1 \rangle ^{2^{2^{i-1}}}\), and \(f_i = \langle 2^{2^i}-1 \rangle \) by line 5. Then, we apply the \(\texttt {SQUARE}^{-1}\) operation \(2^{i-1}\) times to \(f_{k_b} = \langle 2^{2^{i-1}}-1 \rangle ^{2^{2^{i-1}}}\) and obtain \(\langle 2^{2^{i-1}}-1 \rangle \) in the same register. Finally, we apply \(\texttt {ADD}\) to \(f_{i-1} = \langle 2^{2^{i-1}}-1 \rangle \) and \(f_{k_b} = \langle 2^{2^{i-1}}-1 \rangle \), and initialize \(f_{k_b}\). Since \(f_{k_b}\) in Algorithm 2 plays the same role as \(f_{2(i - 1) + 1}\) in Algorithm 1 for all \(i = 1, 2, \ldots , k_1\), Algorithm 2 reduces \(k_1 - 1\) registers, i.e., \((k_1 - 1)n\) qubits. Therefore, we use the \(\texttt {MODMULT}\) operation \(k_1\) times and new \(k_1 + 1\) registers, i.e., \((k_1 + 1)n\) qubits.

Although we omit the detail, \(f_{k_b}\) is also used to store the outputs of second and third steps. Thus, Algorithm 2 reduces one more register, i.e., n qubits. To sum up, Algorithm 2 applies the \(\texttt {MODMULT}\) operation \(k_1 + t -1\) times and use new \((k_1+t-1)n = k_bn\) qubits.

We repeatedly claim that we use Algorithm 2 two times in each inversion computation.

4 Our method

In this section, we propose quantum FLT-based inversion algorithms. In Sect. 4.1, we review the notion of addition chain which is a core tool of our improvement. In Sects. 4.2 and 4.3, we propose our basic algorithm and extended algorithm that are improvements of Putranto et al.’s algorithm [31] and Banegas et al.’s algorithm [30], respectively.

4.1 Addition chain

Let N and \(\ell \) be nonnegative integers. An addition chain for N of length \(\ell \) is given by \(p_0=1, p_1, p_2, \ldots , p_\ell = N\) with the following property:

  • for all \(s = 1, 2, \ldots , \ell \), there exist i and j which satisfy \(0 \le i,j < s\) and \(p_s = p_i + p_j\).

If there are no i and j such that \(i \ne j\) satisfying \(p_s = p_i + p_j\), \(p_s\) should be computed by \(p_s = 2p_i\) for some \(0 \le i < s\). We call such \(p_s\) a doubled term. Otherwise, we call \(p_s\) including \(p_0\) an added term. For an addition chain \(\{p_s\}^\ell _{s = 0}\), we define two sets

$$\begin{aligned} D{:=}&~\left\{ s \in \{1,2,\ldots ,\ell \}\mid p_s \text { is a doubled term}\right\} ,\\ M{:=}&~ \left\{ s \in \{1,2,\ldots ,\ell \}\mid p_s \text { is an added term}\right\} , \end{aligned}$$

such that \(D \cap M = \emptyset \). We also introduce two sequences \(\{a_{s}\}_{s=1}^{\ell }\) and \(\{b_{s}\}_{s=1}^{\ell }\) that satisfy \(p_s = p_{a_{s}}+p_{b_{s}}\) for all \(1 \le s \le \ell \). Intuitively, the sequences indicate how each term \(p_s\) is computed. We note that the sequences may not be unique for an addition chain \(\{p_s\}^\ell _{s = 0}\).

Aw we explained in Sect. 1.3, there is relation between the FLT-based inversion and addition chains. In the first and second steps of Algorithms 1 and 2, we start from \(\langle 2^{2^0}-1 \rangle \) and compute \(\langle 2^{2^1}-1 \rangle ,\langle 2^{2^2}-1 \rangle ,\ldots ,\langle 2^{2^7}-1 \rangle , \langle 2^{2^7+2^5}-1 \rangle ,\) and \(\langle 2^{2^7+2^5 + 2^1}-1 \rangle = \langle 2^{2^{162}}-1 \rangle \) when \(n=163\). Here, we focus on the exponents of 2, i.e,.,

$$\begin{aligned} \{2^0=1,2^1,2^2,\ldots ,2^7,2^7+2^5,2^7+2^5+2^1=162\}. \end{aligned}$$

We find that the sequence of numbers is an addition chain for 162. Moreover, \(2^1,2^2,\ldots ,2^7\) are doubled terms and \(2^7+2^5,2^7+2^5+2^1=162\) are added terms. In general, Algorithms 1 and 2 are based on the same addition chain for \(n - 1\) following Itoh and Tsujii’s FLT-based inversion. Moreover, the first \(\lfloor \log _2(n - 1)\rfloor \) elements excluding \(2^0 = 1\) are always doubled terms and the last \(t - 1\) elements are always added terms. Hereafter, we call the sequence Itoh and Tsujii’s addition chain.

4.2 Basic algorithm

We find that previous quantum FLT-based inversion algorithms [30, 31] are based on Itoh and Tsujii’s addition chains that are automatically determined by the value \(n - 1\). Here, we show that Putranto et al.’s algorithm [31] can use arbitrary addition chains and does not necessarily have to be specific to Itoh and Tsujii’s addition chains.

At first, we introduce some properties that arbitrary addition chains inherently satisfy. These properties enable us to prove the main theorem later.

Lemma 1

For an arbitrary addition chain \(\{p_s'\}_{s=0}^\ell \) for N of length \(\ell \), there exists an addition chain \(\{p_s\}_{s=0}^\ell \) for the same N and \(\ell \) so that the latter addition chain satisfies following properties.

(i):

Both \(\{p_s\}_{s=0}^\ell \) and \(\{p'_s\}_{s=0}^\ell \) consist of the same elements although the order may not be the same. In other words, for all \(0< s < \ell \), there exists \(0< s' < \ell \) such that \(p_s = p_{s'}'\). Specifically, \(p_0 = p'_0 = 1\) and \(p_\ell = p'_\ell = N\) hold.

(ii):

A sequence consisting of only added terms of \(\{p_s\}_{s=0}^\ell \) are monotonically increasing. In other words, for all \(i, j \in M\) such that \(i < j\), it holds that \(p_i < p_j\).

(iii):

An element for computing a doubled term appear just before the doubled term. In other words, for all \(i \in D\), it holds that \(p_i = 2p_{i - 1}\).

Proof

It is clear that for an arbitrary addition chain \(\{p_s'\}_{s=0}^\ell \) for N of length \(\ell \), there is a unique sequence \(\{p_s\}_{s=0}^\ell \) that satisfy all properties (i)–(iii). What we have to show is that \(\{p_s\}_{s=0}^\ell \) is an addition chain for N of length \(\ell \). Due to the property (i), \(p_0 = 1\) and \(p_\ell = N\) hold. We complete the proof by showing that for all \(s = 1, 2, \ldots , \ell \), there exist i and j which satisfy \(0 \le i \le j < s\) and \(p_s = p_i + p_j\). If \(s \in D\), it holds that \(p_s = 2p_{s - 1} = p_{s - 1} + p_{s - 1}\) due to the property (iii).

Hereafter, we consider the case of \(s \in M\) such that \(p_s = p_i + p_j\). To prove the claim, we show that for all \(1 \le s < v \le \ell \), it holds that \(p_s < p_v\). If the statement holds, there exist i and j which satisfy \(0 \le i \le j < s\) and \(p_s = p_i + p_j\) since \(p_i < p_s\) and \(p_j < p_s\) hold. If \(v \in M\) holds, then it holds that \(p_s < p_v\) due to the property (ii). If \(v \in D\), then there exists an index \(v' \in M\) such that \(s \le v' < v\) and \(p_v = 2^{v - v'}p_{v'}\). Due to the property (ii), it holds that \(p_s \le p_{v'} < 2^{v - v'}p_{v'} = p_v\). Thus, we complete the proof. \(\square \)

We are ready for providing the existence of quantum an FLT-based inversion algorithm that uses an arbitrary addition chain.

Theorem 2

Let f be an element of \({\mathbb {F}}_{2^n}^*\) and \(\{p_s\}_{s=0}^{\ell }\) be an addition chain for \(n-1\) of length \(\ell \) satisfying the properties (i)–(iii) of Lemma 1. Let d and m denote the numbers of doubled terms and added terms in \(\{p_s\}_{s=0}^{\ell }\), respectively. There exists a quantum algorithm that takes \(f = \langle 1 \rangle \) and \(\{p_s\}_{s=0}^{\ell }\) as input and outputs \(\langle 2^{n-1}-1 \rangle \) with new \((2d+m+1)n = (\ell + d + 1)n\) qubits and \(\texttt{MODMULT}\) operations \(\ell \) times.

We note that an algorithm given in Theorem 2 is an extension of Putranto et al.’s algorithm [31] for an arbitrary addition chain. In other words, when the algorithm takes Itoh and Tsujii’s addition chain as input, then the efficiency is the same as Putranto et al.’s algorithm since it holds that \(d = \lfloor \log _2(n - 1) \rfloor \) and \(m = t - 1\) for Itoh and Tsujii’s addition chain.

Proof

In this proof, we assume \(p_{a_s} \le p_{b_s}\), where \(\{a_{s}\}_{s=1}^{\ell }\) and \(\{b_{s}\}_{s=1}^{\ell }\) are sequences that satisfy \(p_s = p_{a_{s}}+p_{b_{s}}\) for all \(1 \le s \le \ell \) as we introduced in Sect. 4.1. Hereafter, we are given \(\langle 2^{p_0}-1 \rangle = f\) and compute \(\langle 2^{p_1}-1 \rangle ,\ldots ,\langle 2^{p_\ell }-1 \rangle \) sequentially. We show the proof by mathematical induction. Specifically, we show how to compute \(\langle 2^{p_u}-1 \rangle \) for \(1 \le u \le \ell \) by assuming that \(\langle 2^{p_1}-1 \rangle ,\ldots ,\langle 2^{p_{u - 1}}-1 \rangle \) have been computed.

At first, we discuss the simplest case. In particular, we show how to compute \(\langle 2^{p_u}-1 \rangle \) by assuming that \(\langle 2^{p_{a_u}}-1 \rangle \) and \(\langle 2^{p_{b_u}}-1 \rangle \) are stored as they are. We divide the situation into two cases, i.e., \(u \in D\) and \(u \in M\), and explain separately.

Case of \(u \in D\)::

We can compute \(\langle 2^{p_u}-1 \rangle \) in essentially the same way as in the loop from line 1 to 5 in Algorithm 1. Let \(\langle 2^{p_{a_u}}-1 \rangle \) be stored in i-th register. We first apply ADD to copy \(\langle 2^{p_{a_u}}-1 \rangle \) in a new j-th register. Then, we apply the \(\texttt {SQUARE}\) operation \(2^{p_{a_u}}\) times to j-th register and obtain \(\langle 2^{2p_{a_u}} - 2^{p_{a_u}} \rangle \) in the same register. Finally, we apply \(\texttt {MODMULT}\) to \(\langle 2^{p_{a_u}}-1 \rangle \) in the i-th register and \(\langle 2^{2p_{a_u}} - 2^{p_{a_u}} \rangle \) in the j-th register, and obtain \(\langle 2^{2^{2p_{a_u}}}-1 \rangle \) in a new k-th register. Due to \(u \in D\), it holds that \(p_u = p_{a_{u}}+p_{a_{u}} = 2p_{a_{u}}\), i.e., \(\langle 2^{2^{2p_{a_u}}}-1 \rangle = \langle 2^{p_u} - 1 \rangle \). Here, we use the \(\texttt {MODMULT}\) operation once and new two registers (j-th and k-th register), i.e., 2n qubits.

Case of \(u \in M\)::

We can compute \(\langle 2^{p_u}-1 \rangle \) in essentially the same way as in the loop from line 6 to 9 in Algorithm 1. Let \(\langle 2^{p_{a_u}}-1 \rangle \) and \(\langle 2^{p_{b_u}}-1 \rangle \) be stored in i-th register and j-th register, respectively. We first apply the \(\texttt {SQUARE}\) operation \(2^{p_{b_u}}\) times to \(\langle 2^{p_{a_u}}-1 \rangle \) in i-th register and obtain \(\langle 2^{p_{a_u} + p_{b_u}} - 2^{p_{b_u}} \rangle \) in the same register. Then, we apply \(\texttt {MODMULT}\) to \(\langle 2^{p_{a_u} + p_{b_u}} - 2^{p_{b_u}} \rangle \) in the i-th register and \(\langle 2^{p_{b_u}}-1 \rangle \) in the j-th register, and obtain \(\langle 2^{p_{a_u} + p_{b_u}} - 1 \rangle = \langle 2^{p_u}-1 \rangle \) in a new k-th register. Here, we use the \(\texttt {MODMULT}\) operation once and new one register (k-th register), i.e., n qubits.

After the computation, \(\langle 2^{p_{a_u}}-1 \rangle \) is still stored as it is if \(u \in D\); however, \(\langle 2^{p_{a_u}}-1 \rangle \) becomes \(\langle 2^{p_{a_u}}-1 \rangle ^{2^{p_{b_u}}} = \langle 2^{p_{a_u} + p_{b_u}}-2^{p_{b_u}} \rangle \) if \(u \in M\). In other words, an assumption that \(\langle 2^{p_{a_u}}-1 \rangle \) and \(\langle 2^{p_{b_u}}-1 \rangle \) are stored as they are does not always hold. We note that the assumption always hold if \(u \in D\) since \(a_u = u - 1\) due to the property (iii) of Lemma 1.

Next, we show how to compute \(\langle 2^{p_u}-1 \rangle \) for \(u \in M\) in general. Let \(c_u\) and \(d_u\) be nonnegative integers. Then, we show how to compute \(\langle 2^{p_u}-1 \rangle \) from \(\langle 2^{p_{a_u} + p_{c_u}}-2^{p_{c_u}} \rangle \) and \(\langle 2^{p_{b_u} + p_{d_u}}-2^{p_{d_u}} \rangle \). We should consider three cases, i.e., the case of \((c_u,d_u) = (0,0)\), the case of \(c_u > 0 \wedge d_u = 0\), and the case of \(d_u > 0\). When \((c_u,d_u) = (0,0)\), we can compute \(\langle 2^{p_u}-1 \rangle \) as explained above since \(\langle 2^{p_{a_u}}-1 \rangle \) and \(\langle 2^{p_{b_u}}-1 \rangle \) are stored as they are. Hereafter, we show how to compute \(\langle 2^{p_u}-1 \rangle \) if \(c_u > 0 \wedge d_u = 0\) by following the same way as the case of \((c_u,d_u) = (0,0)\). Moreover, we show that the case of \(d_u > 0\) never happens.

Case of \(c_u > 0 \wedge d_u = 0\)::

Let \(\langle 2^{p_{a_u} + p_{c_u}}-2^{p_{c_u}} \rangle \) and \(\langle 2^{p_{b_u}}-1 \rangle \) be stored in i-th register and j-th register, respectively. We first apply the \(\texttt {SQUARE}\) operation \(2^{p_{b_u} - p_{c_u}}\) times to \(\langle 2^{p_{a_u} + p_{c_u}}-2^{p_{c_u}} \rangle \) in the i-th register and obtain \(\langle 2^{p_{a_u} + p_{b_u}} - 2^{p_{b_u}} \rangle \) in the same register. Then, we apply \(\texttt {MODMULT}\) to \(\langle 2^{p_{a_u} + p_{b_u}} - 2^{p_{b_u}} \rangle \) in the i-th register and \(\langle 2^{p_{b_u}}-1 \rangle \) in the j-th register, and obtain \(\langle 2^{p_{a_u} + p_{b_u}} - 1 \rangle = \langle 2^{p_u}-1 \rangle \) in a new k-th register. Here, we use the \(\texttt {MODMULT}\) operation once and new one register (k-th register), i.e., n qubits.

Here, we should check that \(p_{b_u} - p_{c_u} > 0\) holds. As we have described so far, \(\langle 2^{p_{a_u}}-1 \rangle \) becomes \(\langle 2^{p_{a_u} + p_{c_u}}-2^{p_{c_u}} \rangle \) when we compute \(\langle 2^{p_{a_u} + p_{c_u}}-1 \rangle \). If \(p_{a_u} + p_{c_u}\) is a doubled term and \(p_{a_u} = p_{c_u}\) holds, \(\langle 2^{p_{a_u}}-1 \rangle \) is still stored as they are; in other words, \(c_u = 0\) holds. Thus, \(p_{a_u} + p_{c_u}\) is an added term. In this case, since \(\langle 2^{p_{a_u} + p_{c_u}}-1 \rangle \) was already computed, it holds that \(p_{a_u} + p_{c_u} < p_{b_u} + p_{c_u}\) due to the property (ii) of Lemma 1.

Case of \(d_u > 0\)::

As we have described so far, \(\langle 2^{p_{b_u}}-1 \rangle \) becomes \(\langle 2^{p_{b_u} + p_{d_u}}-2^{p_{d_u}} \rangle \) when we compute \(\langle 2^{p_{b_u} + p_{d_u}}-1 \rangle \). Let \(u'\) be an index such that \(p_{u'} = p_{b_u} + p_{d_u}\). Then, it hold that \(a_{u'} = b_u\) and \(b_{u'} = d_u\). Since \(\langle 2^{p_{b_u} + p_{d_u}}-1 \rangle \) was already computed, it holds that \(p_{b_u} + p_{d_u}< p_{a_u} + p_{b_u} \Leftrightarrow p_{d_u} < p_{a_u}\) due to the property (ii) of Lemma 1. Moreover, as we mentioned at the beginning of this proof, \(p_{a_s} \le p_{b_s}\) holds for all s. Thus, it holds that \(p_{a_u} \le p_{b_u} = p_{a_{u'}} \le d_u = p_{b_{u'}}\). This is the contradiction. Thus, \(d_u > 0\) never happens.

To sum up, when we compute \(\langle 2^{p_u}-1 \rangle \), we always apply \(\texttt {MODMULT}\) once and use 2n and n new qubits if \(u \in D\) and \(u \in M\), respectively. Therefore, we apply \(\texttt {MODMULT}\) operation \(d+m=\ell \) times and use new \((2d+m+1)n\) qubits. \(\square \)

Algorithm 3
figure c

Basic algorithm

We describe our basic algorithm based on Theorem 2 in Algorithm 3. We note that Algorithm 3 takes not only an addition chain \(\{p_s\}_{s=0}^{\ell }\) but also \(\{a_s\}_{s=1}^{\ell }\), \(\{b_s\}_{s=1}^{\ell }\), and \(\{Q_s\}_{s=1}^{\ell }\) as input. Here, we explain the roles of the additional inputs. We proved Theorem 2 by assuming \(p_{a_s} < p_{b_s}\); however, the algorithm becomes less efficient since we apply \(\texttt {SQUARE}\) operation \(2^{p_{b_s}}\) times to \(\langle 2^{p_{a_s}} - 1 \rangle \) and obtain \(\langle 2^{p_{a_s} + p_{b_s}} - 2^{p_{b_s}} \rangle \) for computing \(\langle 2^{p_{a_s} + p_{b_s}} - 1 \rangle \) from \(\langle 2^{p_{a_s} + p_{b_s}} - 2^{p_{b_s}} \rangle \) and \(\langle 2^{p_{b_s}} - 1 \rangle \). In other words, we can save the number of \(\texttt {SQUARE}\) if we apply the operation \(2^{p_{a_s}}\) times to \(\langle 2^{p_{b_s}} - 1 \rangle \) and obtain \(\langle 2^{p_{a_s} + p_{b_s}} - 2^{p_{a_s}} \rangle \) for computing \(\langle 2^{p_{a_s} + p_{b_s}} - 1 \rangle \) from \(\langle 2^{p_{a_s} + p_{b_s}} - 2^{p_{a_s}} \rangle \) and \(\langle 2^{p_{a_s}} - 1 \rangle \). Therefore, the restriction \(p_{a_s} < p_{b_s}\) results in more CNOT gates and larger depth. However, the restriction is required for proving the existence of a quantum algorithm for arbitrary addition chains. In contrast, we focus on specific binary curves recommended by NIST. Thus, Algorithm 3 takes \(\{a_s\}_{s=1}^{\ell }\) and \(\{b_s\}_{s=1}^{\ell }\) as input, where it is interesting that \(p_{a_s} \ge p_{b_s}\) hold for most s. The last input \(\{Q_s\}_{s=1}^{\ell }\) describes the numbers of \(\texttt {SQUARE}\) to be applied in each step.

4.3 Extended algorithm

As we explained in Sect. 3.3, Banegas et al. [30] reduced the required qubits from Putranto et al.’s algorithm [31] by clearing garbages and sacrificing the number of CNOT gates and the depth. In the same way, we can reduce required qubits of our Algorithm 3 as described in Algorithm 4. What is more, we introduce a trade-off parameter L, where Algorithm 4 with the larger L requires fewer qubits, more CNOT, and larger depth. We can further save one register, i.e, n qubits, to store the output \(\langle 2^{n}-2 \rangle \) if the last element \(n - 1\) of an addition chain is an added term, where we can find such an addition chain for NIST recommended curves for all n. The performance of Algorithm 4 is described as follows.

Algorithm 4
figure d

Extended algorithm

Theorem 3

Let f be an element of \({\mathbb {F}}_{2^n}^*\) and \(\{p_s\}_{s=0}^{\ell }\) be an addition chain for \(n-1\) of length \(\ell \) satisfying the properties (i)–(iii) of Lemma 1 and \(\ell \in M\). Let d and m denote the numbers of doubled terms and added terms in \(\{p_s\}_{s=0}^{\ell }\), respectively. There exists a quantum algorithm that takes \(f = \langle 1 \rangle \), \(\{p_s\}_{s=0}^{\ell }\), and \(L \in \{0,1,\ldots ,d-1\}\) as input and outputs \(\langle 2^{n-1}-1 \rangle \) with new \((2d+m - L)n = (\ell +d - L)n\) qubits and \(\texttt{MODMULT}\) operations \(\ell \) times.

Algorithm 4 takes \(\texttt {pl}\) and \(\{c\ell _t\}_{t=0}^{d}\) as addition input. An array \(\texttt {pl}\) has \(d-L\) members, and stores indices of the polynomials g which are used for ADD to clear garbages. The sequence \(\{c\ell _t\}_{s=0}^{t}\) describe the number of times to applying SQUARE or SQUARE \(^{-1}\) for clearing garbages. More precisely, we apply SQUARE \(c\ell _t\) times if \(c\ell _t > 0\) and SQUARE \(^{-1}\) \(-c\ell _t\) times if \(c\ell _t < 0\). We set \(c\ell _0 = 0\) and \(\overline{x} {:=}x\mod (d-L)\). Garbages are stored in \(h_0,\ldots ,h_{d-L-1}\) in turn and clearing is performed by initializing them to 0 from \(h_0\) to \(h_{d-L-1}\) in this order. We describe the algorithm for clearing garbages in Algorithm 5. We note that the case of \(L=0\) is different from basic algorithm since clearing to store \(\langle 2^{n-1}-1 \rangle \) is still performed. When \(L = d-1\), we only prepare a polynomial \(h_0\) for garbages; however, initializing is performed whenever we compute \(\langle 2^{p_s}-1 \rangle \), where \(s \in D\). In general, each time L increases by 1, we apply an additional clearing that implicates the trade-off between the number of qubits and the number of CNOT gates, and the depth.

Algorithms 3 and 4 are also applied two times for an inversion computation each. We uncompute the ancillary qubits by the second operation.

Algorithm 5
figure e

GARBAGECLEAR\((c,k,\ell )\)

5 Comparison

In this section, we compare our proposed quantum FLT-based inversion algorithms with previous ones [30, 31]. In Sect. 5.1, we find addition chains for our algorithms. In Sect. 5.2, we compare the quantum resources for computing inversion. In Sect. 5.3, we show the effectiveness of the trade-off parameter L of our extended algorithm. In Sect. 5.4, we compare the quantum resources for point addition and Shor’s algorithm.

Difference from preliminary version As mentioned in Sect. 1.2, we use quantum multiplication by Hoof [34] in [1]; however, we use one by Kim et al. [35] in this version. Therefore, we update the number of quantum resources in Tables and Figures by Kim et al.’s multiplication.

5.1 Our choice of addition chains

As we showed in Theorems 2 and 3, the quantum resource of FLT-based inversion depends on \(d,m,\ell \) of addition chain. Table 1 summarizes \(d,m,\ell \) Itoh and Tsujii’s addition chain for all n recommended by NIST. We find addition chains for all n in order of priority the number of TOF and qubits. In other words, we first find addition chains with the minimum length \(\ell \), then find the one with minimum doubled terms d among them. Table 2 summarizes \(d,m,\ell \) our choice of addition chains, and Table 3 summarizes the concrete addition chains \(\{p_s\}_{s=0}^{\ell }\) with the sequences \(\{a_s\}_{s=1}^{\ell }\), \(\{b_s\}_{s=1}^{\ell }\), and \(\{Q_s\}_{s=1}^{\ell }\) which are input of our algorithms. We can find addition chains with shorter length \(\ell \) for \(n = 571\). Moreover, we can find addition chains with fewer doubled terms d for all n. Our choice of addition chains work well with our algorithms. Indeed, we can save CNOT gates since \(p_{a_s} \ge p_{b_s}\) holds for most s as we discussed at the end of Sect. 4.2. Similarly, we can save one register for Algorithm 4 since \(n - 1\) is an added term as we discussed in Sect. 4.3.

Table 1 d, m, \(\ell \) of Itoh and Tsujii’s addition chains
Table 2 d, m, \(\ell \) of our choice of addition chains

5.2 Comparison in a quantum inversion computation

Table 4 compares quantum resources among the following algorithms:

  • basic algorithm: our proposed Algorithm 3

  • extended algorithm: our proposed Algorithm 4 for \(L=d-1\)

  • PWLK22-FLT: Putranto et al.’s FLT-based algorithm

  • BBHL21-FLT: Banegas et al.’s FLT-based algorithm

  • BBHL21-GCD: Banegas et al.’s GCD-based algorithm

in terms of the number of TOF, qubits, CNOT, and depth.

We do not compare with Kim and Hong’s GCD-based inversion algorithm [42] which achieves fewer qubits and fewer TOF gates than Banegas et al.’s GCD-based inversion algorithm since it does not estimate the number of CNOT gates and the depth.

We compare the quantum resources for computing \(h+gf^{-1}\) from fgh with two inversions and one modular multiplication. Here, the depth of ADD is 1. We calculate the number of CNOT gates and the upper bound of the depth of SQUARE by using LUP decomposition which Banegas et al.’s [30] used. The number of TOF gates and CNOT gates and the upper bound of the depth of MODMULT are given by Hoof [34]. We also calculate the depth considering parallel computation by ourselves, although we do not describe it in detail. However, since paralleling is not complete, the depth is upper bound in each case.

As we described in Sects. 4.2 and 4.3, our algorithms achieve the same performance when we use Itoh and Tsujii’s addition chain. However, we find better addition chains with smaller \(\ell \) and/or d for all n as we claimed in Sect. 5.1. Thus, our basic and extended algorithms are strictly better than PWLK22-FLT and BBHL21-FLT, respectively. Indeed, Algorithms 3 and 4 successfully reduce all quantum resources of PWLK22-FLT and BBHL21-FLT, respectively. Moreover, our extended algorithm achieves smaller depth than PWLK22-FLT when \(n = 571\). Compared with BBHL21-GCD, although BBHL21-GCD achieves fewer qubits than our algorithms by two, our algorithms achieve much fewer TOF than BBHL21-GCD by ten.

Remark 1

In the preliminary version [1], addition chains given in Table 3 are different from the ones which are used for quantum resource estimation. In this version, we correctly describe addition chains used for estimation in Tables 3, 5.

Remark 2

After the publication of the preliminary version [1], Kim and Hong proposed a quantum GCD-based inversion algorithm [42] which achieves slightly fewer qubits and fewer TOF gates than Banegas et al.’s GCD-based inversion algorithm. However, we do not list the algorithm in Table 4 since Kim and Hong did not estimate the number of CNOT gates and the depth and the analysis of their GCD-based algorithm is out of scope of this paper. We note that Kim and Hong’s GCD-based algorithm does not violate the advantage of FLT-based algorithms since the number of TOF gates of the former algorithm is close to that of Banegas et al.’s GCD-based inversion algorithm and much larger than those of FLT-based ones.

Table 3 Our choice of addition chains \(\{p_s\}_{s=0}^{\ell }\) with the sequences \(\{a_s\}_{s=1}^{\ell }\), \(\{b_s\}_{s=1}^{\ell }\), and \(\{Q_s\}_{s=1}^{\ell }\)
Table 4 Comparison of the number of TOF gates, qubits, and CNOT gates and the depth in an inversion between ours and prior work

5.3 Quantum resources trade-off in extended algorithm

We describe the quantum resources of Algorithm 4 (extended algorithm) for all possible trade-off parameters L. As we discussed in Sect. 4.3, the extended algorithm for \(L=0\) is not the case of basic algorithm, but the case that only n qubits for storing the computation results are reduced. Figures 1, 2, 3, 4, 5, 6, 7, 8 illustrate the trade-off with respect to L. Throughout the comparisons, we do not consider the number of TOF since L does not affect it. In all Figs. 12, 3, 4, 5, 6, 7, 8, the round points which are placed on the rightmost represent basic algorithm, then \(L=0,1,2,\ldots \) from the right to the left. We can see that the number of qubits decreases and the number of CNOT gates and the depth increase for the larger L. However, we can see the same depth in the case of basic algorithm and \(L=0\) although the numbers of CNOT gates are not the same. The reason is that we can completely parallelize clearing garbage for storing \(\langle 2^{n-1}-1 \rangle \). Although we may be able to parallelize other clearing procedures and will get better upper bounds of the depth, we leave it as a future work.

Fig. 1
figure 1

Quantum resources trade-off in extended algorithm where \(n=163\)

Fig. 2
figure 2

Quantum resources trade-off in extended algorithm where \(n=233\)

Fig. 3
figure 3

Quantum resources trade-off in extended algorithm where \(n=283\)

Fig. 4
figure 4

Quantum resources trade-off in extended algorithm for \(n=571\)

Fig. 5
figure 5

Quantum resources trade-off in FLT-based inversion algorithms where \(n=163\)

Fig. 6
figure 6

Quantum resources trade-off in FLT-based inversion algorithms where \(n=233\)

Fig. 7
figure 7

Quantum resources trade-off in FLT-based inversion algorithms where \(n=283\)

Fig. 8
figure 8

Quantum resources trade-off in FLT-based inversion algorithms for \(n=571\)

5.4 Comparison in Shor’s algorithm

Table 4 compares quantum resources among Shor’s algorithm based on our proposed FLT-based inversion algorithms and previous inversion algorithms as in Table 4 in terms of the number of TOF, qubits, CNOT, and depth. To perform \(2n + 2\) point additions, we use Banegas et al.’s point addition algorithm [30]. A point addition computation contains two quantum inversion computations. We simply add the numbers in Table 4 for counting the quantum resources. Banegas et al.’s point addition algorithm contains some computations which we do not summarize. We refer to the paper [30] for counting the number of TOF gates and CNOT gates for those computations. We consider parallel quantum computing and calculate the depth of them by ourselves. Since we use semiclassical Fourier transform [43] in a part of Shor’s algorithm, we use only another control qubit to point additions, therefore the whole number of qubits increases by 1 from the number of qubits used in a single inversion. Table 6 shows the number of quantum resources in Shor’s algorithm. Our two algorithms still perform better like a comparison in an inversion algorithm, since inversion computations occupy the largest part of a point addition computation in a view of the number of qubits and quantum gates. However, Banegas et al.’s point addition algorithm initializes \(\lambda \), and this leads us to compute two inversions. If we prepare other n qubits for \(\lambda \) in each point addition, we can save up an inversion and the number of TOF gates and CNOT gates and the depth will be about a half of the values summarized in Table 6. Then, the number of qubits increases by \((2n+1)n\).

Table 5 Quantum resources of extended algorithm in each L
Table 6 Comparison of the number of TOF gates, qubits, and CNOT gates and the depth in Shor’s algorithm between ours and prior works

6 Windowing

We briefly explain the quantum read-only memory (QROM) in Sect. 6.1. Then we describe point addition using windowing by Häner et al. [41] and show the optimal window size and the number of TOF gates in each case in Sect. 6.2.

6.1 Quantum read-only memory

Quantum read-only memory (QROM) allows classical memory to be accessed by giving an index, which can be represented by superposition. Let A denote the number of data stored in QROM. We explain data as \(\mathinner {|{d_i}\rangle }\) for \(i=0,1,\ldots ,A-1\). Then, the QROM operation is given by

$$\begin{aligned} \textrm{QROM}\left( \sum _{i=0}^{A-1}\alpha _i\mathinner {|{i}\rangle }\mathinner {|{S_i}\rangle }\right) = \sum _{i=0}^{A-1}\alpha _i\mathinner {|{i}\rangle }\mathinner {|{S_i + d_i}\rangle }, \end{aligned}$$
(4)

where \(\mathinner {|{i}\rangle }\) is the index, \(\alpha _i \in {\mathbb {C}}\) is the amplitude of \(\mathinner {|{i}\rangle }\), and \(\mathinner {|{S_i}\rangle }\) is the arbitrary quantum state. For constructing QROM, we require some quantum resources, including TOF gates. Babbush et al. [44] gave a T-depth-less QROM construction, and they made use of \(2(A-1)\) TOF gates. We note that several ancillary qubits are also required for QROM; however, we do not count them because we only focus on the number of TOF gates in this section. Generally, QROM is used for skipping some quantum computations and saving the quantum gates. Therefore, we should carefully analyze the balance between the required TOF gates for QROM and the reduced TOF gates.

6.2 Point addition using windowing

Table 7 Optimal window size w and the number of TOF gates for Shor’s algorithm

Quantum computation using QROM has been discussed. For example, Gidney [45] explained several quantum basic arithmetics with QROM. Those ways of using QROM for looking up some data are called windowing. Häner et al. [41] indicated that point addition on elliptic curves using windowing is also possible, and Banegas et al. [30] and Putranto et al. [31] made use of that method. We describe the outline below. Let w be an nonnegative integer, and \(A=2^w\). Then, QROM stores [i]U for \(i=0,1,\ldots ,2^w-1\), where U is a point on a binary elliptic curve. Point addition algorithm which uses LOOKUP to access the above QROM is explained by Banegas et al. [30]. We can decrease the times of point addition from \(2(n+1)\) to \(2\lceil \frac{n+1}{w} \rceil +1\)Footnote 2, therefore the number of TOF gates decreases with increasing w. However, the number of TOF gates to construct a QROM is \(2(2^w-1)\).

Now we find an optimal w, which minimizes the number of TOF gates, about each n for each algorithm. Then, we calculate the total number of TOF gates and compare our algorithms to prior works. We show the result in Table 7. Our two algorithms and prior FLT-based algorithms bring the same results for \(n=163,233,283\). For \(n=571\), we can see the advantage of our algorithms over PWLK22-FLT and BBHL21-FLT. However, the optimal w of BBHL21-GCD are larger than others. That is because BBHL21-GCD uses much more TOF gates than FLT-based algorithms, then windowing performs better.

7 Conclusion

In this paper, we reconsidered quantum FLT-based inversion algorithms from the viewpoint of addition chains. In purpose of analyzing the quantum resources for quantum computation, we described the number of TOF gates, qubits, and CNOT gates and the depth change depending on the addition chain. Also, we showed the existence of a quantum FLT-based inversion algorithm whose input contains an arbitrary addition chain. Then, we constructed two algorithms, basic algorithm corresponding to Putranto et al.’s algorithm and extended algorithm corresponding to Banegas et al.’s algorithm. Moreover, we reduce the number of TOF gates and the number of qubits preferentially in this order and optimized addition chains. As a result, basic algorithm and extended algorithm purely improve Putranto et al.’s algorithm and Banegas et al.’s algorithm, respectively. That stems from the existence of better addition chains, whose length is shorter, or d is smaller than Itoh and Tsujii’s addition chains. We can say that our results gave a more precise estimation of quantum resources used to solve binary ECDLP with NIST recommending n.

We get some optimized addition chains that perform the same as addition chains in Table 3, therefore we can choose an addition chain that depth is also reduced the most. We have already chosen addition chains that achieve less depth; however, it is extremely hard to optimize the depth since that requests a complete analysis of parallel quantum computation. We leave it to future work. Also, there may be a better way to clear all qubits used in inversion algorithms.