1 Introduction

Operations in quantum computing are typically represented using quantum (reversible) circuits, and therefore, various studies have been conducted on reversible circuit synthesis and circuit optimization [1,2,3,4,5,6]. The design efficiency of a circuit determines the design cost and latency for the desired reversible operation. When making circuits, among various gate libraries, the Clifford+T gate library is commonly used when considering FTQC (fault-tolerant quantum computation) in particular [7]. It is known that a T (or \(\hbox {T}^\dag \)) gate, one of the non-Clifford gates, in this standard universal fault-tolerant gate library is more costly to design and take longer to execute than Clifford gates. [8]. In other words, T gates are dominant factors in the FTQC setting in terms of the running time and the implementation cost. T-depth (depth formed by T gates that operate non-parallelly) may determine the execution time of the circuit, and T-count (the number of T gates) may decide the design cost and the required fidelity of the gates. That is, T-depth and T-count, which are cost metrics associated with T gates, are important considerations. As an example, the quantum resources required for magic state distillation in FTQC setting are determined by these two cost metrics. T and \(\hbox {T}^\dag \) gates are primarily employed within Toffoli gate, a widely recognized composite gate. As a result, cost metrics about Toffoli gates can be utilized instead of those associated with T gates such as Toffoli-depth and Toffoli-count [9].

Meanwhile, several studies focused on specific gates such as MPMCT (mixed polarity multiple controlled Toffoli) gates and how to decompose these gates to Toffoli gates [1, 2, 6, 10]. In this work, we present an advanced method to reduce Toffoli-depth of a given MPMCT gate and thereby dependently reduce T-depth. Of course, the resulting circuit’s T-depth can be reduced further by previous works [3,4,5]. One of the existing MPMCT gate decomposition methods was announced [10], but its cost metrics are phase-depth & phase-count, which are the generalizations of T-depth & T-count in terms of phase. In their work, the gates may deal with phases smaller than T gate’s phase, so it may be expected that more complicated work is required when designing a real quantum computer. Other MPMCT gate techniques use T-depth & T-count as cost metrics [1, 2, 6]. Another previous method divided the number of CWQs (clean work qubits) into three categories based on the number of MPMCT gate control parts [2]. However, when this method is applied, the resulting circuit’s T-depth (or T-count) sometimes increases as the number of CWQs or DBQs (dirty borrowed qubits) used for implementing the given MPMCT gate increases. Descriptions for the terminologies of work qubits (or ancilla qubits) are shown in the next section. They just showed results using a graph when the number of controls is too small. In other words, using this method may result in a more inefficient circuit despite adding more quantum resources. This problem arises because the central part of the resulting circuit is not sufficiently supported by CWQs. To address this issue, we present an advanced method that employs CWQs properly at all circuit regions. This method ensures that the number of work qubits and T-depth do not increase simultaneously.

The proposed technique is demonstrated using Grover’s algorithm with the SHA-256 and SHA3-256 hash algorithms as examples in Sect. 4 [3, 11]. In 1996, Grover developed an algorithm that uses quantum computing to retrieve a pre-image or key for a given cryptosystem much more quickly than any classical algorithm [12]. Grover’s algorithm consists of Oracle operators and Diffusion operators, both of which use MPMCT gates. However, the implementation way of MPMCT gate in each operator should be different because the state of the work qubits used to help implement each gate may differ within each operator. Depending on the cryptosystem quantum circuit used, the states of the work qubits that can help implement the MPMCT gate are different. We will show the quantum resources for MPMCT gates that can be used in Grover’s algorithm according to the number of CWQs and DBQs. We will also show the quantum resources required to implement the attack algorithm circuit for each cryptosystem. In the case of SHA-256, it can be seen that the quantum resources are less required than those presented in a previous study [9]. Of course, the proposed MPMCT gate technique is not limited to Grover’s algorithm and can be used in other quantum algorithms, such as quantum random-walk-based algorithms, indicating its high usability [13,14,15].

The rest of this paper is organized as follows. Section 2 begins by introducing terminologies for kinds of qubits according to the purpose of use and the states in the quantum circuit. We briefly describe previous lemmas and techniques for MPMCT gates decomposition and discuss the limitations of these. In addition, we mention Grover’s algorithm, which is used as a specific example in our proposed method, and which MPMCT gates are needed in this algorithm. In Sect. 3, we present our advanced technique, which is divided into four categories, and explain implementation ways for each category. We note that, like previous studies, we utilize well-known lemmas to decompose MPMCT gates into Toffoli gates as mentioned in Sect. 2. In Sect. 4, we demonstrate the superiority of the proposed technique by implementing Grover’s algorithm. Specifically, we try to implement Grover’s algorithm for SHA-256 or SHA3-256 cryptosystems and compare the quantum resources required for security strength measurement with those in previous studies. In the last section, we summarize and point out the limitations of this study and suggest future research directions.

2 Background and previous work

In this section, the types of qubits, an MPMCT gate, and the techniques presented in previous studies are mentioned. Grover’s algorithm, which will be used as an application example of our proposed technique, is also briefly mentioned.

2.1 Qubit and MPMCT gate

A qubit is a fundamental unit of quantum information and can be represented as a two-dimensional vector that belongs to the (projective) Hilbert space \(\mathcal {H}\). In contrast to classical bits, a qubit can exist in states with various phases and values due to its ability to be expressed as a linear combination of \(|0\rangle \) and \(|1\rangle \). Quantum circuits represent reversible operations, which are unitary in complex space. As a result, the size of the state of qubits remains constant throughout the circuit. Depending on their states and usage purposes in the circuit, qubits are referred to by different names. We will utilize the following various types of qubits [9]:

  1. 1.

    Data qubits are the qubits that hold the input data in a quantum algorithm. For example, in the case of SHA-256, the data qubits represent the pre-image search space. If the length of the plaintext (or pre-image) is m-bit, the quantum circuit requires m data qubits. Therefore, in Grover’s algorithm, the data qubits represent the search space for the solution. All qubits that are not data qubits are work qubits or ancilla qubits, which assist in performing specific operations in the algorithm.

  2. 2.

    Clean work qubits (CWQs) are work qubits whose initial states are known and can assist in performing certain operations in quantum circuits. CWQs can be initialized through an uncomputation (restoration or clearing) step after a specific operation is finished, which restores them to their original state. The restored CWQs can then be used in subsequent operations with the same initial state. Typically, CWQs are initialized to the \(|0\rangle \) states.

  3. 3.

    Dirty borrowed qubits (DBQs) are work qubits whose initial states are unknown or entangled, so they are considered to be in arbitrary states before being provided in a specific operation. Unlike CWQs, these work qubits are less effective due to these arbitrary initial states. When a sub-circuit representing the specific operation is made, the restoration step for DBQs’ states may be included in the case of utilizing their original states in the following sub-circuit, which makes the circuit design more complicated than the case for CWQs.

Quantum circuits are commonly visualized as two-dimensional diagrams, with data qubits placed on the top and work qubits on the bottom. In the data qubits range, the least significant bit (LSB) data qubit is typically located at the top, while the most significant bit (MSB) is placed at the bottom. That is, data qubits are sequentially arranged from top to bottom according to the size of the binary digit. The flow of time in a quantum circuit is from left to right.

Quantum gates used in quantum circuits correspond to specific unitary operations. In this context, the Clifford+T gates are typically consdiered as elementary gates, as mentioned in equation (1). One commonly used composite gate is Toffoli (or \(\hbox {C}^2\)NOT, a doubly controlled-NOT) gate, which has T-depth 3 and T-count 7 [16] (Fig. 1). (This T-depth value is optimal with no work qubits and no limitation for CNOT-count.) Toffoli gate is the quantum analogue of the AND gate in classical circuits (\(Toffoli: |{x_1x_2x_3}\rangle \rightarrow |{x_1x_2 ((x_1\wedge x_2)\oplus x_3)\rangle }\)). Toffoli-count and Toffoli-depth are cost metrics associated with Toffoli gate, which frequently employs T gates. Therefore, these metrics can be used instead of T-count and T-depth, which are considered significant when considering FTQC. Toffoli-count is the number of Toffoli gates in a quantum circuit, and Toffoli-depth is the number of non-parallel processing of Toffoli gates in a quantum circuit. Our method tries to express a given MPMCT gate as a circuit with optimized Toffoli-depth. After applying the our presented method, the resulting circuit’s T-depth can be reduced using previous T-depth reduction techniques [3,4,5, 9]. That is, as in these previous studies [2, 6], FTQC computing which deals with T-depth (or T-count) is considered more rather than NISQ (noisy intermediate-scale quantum) computing which deals with CNOT-count (or CNOT-depth) [17,18,19]. In NISQ computing, various qubit placements (or the qubit connectivities) such as linear and square-grid are considered, while in this study, the circuit is designed when qubit connectivity is assumed to be all-to-all, that is, at the logical level. Therefore, swapping operations for adjacent qubit placement are not considered [19]. In one study [18], various logical equivalent circuits were created in terms of CNOT-count, while this work makes circuits with various T-depths. Then, the one with the smallest T-depth value is selected.

$$\begin{aligned} \begin{aligned}&T: |{x_1}\rangle \rightarrow e^{\frac{\pi i}{4} x_1}|{x_1}\rangle{} & {} H: |{x_1}\rangle \rightarrow \frac{|{0}\rangle +(-1)^{x_1}|{1}\rangle }{\sqrt{2}}\\&P: |{x_1}\rangle \rightarrow e^{\frac{\pi i}{2} x_1}|{x_1}\rangle{} & {} X: |{x_1}\rangle \rightarrow |{x_1\oplus 1}\rangle \\&Z: |{x_1}\rangle \rightarrow (-1)^{x_1}|{x_1}\rangle{} & {} CNOT: |{x_1x_2}\rangle \rightarrow |{x_1(x_1\oplus x_2)\rangle }&\end{aligned} \end{aligned}$$
(1)
Fig. 1
figure 1

A Toffoli gate [16]. Toffoli-depth, which is a metric for Toffoli gates, are used when presenting our MPMCT gate decomposition method

An MPMCT gate T(\(\hbox {C}_1\),\(\hbox {C}_2\),t), which is dealt with in this paper, can be defined as follows [2].

Definition 1

Given a set X = {\(\hbox {x}_1\), \(\cdots \), \(\hbox {x}_n\)}of lines in a quantum circuit, an MPMCT gate T(\(\hbox {C}_1\),\(\hbox {C}_2\),t) (or T(C,t)) consists of three line sets \(\hbox {C}_1\),\(\hbox {C}_2\),and {t}. (The set of all control lines is represented by C (= \(\hbox {C}_1\bigcup \hbox {C}_2\)).)

  • the set of on-control lines \(\hbox {C}_1\)={\(\hbox {x}_{11}\), \(\cdots \), \(\hbox {x}_{1i}\)} \(\in \) X,

  • the set of off-control lines \(\hbox {C}_2\)={\(\hbox {x}_{21}\), \(\cdots \), \(\hbox {x}_{2j}\)} \(\in \) X,

  • and a target line t \(\in \) X/(\(\hbox {C}_1\bigcup \hbox {C}_2\))

The target line undergoes a bit-flip (or inversion) if all the states of the on-control lines are true (\(|1\rangle \)) and all the states of the off-control lines are false (\(|0\rangle \)). This means that when passing through the MPMCT gate, the states of the qubits on the control lines remain unchanged, while the state of the qubit corresponding to the target line \(|t\rangle \) is mapped to \(|t\oplus (x_{11}\wedge \cdots \wedge x_{1i}\wedge \overline{x_{21}}\wedge \cdots \wedge \overline{x_{2j}})\rangle \) if conditions hold.

The MPMCT gate can be constructed by adding NOT gates or CNOT gates to a non-MPMCT gate, as shown in Fig. 2. This figure shows the simplest version, MPT (mixed polarity Toffoli) gate. The MPMCT gate is also called the MCT gate or the MP-\(\hbox {C}^c\)NOT (mixed polarity c-controlled-NOT) gate.

Fig. 2
figure 2

An MPT (A mixed polarity Toffoli) gate. For this gate to work, the state of the qubit in the first control line should be false (\(|0\rangle \)) and the state of the qubit in the second control line should be true (\(|1\rangle \))

2.2 Previous works for decomposing MPMCT gates

Previous studies have proposed methods to decompose MCT gates into multiple Toffoli gates using different numbers of CWQs and DBQs. First, the lemmas presented in different studies are briefly mentioned. Then, the algorithm in one previous study is mentioned in detail [2]. And then another study is briefly mentioned [6].

2.2.1 Lemmas for decomposing MPMCT gates

These Lemmas can be utilized to decompose the MPMCT gate into several Toffoli gates, depending on the type and number of given work qubits.

Lemma 1

For c\(\geqq \)3, a \(\hbox {C}^c\)NOT gate can be represented as a gate with Toffoli-depth (& Toffoli-count) of 4(c-2) and T-depth of 4(c-1) given c-2 DBQs [1,2,3, 20].

Lemma 1 is a result of combining MPMCT decomposition methods and T-depth (T-count) reduction techniques [3,4,5]. By utilizing this lemma, a circuit with a T-count of 12c-20 can be implemented, where Width (the total number of qubits) should be at least 2c-1. To provide an example for this Lemma 1, Fig.  3 illustrates the MP-\(\hbox {C}^4\)NOT gate. It can be observed that the MP-\(\hbox {C}^4\)NOT gate can be represented by a circuit with Toffoli-depth (& Toffoli-count) of 8 and T-depth of 12.

Fig. 3
figure 3

An MP-\(\hbox {C}^4\)NOT (a mixed polarity 4-controlled-NOT) gate. By using two DBQs, the gate can be represented with a circuit with Toffoli-depth (Toffoli-count) of 8 and T-depth of 12. [3,4,5]

Lemma 2

For c\(\geqq \)4 and a single DBQ, a \(\hbox {C}^c\)NOT gate can be implemented with T-depth of 8c-20 [1, 2, 21]. This implementation requires not only Toffoli gates but two controlled-V & controlled-\(\hbox {V}^\dag \) gate pairs, where V gate is defined by a gate satisfying \(V^2=X\). For c=4, Toffoli-depth (Toffoli-count) is 4, and for c=5, it is 12. When c\(\geqq \)6, the MCT gate can be implemented with Toffoli-depth of 8c-32.

Lemma 2 is referred to as MI (or Miller) mapping, which utilizes controlled-V & controlled-\(\hbox {V}^\dag \) gate pairs. Since the V gate is \(\sqrt{X}\), \(V: |{x}\rangle \rightarrow \frac{(1+(-1)^{x}i)|{0}\rangle +(1-(-1)^{x}i)|{1}\rangle }{2}\), where \(x\in \{0,1\}\) [16]. An example of MI mapping with six controls is depicted in Fig. 4, where the MP-\(\hbox {C}^6\)NOT gate is composed of four \(\hbox {C}^3\)NOT gates and two controlled-V/\(\hbox {V}^\dag \) gate pairs. This configuration results in a \(\hbox {C}^6\)NOT gate with T-depth of 28.

Fig. 4
figure 4

An MP-\(\hbox {C}^6\)NOT gate implementation by MI (Miller) mapping [1, 21]. Four 3-controlled sub-MCT gates can be decomposed into 16 Toffoli gates, resulting in a total T-depth of 28

Lemma 3

For c\(\geqq \)4 and a single CWQ, a \(\hbox {C}^c\)NOT gate can be expressed as a circuit with T-depth of 6(c-2) when c is even and 6(c-2)-2 when c is odd [1, 2]. When c=4, Toffoli-depth (Toffoli-count) is 6, and when c=5, it is 10. When c\(\geqq \)6, Toffoli-depth is 6c-20 when c is even and 6c-22 when c is odd.

This Lemma 3 describes the NC mapping in previous studies [1, 22]. One previous study stated that T-depth is 6(c-2) when c is even and T-depth is 6(c-2)+2 when c is odd [1]. However, according to the logic in their study, T-depth for odd c could be 6(c-2)-2 instead of 6(c-2)+2.

As an example, we have provided circuit diagrams for the \(\hbox {C}^4\)NOT and \(\hbox {C}^9\)NOT gates in Figs. 5 and 6, respectively. These circuits’ T-depth values result in T-depths of 12 and 40, respectively.

Fig. 5
figure 5

A \(\hbox {C}^4\)NOT gate with 1 CWQ. A \(\hbox {C}^4\)NOT gate can be expressed using a circuit that has Toffoli-depth of 6 and T-depth of 11

Fig. 6
figure 6

A \(\hbox {C}^9\)NOT gate decomposition with 1 CWQ. Appropriate pairing for the \(\hbox {C}^2\)(iZ) & \(\hbox {C}^2\)(-iZ) gates painted in black with each other can reduce T-depth (or T-count) by 4, according to previous studies [1, 3]. Based on the figure, the total T-depth is 40. However, if the \(\hbox {C}^6\)NOT gate in the central part is decomposed properly, T-depth may be shared between the \(\hbox {C}^4\)NOT gates and the \(\hbox {C}^6\)NOT gate, resulting in further T-depth reduction. Therefore, it may be possible that the total T-depth is less than 40

In Fig. 5, it can be observed that T-depth of the \(\hbox {C}^4\)NOT gate can be reduced up to 11 instead of 12. First, the \(\hbox {C}^4\)NOT gate is decomposed into 6 Toffoli gates, which are then converted into 6 \(\hbox {C}^2\)(iZ) or \(\hbox {C}^2\)(-iZ) gates using T-depth reduction methods. At this point, the leftmost two \(\hbox {C}^2\)(iZ) & \(\hbox {C}^2\)(-iZ) gates can share one T-depth, resulting in an additional reduction of T-depth by one. Thus, T-depth of the \(\hbox {C}^4\)NOT gate can become 11.

For the \(\hbox {C}^9\)NOT gate, it is decomposed into 2 \(\hbox {C}^4\)NOT gates and 1 \(\hbox {C}^6\)NOT gate, and then further decomposed into 32 Toffoli gates. T-depth can be reduced from 44 to 40 by pairing the \(\hbox {C}^2\)(iZ)/\(\hbox {C}^2\)(-iZ) gates that are painted in black in the circuit diagram. (That is, further T-depth reduction can be achieved by pairing the first and fourth gates, and the second and third gates, as described in [3].) This figure shows that by decomposing the \(\hbox {C}^6\)NOT gate appropriately, T-depth can be shared between the \(\hbox {C}^4\)NOT gates and the \(\hbox {C}^6\)NOT gate, potentially leading to a reduction in T-depth beyond 6(c-2) or 6(c-2)-2 for c\(\geqq \)4. In other words, Toffoli gates forming the sub-MCT gate in the center and Toffoli gates composing the front & back sub-MCT gates may be designed to share the time slice (stage).

Given these observations, some readers may wonder to what extent T-depth can be reduced related to this Lemma. In next section, Lemma 5 provides a lower bound on T-depth of the \(\hbox {C}^c\)NOT gate when there is one CWQ but this lemma also considers the number of DBQs in the circuit. If there are enough DBQs in addition to one CWQ, a more efficient design for the \(\hbox {C}^c\)NOT gate can be achieved (Lemma 5).

Now let us consider the scenario where there are sufficiently many CWQs. If there are CWQs enough, that is, if the number of CWQs is close to the number of controls of a given \(\hbox {C}^c\)NOT gate, then this gate can be implemented as a circuit with Toffoli-depth \(\mathcal {O}\)(log c) (Lemma 4).

Lemma 4

For c\(\geqq \)3, a \(\hbox {C}^c\)NOT gate can be expressed as a circuit with Toffoli-depth 2\(\lceil log_2c\rceil \)-1 and T-depth 2\(\lceil log_2c\rceil \)+2 when there are c-2 CWQs [2, 5]. If there are c-1 CWQs, the gate can be expressed with T-depth 2\(\lceil log_2c\rceil \)+2 when c is even, or T-depth 2\(\lceil log_2c\rceil \) when c is odd. If the number of CWQs is c, the \(\hbox {C}^c\)NOT gate can be implemented with T-depth 2\(\lceil log_2c\rceil \). If the number of CWQs is equal to or more than c+1, it can be implemented with T-depth 2\(\lceil log_2c\rceil \)-1. Therefore, when the number of CWQs is c+1 or greater, T-depth of the \(\hbox {C}^c\)NOT gate can be reduced to the same value as its Toffoli-depth.

Furthermore, when the number of CWQs for a \(\hbox {C}^c\)NOT gate is c-2, the gate can be implemented with Toffoli-count 2c-3 and T-count 8c-9. A proof of this lemma is provided in “Appendix A.” Fig. 7 provides an example implementation of the \(\hbox {C}^{12}\)NOT gate, which has Toffoli-depth 7, and Toffoli-count 21, and utilizes a total of 10 CWQs.

Fig. 7
figure 7

A \(\hbox {C}^{12}\)NOT gate with 10 CWQs. This gate can be implemented using Toffoli-depth 7 and T-depth 10 gates. This is achieved by reducing the number of controls to be handled by about half each time passing through a time slice formed by Toffoli gates

The lemmas discussed are summarized in Table 1, which reveals that when the number of available CWQs is sufficient, Toffoli-depth (T-depth) can be expressed in a logarithmic form, not linear form. This means that in order to design an efficient MPMCT gate, it is important to manage the number of available CWQs effectively, so that Lemma 4, which enables circuit designs with Toffoli-depth \(\mathcal {O}\)(log n), can be utilized. In our proposed method, one of these lemmas is selected and used in the central step. If there are no CWQs at all and there are not many DBQs, the DBQs-dedicated algorithm from the previous study is used [6]. As a side note, when decomposing an MPMCT gate for c\(\geqq \)3 using a quantum circuit based on the Clifford+T (or NCT, which is NOT, CNOT, and Toffoli) gate library, it is known that it may be necessary to utilize at least one work qubit [23]. However, work qubits may not be required when decomposing MCT gates if elementary gates that can handle smaller (global) phases are employed, as suggested in [10, 24].

Table 1 Lemmas for \(\hbox {C}^c\)NOT gate decomposition (c\(\geqq \)3)

2.2.2 Previous methods for decomposing MPMCT gates

The lemmas discussed previously only considered cases where the number of available CWQs was either 0, 1, or close to the number of control lines of a given MPMCT gate. There have been previous studies on the decomposition of MPMCT gates when given various numbers of CWQs and DBQs [2, 6]. One study [2] categorizes the number of CWQs into three ranges based on the number of control lines. The number of controls for a \(\hbox {C}^c\)NOT gate is denoted as c, the number of CWQs as k, and the number of DBQs as d. The ith CWQ is denoted as \(a_i\), where i ranges from 1 to k.

The first case is where k is more than or equal to 1 and less than or equal to approximately c/2. In this range, the \(\hbox {C}^c\)NOT gate can be decomposed as follows.

  1. 1.

    At the front step, the control lines of the \(\hbox {C}^c\)NOT gate T(C,t) are divided into \(k+1\) groups \(C_1,\cdots ,C_{k+1}\). The first k groups correspond to the control parts of sub-MCT gates. The target parts consist of the CWQs (\(a_i\)). These k sub-MCT gates T(\(C_i\), \(a_i\)) are placed in the front part of the quantum circuit to be executed at the same time. These gates do not share any control lines or target lines. As further explained below, this step may consist of several stages (time slices).

  2. 2.

    At the central step, we install a sub-MCT gate T(\(C_{k+1}\cup \{a_1,\cdots ,a_k\}\),t) that takes control lines previously used as target parts in the front step. This is done using an original target line t and including control lines in the last group \(C_{k+1}\).

  3. 3.

    At the back step, we place the same k MCT gates created in the front step to initialize the values of the CWQs as an uncomputation step. If the front step consists of multiple stages, the back step is organized in the reverse order of the design of the front step.

In this decomposition method, the gate efficiency (Toffoli-depth or T-depth) is determined depending on how the control lines are classified into \(C_1\), \(\cdots \), \(C_{k+1}\). The sum of their sizes is equal to the total number of control lines c (= \(|C_1|+\cdots +|C_k|+|C_{k+1}|\)). The paper provides the condition expression for this grouping (2). The reason for this conditional expression is to secure the number of DBQs so that Lemma 1 can be used in every step.

$$\begin{aligned} \frac{c-2k-d-1}{2} \le |C_{k+1}| \le \frac{c+2-k+d}{2} \end{aligned}$$
(2)

They suggest a way to determine the sizes of the control groups \(C_1,\cdots ,C_{k+1}\) for the above decomposition process.

  1. 1.

    First, \(|C_{k+1}|\) is set to the maximum possible value, considering equation (2). We distribute the remaining \(n - |C_{k+1}|\) control lines in \(C_1,\cdots ,C_k\) as evenly as possible, with a maximum difference of 1 between each group.

  2. 2.

    We move the control lines from \(C_{k+1}\) to \(C_1,\cdots ,C_k\) while the existing maximum value of \(|C_1|,\cdots ,|C_k|\) maintains. Equation (2) should still be satisfied.

  3. 3.

    If k is more than 2 and there are more than 3 control lines that can be moved from \(C_{k+1}\) to \(C_1,\cdots ,C_k\), then 2 control lines are moved, one to \(C_1\) and the other to \(C_2\), while still satisfying equation (2). After completing this step 3, the algorithm returns to step 2 and repeats it with the updated sets \(C_1, \cdots , C_{k+1}\). If the number of control lines that can be moved from \(C_{k+1}\) to \(C_1, \cdots , C_k\) is not more than 3, this algorithm terminates.

In step 3, when going back to step 2, the control lines from \(C_{k+1}\) may be further distributed to \(C_3,\cdots ,C_k\) since the maximum value among \(|C_1|, \cdots , |C_k|\) has increased by 1. A detailed explanation of this can be found in a previous paper [2].

The second case is when k is more than or equal to approximately c/2 and less than \(c-2\). In this case, \(\lfloor c/2\rfloor \) CWQs are used for the first time slice, which is composed of \(\lfloor c/2\rfloor \) Toffoli gates that operate simultaneously.

In the second time slice, which is within the front step, the remaining \(c-\lfloor c/2\rfloor \) control lines should be processed using the remaining \(k-\lfloor c/2\rfloor \) CWQs. The number of DBQs available at this stage is increased by \(2\lfloor c/2\rfloor \). From this stage, one of the above two strategies is selected and applied recursively. This means that if the number of remaining CWQs is more than half of the number of remaining controls, time slices are constructed only with Toffoli gates, as mentioned earlier. On the other hand, if the number of remaining CWQs is insufficient, the strategy mentioned in the first case is followed. For example, if \(k=c-3\), the front step will consist of several time slices consisting only of Toffoli gates, and consequently, the central step may consist of a \(\hbox {C}^3\)NOT gate.

The last case is when k is more than or equal to \(c-2\). According to Lemma 4, the \(\hbox {C}^c\)NOT gate can be decomposed into Toffoli-depth 2\(\lceil log_2c\rceil \)-1 circuit.

In fact, there is a logical error in this previous method [2]. In other words, even when the number of work qubits increases, Toffoli-depth of the resulting circuit also sometimes increases. A detailed description of this error is introduced in Sect. 4. In that section, a figure is shown as an example of such an error (Fig. 12).

Another technique for optimizing Toffoli-depth of MPMCT gates utilizes dynamic programming to derive optimal solutions for each stage in the circuit [6]. Some characteristics of the method presented in this study are briefly mentioned. In this study, the case where CWQs and DBQs coexist was not considered, and only cases with one type of work qubits were considered and presented, respectively. In the case of a sub-method considering the case where there are only DBQs, it can be seen as a method that solves the limitation of Lemma 1 because it can handle the case where the number of DBQs is not large enough. In the case of another sub-method where CWQs are considered only, it is designed to ensure that Lemma 4 can be applied by retaining a sufficient number of CWQs in every step. In other words, it is a kind of greedy algorithm designed to use circuits with optimized Toffoli-depth at each step. This is a difference from the former previous technique [2] mentioned above that this former study tried to retain a sufficient number of DBQs in all steps. A more detailed explanation is given in a previous paper [6].

Fig. 8
figure 8

Grover’s algorithm for a given cryptosystem f. A key or pre-image can be found using this quantum circuit with time complexity of \(\mathcal {O}(\sqrt{2^m})\). MPMCT gates should be installed in each operator and the type and number of work qubits vary depending on the type of cryptosystem circuit

The method presented in this study complements the shortcomings of these two previous studies. The proposed algorithm first solves the logical error of the former previous technique and returns a circuit with a lower Toffoli-depth. In this previous technique, as mentioned above, it is divided into three ranges according to the value of k, but in the proposed technique, it is divided into five ranges. Also, unlike another previous technique [6], it considers both kinds of work qubits in every step. It is explained in detail in the next section.

2.3 Grover’s algorithm

Figure 8 illustrates the quantum circuit for Grover’s algorithm [12]. To find a m-bit key or pre-image of a given cryptosystem, the algorithm requires \(\mathcal {O}(\sqrt{2^m})\) iterations of the Grover iteration, which is composed of the Oracle operator and the Diffusion operator.

The Oracle operator in Grover’s algorithm is composed of the quantum circuit \(U_f\) for the cryptosystem f, the inverse circuit \(U_f^\dag \), and a comparator. The search space consists of m data qubits, while n output qubits represent output values after passing through the quantum gate \(U_f\) that represents the cryptosystem f. (As a side note, if the cryptosystem is an in-place version circuit, these output qubits may not be needed.) k CWQs can help execute the cryptosystem circuits and are initialized through the uncomputation step via the \(U_f^\dag \) gate. When implementing the inverse circuit, the order of the gates in the original cryptosystem’s circuit is arranged in reverse order. Finally, one work qubit serves as the Oracle qubit, which is used in the target part of the comparator (a \(\hbox {C}^n\)NOT gate) in the Oracle operator and is crucial for the phase kick-back technique. To implement the comparator, m+k DBQs can be used, and additional help from CWQs not shown in the figure can also be used. In fact, although the oracle qubit is conventionally employed for the phase kick-back technique in Grover’s algorithm, it can be deleted. As shown in [25], a \(\hbox {C}^{n-1}\)NOT gate is sufficient for an n-bit ciphertext when implementing the comparator. A \(\hbox {C}^{n-1}\)Z gate can be constructed by adding two H gates, and the phase kick-back can be achieved through this resulting \(\hbox {C}^{n-1}\)Z gate, as described in Eq. 3. Thus, the desired operation can be accomplished using a \(\hbox {C}^{n-1}\)NOT gate instead of a \(\hbox {C}^n\)NOT gate.

$$\begin{aligned} \begin{aligned}&C^{n-1}Z:|x_1\ldots x_n\rangle \rightarrow (-1)^{x_1\ldots x_n}|x_1\ldots x_n\rangle \\ \end{aligned} \end{aligned}$$
(3)

In the Diffusion operator, a \(\hbox {C}^{m-1}\)Z gate can be used when the size of the key or pre-image M is m bits. The required MPMCT gate varies depending on the size of the pre-image M in the Diffusion operator. This operator is composed of the \(\hbox {C}^{m-1}\)Z gate and H and X gates surrounding this MCT gate. To implement the \(\hbox {C}^{m-1}\)Z gate, n+k CWQs can be provided.

We used Grover’s algorithm for our technique as a concrete example, while the cryptosystems used are SHA-256 or SHA3-256 cryptosystems [3, 11]. As explained now, a \(\hbox {C}^{n-1}\)NOT gate is required in the Oracle operator, and a \(\hbox {C}^{m-1}\)NOT gate is used in the Diffusion operator. These two MPMCT gates need to be implemented differently since the states of the work qubits are different. Quantum resources of MPMCT gates according to a given number and type of work qubits are summarized in Sect. 4. The implementations of the MPMCT gates are critical to the efficiency of the entire Grover’s algorithm because the efficiency of implementing the cryptosystem and MPMCT gates determine the quantum resources required to run the algorithm. A previous study found that the optimal number of Grover iterations is about 0.690 \(\cdots \sqrt{2^m}\) [9]. In this work, we will use this formula to calculate the quantum resources required to implement the algorithm.

3 Proposed method

3.1 New lemma

Before presenting the advanced method, we have an observation that may be utilized in our proposed method. This lemma shows the lower bound for T-depth in Lemma 3. The difference from the lemmas presented in the previous section is that it considers CWQ and DBQ simultaneously (Lemma 5).

Lemma 5

For c\(\geqq \)4, if there is 1 CWQ and c- 5 DBQs available, A \(\hbox {C}^c\)NOT gate can be expressed with T-depth 4c-5 and Toffoli-depth 4c-10. Obviously, the \(\hbox {C}^4\)NOT or \(\hbox {C}^5\)NOT gate does not require any DBQs in this lemma.

This lemma can be described by decomposing the \(\hbox {C}^c\)NOT gate into two Toffoli (\(\hbox {C}^2\)NOT) gates and a \(\hbox {C}^{c-1}\)NOT gate. By decomposing the \(\hbox {C}^c\)NOT gate in this way, c-3 DBQs needed to decompose the \(\hbox {C}^{c-1}\)NOT gate can be given for using Lemma 1. Of course, after decomposition with Toffoli gates, T-depth can be reduced to 4c-5 through existing T-depth reduction techniques. Figure 9 provides an example for this lemma by implementing a \(\hbox {C}^9\)NOT gate.

Fig. 9
figure 9

A \(\hbox {C}^9\)NOT gate with 1 CWQ and 4 DBQs. It can be expressed as a circuit with Toffoli-depth 26 and T-depth 31. T-depth of the gate obtained using this lemma may be one less than that of Lemma 1. Moreover, compared to the result of Lemma 3, T-depth is reduced by about 23%

It is noted that T-depth of 4c-5 presented in Lemma 4 is smaller than T-depth of 4c-4 in Lemma 1. While Lemma 1 uses three additional DBQs to implement the gate, no CWQs are required. This shows the significant advantage that CWQs can provide in gate design compared to DBQs.

3.2 Our method sketch

We now present our technique. Our work considers the number of CWQs and DBQs to be provided in all parts of the implemented circuit simultaneously, which is an improvement over the previous studies. One of the previous method [2] aimed to maximize the use of CWQs in the front & back steps and DBQs in the central step. In contrast, our proposed method utilizes both CWQs and DBQs in all steps of the circuit. The key idea is to adjust the number of CWQs available to match as much as possible the number of control lines handled in the central step and to choose the number of the sub-MCT gates used in the front & back steps by considering the number of their control lines. With this technique, we can use Lemma 4 instead of Lemma 1 in the central step.

Assume we want to design a \(\hbox {C}^c\)NOT gate with k\(> \lfloor c/m \rfloor \)(+1). Let us suppose that the maximum number of control lines among the sub-MCT gates used in the first stage (of the front step) is m (\(\geqq \) 3). We divide the control lines of the \(\hbox {C}^c\)NOT gate into \(\lfloor c/m \rfloor \)+1 groups of control lines, denoted as \(C_1,\cdots ,C_{\lfloor c/m \rfloor +1}\). Unlike the previous method [2], the number of groups is not determined by the number of CWQs, but by the number of control lines of the sub-MCT gate to be used. We set \(|C_1|\) = \(\ldots \)=\(|C_{\lfloor c/m\rfloor }|\)=m, and \(|C_{\lfloor c/m\rfloor +1}|\) = c-m\(\lfloor c/m\rfloor \). At this stage, we use \(\lfloor c/m \rfloor \)(+1) CWQs as target parts of sub-MCT gates. If \(|C_{\lfloor c/m\rfloor +1}|\) is 2 or more, one CWQ is required more (Table 2).

Table 2 A \(\hbox {C}^c\)NOT gate decomposition in the central step in our proposed method

In the central step, we try to decompose the \(\hbox {C}^{\lceil c/m \rceil }\)NOT gate to Toffoli gates using the remaining \(k-\lfloor c/m\rfloor (-1)\) CWQs. If \(k - \lfloor c/m \rfloor (-1) \geqq \lfloor c/m \rfloor (+1)-2\), we can use Lemma 4 in this step. In the front & back step, we can apply Lemma 1 if we have \(\lfloor c/m \rfloor (m-2) + max\{c-m\lfloor c/m\rfloor -2,0\}\) DBQs. Therefore, we can implement the given MPMCT gate as a circuit with Toffoli-depth approximately equal to \(4(m-2)\times 2+2\lceil \log _2 \lceil c/m\rceil \rceil -1\).

To determine the number of control lines m for the main sub-MCT gates used in the front & back steps, we can refer to Table 2. In this table, the case with the smallest number of remaining CWQs and the largest MCT gates handled in the central step is when \(|C_{\lfloor c/m\rfloor +1}|\ge 2\). To ensure optimized T-depth in the central step, we need to satisfy the inequality k-\(\lfloor c/m\rfloor \)-1 \(\gtrapprox \) \(\lfloor c/m\rfloor \)+1+1. Solving for m, we obtain an expected value for m as m\(\approx \)2c/(k-3). We can expect that the optimal Toffoli-depth (T-depth) is obtained in the central step when the value of m is approximately 2c/(k-3).

If the number of DBQs is not equal or more than \(\lfloor c/m \rfloor (m-2) + max\{c-m\lfloor c/m\rfloor -2,0\}\), then \(\lfloor c/m \rfloor \) \(\hbox {C}^m\)NOT gates cannot be installed in the front step when considering lemma 1. At this time, it is tried to install \(\lfloor d/(m-2) \rfloor \) \(\hbox {C}^m\)NOT gates instead. Lemma 1 can be applied to these \(\lfloor d/(m-2) \rfloor \) sub-MCT gates. The sub-MCT gate installation strategy according to the various number of work qubits is described in detail in “Appendix B.”

3.3 Our method’s process

In our method, the design for a given \(\hbox {C}^c\)NOT gate involves four steps: front, central, back, and additional reduction steps. Steps may consist of multiple stages for achieving lower Toffoli-depth. Only the central step is composed of one stage because it is a step to which one of the lemmas is applied. Different values of m may be selected for each stage. As explained earlier, one of the lemmas is selected and used in the central step. Lemma 1 is used for sub-MCT gates by default in the front & back steps, but further reductions may be possible. Our method goes through the following process. We repeat the process steps for values of m from c-1 to 2.

  1. 1.

    Front step. If \(d \ge \lfloor c/m \rfloor (m-2) + max\{c-m\lfloor c/m\rfloor -2,0\}\) then set l = min{\(\lfloor c/m\rfloor \) + H(c-m\(\lfloor c/m\rfloor \)-2), k}.Footnote 1 Otherwise, set l = min{\(\lfloor d/(m-2)\rfloor \)+1,k}. To divide the control lines into l groups, we set \(|C_1|\) = \(\ldots \) = \(|C_{l-1}|\) = m and \(|C_{l}|\) = c-m(l-1) if c-m\(\lfloor c/m\rfloor \) \(\ge \) 2. In the front step, a stage may be constructed using \(l-1\) \(\hbox {C}^m\)NOT gates and one single sub-MCT gate of which the number of controls is less than m by applying Lemma 1. If the work qubits required to implement these sub-MCT gates are insufficient, the process stops for this m and proceeds for the following m. l CWQs are used as target lines for these sub-MCT gates. Toffoli-depth for this stage may be 4(m-2). The number of available CWQs in the next stage (or step) is reduced by the number of sub-MCT gates in the previous stage right before. d increases by the number of control lines of sub-MCT gates. The number of control lines dealing with in the next stage (or step) decreases by the number of controls handled in the previous stage and increases by the number of installed sub-MCT gates. If the remaining number of CWQs is small or large enough to use one of the lemmas, go to the next step. Otherwise, another stage is formed recursively.

  2. 2.

    Central step. In the central step, one of the lemmas is used and applied. The control lines of the sub-MCT gate handled in this step are the target lines of the previous stage right before. The target line is that of the original MCT gate. The number of DBQs is increased by the number of controls handled in the front step, and the number of CWQs is decreased by the number of sub-MCT gates installed in the front step. If no CWQs are available and the number of DBQs is not good enough, the technique dedicated to DBQs [6] is considered. If enough CWQs are left to use Lemma 4, Toffoli-depth can be expressed in logarithm form in this step.

  3. 3.

    Back step. The back step is composed of the reverse order for the gates used in the front step. If the recursive way is used in the front step, the back step also consists of several stages.

  4. 4.

    Further reduction step. If there are enough CWQs or DBQs, then further reductions could be attempted in the front & back steps. If there are more than 2\(\lfloor c/3 \rfloor \) CWQs in the first place, it is better to apply Lemma 4 instead of Lemma 1. Also, if the DBQs are large enough that some do not need to be used in other stages, Toffoli-depth for these two steps might be lowered from (4m − 8)\(\times \)2 to (2m − 3)\(\times \)2 by canceling some Toffoli gates.

  5. 5.

    Comparison of T-depth values. Toffoli-depth and T-depth of the created circuit are recorded for each m. Since the recursive way is used, more than c-2 results may be recorded. If we encounter a situation where both Toffoli-depth and T-depth values increase as value m decreases, the process terminates before m equals 2. We compare T-depth values recorded to select the best T-depth decomposition way from the list. We select the way among approximately c-2 decompositions recorded in the list that makes the lowest T-depth.

The proposed technique examines the cases as many as possible until m becomes 2 or the algorithm is stopped to ensure completeness and to include the results of the previous studies in the list. Toffoli-depth (T-depth) value of the resulting circuit for each m is recorded in the list. It will be observed that these values gradually decrease and then rise again. Toffoli-depth/T-depth recording is performed each time a resulting circuit is made, and as m decreases, the values also may decrease. However, at some point, both values may rise. The decomposition design way for achieving the best T-depth before reaching that point is already included in the list so the process may be stopped before m becomes 2.

When designing a technique based on the core ideas mentioned, it is critical to make an acceptable difference between the remaining number of CWQs and the control lines of the sub-MCT gate being handled in the central step. Lemma 4 cannot be directly applied when the number of CWQs remaining is less than the control lines by three or more in the central step. In that case, the front step tends to perform recursively again, so the front step might consist of multiple stages.

When executing the process presented above, it is necessary to check whether it is possible to additionally reduce Toffoli-depth (T-depth) in front & back steps. Let us consider a scenario where l sub-m-controlled-NOT gates are installed in a stage (in the front or back step) and l CWQs are utilized in target parts. In such a case, if there are additional l(m-2) CWQs, then each sub-gate can be implemented as a circuit with Toffoli-depth 2\(\lceil \log _2m\rceil \)-1. Some readers may wonder when this Lemma 4 can be applied in the front & back steps. As mentioned later, when \(2\lfloor c/3\rfloor \lessapprox k\leqq c- 3\), Lemma 4 can be applied in the front & back steps. Using Lemma 4 in these steps is consistent with the approach taken in [6] that considers CWQs only.

Meanwhile, if there are enough DBQs available, Toffoli-depth can be further reduced by canceling gates between the front and back steps where Lemma 1 is applied. When there are approximately l(m-2) DBQs, corresponding stages in these steps may be implemented in the same form using Lemma 1. If these DBQs used in these stages do not have to be used in other stages, some Toffoli gates made in the front & back step can cancel each other, resulting in a reduction of Toffoli-depth from \(2x4(m-2) to 2x(2m-3)\). An example of reducing Toffoli-depth by 1 each in the front & back steps due to the help of 1 DBQ is shown in Fig. 10. Since two Toffoli gates are placed consecutively, they can be deleted.

Fig. 10
figure 10

A \(\hbox {C}^5\)NOT gate with 2 CWQs and 1 DBQ. In the front & back steps, Toffoli-depth can be additionally reduced by 2 and T-depth by 4

3.4 Extreme case) #CWQs k is too small

We tried to predict which value m will lead to the optimal Toffoli-depth (T-depth) when the number of given CWQs is too small. This is because equation 2c/(k-3) cannot be used when k is less than 4. For the case where k = 0 or 1, MPMCT gate decomposition can be attempted using the lemmas mentioned earlier. When k = 2, all the available CWQs are attempted to be used in the front step, so the expected value of m is approximately equal to or less than c/2. In the case of k = 3, note the possibility of having optimal Toffoli-depth in the central step, even if optimal T-depth cannot be achieved. So we use the alternate formula k - \(\lfloor c/m \rfloor \)-1 \(\lessapprox \) \(\lfloor c/m \rfloor \)+1-2 to obtain optimal Toffoli-depth. Therefore, we have the formula m = \(\lceil 2c/3 \rceil \). Of course, since these values do not consider the number of DBQs at all, so if the number of DBQs is too small, the optimal T-depth value may not be obtained from the expected values for m.

Consider Fig. 11 as an example where we attempt to decompose the \(\hbox {C}^5\)NOT gate using only 2 CWQs. Since \(\lceil c/2 \rceil \) = 3, we try to decompose it for m = 3. When m = 3, it can be implemented with Toffoli-depth 7 and T-depth 12 circuit. Note that an existing target qubit can also be utilized as a DBQ in front & back steps.

Fig. 11
figure 11

A \(\hbox {C}^5\)NOT gate with 2 CWQs. The circuit subfigure in the middle corresponds to the case where \(m=2\), and the circuit subfigure on the right where \(m=3\). An existing target qubit can be used as a DBQ in front and back steps

As another example, let us decompose a \(\hbox {C}^6\)NOT gate with the help of 3 CWQs and 2 DBQs. Since k=3, it can be expected that a \(\hbox {C}^4\)NOT gate will lead to an optimal circuit based on the equation \(\lceil 2c/3 \rceil \)=4. However, As a result, we can implement a circuit with Toffoli-depth 6, and T-depth 11 when m = 2. The specific examples discussed in this paper are summarized in Table 3.

Table 3 Various \(\hbox {C}^c\)NOT gate decompositions

3.5 Our strategy depending on #CWQs k

In the previous study [2], the technique was presented by dividing the range of the number of CWQs k into three categories, based on the number of controls of the desired MPMCT gate. However, in this work, the range for k is tried to divide into five categories. When k is near the boundary value of each interval, both cases of nearby categories are recommended to consider. Again, it is important to consider not only the number k of CWQs, but also whether there are enough DBQs available to apply Lemma 1 in the front & back steps.

  • \(k = 0\). If no CWQs are available, Lemmas 1 and 2 are considered for circuit implementation. For cases where the number of DBQs is less than c-2 and greater than 1, the method using only DBQs from the previous study [6] is considered.

  • \(1\leqq k \lessapprox 4\). For the case when k=1, either Lemma 3 or 5 can be utilized. Because the number of CWQs is too small, the formula m \(\lessapprox \) 2c/(k-3) cannot be used, so the alternative formulas mentioned above are used to expect the lowest point for T-depth when carrying out the proposed process. If there are enough DBQs, additional reductions in Toffoli-depth may be possible in the front & back steps. If DBQs exist insufficiently, the previous method [6] that solely considers DBQs may have to be used in the central step.

  • \(4\lessapprox k \lessapprox 2\lfloor c/3\rfloor \). This is the range used to apply the process presented above as it is. In general, an optimal circuit can be expected to obtain when m is approximately equal to 2c/(k-3). However, a circuit with optimized T-depth is sometimes obtained when the process is applied recursively. In addition, in some cases, the best results could be achieved even when Lemma 1 is used in the central step. Also, since Lemma 1 is highly likely to be used in front & back steps, we check whether an additional reduction based on DBQs is possible.

  • \(2\lfloor c/3\rfloor \lessapprox k \leqq c-3\). An optimized decomposition can be obtained when the number of control lines m for main sub-MCT gates in the front & back steps is 3. Additionally, since the number of CWQs is sufficiently large, Lemma 4 can be applied in all steps like the method considering CWQs only in [6]. It is observed that the optimized Toffoli-depth (T-depth) remains constant even as the value of k increases in this range.

  • \(c-2\leqq k\). This is the interval where Lemma 4 can be applied, which corresponds to the last interval in the previous study [2].

Some readers may wonder whether the presented categories can be further subdivided. For instance, they may suggest that the range \(4 \lessapprox k \lessapprox 2\lfloor c/3\rfloor \) could be split into smaller intervals, such as \(3\lfloor c/4\rfloor \lessapprox k \lessapprox 2\lfloor c/3\rfloor \), \(4\lfloor c/5\rfloor \lessapprox k \lessapprox 3\lfloor c/4\rfloor \), and so on. However, such cases are rare to in practice, as shown in Lemma 6. In other words, the scenario that we use \(\hbox {C}^4\)NOT, \(\hbox {C}^5\)NOT gates, etc., at front & back steps as main sub-MCT gates and apply Lemma 4 is frequently unlikely to occur.

Lemma 6

For c\(\ge \)0 and c is not equal to 3, 6, 7, or 15, we have \(3\lfloor c/4\rfloor \ge 2\lfloor c/3\rfloor \).

This lemma shows a limitation on the possibility of further classification as mentioned earlier. The proof of this lemma is presented in “Appendix C.” Meanwhile, in some instances, applying the proposed method recursively or utilizing Lemma 1 in the central step may lead to a circuit with the lowest T-depth, as mentioned above. Conducting further research to identify such cases could help refine the existing classification.

4 Experimental evaluations

4.1 Comparison to the previous method

The first subfigure in Fig. 12 shows the contrast between the previous method [2] and our proposed method. The first subfigure displays T-depth of the circuit that can be implemented based on the number of CWQs provided when implementing the \(\hbox {C}^{255}\)NOT gate utilizing 512 DBQs. This graph clearly shows that the previous method exhibits a section in which T-depth increases as the number of CWQs increases, indicating a flaw in the technique [2]. The previous technique had a limitation in that it focused primarily on the number of CWQs and less on the number of DBQs. In particular, the problem arises because the previous method employs an excessive number of CWQs in the front & back part in the circuit, leaving the central part of the circuit with no choice but to use Lemma 1. DBQs are not fully utilized in the front & back part of the circuit and were only heavily used in the central part. Consider the number of CWQs used that can be approximated as c/2, \(c/2+c/2^2\), \(c/2+c/2^2+c/2^3\), and so on, for the number of control lines c in the MPMCT gate. When this previous technique is used, T-depth (Toffoli-depth) of the implemented circuit tends to increase and then decrease before and after the neighborhoods of the numbers of CWQs mentioned. That is, these numbers lead to local extreme values for T-depth. This phenomenon is illustrated well in the first subfigure.

In contrast, the proposed technique achieves a rapid reduction in T-depth as the number of CWQs provided increases. In the proposed technique, if enough CWQs available remain in the central part, then Lemma 4 can be used instead of Lemma 1. This leads to a logarithmic form for Toffoli-depth (T-depth) of the central part of the circuit, instead of linear. The reason for this difference is that the previous method forms the algorithm based on the number of CWQs available in each stage (time slice), while the proposed technique forms the algorithm based on the number of control lines of sub-MCT gates to be used.

The second subfigure shows that there is an overwhelming utility of CWQs over DBQs. T-depth of the created circuit is more correlated with the number of CWQs than with the number of DBQs. It can be experimentally seen that there are no logical errors in our method. Since an MPMCT gate construction is impossible in our setting when (k,d) = (0,0), T-depth value is arbitrarily set to 100,000 in this condition.

In Fig. 13, each subfigure’s x-axis represents the number of DBQs. In the first figure, the results of the previous method and our method are shown together. Even though we added our further reduction process to the previous method, there is a difference from the results. When the number of CWQs is small, the number of DBQs may affect T-depth value more. Second graph shows that even if the number of CWQs does not change and is small, further reduction in T-depth may be possible when the number of DBQs is sufficiently large in our further reduction step.

Fig. 12
figure 12

A \(\hbox {C}^{255}\)NOT gate construction with the different number of work qubits. The first subfigure compares the results between the previous method [2] and our method. As the number of CWQs increases, we observe a radical decrease in T-depth, indicating the efficiency of our approach. In contrast, the previous method was found to have some logical errors. The second subfigure experimentally shows that our method is much more affected by CWQs than DBQs. It can also be confirmed that no logical errors occur in our method

Fig. 13
figure 13

Other graph versions for a \(\hbox {C}^{255}\)NOT gate construction with the various number of work qubits. They show that when the number of CWQs is fixed and small, T-depth may be reduced through a further reduction process as the number of DBQs increases

4.2 Application to Grover’s algorithm

As mentioned in the previous Sect. 2, we utilized our proposed technique to implement the MPMCT gates necessary for Grover’s algorithm while employing the SHA-256 and SHA3-256 cryptosystems. An (n-1)-controlled-NOT gate is placed for an n-bit ciphertext in the Oracle operator, and a (m-1)-controlled-NOT gate is used in the Diffusion operator when the size of the key (or pre-image) M is m bits. Therefore the required MPMCT gate varies depending on the size of the pre-image M in the Diffusion operator. When constructing Grover’s algorithm based on the SHA-256 circuit, we set the message space to 2\(^{266}\) and 2\(^{447}\), respectively. 266-bit is the message length selected in a previous study [9], while 447-bit is the maximum size of the original message that can be processed in one message block in SHA-256. The reason they chose 266-bit was that they thought it was an appropriate value for the aforementioned optimal number of Grover iterations.

We also calculated quantum resources for the MPMCT gates required to implement Grover’s algorithm for SHA3-256. We cover the message sizes of 266 bits and 1086 bits, one of which is the maximum length of an original message that can be processed within a message block in SHA3-256, so we need to implement either the \(\hbox {C}^{265}\)NOT or \(\hbox {C}^{1085}\)NOT gate in the Diffusion operator. Since both SHA-256 and SHA3-256 have a bit length of 256 bits in the ciphertext, the Oracle operator requires the \(\hbox {C}^{255}\)NOT gate in both cases.

Table 4 shows the location of the MPMCT gate used in Grover’s algorithm and compares the quantum resources (Toffoli-depth & T-depth) required for MPMCT gates between the previous method [2] and our proposed method. The number of DBQs (#DBQs) and CWQs (#CWQs) available can vary depending on the version of the hash function circuit and the length of the message. As an example, let us consider the SHA-256-Z1 circuit with Width of 768 qubits. After passing through this cryptosystem circuit in Grover’s algorithm, 256 qubits form an ciphertext space, while the remaining 512 qubits exist as superposed values formed by the message schedule algorithm. That is, 512 DBQs can assist in building the \(\hbox {C}^{255}\)NOT gate, while CWQs do not exist at the point. Since there are more than 253 DBQs, Lemma 1 can be applied, and thus the \(\hbox {C}^{255}\)NOT gate can be implemented as a circuit with Toffoli-depth 1012 (and T-depth 1016). Since Lemma 1 is used in both methods, there is no difference in the results.

Table 4 Quantum resource comparisons for MPMCT gate between results of the previous method and those of the proposed method [2, 3]

At this time, we use SHA-256-Z3 and Z4 circuits for comparison. After passing these circuits in the Oracle operator, the number of CWQs that can be provided to the MPMCT gate design is 159 and 194, respectively. Note that 159 is less than \(c/2 + c/2^2 \approx \)191 while 194 exceeds this value. When the previous technique is applied with 159 CWQs, the \(\hbox {C}^{255}\)NOT gate is implemented as Toffoli-depth 138 circuit. On the other hand, when 194 CWQs are used, the circuit is implemented with Toffoli-depth of 164. As mentioned earlier, the result using 159 CWQs is better than using 194 CWQs in the previous technique [2]. Because too many CWQs were used in the front & back steps, insufficient CWQs are provided in the central part of the circuit. Only Lemma 1 can be used in the central part, resulting in a not-efficient circuit. On the other hand, when using our method, the results have no logical error. Also, better results than those of previous studies are presented.

A \(\hbox {C}^{265}\)NOT or \(\hbox {C}^{446}\)NOT gate is implemented in the diffusion operator. After passing through the Oracle operator, at least 256 CWQs can be provided because the qubits forming the ciphertext have been initialized. For a message length of 447 bits, 65 CWQs can be used more, and if the message M is 266 bits long, 246 CWQs can be provided more. When the message length |M| is 266, the number of available CWQs is sufficient to implement the \(\hbox {C}^{265}\)NOT gate with the same using Lemma 4, as the previous method. For \(|M|=447\), the cases for SHA-256-Z1 and Z2 show that using the proposed method can make better implementations for MPMCT gates in the Diffusion operator. In the case of SHA-256-Z1, the number of CWQs available is 321, which is about 125 less than the number of control lines. The table shows that the proposed method reduces T-depth by about 94%, compared to the previous scheme.

We also calculated the quantum resources required for a pre-image attack algorithm on four versions of the SHA3-256 cryptosystem. In most cases, the number of CWQs or DBQs is sufficiently large that the values are the same as presented in the previous method. The superiority of the proposed method can be confirmed when using the SHA3-256-v1 circuit, which uses a relatively small number of work qubits.

Additionally, we calculated the total quantum resources required to implement Grover’s algorithm for Secure Hash Algorithms (Table 5). For comparison with the previous study [9], we calculated Width and Toffoli-depth required when the message length is 266. The number of Grover iterations followed Proposition 3 in [9], which repeats 0.690...\(\cdot \sqrt{2^n}\) times when the size of the ciphertext is n bits. Since we do not add the Oracle qubit, Width required for Grover’s algorithm is the same as that for the cryptosystem implementation. Compared to Table 7 of the previous study [9], we observe that the quantum resources required for the quantum pre-image attack on SHA-256 are reduced significantly in terms of space-time complexity.

Table 5 Quantum resources for Secure Hash Algorithms and Grover’s algorithm where the length |M| of the message is 266

5 Conclusion

We present an advanced method than the previous methods for MPMCT gate decomposition [2, 6]. The key idea is to manage the number of available CWQs so that Lemma 4 can be applied in the central part of the circuit. The circuit with the lowest T-depth is selected by trying all possible cases for the main sub-MCT gates used in the front & back parts. Additional Toffoli-depth reduction between front & back parts of the circuit can also be considered if DBQs remain sufficient when Lemma 1 is applied. If CWQs exist enough, Lemma 4 may be applied throughout the circuit.

Compared to the previous method [2], T-depth value of the resulting circuit decreases more rapidly as the number of CWQs increases. Also, the logical fallacy in this previous method has been eliminated. As a concrete example, quantum resources for MPMCT gates used in Grover’s algorithm are presented. Since the number of work qubits (CWQs, and DBQs) that can be provided varies depending on the used cryptosystem circuit, MPMCT gate decomposition is also performed using different sub-MCT gates. Additionally, we calculated the total quantum resources (Width, Toffoli-depth) for Grover’s algorithm. Quantum resources required for the pre-image attack are reduced significantly compared to the previous study [9].

As mentioned above, there are some cases where an optimal circuit comes out when a different m value is used, which is not the expected value for the control lines of the main sub-MCT gates. Also, there are cases where the optimal circuit returns when Lemma 1 is used in the central part. That is why a list for every possible m is written in our method. Investigating when these exceptions occur could be a future area of research. Through this investigation, a more advanced MPMCT gate decomposition technique considering FTQC at the logical level may be developed.