Introduction

The advent of Noisy Intermediate Scale Quantum (NISQ) technologies [2] makes multiqubit processors with modest but increasing numbers of qubits available. Google, IBM, and Intel have recently announced quantum computers with 72, 65, and 49 qubits, respectively [3,4,5]; and new systems with 50–200 qubits are expected to be commercially available in the next few years. However, our ability to use the hardware to solve interesting problems is lagging. Solving practical computational problems typically requires evaluating quantum circuits with many hundreds or even thousands of qubits, exceeding the size of the devices. In addition, large gate errors and short qubit coherence times prevent accurate evaluations of deep circuits.

Despite the remarkable progress in manufacturing and controlling these small multiqubit systems, building hardware with a sufficiently high number of high-fidelity qubits remains an extremely challenging task. Engineering challenges worsen as the systems scale and are inherent for all major qubit technologies, including superconducting qubits (errors due to Josephson junction defects and spurious microwave resonances [6]), ion traps (susceptibility to noise and difficulty to address individual ions [7]), neutral atoms (motion of the atoms inside the lattice [8]), and quantum dots (difficulty to entangle multiple qubits [9, 10]).

Successfully solving practical computational problems can be achieved only by developing techniques that can simultaneously map large problems onto small qubit systems and mitigate the effects of noise. The Quantum Divide and Compute (QDC) approach is one such technique. In this approach, we divide a large and potentially deep quantum circuit to suit the number of qubits and coherence times available in current quantum hardware. We then perform the computations on the subcircuits obtained by this division on a quantum processor, and we finally recombine our output results to obtain the output of the original circuit. This allows us to compute the outputs of quantum circuits that are too deep or too wide to be run on existing small-scale quantum processors.

There has been some previous work related to this approach. Bravyi et al. [11] showed that a quantum circuit on \(n + k\) qubits can be simulated by sparse circuits on n qubits and exponential classical processing that takes time \(2^{O(k)}\)poly(n). A more general approach that allows fragmenting larger quantum circuits into smaller subcircuits was introduced in [12]. In this work, tensor-network techniques were used to show how to decompose a circuit with a large quantum volume [13] into smaller subcircuits with quantum volumes compatible with NISQ devices. The classical computing overhead of the circuit fragmenting techniques was reduced in [14], and maximum likelihood tomography was applied on top of the circuit fragmentation to ensure that the reconstructed probability distributions are strictly non-negative and normalized. This work also showed, with the help of classical simulations, that the QDC strategy, when combined with maximum likelihood tomography, can estimate the output of a clustered circuit with higher fidelity than the full circuit execution. In [15], a method was introduced to locate the optimal location of the cut (the location where the circuit should be fragmented). The QDC strategy was applied to commonly known circuits in quantum computing such as supremacy circuits, Grover and Bernstein-Vazirani circuits, and was shown to achieve a high quantum circuit evaluation fidelity.

The ultimate test for the quantum computing field—the ability to use controlled quantum systems to perform tasks surpassing what can be done using classical computers, also called quantum supremacy [16]—has received considerable attention from both the scientific community and the general public. The largest classical supercomputers are capable of reliably simulating quantum systems with approximately 50 qubits [17, 18], and there is evidence that devices with more than 50 qubits may be able to demonstrate quantum supremacy even in the presence of noise [19]. While quantum supremacy is not one of the goals of this work, the developed techniques will allow increasing the size of circuits that can be evaluated on quantum hardware as well as on quantum simulators run on classical hardware [20,21,22,23] by a constant factor. Consequently, it will be possible to evaluate quantum circuits with hundreds of qubits and use quantum algorithms to solve problems larger than ever before.

Circuit cutting naturally complements variational quantum-classical algorithms such as the Variational Quantum Eigensolver (VQE) [24, 25] and the Quantum Approximate Optimization Algorithm (QAOA) [26]. These approaches have successfully produced suitable quantum circuits for optimization problems by combining shallow quantum circuits with classical processing; and they allow some control over the width, depth, and connectivity of the circuits. However, the quality of the approximate solutions produced by VQE and QAOA decreases as the width and depth of their circuits decreases, and solving most interesting problems still requires hundreds of qubits [27, 28].

Circuit cutting offers numerous benefits. First, the technique does not compromise the quality of the solution as the size of the subcircuits decreases (overhead may scale exponentially with the number of cuts, however). Second, the technique can be applied to any sparsely connected quantum circuit, irrespective of the structure of the problem. Third, circuit cutting has a close relationship with tensor network quantum simulation techniques that are used to address scalability limitations due to memory requirements that grow exponentially with the size of the simulated systems. Fourth, circuit cutting can enhance the performance of existing quantum-classical variational approaches because it can increase the size of the subproblems tackled by the variational quantum eigensolver.

In this article, we follow up on our previous work on the topic [1]: we start by giving a detailed derivation of the formula for the output reconstruction of the original circuit from the outputs of its fragments, and a description of the noise models we chose to reproduce the experimental results (“Methods”). We quantify the performance of the QDC method by recalling our previous results [1] on a 20-qubit IBM processor for different qubit counts and fragment sizes (“Summary of Previous Results”). Then in “Results”, based on noisy simulations, we quantify the differential influence of various noise sources such as readout error, gate error and decoherence on the success probability of the algorithm for different qubit counts and fragment sizes. Finally, we discuss the classical complexity of the method, its relation to tensor-network simulation approaches, and its implications for homogeneous and heterogeneous quantum computing.

Methods

Circuit Cutting

An algorithm that allows circuit cutting was first described in [12]. In this section, we provide a self-contained derivation that allows to compute the probability distribution of a circuit that has been fragmented into several smaller disconnected pieces. We first derive a formula that uses probability distributions of two fragments to obtain the probability distribution of the original circuit. We then generalize the formula for cases with more than two fragments.

Fig. 1
figure 1

Cutting sketch in the two-fragment case. a Original circuit, b Upper fragment of the circuit, c lower fragment of the circuit, d lower fragment in its Bell-state variant

Two-Fragment Case: Definitions

Let us consider a m-qubit circuit described as the following composition of operations:

$$\begin{aligned} \mathcal {O}=\mathcal {O}_{A}^{\mathrm {a}}\circ \mathcal {O}_{B}^{\mathrm {a}}\circ \mathcal {O}_{A}^{\mathrm {b}}\circ \mathcal {O}_{B}^{\mathrm {b}}, \end{aligned}$$

where the support of superoperators \(\mathcal {O}_{A}^{\mathrm {b}}\) and \(\mathcal {O}_{B}^{\mathrm {b}}\) is a bipartition of the qubits; similarly, the support of \(\mathcal {O}_{A}^{\mathrm {a}}\) and \(\mathcal {O}_{B}^{\mathrm {a}}\) is a bipartition such that the two “a” (for “after”) sets differ from the “b” (for “before”) sets by one qubit. Without loss of generality, one can assume that up to a relabeling, the support of \(\mathcal {O}_{A}^{\mathrm {b}}\) is \(q_{0},\ldots q_{n}\) and that of \(\mathcal {O}_{B}^{\mathrm {b}}\) is \(q_{n+1},\ldots q_{m-1}\), and the “a” supports, \(q_{0},\dots q_{n-1}\) and \(q_{n},\ldots q_{m-1}\) (see Fig. 1a).

The final state of the circuit is given by the density matrix:

$$\begin{aligned} \rho&=\mathrm {\mathcal {O}}(\rho _{0})=\mathcal {O}_{A}^{\mathrm {a}}\circ \mathcal {O}_{B}^{\mathrm {a}}\circ \mathcal {O}_{A}^{\mathrm {b}}\circ \mathcal {O}_{B}^{\mathrm {b}}(\rho _{0}) \end{aligned}$$

where \(\rho _0\) is the initial density matrix. The probability of measuring a state i with binary representation \(i=(\hat{b}_{0}(i),\ldots \hat{b}_{m-1}(i))\) is given by

$$\begin{aligned} p(i)=\mathrm {Tr}\left[ \varPi _{i}\cdot \rho \right] , \end{aligned}$$
(1)

where \(\varPi _{i}\) is the projector on state i (\(i=0\dots 2^{m}\)). It can be expressed as \(\varPi _{i}=|i\rangle \langle i|=\otimes _{k=0}^{m-1}|\hat{b}_{k}(i)\rangle \langle \hat{b}_{k}(i)|\), where \(\hat{b}_{k}(i)\) is the value of the kth bit of i. We note that \(\varPi _{i}^{\dagger }=\varPi _{i}\), and \(\sum _{i}\varPi _{i}=\otimes _{k}\sum _{\hat{b}_{k}=0}^{1}|\hat{b}_{k}\rangle \langle \hat{b}_{k}|=I\). Thus:

$$\begin{aligned} p(i)=\mathrm {Tr}\left[ \varPi _{i}^{\dagger }\cdot \mathcal {O}_{A}^{\mathrm {a}}\circ \mathcal {O}_{B}^{\mathrm {a}}\circ \mathcal {O}_{A}^{\mathrm {b}}\circ \mathcal {O}_{B}^{\mathrm {b}}(\rho _{0})\right] . \end{aligned}$$
(2)

We now switch to a Pauli-basis representation (see “Appendix A” for a reminder). Using Eq. (16), we get

$$\begin{aligned} p(i)=2^{m}\langle \langle \varPi _{i}|\mathcal {R}_{A}^{\mathrm {a}}\mathcal {R}_{B}^{\mathrm {a}}\mathcal {R}_{A}^{\mathrm {b}}\mathcal {R}_{B}^{\mathrm {b}}|\rho _{0}\rangle \rangle \end{aligned}$$
(3)

where \(\mathcal {R}_{A/B}^{\mathrm {a/b}}\) is the Pauli transfer matrix (PTM) representation of superoperator \(\mathcal {O}_{A/B}^{\mathrm {a/b}}\).

Bipartite Splitting Formula

Basic formula We now derive the splitting formula. Let us decompose the one-qubit PTM representation of the identity superoperator as

$$\begin{aligned} \mathcal {R}_{I}=\sum _{\alpha =X,Y,Z}\sum _{bb'\in \{0,1\}}\tilde{\gamma }_{\alpha }^{bb'}|\sigma _{\alpha }^{b}\rangle \rangle \langle \langle \sigma _{\alpha }^{b'}|, \end{aligned}$$
(4)

where \(|\sigma _{\alpha }^{b}\rangle \rangle\) are the (real) coordinates in the Pauli basis of the density matrix corresponding to the bth eigenvector \(|\psi _{\alpha }^{b}\rangle\) of Pauli matrix \(\sigma _{\alpha }\). The \(\tilde{\gamma }\) tensor is given by \(\tilde{\gamma }_{X}^{bb'}=\tilde{\gamma }_{Y}^{bb'}=2\delta _{bb'}-1\) and \(\tilde{\gamma }_{Z}^{bb'}=2\delta _{bb'}\).

Inserting \(\mathcal {R}_{I}\) (acting on qubit \(q_{n}\)) in the expression for the probability, Eq. (3), we obtain

$$\begin{aligned} p(i)&=2^{m}\langle \langle \varPi _{i}|\underbrace{\mathcal {R}_{A}^{\mathrm {a}}}_{q_{0},\dots q_{n-1}}\underbrace{\mathcal {R}_{B}^{\mathrm {a}}}_{q_{n},\dots q_{m-1}}\underbrace{\mathcal {R}_{I}}_{q_{n}}\underbrace{\mathcal {R}_{A}^{\mathrm {b}}}_{q_{0},\dots q_{n}}\underbrace{\mathcal {R}_{B}^{\mathrm {b}}}_{q_{n+1}\dots q_{m-1}}|\rho _{0}\rangle \rangle \\&=2^{m}\sum _{\alpha =X,Y,Z}\sum _{bb'\in \{0,1\}}\tilde{\gamma }_{\alpha }^{bb'} \;\;\\&\quad \times \langle \langle \varPi _{i}|_{q_{0}\dots q_{n-1}}\langle \langle \varPi _{i}|_{q_{n}\dots q_{m-1}}\underbrace{\mathcal {R}_{A}^{\mathrm {a}}}_{q_{0}\dots q_{n-1}}\underbrace{\mathcal {R}_{B}^{\mathrm {a}}}_{q_{n}\dots q_{m-1}}|\sigma _{\alpha }^{b}\rangle \rangle _{q_{n}}\\&\quad \times \langle \langle \sigma _{\alpha }^{b'}|_{q_{n}}\underbrace{\mathcal {R}_{A}^{\mathrm {b}}}_{q_{0}\dots q_{n-1}}\underbrace{\mathcal {R}_{B}^{\mathrm {b}}}_{q_{n+1}\dots q_{m-1}}|\rho _{0}\rangle \rangle _{q_{0}\dots q_{n}}|\rho _{0}\rangle \rangle _{q_{n+1}\dots q_{m-1}}\\&=2^{m}\sum _{\alpha =X,Y,Z}\sum _{bb'\in \{0,1\}}\tilde{\gamma }_{\alpha }^{bb'}2^{-n-1}p_{A}^{\alpha }(i_{|0\dots n-1};b')\;2^{-m+n} \;\;\; \\&\quad \times p_{B}^{\alpha b}(i_{|n\dots m-1}). \end{aligned}$$

We thus obtain the final formula (with \(i=(\hat{b}_{0}\dots \hat{b}_{m-1})\)):

$$\begin{aligned} p(\hat{b}_{0}\dots \hat{b}_{m-1})&= \frac{1}{2}\sum _{\alpha =X,Y,Z}\sum _{bb'\in \{0,1\}}\tilde{\gamma }_{\alpha }^{bb'}p_{A}^{\alpha }(\hat{b}_{0}\dots \hat{b}_{n-1};b')\nonumber \\&\quad \times p_{B}^{\alpha b}(\hat{b}_{n}\dots \hat{b}_{m-1}) \end{aligned}$$
(5)

with

$$\begin{aligned}&p_{A}^{\alpha }(\hat{b}_{0}\dots \hat{b}_{n-1};b')\equiv 2^{n+1}\langle \langle \varPi _{\hat{b}_{0}\dots \hat{b}_{n-1}}|\langle \langle \sigma _{\alpha }^{b'}|_{q_{n}}\mathcal {R}_{A}|\rho _{0}\rangle \rangle _{q_{0}\dots q_{n}} \end{aligned}$$
(6)
$$\begin{aligned}&p_{B}^{\alpha b}(\hat{b}_{n}\dots \hat{b}_{m-1})\equiv 2^{m-n}\langle \langle \varPi _{\hat{b}_{n}\dots \hat{b}_{m-1}}|\mathcal {R}_{B}|\sigma _{\alpha }^{b}\rangle \rangle _{q_{n}}|\rho _{0}\rangle \rangle _{q_{n+1}\dots q_{m-1}}, \end{aligned}$$
(7)

where we have regrouped \(\mathcal {R}_{A}\equiv \mathcal {R}_{A}^{\mathrm {a}}\mathcal {R}_{A}^{\mathrm {b}}\) and \(\mathcal {R}_{B}\equiv \mathcal {R}_{B}^{\mathrm {a}}\mathcal {R}_{B}^{\mathrm {b}}\). In other words, \(p_{A}^{\alpha }(\hat{b}_{0}\dots \hat{b}_{n-1};b')\) is the probability of measuring bitstring \(\hat{b}_{0}\dots \hat{b}_{n-1},b'\) when measuring the final state of fragment A with a measurement on axis \(\alpha\) for qubit \(q_{n}\) (see Fig. 1b), and \(p_{B}^{\alpha b}(\hat{b}_{n}\dots \hat{b}_{m-1})\) is the probability of measuring bitstring \(\hat{b}_{n}\dots \hat{b}_{m-1}\) when measuring the final state of fragment B with qubit \(q_{n}\) initially prepared in the bth eigenstate of Pauli matrix \(\sigma _{\alpha }\) (see Fig. 1c).

Variant using Bell pair We now derive a different expression based on the following idea: instead of preparing both eigenstates of \(\sigma _{\alpha }\), one can use an ancilla qubit, prepare a Bell state, and measure the value of the ancilla along measurement axis \(\alpha\) and obtain an equivalent result, with a slightly different expression.

Switching from the Pauli-basis expression back to the original representation, Eq. (7) is equivalent to

$$\begin{aligned} p_{B}^{\alpha b}(i)&=\mathrm {Tr}\left[ \varPi _{i}\mathcal {O}_{B}(\sigma _{\alpha }^{b}\otimes \rho _{0})\right] \end{aligned}$$

where \(\sigma _{\alpha }^{b}=|\psi _{\alpha }^{b}\rangle \langle \psi _{\alpha }^{b}|\). Let us decompose

$$\begin{aligned} |\psi _{\alpha }^{b}\rangle&=\sum _{k\in \{0,1\}}\langle k|\psi _{\alpha }^{b}\rangle |k\rangle \end{aligned}$$

then

$$\begin{aligned} p_{B}^{\alpha b}(i)&=\sum _{kk'}\langle k|\psi _{\alpha }^{b}\rangle \langle \psi _{\alpha }^{b}|k'\rangle \mathrm {Tr}\left[ \varPi _{i}\cdot \mathcal {O}_{B}(|k\rangle \langle k'|\otimes \rho _{0})\right] \\&=\sum _{kk'}\langle \psi _{\alpha }^{b*}|k\rangle \langle k'|\psi _{\alpha }^{b*}\rangle \\&\quad \times \mathrm {Tr}\left[ \left( I\otimes \varPi _{i}\right) \cdot \left( \mathcal {I}\otimes \mathcal {O}_{B}\right) (I\otimes |k\rangle \langle k'|\otimes \rho _{0})\right] \\&=\mathrm {Tr}\Bigg [\left( |\psi _{\alpha }^{b*}\rangle \langle \psi _{\alpha }^{b*}|\otimes \varPi _{i}\right) \\&\quad \times \left( \mathcal {I}\otimes \mathcal {O}_{B}\right) \left( \sum _{kk'}|k\rangle \langle k'|\otimes |k\rangle \langle k'|\otimes \rho _{0}\right) \Bigg ]\\&=2\mathrm {Tr}\left[ \left( \varPi _{\alpha }^{b*}\otimes \varPi _{i}\right) \cdot \left( \mathcal {I}\otimes \mathcal {O}_{B}\right) \left( \rho _{\varPhi ^{+}}\otimes \rho _{0}\right) \right] \end{aligned}$$

where \(\varPi _{\alpha }^{b*}=|\psi _{\alpha }^{b*}\rangle \langle \psi _{\alpha }^{b*}|\) is the projector onto the complex conjugate of the bth eigenstate of the \(\sigma _{\alpha }\) Pauli matrix, and \(\rho _{\varPhi ^{+}}\) is the density matrix of the Bell state

$$\begin{aligned} |\varPhi ^{+}\rangle \equiv \frac{1}{\sqrt{2}}\sum _{k=0,1}|kk\rangle . \end{aligned}$$
(8)

In the second line, we have added an ancilla qubit. Now, let us note that for \(\alpha =X,Z\), \(|\psi _{\alpha }^{b}\rangle =|\psi _{\alpha }^{b*}\rangle\) (the eigenvector is real-valued), while \(|\psi _{Y}^{b*}\rangle =|\psi _{Y}^{1-b}\rangle\), and let us define

$$\begin{aligned} \hat{p}_{B}^{\alpha }(b;i)\equiv \mathrm {Tr}\left[ \varPi _{\alpha }^{b}\otimes \varPi _{i}\left( \mathcal {I}\otimes \mathcal {O}_{B}\right) (\rho _{\varPhi ^{+}}\otimes \rho _{0})\right] . \end{aligned}$$
(9)

Then

$$\begin{aligned} p_{B}^{\alpha b}(i)&={\left\{ \begin{array}{ll} 2\hat{p}_{B}^{\alpha }(i;b) &{} \alpha =X,Z\\ 2\hat{p}_{B}^{\alpha }(i;1-b) &{} \alpha =Y. \end{array}\right. } \end{aligned}$$

Thus, after relabeling \(b\rightarrow 1-b\) for \(\alpha =Y\) in the final formula Eq. (5), we finally obtain the final expression:

$$\begin{aligned} \boxed {p(\hat{b}_{0}\dots \hat{b}_{m-1})= \sum _{\alpha =X,Y,Z}\sum _{bb'\in \{0,1\}^{2}}\gamma _{\alpha }^{bb'}p_{A}^{\alpha }(\hat{b}_{0}\dots \hat{b}_{n-1};b')p_{B}^{\alpha }(b;\hat{b}_{n}\dots \hat{b}_{m-1}).} \end{aligned}$$
(10)

where \(\gamma _{X}^{bb'}=2\delta _{bb'}-1\), \(\gamma _{Y}^{bb'}=-\gamma _{X}^{bb'}\) and \(\gamma _{Z}^{bb'}=2\delta _{bb'}\).

The graphical representation for such a contraction is shown in Fig. 2a.

Fig. 2
figure 2

Graphical representation of the contraction formula. a Two fragment case. b Multifragment case for the GHZ circuit shown in Fig. 1a

Multi-fragment Case

Fig. 3
figure 3

Graphical representation of the contraction formula for a generic case (here with three fragments). a Sketch of the fragmentation of a four-qubit circuit in three fragments. b Corresponding tensor network to contract to get final distribution

The formula for the multi-fragment case can be inferred from that of the two-fragment case: the procedure sketched for the two-fragment case can be recast in more generic terms, as described in [12]. This is done by considering the directed acyclic graph \(G=(V,E)\) corresponding to the quantum circuit at hand (see Fig. 3 for an illustration of the procedure). Its vertices V are quantum operations such as qubit initialization, measurement and gates. The cutting procedure amounts to finding a subset \(E'\subset E\) of M (directed) edges in this graph whose removal leads to K disconnected directed acyclic graphs \(\{G^{(i)}=\left( V_{i},E_{i}\right) \}_{i=1\ldots K}\). In each disconnected graph, \(n_{i}+m_{i}\) vertices have a dangling edge corresponding to the original \(n_{i}\) incoming and \(m_{i}\) outgoing edges connecting it to the rest of the original graph, with \(\sum _{i}n_{i}=\sum _{i}m_{i}=M\). One then adds a measurement along axis \(\alpha _{k}\) (\(\alpha _{k}=X,Y,Z)\) as a termination to each outgoing dangling edge (\(k=1\dots n_{i}\)), and a Bell-state gadget (as described in the previous section), whose ancilla line is terminated by an \(\alpha '_{k}\)-measurement, to each incoming dangling edge. Translating the family of graphs \(G_{\alpha _{1\dots }\alpha _{n_{i}},\alpha '_{1\dots }\alpha '_{m_{i}}}^{(i)}\)back to quantum circuits \(\mathcal {C}_{\alpha _{1\dots }\alpha _{n_{i}},\alpha '_{1\dots }\alpha '_{m_{i}}}^{(i)}\), we can sample (using a quantum computer) the corresponding probability distributions. We denote as

$$\begin{aligned} p_{i}^{\alpha _{1}\dots \alpha _{n_{i}},\alpha '_{1}\dots \alpha '_{m_{i}}}\left( b_{1},\dots b_{n_{i}};s;b'_{1},\dots b'_{m_{i}}\right) \end{aligned}$$

the probability of measuring bitstring \(b_{1},\dots b_{n_{i}};s;b'_{1},\dots b'_{m_{i}}\), with \(s=(\hat{b}_1 \dots \hat{b}_{p_i})\) a bitstring corresponding to the state of “final” qubits of circuit \(\mathcal {C}^{(i)}\), and \((b_{1},\dots b_{n_{i}})\) (resp. \(b'_{1},\dots b'_{m_{i}})\)) the bitstrings corresponding to the measured value for the measurements on the incoming (resp. outgoing) edges of sub-graph \(G^{(i)}\) after pre-measurement rotations along axes \(\alpha _{1}\dots \alpha _{n_{i}},\alpha '_{1}\dots \alpha '_{m_{i}}\).

The final probability distribution is obtained by contracting the tensor network defined by the graph \(\hat{G}=\left( \hat{V},\hat{E}\right)\), with \(|\hat{V}|=K+M\) and \(|\hat{E}|=2M\). Here, K “fragment” vertices correspond to the K disconnected components \(\{G^{(i)}\}\), and M “connecting” vertices to the M removed edges. The 2M edges connect each of the K “fragment” vertices via one of the M “connecting” vertices. To each “fragment” vertex, we associate a distribution \(p_{i}\), while to each “connecting” vertex, we associate a \(\gamma\) tensor [as defined below Eq. (10)].

We give an example of such a tensor network for the Greenberger–Horne–Zeilinger (GHZ) circuit we considered in our previous work as well in Fig. 2 b: in this case, the underlying graph turns out to be linear. We also show, in Fig. 3, an example with a more complex circuit and the resulting, more complex tensor network. Here, \(K=3\) and \(M=3\).

The contraction of these networks yields the sought-after distribution. The classical complexity of carrying out this contraction will be discussed in “Contraction Complexity and Relation to Tensor-Network Simulation”.

Noisy Simulation

NISQ processors are characterized by a substantial level of noise. In this section, we describe what noise processes we took into account in our simulation of the IBM Johannesburg quantum processor.

In this study, we focus on the noise processes whose quantitative levels are reported by the hardware manufacturer, IBM (see Table 1 for a summary of the numerical values used in the noisy simulations below). This pragmatic approach is justified a posteriori by the reasonable agreement of our numerical simulations with the experimental data (see Ref. [1], and “Results”). It should nevertheless be emphasized that (i) it uses rather simple noise models, that should be compared to noise models extracted from a full process tomography of the processor, and that (ii) it excludes some noise processes that are suspected to affect the final quantum state distribution in a non-negligible way, e.g., crosstalk (spatially correlated noise) and temporally correlated noise (like 1/f noise).

Table 1 Johannesburg processor metrics, as retrieved from IBM Quantum Experience on March 5th, 2020

The most prominent source of error in today’s superconducting processors is the readout error. The duration of the dispersive readout conducted in transmon processors, of the order of a few microseconds, makes for a higher probability of error, most notably of the relaxation (or amplitude damping) type. We thus model the readout process as a two-outcome POVM corresponding to an amplitude-damping quantum channel of duration \(\tau\) followed by a perfect Z-axis measurement: \(\lbrace \varvec{E}, \varvec{I} - \varvec{E} \rbrace\), with \(\varvec{E}=\left( \begin{array}{cc} 0 &{} 0\\ 0 &{} 1-\gamma \end{array}\right) .\) The duration \(\tau\) is adjusted so as to obtain a readout error rate \(\gamma = 1 - e^{-\tau /T_1}\) that matches the readout error rate reported by IBM. With \(\gamma = 4.1\%\) and \(T_1=65\,\,\upmu s\), we find \(\tau =2.75 \,\,\upmu\) s, a duration that is consistent with the usual measurement durations of dispersive readout processes. Note that this noise model does not include measurement crosstalk effects [29].

Another source of error is gate noise, i.e. gate imperfections. Here, since the hardware manufacturer only reports average 1- and 2-qubit gate error rates, we picked the simplest noise process to model gate noise, namely depolarizing noise with a depolarization probability adjusted so that the average process fidelity \(\mathcal {F}_\mathrm {avg}\) matches the qubit-averaged average error rates \(\epsilon _\mathrm {avg} = 1 - \mathcal {F}_\mathrm {avg}\) reported by the hardware maker. We recall that the one-qubit depolarizing noise process is characterized by the following Kraus operators:

$$\begin{aligned} \varvec{K}_{0}^{D}&=\sqrt{1-p_{(1)}^{D}}\varvec{I},\\ \varvec{K}_{i}^{D}&=\sqrt{p_{(1)}^{D}}\varvec{\sigma }_{i},\;\;i=1,2,3, \end{aligned}$$

where \(\varvec{\sigma }_{i}\) denote the Pauli spin matrices. We model two-qubit depolarization processes as a tensor product of the one-qubit depolarizing noise. Let us stress that more structured, and therefore more accurate, noise models could be extracted from quantum process tomography methods, at the cost of a larger characterization overhead. Furthermore, this noise model does not include any crosstalk effects (see, e.g. [30]), despite evidence that they play some role in today’s NISQ processors.

Finally, we include the effect of decoherence on idle qubits, i.e. qubits that are not being acted upon by a quantum gate, but are nevertheless coupled to the outside environment. This decoherence can be decomposed into two main types, namely relaxation and dephasing. Relaxation (also known as amplitude damping or, in other contexts, spontaneous emission) causes excited qubits (i.e. in state \(|1\rangle\)) to relax to their ground state (\(|0\rangle\)) with a probability that is characterized by a time \(T_1\): \(p_{\tau _{\mathrm {idle}}}^{\mathrm {A.D}}=1-e^{-\tau _{\mathrm {idle}}/T_{1}}\), namely, the longer the idling duration \(\tau _{\mathrm {idle}}\), the higher the probability of a relaxation event. Similarly, dephasing events cause the two components \(|0\rangle\) and \(|1\rangle\) of a superposed state to acquire an unwanted dephasing with a certain probability. Under simplifying assumptions about the power spectral density (PSD) of the qubit-environment system, namely the assumption of a white noise PSD, this probability is given by \(p_{\tau _{\mathrm {idle}}}^{\mathrm {P.D}}=1-e^{-2\tau _{\mathrm {idle}}/T_{\varphi }}\), with \(\frac{1}{T_{\varphi }}=\frac{1}{T_{2}}-\frac{1}{2T_{1}}\). We note that this is a quite strong simplification, as actual transmon processors are known to have a PSD that deviates from white noise, with, most notably, a sizable pink (1/f) noise component (see, e.g [31] for a review) that leads to a deviation to the exponential decay of the formula we used. Let us also stress that such a noise modeling does not take into account temporally correlated noise. As a reminder, here are the Kraus operators associated with amplitude damping and (pure) dephasing:

$$\begin{aligned} \varvec{K}_{0}^{\mathrm {A.D}}&=\left[ \begin{array}{cc} 1 &{} 0\\ 0 &{} \sqrt{1-p_{\tau _{\mathrm {idle}}}^{\mathrm {A.D}}} \end{array}\right] ,\varvec{K}_{1}^{\mathrm {A.D}}=\left[ \begin{array}{cc} 0 &{} \sqrt{p_{\tau _{\mathrm {idle}}}^{\mathrm {A.D}}}\\ 0 &{} 0 \end{array}\right] ,\\ \varvec{K}_{0}^{\mathrm {P.D}}&=\left[ \begin{array}{cc} 1 &{} 0\\ 0 &{} \sqrt{1-p_{\tau _{\mathrm {idle}}}^{\mathrm {P.D}}} \end{array}\right] ,\varvec{K}_{1}^{\mathrm {P.D}}=\left[ \begin{array}{cc} 0 &{} 0\\ 0 &{} \sqrt{p_{\tau _{\mathrm {idle}}}^{\mathrm {P.D}}} \end{array}\right] . \end{aligned}$$

The values we used for \(T_1\) and \(T_2\) are reported in Table 1.

The noisy simulations are conducted on the Atos Quantum Learning Machine (QLM), a classical supercomputing platform dedicated to writing, simulating and optimizing quantum algorithms [22].

Fig. 4
figure 4

Qubit connectivity map of the Johannesburg processor. Edges are shown between qubit pairs coupled via a resonator that allows application of the two-qubit CNOT gate

Before simulating the circuits resulting from the fragmentation procedure described in the previous section, we use the QLM’s Nnizer plugin to compile the circuits, i.e. most notably to adapt them to the Johannesburg processor’s restricted qubit topology (shown in Fig. 4). Then, we perform noisy simulations using a density-matrix-based noise simulator that uses a dense representation of the density matrix \(\rho\) of the qubit register.

Results

Summary of Previous Results

Fig. 5
figure 5

Success probability as a function of circuit size (number of qubits) for various numbers of fragments using IBM’s Johannesburg processor

Fig. 6
figure 6

Success probability as a function of circuit size (number of qubits) for various numbers of fragments using IBM’s Johannesburg processor (solid black lines) and Atos QLM noisy simulation (dashed blue lines). The black integers next to each black disk indicate the maximum fragment size (in number of qubits)

In [1], we investigated the performance of the circuit-cutting procedure for a simple GHZ-type circuit shown in Fig. 1a. As a proxy for the quality of the procedure, we chose the quantity

$$\begin{aligned} P_{\mathrm {success}}\equiv p\left( |0\rangle ^{\otimes m/2}|1\rangle ^{\otimes m/2}\right) +p\left( |1\rangle ^{\otimes m/2}|0\rangle ^{\otimes m/2}\right) , \end{aligned}$$
(11)

which, given the GHZ circuit at hand, is unity in the absence of any noise.

We carried out the procedure both using an actual 20-qubit processor, IBM Johannesburg, and using the Atos Quantum Learning Machine’s noisy simulator.

The experimental success probabilities, shown in Fig. 5, display two clear trends: on the one hand, increasing the number of qubits leads to a decreasing success probability. This trend can be accounted for by the fact that increasing the number of qubits increases the number of gates of the circuit, and thus the sensitivity to gate errors and environmental decoherence. On the other hand, increasing the number of fragments in general leads to an improved success probability: the 6-8 fragment success probabilities are larger than the success probabilities obtained for lower numbers of fragments (with some exceptions to this observation: the one-fragment success probability often exceeds that of the 2 and 4–5 fragment cases, maybe due to compiler optimizations on the hardware side for circuits with larger numbers of qubits; we also note a point at \(n_\mathrm {qbits}=10\) where the 4–5 fragment success probability exceeds that of the 6–8 fragment case). This trend can be ascribed to the smaller gate count of each individual fragment, and thus a reduced sensitivity to errors. This smaller gate count not only comes from the mere cutting procedure, but also from the fact that smaller circuits better match the limited connectivity (Fig. 4) of the Johannesburg chip. Conversely, larger circuits need to be compiled to fulfill the connectivity constraints, leading to larger gate counts.

To substantiate these interpretations, we performed noisy simulations with noise models established using the constructor’s calibration data (Table 1). We show the results in Fig. 6: a 20% agreement is found between the noisy simulations and the experimental data. In particular, the drops in success probability, which can be traced back to connectivity-related insertions of SWAP gates, are reproduced. We note that the error bars coming from the finite number of shots (8192) used for each fragment are contained within the data symbols.

Analysis of the Influence of the Different Noise Types

Fig. 7
figure 7

Effect of better readout: Same as Fig. 5, but with a readout duration divided by 5

Fig. 8
figure 8

Effect of better gates: Same as Fig. 5, but with a depolarizing error per gate divided by 5

Fig. 9
figure 9

Effect of better coherence: Same as Fig. 5, but with \(T_1\) and \(T_2\) coherence times multiplied by 5

Fig. 10
figure 10

Increase in success probability averaged over the number of qubits as a function of the number of fragments, \(\varDelta P (\mathcal {S}, n_\mathrm {f})\), for a faster readout (blue, parameters of Fig. 7), better gates (orange, parameters of Fig. 8), better coherence (green, parameters of Fig. 9)

In this section, we study and compare the differential impact of all the noise types we have previously taken into account: gate imperfections, idling and readout errors. Our goal is to understand which types of noise have a particularly severe influence on the fidelity of the fragmenting procedure and to formulate recommendations as to which noise types should be addressed first if one wants to make the most of the fragmenting procedure. Hence, we study the influence of the three noise types by simulating better readout measurements (Fig. 7), better gates (Fig. 8) and a better coherence time (Fig. 9).

Faster readout. First, we analyze the impact of readout errors by decreasing the duration \(\tau\) of the measurements on all the subcircuits generated by the splitting procedure. Readout error is at present the largest source of errors in superconducting processors, with error rates as high as a few percent. It is thus reasonable to assume that large experimental efforts are going to be made to reduce this error rate. Here, we suppose the reduction in readout error rate to originate from a reduction of the readout duration (in practice by a factor 5), although it would be equivalent, in this noise model that assumes the errors to come only from an amplitude damping noise, to keep the readout duration fixed and to increase the T1 coherence time (by the same factor 5). In reality, progress is being made on both fronts (see, e.g [32, Fig. 2.c], for the increasing T1 trend, and [33] for recent efforts towards faster measurements).

We see in Fig. 9 that better readout improves the overall success probability all the more as the fragment number is large. The difference between the solid and the dashed lines qualitatively increases with the number of readout measurements used, and consequently the number of fragments. Indeed, more fragments necessitate more measurements to characterize the quantum state of each fragment. Nevertheless, we still see drops in the evolution of the success probability with the number of qubits. It can be explained by the topology constraints that require the use of several SWAP gates when we try to perform gates between physical qubits that are not adjacent. This calls the study of the next paragraph.

Better gates. To model the use of better gates, we choose to lower the amplitude of the depolarizing channel by dividing the depolarizing error rate by a factor of 5. The limited gate fidelity is the second major source of errors in superconducting processors. It comes from calibration errors as well as decoherence. Here, we mimic the improvement in gate quality by simply dividing the error rate by a factor of 5. Such a factor is realistic, in view of the improvements in gate qualities of superconducting processors in the recent years, and of the variability in the error rates reported by the hardware providers (the two-qubit error rates reported for IBM Johannesburg [34], Google Sycamore [35, Fig.2, Table II] and Rigetti Aspen 7 [36], are, respectively, 0.2%, 0.62% and 4.8%).

The results of this change in the noise model can be seen in Fig.  6. We notice that the slope is more regular as the number of qubits increases. Indeed, a smoothing of the “drops” in success probability is observed. These drops were the consequence of performing a gate between qubits that are not adjacent in the connectivity map (Fig. 4) and that require using several SWAP gates. Thus, better gates help mitigate the effect of topology. The insertion of additional SWAP gates because of topology constraints becomes less detrimental to the overall success probability when the inserted gates are of good fidelity.

Better coherence. Finally, to understand the impact of coherence on the splitting procedure, we increase the relaxation time \(T_1\) and the dephasing time \(T_2\) by multiplying them by a factor of 5 (see “Noisy Simulation” for a definition of the corresponding Kraus operators). Decoherence errors indeed account for another portion of the errors incurred by a quantum processor. They not only lead to a decrease in gate fidelity, but also affect idle qubits. Here, the factor of 5 we chose is compatible with the improvements of the recent years (see [32], Fig 2c for the increasing T1 trend) Doing this will delay both spontaneous emission (amplitude damping) and phase flip (dephasing) events.

As shown in Fig. 9, better coherence only has a limited impact on the fragmenting procedure: it seems to improve more the success probability of the runs with fewer fragments than the one of the runs with more fragments where the solid and dashed lines are closer one to the other. This behavior is expected. Using a larger number of fragments imply that the fragments are smaller in terms of qubits size and such small fragments are less sensitive to decoherence.

All these observations are summarized in Fig. 10, which shows the increased success probability using the new parameters compared to the success probabilities \(P_\mathrm {success}^{(0)}\) computed with the Johannesburg noise parameters. For each of the scenarios \(\mathcal {S}\) introduced above, we compute the increase in probability defined as:

$$\begin{aligned} \varDelta P (\mathcal {S}, n_\mathrm {f}) = \langle P_\mathrm {success}(\mathcal {S}, n_\mathrm {f}, n_\mathrm {q}) - P_\mathrm {success}^{(0)}(n_\mathrm {f}, n_\mathrm {q})\rangle _{n_\mathrm {q}}. \end{aligned}$$
(12)

We see that, as discussed above, better readout is all the more helpful as the number of fragments is large, while, conversely, better coherence is more beneficial for smaller number of fragments. Achieving better gate fidelities, on the other hand, is equally beneficial with and without fragmentation since the slope of the orange line is close to 1. (We stress that because of the arbitrariness in the quantitative choice of level of improvement for the three scenarios, one cannot conclude any quantitative insight from the value of the improvement; here, our conclusions are qualitative and only based on the slope with respect to the number of fragments). Consequently, to make the most of the fragmenting procedure in the case of numerous fragments, the major error source to focus on is the measurement error by designing faster readouts.

Contraction Complexity and Relation to Tensor-Network Simulation

Fig. 11
figure 11

First three contraction steps for the fragmentation of the GHZ-type circuit of Fig. 1

In this section, we elaborate on the complexity of the fragmentation algorithm. As presented in “Circuit Cutting”, the fragmentation method consists of a quantum and a classical step. In the quantum step, a batch of quantum circuits is executed on a Quantum Processing Unit (QPU). The number of such circuits scales as the number K of disconnected subgraphs of the original directed acyclic graph with some edges removed. The outcome of this step is a list of probability distributions \(p_i\). In the classical step, a tensor network with nodes corresponding either to the probability distributions or to the \(\gamma\) tensors defined in “Circuit Cutting” needs to be contracted.

Here, we shall be interested in the contraction complexity of such a tensor network, assuming one wants to recover the probability of a single bitstring \((\hat{b}_0, \dots \hat{b}_{m-1})\), i.e. for a fixed assignment of the external legs of the tensor network shown in Fig. 2b. A naive contraction of the tensor network at hand, namely a simultaneous summation over all internal indices \({(\alpha _i, b_i, b'_i)}_{i=1\dots K-1}\), would entail a contraction complexity of \(12^{K-1}\), i.e. a classical computation that is exponential in the number K of fragments. In our case, however, the linear structure of the graph underlying the tensor network allows for a much more efficient sequential contraction strategy. Such a strategy, which is also widely exploited for contracting so-called Matrix Product States (see, e.g. [37, 38] for a review), consists in sequentially contracting the nodes of the network starting from one end of the linear graph. This is illustrated in Fig. 11, where we show the first three steps. The contraction complexity of the successive steps is 12, 36, 12, 36, ..., 12, 36, 12, 6. For K fragments, this yields an overall contraction complexity of \(48 (K-2) + 18 = 48 K - 78\), i.e a linear complexity in the fragment number K.

In the case of a general tensor network, the optimal contraction complexity can be shown to be at least of the order of \(O(e^T)\), where T is the so-called treewidth of the network graph [39]. The treewidth of a graph can be defined as a combinatorial metric of closeness of the graph to a tree. There are a few ways to define the treewidth in more formal way: the minimum k for which a given graph is a partial k-tree, or the elimination width.

Tensor-network theory can also be leveraged to simulate quantum circuits classically. There are a number of tensor-network-based simulators developed for such simulations: QFlex [40], AC-QDP [41], Quimb [42], and QTensor [43]. These simulators are typically much faster and more efficient than state vector simulators for shallow circuits [44] such as the circuits in this work. In these tensor simulators, the circuits are not directly represented by tensors, but rather use line graphs, which was proposed by Boixo et al. [45]. This approach has multiple benefits. The only disadvantage of the line graph approach is that it has limited usability to simulate sub-tensors of amplitudes, which was resolved in the work by Schutski et al. [46].

The method studied in our work, circuit cutting, has a counterpart in tensor-network-based simulation. It is called tensor slicing. One way to understand the slice of a tensor as an index that can be viewed as the function of many variables evaluated at some value of one variable:

$$\begin{aligned} f(x_1, x_2, \ldots x_n)|_{x_1 = a} = \tilde{f}(x_2,\ldots x_n), \end{aligned}$$

where variables can have integer values \(x_i \in [0,d-1]\). Thus, in this technique, slicing reduces the number of indices of the tensor one by one. Since all sizes of indices we use are equal to 2, removal of n vertices allows to split the expression into \(2^n\) separate parts. This operation is also equivalent to decomposition of the full tensor expression. Each separate tensor is represented by a graph with lower connectivity than the original one. As a result, it dramatically reduces the complexity of finding the optimal elimination. Thus, it results in a lower contraction cost. It is a powerful technique that allows to simulate large circuits as does the circuit-cutting technique described in this work.

Homogeneous and Heterogeneous Quantum Computing

One exciting application of the circuit-cutting technique is to allow to execute much larger circuits. It can be done in two ways: split circuits and run sequentially on a quantum device (as we demonstrated in [1]), or run at the same time on multiple quantum devices. The latter way can lead to an exciting new era of how quantum computation is done—distributed quantum computing. It can potentially not only allow for the execution of larger circuits, but also for a much faster execution. It is arguably a more realistic approach in the near future compared to the “true” distributed quantum computing that requires a quantum network connecting quantum devices. In our approach, indeed, we would utilize only the classical network.

Conclusions

In this work, we further investigated the Quantum Divide and Conquer approach, whose first implementation was demonstrated in a recent work of ours [1].

After giving more details as to the mathematical framework and physical models used for this implementation, we analyzed the influence of different noise sources on the success probability of a simple, GHZ-type circuit using classical noisy simulations on the Atos Quantum Learning Machine. We focused on the three main noise sources of today’s superconducting processors, namely readout errors, gate errors and decoherence (relaxation and dephasing) on idle qubits. We showed that readout errors are the most detrimental to the QDC procedure, because QDC requires additional measurements as the number of fragments increases. Conversely, the effect of idling noise is mitigated by QDC, as QDC results in smaller circuits that are less susceptible to this source of noise.

We also analyzed the computational complexity of QDC using tensor-network methods. While for a general circuit the contraction complexity increases exponentially with the number of cuts, for the GHZ-like circuit we studied, the complexity increases linearly with the number of cuts.

Finding more complex circuits in which the contraction complexity is still manageable is an interesting future direction. Circuits that have a “clustered” structure [14], that are e.g required in methods like the Dynamic Quantum Variational Ansatz [47], are promising candidates. In these methods, indeed, the ansatz has a mixer unitary that is made up of partial mixers that can have limited connectivity between each other, and can therefore form clusters.