Classical splitting of parametrized quantum circuits

Barren plateaus appear to be a major obstacle for using variational quantum algorithms to simulate large-scale quantum systems or to replace traditional machine learning algorithms. They can be caused by multiple factors such as the expressivity of the ansatz, excessive entanglement, the locality of observables under consideration, or even hardware noise. We propose classical splitting of parametric ansatz circuits to avoid barren plateaus. Classical splitting is realized by subdividing an N qubit ansatz into multiple ansätze that consist of O(logN)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\log N)$$\end{document} qubits. We show that such an approach allows for avoiding barren plateaus and carry out numerical experiments, and perform binary classification on classical and quantum datasets. Moreover, we propose an extension of the ansatz that is compatible with variational quantum simulations. Finally, we discuss a speed-up for gradient-based optimization and hardware implementation, robustness against noise and parallelization, making classical splitting an ideal tool for noisy intermediate scale quantum (NISQ) applications.


I. INTRODUCTION
Variational quantum algorithms (VQAs) [1] are promising tools to solve a wide range of problems, such as finding the ground state of a given hamiltonian via the variational quantum eigensolver (VQE) [2], solving combinatorial optimization problems with the quantum approximate optimization algorithm (QAOA) [3] or solving classification problems using quantum neural networks [4].
VQAs are suitable for noisy intermediate scale quantum (NISQ) [5] hardware as they can be implemented with a small number of layers and gates for simple tasks. However, a scalability problem arises with the increasing number of qubits, hindering a possible advantage. VQAs rely on a classical optimization loop that updates the parameters of the ansatz iteratively until a condition on the cost function is satisfied. Classical optimizers use the information on the parametrized cost landscape to find the minimum. The updates on the parameters move the ansatz to a lower point on the cost surface. In 2018, Mc-Clean et al. showed that the cost landscape flattens with the increasing number of qubits, making it exponentially harder to find the solution for the optimizer [6]. The flattening was first observed by looking at the distribution of gradients across the parameter space, and the problem was named barren plateaus (BPs). A VQA is said to have a BP if its gradients decay exponentially with respect to one of its hyper-parameters, such as the number of qubits or layers.
Since the discovery of the BP problem, there has been significant progress that improved our understanding of what causes BPs and several methods to avoid them have been proposed. It has been shown that noise [7], entanglement [8], and the locality of the observable [9] play * cenk.tueysuez@desy.de an essential role for determining whether an ansatz will exhibit BPs. It has also been shown that the choice of ansatz (e.g. expressivity) of the circuit is one of the decisive factors that impact BPs [10]. For instance, the absence of BPs has been shown for quantum convolutional neural networks (QCNN) [11,12] and tree tensor networks (TTN) [13,14]. On the other hand, the hardware efficient ansatz (HEA) [6,14,15] and matrix product states (MPS) [14] have been shown to have BPs.
One of the essential discoveries showed that BPs are equivalent to cost concentration and narrow gorges [16]. This implies that BPs are not only a result of the exponentially decaying gradient but also of the cost function itself, and they can be identified by analyzing random points on the cost surface. As a result, gradient-free optimizers are also prone to BPs and do not offer a way to circumvent this problem [17].
Many methods have been suggested to mitigate BPs in the literature. Some of these methods suggest to use different ansätze or cost functions [18,19], determining a better initial point to start the optimization [20][21][22][23], determining the step size during the optimization based on the ansatz [24], correlating parameters of the ansatz (e.g., restricting the directions of rotation) [10,25], or combining multiple methods [26,27].
In this work, we propose a novel idea in which we claim that if any ansatz of N qubits is classically separated to a set of ansätze with O(log N ) qubits, the new ansatz will not exhibit Barren Plateaus. This work is not the first proposal in the literature that considers partitioning an ansatz. However, our proposal is significantly different. Most work in the literature first considers an ansatz and then emulates the result of that ansatz through many ansätze (exponentially many in general) with less number of qubits (which increases the effective size of quantum simulations) using gate decompositions, entanglement forging, divide and conquer or other methods [28][29][30][31][32][33][34][35]. On the other hand, this work proposes us-ing ansätze that are classically split, meaning that there are no two-qubit gate operations between the subcircuits before splitting. This way, there is no need for gate decompositions or other computational steps. Our results show that this approach provides many benefits such as better trainability, robustness against noise and faster implementation on NISQ devices.
In the remainder of the paper, we start by giving an analytical illustration of the method in Section II. Then, we provide numerical evidence for our claim in Section III and extend our results to practical use cases by comparing binary classification performance of classical splitting for classical and quantum data. Next, we propose an extension of the classical splitting ansatz and perform experiments to simulate the ground state of the transversalfield ising hamiltonian. Finally, we discuss the advantages of employing classical splitting, make comments on future directions in Section IV and give an outlook in Section V.

II. AVOIDING BARREN PLATEAUS
Barren plateaus (BPs) can be identified by investigating how the gradients of an ansatz scale with respect to a parameter. Here, we will start with the notation of McClean et al. and extend it to classical splitting [6]. The ansatz is composed of consecutive parametrized (V ) and non-parametrized entangling (W ) layers. We define U l (θ l ) = exp(−iθ l V l ), where V l is a Hermitian operator and W l is a generic unitary operator. Then the ansatz can be expressed with a multiplication of layers, Then, for an observable O and input state of ρ, the cost is given as The ansatz can be separated into two parts to investigate a certain layer, such that U − ≡ j−1 l=1 U l (θ l )W l and U + ≡ L l=j U l (θ l )W l . Then, the gradient of the j th parameter can be expressed as The expected value of the gradients can be computed using the Haar measure. Please see Appendix A for more details on the Haar measure, unitary t-designs and details of the proofs in this section. If we assume the ansatz U (θ) forms a unitary 2-design, then this implies that ∂ k C(θ) = 0 [6]. Since the average value of the gradients are centered around zero, the variance of the distribution, which is defined as, can inform us about the size of the gradients. The variance of the gradients of the j th parameter of the ansatz, where U − and U + are both assumed to be unitary 2designs, and the number of qubits is N , is given as [6,10], This means that for a unitary 2-design the gradients of the ansatz vanish exponentially with respect to the number of qubits N . Details of this proof is provided in Appendix A. Now, let us consider the classical splitting (CS) case. We split the ansatz U (θ) to k many m-qubit ansätze, where we assume without loss of generality that N = k × m. Then, we introduce a new notation for each classically split layer, where index l determines the layer and index i determines which sub-circuit it belongs to. This notation combines the parametrized and entangling gates under U i l . Then, the overall CS ansatz can be be expressed as, The CS ansatz can be seen in Fig. 1a. Next, we will assume the observable and the input state to be classically split, such that they both can be expressed as a tensor product of m-qubit observables or states. This assumption restricts our proof to be valid only for m-local quantum states and m-local observables. It is important to note here that we use a definition that is different from the literature throughout the paper. For this proof, an m-local observable is an observable such that there are no operators that act on overlapping groups of m qubits. A generic m-local observable can be expressed as, (8) where O i is an observable over the qubits {(i − 1)m + 1, (i − 1)m + 2, ..., im}, andī represents the remaining N − m qubits. Then, the cost function becomes; FIG. 1. All types of ansätze used in this work. (a) An N -qubit generic ansatz consisting of L layers of the parametrized unitary U are separated in to k = N/m many m-qubit ansätze. This ansatz will be referred to as the classically split (CS) ansatz. The standard ansatz can be recovered by setting m = N . (b) Extended classically split (ECS) ansatz. This is an extension to the CS ansatz. First L layers of the ansatz consists of k = N/m many m qubit U blocks. Then, T layers of N qubit V layers are applied. (c) A simple ansatz that consists of RY rotation gates and CX gates connected in a "ladder" layout. (d) Hardware Efficient Ansatz (HEA) that is used to produce the quantum dataset. Parameters of the first column of U3 gates are sampled from a uniform distribution ∈ [−1, 1], while the rest of the parameters are provided by the dataset [36]. (e) EfficientSU2 ansatz with "full" entangler layers [37].
This can be written as a simple sum, where, Then, the costs of each classically separated circuit are independent of each other. The gradient of j th parameter of the i th ansatz can be written as, Now, let us consider each ansatz U i (θ i ) to be a unitary 2-design. We want to choose the integer m such that it scales logarithmically in N . Hence, we choose β and γ appropriately, such that m = β log γ N holds. Then, if we combine Eq. (5) with Eq. (12), the variance of the gradient of j th parameter can be expressed as Here, the dependence on i or j becomes irrelevant (a simpler choice for ansatz design would be to choose every new ansatz to be the same), so it can be dropped for a simpler notation. Similar to Eq. (5) the variance scales with the dimension of the hilbert space (e.g. O(2 m )). Then, the overall expression scales with, O(N −6β log γ 2 ), where β and γ are constant (e.g. β = 1 and γ = 2 results in m = log 2 N ). As a result, the variance of the classical splitting ansatz scales with O(poly(N) −1 ) instead of O(exp(N) −1 ). Therefore, a CS ansatz, irrespective of its choice of gates or layout, can be used without leading to BPs.

III. NUMERICAL EXPERIMENTS
In this section, we report results of four numerical experiments. We investigate the scaling of gradients under classical splitting by computing variances over many samples in Section III A. Then, we perform three experiments to observe how classical splitting affects performance of an ansatz. This task by itself leads to many questions as there are multitudes of metrics that one needs to compare and as many different problems one can consider. For this purpose, we consider problems well known in the literature, where trainability of ansätze plays a significant role.
First, we perform binary classification on a synthetic classical dataset in Section III B. The dataset contains two distributions that are called as classes. The goal is to predict the class of each sample. We perform the same task for distribution of quantum states in Section III C. Then, we give practical remarks in Section III D. Finally, we propose an extension to the CS ansatz and employ it for quantum simulating the ground state of the transverse field ising hamiltonian in Section III E.
For the first three experiments (Sections III A to III C), we consider the CS ansatz with layers that consists of R Y rotation gates and CX entangling gates applied in a ladder formation for each layer. This layer can be seen in Fig. 1c. As the observable, we construct the 1-local observable defined in Eq. (14), where Z i represents the Pauli-Z operator applied on the i th qubit and 1ī represents the identity operator applied on the rest of the qubits.
A. Barren Plateaus Barren Plateaus are typically identified by looking at the variance of the first parameter over a set of random samples [6]. Recently, it has been shown that this is equivalent to looking at the variance of samples from the difference of two cost values evaluated at different random points of the parameter space [16]. Since the gradient-free optimization methods are also affected from BPs, the values of the cost become a more inclusive indicator [17]. For this reason, we will report our findings with respect to the cost, rather than the gradients to draw a broader picture. Results with respect to the gradient of the first parameter is presented in Appendix B for the sake of completeness. The experiments were performed using analytical gradients and expectation values, assuming a perfect quantum computer and infinite number of measurements, using Pennylane [38] and Pytorch [39]. Variances are computed over 2000 samples, where the values of the parameters are randomly drawn from a uniform distribution over [0, 2π].
We start by presenting the variances over different values of m and N in Fig. 2. We fix the number of layers (L) to N , so that the ansatz exhibits BPs in the no classical splitting setting (m = N ). The results indicate that a constant value of m resolves the exponential behaviour, as expected from Eq. (13). Furthermore, it is evident that larger values of m can allow the ansatz to escape BPs, given that m grows slow enough (e.g. O(log N )).
Our theoretical findings illustrate that the classical splitting can be used to avoid BPs irrespective of the number of layers. In our first experiment, we numerically showed that this holds when we set L = N . Recent findings showed that, a transition to BPs happens at a depth of O(log N ) for an ansatz with a local cost function [9]. Therefore, there is great importance in investigating the behaviour for larger values of L. For considerably low values of N (e.g. N < 32), we can assume a constant value for m (e.g. m = 4), such that m is approximately O(log N ). We present variances of two ansätze (m = 4, m = N ) for up to 200 layers and 16 qubits in Fig. 3. For the standard ansatz, we see a clear transition to BPs with increasing number of layers, as expected [9]. On the other hand, the CS ansatz (m = 4) shows a robust behavior from small to large number of layers.
These two experiments show the potential of the classical splitting in avoiding BPs. However, the question of whether this potential can be transferred in-to practice (e.g. binary classification performance or quantum simulation) still lacks an answer. Next, we will be addressing this question.

B. Binary classification using a classical dataset
In this experiment, we will continue using the same ansatz with same assumptions to perform binary classification using a classical dataset. Our goal here is to compare performance of the CS ansatz to the standard case for increasing number of qubits. We need a dataset that can be scaled for this purpose. However, datasets are typically constant in dimension and do not offer an easy way to test the scalability in this sense. Therefore, we employ an ad-hoc dataset that can be produced with different number of features.
Three datasets (N = 4, 8 and 16) were produced using the make classification function of scikit-learn 1 [40]. This tool allows us to draw samples from an N -dimensional hypercube, where samples of each class are clustered around the vertices. Each dataset contains 420 training and 180 testing samples. Each of the data samples were encoded using one R Y gate per qubit, such that each ansatz uses the same number of features of the given dataset. Please see Appendix C for more details on the production of the dataset and distributions of samples.
The binary classification was performed using the expectation value over the observable defined in Eq. (14) and the binary cross entropy function was used as the loss function during training, such that, where y (i.e. y ∈ {0, 1}) is the class label of the given data sample andŷ is the prediction (i.e.ŷ = . Each column represents a problem with a different sample size (4,8,16). Each marker is placed on the median, boxes cover the range from the first to third quartiles and the error bars extend the quartiles by 3 times range. Each m value is plotted with a different marker and color.
, where x is the data sample) 2 . The ADAM optimizer [41] with a learning rate of 0.1 was used and all models are trained for 100 epochs using full batch size (bs=420) 3 . We report our results based on 50 runs for each setting. Classification performance of ansätze for changing values of m using the three datasets are presented in Fig. 4. Here, the results show the distribution of accuracies over the test set. For the N = 4 case, we see that the standard (m = N ) ansatz performs the best. However, this is not the case as we go to more qubits. For the 8 and 16 qubit cases, it is evident that m < N ansätze can match the performance of the standard ansatz. We can also see that the constant choice of m = 4 can provide a robust performance with increasing number of qubits (at least up-to N = 16), matching our expectations. Training curves of all settings are presented in Appendix D.

C. Binary classification using a quantum dataset
The binary classification performance of the classical splitting over the classical datasets provides the first numerical evidence for their advantage against the standard ansätze. It is also important to investigate if they can be extended to problems where the data consists of quantum states. Our proof in Section II assumed the input states to be tensor product states. Now, we remove this constraint and use a quantum dataset.
For this experiment, we use the NTangled dataset [36]. NTangled dataset provides parameters to produce distributions of quantum states that are centered around different Concentrable Entanglement (CE) [42] values. CE is a measure of entanglement, which is defined as follows, where Q is the power set of the set {1, 2, ..., N}, and ρ α is the reduced state of subsystems labeled by the elements of α associated to |Ψ . The NTangled dataset provides three ansätze trained for different CE values for N=3, 4 and 8. We choose the Hardware Efficient Ansatz ( Fig. 1d) with depth=5, such that the parameters of the first layer of U 3 gates are sampled from a unitary distribution ∈ [−1, 1] and the others are provided by the dataset. Then, we apply the same CS ansatz used in Section III B and perform binary classification such that the CE values are the labels of classes. The CE distributions of the produced quantum states are presented in Appendix E. For the binary classification task, the same training settings are used as in Section III B, except this time models are trained until 50 epochs, as most models were able to reach 100% test accuracy. We report our results using different pairs of distributions in Table I. In the case of N = 4, we observed that classical splitting can perform at similar accuracy, even if the ansatz do not have any entangling gates (m = 1). We see that entangling gates are needed for better performance if the problem gets harder (e.g. 0.25 vs. 0.35 case). If we go to a problem with more qubits, we can safely say that the CS ansatz can match the performance of the standard ansatz and converge faster.

D. Practical remarks on classical splitting
The efficacy of classical splitting relies on the parts of the circuit before and after the set of gates that undergo classical splitting. This can be seen most clearly if we set m = 1 and apply classical splitting to the entire circuit after a possible initialization. In this case, we only perform single qubit operations after initialization. Hence, if the initialization produces a tensor product state, then the circuit subject to classical splitting with m = 1 can no longer generate any entanglement. Similarly, if we initialize with the HEA (Fig. 1d) and apply classical splitting with m = 1 to the remaining circuit, then no tensor product state can be found.
More generally, m = 1 produces a circuit that cannot change the amount of entanglement. For other choices of m, the picture becomes more complicated but, generally, the set of states that can be generated by the quantum circuit before classical splitting will be reduced to a subset based on the characteristics of the remaining initialization.
A naïve implementation of classical splitting therefore requires knowledge of the correct initialization such that the final solution can still be reached with the classically split circuit. In generic applications, this knowledge is likely not available. Hence, an adaptive approach to classical splitting should be considered.
One adaptive approach would be to increase m to check for improvements. After we observe no further training improvement with m = 1, we could move to m = 2. This enlarges the set of states the quantum circuit can reach, and thus may lead to further training improvements, at the cost of possibly stronger BP effects. However, if m = 1 has already converged fairly well, then the state is already fairly close to the m = 2 solution and it is unlikely to find a BP. With m = 2 converged, we can then move to m = 4 and continue the process by doubling m one step at a time.
If, for example, we consider the N = 4 "0.25 vs. 0.3" case of Table I, we may start training with m = 1. This training converges to about 90% accuracy. Increasing m to m = 2 will lead to further improvements that converge to about 98% accuracy. Finally, we can further improve the 98% to 100% accuracy by going to m = 4.
In this way, we utilize the efficiency of classical splitting to obtain an approximate solution which we then refine by trading efficiency for circuit expressivity through increasing m. At this point, the efficiency reduction should no longer lead to insurmountable complications as we already are close to the optimal solution for the current m value.
Another adaptive approach would be to use classical splitting to check and bypass plateaus. For example, if a VQE appears to be converged, it may also just be stuck in a plateau. Applying classical splitting at this point would reduce the effect of the plateau. Thus, if the VQE continues optimizing after classically splitting a seemingly converged circuit, we can conclude that this was in fact a plateau. After a suitable number of updates using the classically split circuit, we can then return to the full circuit in the hopes of having passed the plateau.
Unfortunately, this approach cannot be used to positively distinguish between true local optima and plateaus since the classical splitting reduces expressivity and thus introduces artificial constraints. Hence, if the set of states expressible by the classically split circuit is orthogonal to the gradient in the cost function landscape, then a plateau will be replaced with a local optimum and, thus, no improvements will be obtained. In this case, we therefore cannot conclude that the VQE has converged simply because classical splitting shows no improvements. However, experimenting with different implementations of classical splitting may result in cases that do not replace the plateau with an artificial local optimum.

E. Extending classical splitting to VQE
Until now, we have investigated using classical splitting for binary classification problems. It succeeded by showing an overall better training performance in Section III B and a competitive performance and faster convergence in Section III C. In this section, we consider simulating the ground state of the transverse-field ising hamiltonian (TFIH) on a 1D chain. The TFIH with periodic boundary conditions can be defined as; for N lattice sites, where J determines the strength of interactions and h determines the strength of the external field. Simulating the TFIH on a 1D chain requires connectivity of qubits on the 1D chain. This contradicts with the assumption we made, when we proved absence of BPs for classically split ansätze in Section II, since the TFIH does not fit the definition we had for an m-local observable in Eq. (8). Therefore, we need to rely on the numerical experiments to talk about BPs under the new constraints.
The CS ansätze can only produce local entangled states, for this reason we need an extension of the ansatz in Fig. 1a. We propose to extend the classically split ansatz by adding standard layers at the end. The reason for adding them at the end is to keep the base of light cones 4 produced by the classically split layers constant. 4 A light cone or a causal cone of an ansatz is an abstract concept that illustrates how information spreads as more gates are applied. The types of gates and their connectivity determines the opening angle of the cone. The evidence from the literature suggests that there is a correspondence between the opening angle of the cone, BPs and quantum circuit complexity [9,43].
Then, when we add the standard layers, the light cones will grow at a pace that is determined by the newly-added part 5 . This way, the overall ansatz can still escape BPs as long as the newly-added part does not exhibit BPs. We define the extended classically split (ECS) ansatz with two types of layers. First L layers consist of classically split m qubit gate blocks. Then, there are T layers of any no-BP ansatz (see Fig. 1b). Since the first L layers can only produce m-local product states (i.e. m < O(log N )), the existence of BPs depends only on the remaining T layers. This way we can choose very large L, but need to keep T small as standard ansätze reach BPs rather rapidly (e.g. O(log N ) depth for a ladder connected ansatz [9]). We provide numerical evidence for avoiding BPs with the ECS ansatz in Appendix F.
For the experiment, we consider the Hamiltonian defined in Eq. (17) with J = 1, h = 1. Then, we implement the ECS ansatz with m = 4 for total depth of 2, 4, 6 and 8. Each side of the ansatz consists of EfficientSU2 layers [37] (see Fig. 1e). The first L layers are classically split to subcircuits of m qubits, while the next T layers do not have any splitting. Total depth (D) corresponds to L + T , where T = 0 is equivalent to the CS ansatz, T = D is equivalent to the standard EfficientSU2 ansatz and other values explore hybrid use cases of the ECS ansatz. We report the energy error, which is the absolute difference between the final energy measurement and the exact ground state energy in Fig. 5. Results of 10 runs are averaged and plotted with their minimum and maximum values as the error bars. Experiments are performed under no noise assumption using 10k shots. The SPSA optimizer [44] is used with 10k iterations. Results with m = 2 and training curves of all runs are presented in Appendix G and H.
The upper panel shows that the mean error increases with increasing total depth in the no classical splitting setting (T = D). This is mainly due to the flattening of the cost landscape, which makes the optimization process harder. On the other hand, setting T (e.g. T = 1) to a low number provides a better error, since it preserves trainability despite the increasing total depth. This is a clear indication that the classical splitting allows deeper ansätze.
The lower panel shows the best error obtained in all the runs for two settings. Here, we observe that both settings achieve better errors with increasing depth initially. Then, the no CS setting shows rapidly increasing errors as it looses trainability rather quickly, compared to the ECS ansatz.
In this experiment, the best error was achieved with the fully classically split ansatz (T = 0). This is mainly due to the employed EfficientSU2 ansatz not being a very good choice for this particular problem. This means that by employing other ansätze, the observed behaviour might change, making a larger value of T perform the best. Nevertheless, the results are still a good indication of how the trainability of the ansatz is affected by the choice of L and T . We plan to draw a more detailed picture of the tradeoff between values of L and T in a future work.
Simulating larger size systems requires a deep ansatz (linear or larger in system size) in general [1]. Although a problem-agnostic ansatz can perform well at small sizes, BPs forbid the scalability. Our results show that the ECS can help circumvent this issue and allow deeper ansätze. Here, we haven't investigated the potential of classical splitting to obtain the exact ground state energy of the model, but focused on the trainability aspect. Such a study is left as future work. Our goal here is to show that classical splitting can allow one to build wide and deep ansätze without exhibiting BPs. Typically, faster convergence or a better final energy might be achieved with a different ansatz or an optimizer, but this is out of scope of this work.

IV. DISCUSSION
In this work, we showed that the classical splitting of the ansätze can be used to escape BPs both analytically and numerically. Then, we investigated if the classical splitting hinders the learning capacity of the ansatz. Our experiments showed that this is not the case, and the classically split ansatz can match the performance at low number of qubits and is potentially superior at larger number of qubits.
In general the benefits of classical splitting comes from the reducing the effective Hilbert Space that the CS ansatz can explore. Classical splitting only allows the ansatz to produce m-qubit tensor product states, if the input state is also a tensor product state following our assumptions in Section II. This, as a result, reduces the expressivity of the ansatz. Nevertheless, this also allows the ansatz to avoid BPs [10] by limiting the scaling behavior to the more favorable case of m-qubit systems. In the case of the classical splitting, the exponential increase of the Hilbert Space dimension is prevented and instead a polynomial scaling is enforced. For the m-local CS ansatz, each local Hilbert Space have dim(H k ) = 2 m = N β log 2 γ . Although the advantage of using classical splitting may look trivial, there are many benefits of employing such an ansatz besides the numerical experiments we performed in Section III.
In our binary classification experiments using a classical dataset, we relied on single qubit and single rotation gate data encoding. This meant that any classically split ansatz had less information in each group. This could in fact be improved with embedding methods such as data re-uploading, where one can encode all the data points to each single qubit independently, such that there are alternating layers of rotation gates that encode the data and parametrized gates that are to be optimized [45]. Data re-uploading ansätze showed great classification performance even for low number of qubits. Since the classical splitting doesn't have a limit on the amount of layers, data re-uploading would potentially be great way to get a performance increase.
Classical splitting can provide faster training when used with gradient based optimizers. In general, the exact gradients of ansätze are computed with the wellknown parameter shift rule [46,47]. However, this requires 2 instances of the same circuit to be executed per parameter. This quickly results in a bottleneck for the optimization procedure. An ansatz with L = N layers, where each layer has N parameters, requires O(N 2 ) circuit executions to compute gradients for a single data sample. On the other hand, classical splitting provides cost functions that are independent of each other, as it was shown in Eq. (11). This allows gradients to be computed simultaneously across different instances of the classically split ansatz. As a result, the classically split ansatz optimization requires O(N log N ) circuit executions for m = O(log N ).
The bottleneck in optimization is only one of the challenges of implementing scalable VQAs. Another problem that is worth mentioning here is the amount of two-qubit gates. NISQ hardware provides limited connectivity of qubits. The topology of the devices plays an essential role in the efficient implementation of quantum circuits [48]. Typically, a quantum circuit compilation (or transpilation) procedure is required to adapt a given circuit to be able to be compatible with the capabilities of the devices (e.g. converting gates to native gates, applying SWAP gates to connect qubits which are not physically connected) [49].
Classical splitting provides a significant reduction in number of two qubit gates as it divides a large qubit to many circuits with less qubits. To show the scale of the reduction, we can construct a set of hypothetical devices that has a 2D grid topology (square lattice with no diagonal connections). We start by considering the CS ansatz that consists the ansätze in Fig. 1c and extend it to a fully entangled architecture. A linear entangled ansatz has O(N ) two qubit gates, while a fully entangled one has O(N 2 ) per layer. Then, we use Qiskit's transpiler 6 [37] to fit these ansätze to the hypothetical devices and report the two qubit gate counts in Table II.
The amount of gates are not only important to have a better implementation but also to have a more precise results, since NISQ devices come with noisy gates. We consider the CX gate errors reported by IBM for their devices, which can be taken as O(10 −2 ) on average 7 . Then, as a figure of merit, we can assume 50% to be the limit,  ). A similar reduction in noise is also possible for other types of circuit partitioning methods [50].
Classically splitting an ansatz further allows faster implementation on hardware. A generic ansatz consists of two-qubit gates that follow one and another, matching a certain layout. We mentioned some of these as ladder/linear or full. However, this means that the hardware implementation of such an ansatz requires execution of these gates sequentially, taking a significant amount of time. To overcome such obstacles, ansätze such as the HEA (see Fig. 1d) are widely used in the literature [15]. Classical splitting an ansatz can reduce the implementation time significantly since it allows simultaneous twoqubit gates across different local circuits. This can mean a speed-up of from O(N/ log N ) to O((N/ log N ) 2 ) depending on the connectivity of the original ansatz.
Finally, the formulation we used in Section III B allows the CS ansatz to be implemented on smaller quantum computers instead of a single large quantum computer. This means that for similar problems, there are many implementation options available. These include using one large device, using many small devices (e.g., O(N/ log N ) many O(log N ) qubit devices) and parallelizing the task or using one small device and performing all computation sequentially. All of these features makes the classical splitting an ideal approach for Quantum Machine Learning (QML) applications using NISQ devices.

V. CONCLUSION
In this work, we presented some foundational ideas of applying classical splitting to generic ansätze. Our results indicate many benefits of using classical splitting, such as better trainability, faster hardware implementation, faster convergence, robustness against noise and parallelization under certain conditions. These suggest that classical splitting or variations of this idea might play an essential role in how we are designing ansätze for QML problems. We also presented an extension to the initial classical splitting idea so that these types of ansätze can be used in VQE. The initial results that we presented in this work suggest that classical splitting can help improve the trainability and reach better error values. However, it is still an open question to what extent VQE can benefit from classical splitting. Our results encourages employing approaches that are based upon classically splitting or partitioning parametrized quan-tum circuits [28][29][30][31][32][33][34][35], as they are in general more robust against hardware noise. We consider in-depth analysis and applications with VQE and QAOA as future directions for this work.

Appendix A:
When analyzing the size of the gradients of an ansatz we need tools that allows integration over all states allowed by the ansatz over the d-dimensional Hilbert Space. This can be achieved by using the Haar measure. Haar measure is an invariant measure over the SU(d) group. An ensemble of unitary operators U is called as a unitary t-design if they are equal to the Haar measure µ(U ) up-to polynomial order t. Then, the expectation of ensemble U , where unitary V i can be sampled with probability p i is given as, Then, to perform symbolic integration over the Haar measure we will need to use some properties of the measure [51]. For the first moment we have, where d is the dimension of the Unitary, such that d = 2 N and N is number of qubits. Then, for the second moment we have, Then one can derive the following identities for integrals over the Haar measure [6,9,10], Now, we can use these identities to compute the average value of the gradients. Let's start by reminding ourselves the definitions we used before. The ansatz is composed of consecutive parametrized (V ) and non-parametrized entangling (W ) layers. We define U l (θ l ) = exp(−iθ l V l ), where V l is a Hermitian operator and W l is a generic unitary operator. Then, the curcuit ansatz can be expressed with a multiplication of layers, For an observable O and an input state ρ, the cost function is given as The ansatz can be separated into two parts to investigate a certain layer, such that U − ≡ j−1 l=1 U l (θ l )W l and U + ≡ L l=j U l (θ l )W l . Then, the gradient of the j th parameter can be expressed as [6] Then the expected value of the gradient can be computed by using the Haar integral such that, where we use Eq. (A4) to obtain (A10) and use the fact that trace of the commutator is zero in (A11). This proves that the gradients are centered around zero. Then, the variance of the gradient can inform us about the size of the gradients. The variance is defined as, We can compute the expected value of the variance using the same logic. Then we have, We use Eq. (A6) to obtain Eq. (A14). Then, use the fact that commutator being traceless to obtain Eq. (A16). To compute the integral of Eq. (A16) we need another identity such that [10], Then, the variance becomes, The first integral can be computed using Eq. (A5) and the second can be computed using Eq. (A4). Then we obtain, Finally, the asymptotic behaviour of the variance can be expressed as where d = 2 N . Thus, the variance vanishes exponentially with respect to N.    N=20, m=20, L=2 N=20, m=20, L=20 N=20, m=4, L=20     , T=6  L=1, T=5  L=2, T=4  L=3, T=3  L=4, T=2  L=5, T=1 FIG. 14. The variance of the change in cost as a function of the number of qubits for varying values of L and T for L+T = D = 6. The cost function is the TFIH hamiltonian defined in Eq. 17 and the ansatz is the ECS ansatz with EfficientSU2 sublocks (see  L+T=8, m=2   L=0, T=8  L=1, T=7  L=2, T=6  L=3, T=5  L=4, T=4  L=5, T=3  L=6, T=2  L=7, T=1  L=8, T=0   0 L+T=8, m=4   L=0, T=8  L=1, T=7  L=2, T=6  L=3, T=5  L=4, T=4  L=5, T=3  L=6, T=2  L=7, T=1  L=8, T=0 FIG. 16. Energy error curves of ansätze TFIH for N = 12 using the extended classical splitting (ECS) ansatz with EfficientSU2 sublocks (see Fig. 1b). Columns corresponds to ansätze with m = 2 and m = 4 respectively. Each row shows results with increasing total depth (D), such that L + T = D. Energy errors of 10 runs are averaged and their mean is presented. Energy error is the absolute difference of the energy measurement and the exact ground state energy. It becomes harder to optimize an ansatz with no classical splitting (L = 0) as depth increases. However, we see that the optimization does not get as hard if we set T to a small value, e.g. to 1, and employ classical splitting. We observe similar conclusions with m = 2 and m = 4.