Shuffle-QUDIO: accelerate distributed VQE with trainability enhancement and measurement reduction

The variational quantum eigensolver (VQE) is a leading strategy that exploits noisy intermediate-scale quantum (NISQ) machines to tackle chemical problems outperforming classical approaches. To gain such computational advantages on large-scale problems, a feasible solution is the QUantum DIstributed Optimization (QUDIO) scheme, which partitions the original problem into $K$ subproblems and allocates them to $K$ quantum machines followed by the parallel optimization. Despite the provable acceleration ratio, the efficiency of QUDIO may heavily degrade by the synchronization operation. To conquer this issue, here we propose Shuffle-QUDIO to involve shuffle operations into local Hamiltonians during the quantum distributed optimization. Compared with QUDIO, Shuffle-QUDIO significantly reduces the communication frequency among quantum processors and simultaneously achieves better trainability. Particularly, we prove that Shuffle-QUDIO enables a faster convergence rate over QUDIO. Extensive numerical experiments are conducted to verify that Shuffle-QUDIO allows both a wall-clock time speedup and low approximation error in the tasks of estimating the ground state energy of molecule. We empirically demonstrate that our proposal can be seamlessly integrated with other acceleration techniques, such as operator grouping, to further improve the efficacy of VQE.

Although VQAs promise the practical applications of NISQ machines, they are challenged by the scalability issue.The required number of measurements for VQAs scales with O(poly(n, 1/ )) with n being the problem size and being the tolerable error, which implies an expensive runtime for large-scale problems.One canonical instance is the variational quantum eigensolver (VQE) [42], which is developed to estimate the low-lying energies and corresponding eigenstates of molecule systems.VQE contains two key steps.First, the electronic Hamiltonian is reformulated to the qubit Hamiltonian H = M i=1 α i H i through Jordan-Wigner, Bravyi-Kitaev, or parity transformations [43][44][45], where H i ∈ {σ X , σ Y , σ Z , σ I } ⊗n and α i ∈ R for ∀i ∈ [M ], and M is the number of Pauli operators.The property of H is then estimated by a variational quantum circuit whose parameters are updated by a classical optimizer.Principally, it requires O(poly(M, 1/ )) queries to the quantum circuit in each iteration to collect the updating information [46].With this regard, VQEs towards large-scale molecules request an intractable time expense on the measurements.This scalability issue impedes the journey of VQEs to the quantum advantages.
Approaches for reducing the computational overhead of quantum measurements in VQE can be roughly classified into five categories, including operator grouping [47][48][49][50], ansatz adjustment [51,52], shot allocation [53][54][55], classical shadows [56,57], and distributed optimization [58][59][60][61].Specifically, the operator grouping strategy focuses on finding the commutativity between local Hamiltonian terms {H i } in H.The commutable Hamiltonians can be evaluated by the same measurements, which enable the measurement reduction [19,[47][48][49][50]. Ansatz adjustment targets to tailor the layout of ansatz to reduce the circuit depth [17,51,62] or the number of qubits [52].For example, Ref. [51] attempts to assign two qubits with stronger mutual information to the adjacent locations with direct connectivity on the physical quantum chips, leading to shallower circuits over the original VQEs to reach a desired accuracy.Shot allocation aims to assign the number of shots among {H i } in a more intelligent way.A typical solution is to allocate more shots to the terms with a larger coefficient |α i | and a larger variance of H i .Another measurement reduction method, classical shadows, constructs an approximate classical representation of a quantum state based on few measurements of the state [56].With this representation, O(log(M )) measurements are enough to estimate the expectation value of whole observable with high precision.
On par with engineering the quantum part, we can accelerate the optimization of VQE by using multiple quantum processors (workers), inspired by the success of distributed optimization in deep learning and the growing number of available quantum chips.There are generally two types of distributed VQAs.The first paradigm is decomposing the primal quantum systems into multiple smaller circuits and running them in parallel [63,64].The second paradigm is utilizing the quantum cloud server in which the problem Hamiltonian can be pre-divided into several partitions and distributed into Q local quantum workers respectively.Each worker estimates the expectation value of partial local Hamiltonians with no more than O(poly(M/Q)) queries and delivers the result to the rest workers after a single iteration.Noticeably, such a methodology inevitably encounters the communication bottleneck, quantum circuit noise, and the risk of privacy leakage.As such, Ref. [60] devises the QUantum DIstributed Optimization (QUDIO), a novel distributed-VQA scheme in a lazy communication manner, to address this issue.Unfortunately, the naive allocation method is not suitable for VQEs since the coefficients {α i } of the local Hamiltonian terms {H i } are varied, leading to unbalanced contributions to the overall variance of Hamiltonian estimation.Such an estimation error can be exacerbated by the increased communication interval, which renders the trade-off between the acceleration ratio and the approximation error of VQEs.
To maximally suppress the negative effects of large communication interval on the convergence rate, here we propose a new quantum distributed optimization framework, called Shuffle-QUDIO.Different from QUDIO, for every local worker, the local Hamiltonian terms are randomly shuffled and sampled without replacement according to the worker's rank before each iteration.From the statistically view, this operation alleviates the issue such that every local worker may only observe incomplete local Hamiltonians during the optimization.Moreover, the dynamic allocation of Hamiltonian terms alleviates the accumulated deviation with respect to the target Hamiltonian H after a large number of local updates.In this way, Shuffle-QUDIO achieves performance improvements while keeping low communication cost.Another advantage of our proposal is its compatibility with all types of quantum hardware.This assures its potential of unifying existing quantum devices to accelerate the training of VQEs.
To theoretically exhibit the advance of our proposal, we prove that Shuffle-QUDIO allows a faster convergence rate than that of QUDIO.By leveraging the non-convex optimization theory, we exhibit that the dominate factors effecting the convergence rate are the number of distributed quantum machines K, the local updates (communication interval) W , and the global iterations T , i.e., O(poly(W, K, 1/T )).To benchmark the performance of Shuffle-QUDIO, we conduct systematic numerical experiments on VQEs under both fault-tolerant and noisy scenarios.The achieved results confirm that Shuffle-QUDIO achieves smaller approximation error over QUDIO, as well as lower communication overhead among clients and server, and sub-linear speedup ratio.In addition, we demonstrate that the performance of Shuffle-QUDIO under the noisy setting can be further boosted by combining the advanced operator grouping strategy.
The remaining parts of this paper are organized as follows.Section II briefly introduces the preliminary knowledge about the optimization of variational quantum circuits.Section III presents the pipeline of the proposed algorithm and presents the convergence analysis.Section IV exhibits numerical simulation results.Section V gives a summary and discusses the outlook.

II. PRELIMINARY
The essence of VQE is tuning an n-qubit parameterized quantum state ρ(θ) = |ψ(θ) ψ(θ)| with θ ∈ R P to minimize the energy of a problem Hamiltonian where H i refers to the i-th local Hamiltonian term with the weight α i .The energy minimization is formulated by the loss function With a slight abuse of notation, we denote H i as α i H i and simplify the above loss function as Tr(ρ(θ)H i ).
The parameterized quantum state is prepared by an ansatz with |ψ(θ) = U (θ) |φ and |φ being an initial quantum state.A generic form of U (θ) is where O i is a Hermitian matrix and U e denotes a fixed unitary composed of multi-qubit gates.By iteratively updating the circuit parameters θ to minimize the loss, the quantum state ρ(θ) is expected to approach the eigenstate of H with the minimum eigenvalue.

A. Optimization of VQE
Gradient descent (GD) based optimizers are widely used in previous literatures of VQE.The parameters θ t+1 at the (t + 1)-th iteration is updated alongside the steepest descent direction with learning rate η, i.e., θ t+1 = θ t − η∇L(θ t , H). ( Unlike classical neural networks that utilize gradient back-propagation to update parameters [65], VQE adopts the parameter-shift rule [66,67] to obtain the unbiased estimation of the gradient.The gradient with respect to the i-th parameter is where e i denotes the indicator vector for the i-th element of parameter vector θ.When the number of trainable parameters is P , the required number of measurements to complete the gradient computation scales with O(poly(P M )) without applying any measurement reduction strategies.

B. Optimization of the distributed VQE
To accelerate the training of VQA, Ref. [60] proposed the QUantum DIstributed Optimization (QUDIO) scheme.The key idea of QUDIO is to partition the problem Hamiltonian H in Eq. (1) into several groups and distribute them into multiple quantum processors to be manipulated in parallel.Mathematically, suppose that there are K available quantum processors {Q i } K i=1 , the Hamiltonian terms In the initialization process, the i-th subgroup S i is allocated to the i-th quantum processor All local processors share the same initial parameters θ 0 with θ (0,0) i = θ 0 for ∀i ∈ [K].The subsequent training process alternately switches between the local updates and the global synchronization.During the phase of local updates, each quantum processor follows the gradient descent rule to update the parameters to minimize the local loss function L(θ i , H Si ) = j∈Si Tr(ρ(θ i )H j ), i.e., the parameters of the i-th processor at the (t, w)-th step is updated as Eq.(3).After fulfilling W local updates, all parameters from distributed quantum processors are synchronized by averaging the collected parameters . Repeating the above two phases until the termination conditions (e.g. the maximum number of iterations) are met, the synchronized parameters are returned as the final parameters.
Ignoring the communication overhead among quantum processors, QUDIO with W = 1 is expected to linearly accelerate the optimization of VQE.However, the communication bottleneck could degrade the acceleration efficiency.An optional solution is to increase W to reduce the communication frequency.As indicated in [60], the performance of VQA witnesses a rapid drop with the increased W .

III. SHUFFLE-QUDIO FOR VQE
The performance of QUDIO suffers from a high sensitivity of the communication interval.Intuitively, this issue originates from the fact that each quantum processor in QUDIO only perceives a static subset of the whole observable set during the entire training process.The i-th processor updates its local parameters based on the partial observations before communicating with other processors.Meantime, the coefficients {α i } and the variance of Pauli operators {H i } differ from each other, leading to different contributions to the expectation estimation of H.As a result, the local processor fails to characterize the full property of the problem Hamiltonian H.With multiple local updates, the accumulation of bias further degrades the performance of QUDIO.To tackle this issue, here we devise a novel quantum distributed optimization scheme, called Shuffle-QUDIO, to avoid the performance drop when synchronizing in a low frequency.

A. Algorithm descriptions
The paradigm of Shuffle-QUDIO is depicted in Fig. 1, which consists of three steps.

Initialization. The variational quantum circuit
U (θ) in Eq. ( 2) of each quantum processor is initialized with the same parameters θ (0,0) i = θ 0 for i = {1, ..., K} and all local Hamiltonian terms {H i } are distributed to each processor.

Local updates. Each processor independently up-
dates the parameters θ (t,w) i following the gradient descent principle.First, Shuffle-QUDIO randomly shuffles the sequence of local Hamiltonian terms.Note that the random number of each processor is generated from the same random seed.Assuming the permutation vector is denoted by π (t,w) , the visible Hamiltonians for the i-th processor at the t-th iteration are Then each processor estimates the gradient g dient on the quantum device due to the finite number of measurements, while ∇L refers to the corresponding accurate gradient.The parameters are updated as where η is the learning rate.Repeat the above local updates for W local steps.
3. Global synchronization.Once the local updates are completed, the central server synchronizes parameters among all quantum processors in an averaged manner, i.e., If the number of the global iterations reaches T , the parameters θ T are returned as the output; otherwise, return back to step 2.
The pseudo code of Shuffle-QUDIO is summarized in Fig. 2. Compared with conventional VQE which sequentially measures the expectation value of every single observable, the strategy of distributed parallel optimization accelerates the estimation of the complete observables by K times.Furthermore, the shuffling operation alleviates the deviation of the optimization direction during the local updates and thus warrants a stabler performance after increasing communication interval.This is because in a statistical view, each processor can leverage the information of all local Hamiltonian terms to update local parameters in the training process.
Let ρ(θ) be an n-qubit quantum state parameterized by θ.For any k ∈ {1, ..., M }, let H π(1) , ..., H π(k) be uniformly sampled without replace- Refer to Appendix B for proof details.Lemma 1 implies that the direction of the expected gradient of each local quantum processor in Shuffle-QUDIO is unbiased.This guarantees that the local quantum circuits are individually optimized forward along the right direction when they do not communicate frequently with each other, which narrows the performance gap between a single processor and the synchronized model.

9:
Obtain the subset of Hamiltonian terms Compute the estimated gradients g end for

B. Convergence analysis
We next show the convergence guarantee of Shuffle-QUDIO.When running VQE on NISQ devices, the system imperfection introduces noise into the optimization.To this end, we consider the worst scenario in the convergence analysis, where the system noise is modeled by the depolarizing channel.Mathematically, the depolarizing channel N p transforms the quantum state ρ ∈ C 2 n ×2 n to N p (ρ) = (1 − p)ρ + pI/2 n , and with increasing the noise strength p, the quantum state finally evolves to the maximally mixed state.As proved in [68], the depolarizing channel applied on each circuit depth can be merged at the end of the quantum circuit.Therefore, without loss of generality, the estimated gradient with respect to the i-th parameter is The convergence rate of Shuffle-QUDIO is summarized by the following theorem whose proof is provided in Appendix D.
Theorem 1.Let the gradient of loss function L be F -Lipschitz continuous, G be the upper bound of the gradient norm, η be the learning rate of optimizer, p be the strength of depolarizing noise, K and W be the number of distributed quantum processor and local iterations respectively, the convergence of Shuffle-QUDIO in the noisy scenario is summarized as Theorem 1 reveals that an increased quantum noise rate p leads to poor convergence of Shuffle-QUDIO, which emphasizes the significance of error mitigation [69][70][71][72] in quantum optimization.Meantime, the shorter communication interval W among distributed quantum processors guarantees a better performance of Shuffle-QUDIO.Note that although a large W still hinders the distributed optimization, Shuffle-QUDIO achieves a relatively smaller sensitivity to W than that of QUDIO.From the technical view, although the proof of Theorem 1 is derived from the classical results on local SGD [73], there are some key differences between them.First, in classical local SGD, each worker independently samples a mini-batch from the whole dataset without other limitations.By contrast, the distributed quantum processors randomly sample the local Hamiltonian terms without replacement in each local iteration, which means that the Hamiltonian terms of each processor do not overlap and the union exactly constitutes the complete molecule Hamiltonians.This special sampling method guarantees the integrity of the problem Hamiltonian, but poses a challenge for theoretical analysis.Second, our analysis does not rely on the strong assumptions, such as convexity or Polyak-Lojasiewicz (PL) condition [74].Furthermore, the quantum noise in NISQ devices inevitably shifts the quantum state and biases the estimated gradients, which differentiates VQE from classical machine learning.

IV. NUMERICAL RESULTS
To verify the effectiveness of Shuffle-QUDIO, we apply it to estimate the ground state of several molecules with the lowest energy.Jordan-Wigner transformation [45] is employed to transform these electronic Hamiltonians into the qubit Hamiltonians represented by Pauli operators.For example, the LiH system is totally described by 12 qubits and 631 local Pauli terms {σ I , σ X , σ Y , σ Z } ⊗12 .The ansatz is designed in a hardware-efficient style inspired by [19], whose layout is shown in Fig. 3.We conduct numerical experiments on classical device with Intel(R) Xeon(R) Gold 6267C CPU @ 2.60GHz and 128 GB memory.For each setting, the experiment is repeated for 5 times with different random seeds to mitigate the effect of randomness.Stochastic gradient descent is used to update trainable parameters, where the learning rate is set as η = 0.4.The label 'W = a' refers that the number of local iterations is a.The label 'linear speedup' represents the reference line of the linear speedup.

A. Acceleration ratio
We first explore the speedup of Shuffle-QUDIO.Specifically, under the setting of K quantum processors and W local iterations, we record the time spent per iteration as t K,W 1 and the time spent training the model to reach a specific precision as t K,W

2
. The speedup to time s K,W 1 and the speedup to accuracy s K,W 2 are defined as 2 , respectively.Fig. 4 demonstrates the results when solving the ground state of LiH.As shown in the left panel of Fig. 4, with growing the number of the distributed quantum processors, the metric s 1 witnesses a linear growth (from K = 1 to K = 4, s 1 reaches 3 from 1) and then gradually trends gently (from K = 4 to K = 16, s 1 reaches around 4 from 3) because of the communication bottleneck for a large number of nodes.To alleviate the communication bottleneck, we can increase W to reduce the communication frequency and hence further improve the metric s 1 .However, an overwhelmingly larger W may lead to a poor convergence and then deteriorate the speedup to accuracy s 2 .As indicated by the right panel of Fig. 4, when W ≥ 16, the speedup to accuracy s 2 suffers from a rapid drop, which results from the worse convergence brought by a small number of global synchronization.

B. Sensitivity to communication frequency
We compare QUDIO with Shuffle-QUDIO to show how the increased number of local iterations W effects their performance under the ideal scenario.The molecules LiH with varied inter-atomic length, i.e, 0.3 Å to 1.9 Å with step size 0.2 Å, are explored.For QUDIO, the entire set of Pauli terms constituting the problem Hamiltonian is uniformly partitioned into 32 subsets and distributed into 32 local quantum processors.The accessible Hamiltonian terms for each local processor remain fixed during the whole training process.The number of local iterations W varies in {1, 2, 4, 8, 16, 32}.
The simulation results of VQE for the molecule LiH with 0.5 Å are illustrated in Fig. 5.Because the number of local iterations W varies among different settings, we uniformly collect data point after every 32 iterations (i.e., the least common multiple of all W ) to guarantee the loss is obtained exactly after synchronization.The first row of Fig. 5 records the loss curves of QUDIO with respect to the training steps under different local iterations W . QUDIO experiences a severe drop of performance, and an evident gap between the estimated and the exact results appears when W ≥ 8.By contrast, as shown in the bottom row of Fig. 5, Shuffle-QUDIO well estimates the exact ground energy even when W = 32.Comparing the subplots of the same column, Shuffle-QUDIO shows a distinct advantage in improving the convergence of the distributed VQE when requiring a lower communication overhead.For example, Shuffle-QUDIO achieves −6.8Ha at the 96-th iteration with W = 8, while QUDIO only reaches −4.1Ha.
The potential energy surface of LiH solved by the conventional VQE is shown in Fig. 6, where the left panel describes the results of QUDIO with the varied number of local iterations W and the right panel records the results of Shuffle-QUDIO.There exists a distinct boundary among the potential energy surfaces estimated by the different level of W in QUDIO.More precisely, the estimated potential energy surface is gradually away from the exact potential energy surface (black line) with the increased W , which reveals the vulnerability of QUDIO  when reducing the communication frequency among distributed workers.By contrast, Shuffle-QUDIO exhibits a fairly stable performance even when increasing W from 1 to 32, drawn from the nearly coincident curves of potential energy surface at each setting of W .Note that the slight gap between the exact potential energy surface and the optimal estimated results originates from the re-stricted expressive power of the employed ansatz, which does not guarantee the prepared state definitely covers the ground state of LiH.
To further quantify the stability of Shuffle-QUDIO, we statistically compute the mean and standard deviation of the approximation error Err = |E V QE − E ideal | over various bond distances and random seeds.As illustrated in the top subplot of Fig. 7, the average approximation error Err of QUDIO exponentially scales with increased W .When W ≥ 8, the approximation error estimated by QUDIO exceeds 2Ha, which fails to capture the ground state of LiH.Instead, Shuffle-QUDIO achieves an imperceptible increment (0.093) of the approximation error when W grows from 1 to 32, making it possible to largely reduce the communication overhead with a little performance drop.The bottom subplot of Fig. 7 depicts the standard deviation of the approximation error derived by both methods, showing that Shuffle-QUDIO enjoys a smaller variance and a stronger stability than those of QUDIO.These observations provide the convincing empirical evidence that Shuffle-QUDIO efficiently reduces the susceptibility to W in the quantum distributed optimization.

C. Sensitivity to quantum noise
To better characterize the ability of Shuffle-QUDIO run on NISQ devices, we benchmark its performance under the depolarizing noise and the realistic quantum noise modeled by PennyLane [75].The noise strength of the global depolarizing channel p ranges from 0 to 0.3 with step size 0.1.The realistic noise model is extracted from the 5-qubit IBM ibmq quito device.Note that the measurement error introduced by a finite number of shots is also considered. 1XPEHURIORFDOQRGHV.

JURXS EDVH OLQHDUVSHHGXS
FIG. 8: Speedup of operator grouping to VQE for H2.The label 'base' refers to the case that no operator grouping is applied.The label 'linear speedup' represents the reference line of linear speedup.
We first benchmark the performance of Shuffle-QUDIO with the operator grouping when the shot noise is considered.The results are shown in Fig. 8.After applying operator grouping to the molecule Hamiltonian, the trainable quantum state fast converges to the ground state of the molecule than that of the original measurement strategy.In the light of the speedup provided by the operator grouping, we can integrate this technique into the framework of Shuffle-QUDIO to gain better performance.On the other hand, with growing number of quantum processors, the acceleration rate with the operator grouping strategy gradually decays.This phenomenon partially results from the fact that a small number of Hamiltonian terms leads to a small proportion of operators that can be grouped together.
We next apply QUDIO, QUDIO with the operator grouping, Shuffle-QUDIO, and Shuffle-QUDIO with the operator grouping to estimate the ground energy of the H 2 molecule under both the system and shot noise.For each method, the hyper-parameters are set as K = 4, W = 32, and the number of measurements is 100.The simulation results are shown in Fig. 9.When the depolarizing noise is not big enough (p <= 0.2), Shuffle-QUDIO achieves much smaller approximation error than QUDIO.When p = 0.3, it appears that the overwhelming noise disables both Shuffle-QUDIO and QUDIO.Under the realistic noise setting, Shuffle-QUDIO still works well with a tolerable approximation error 0.063.By contrast, QUDIO is incapable of estimating the accurate ground state energy.Moreover, the operator grouping can further widen the performance gap between QUDIO and Shuffle-QUDIO, by inhibiting the negative effect of quantum noise on Shuffle-QUDIO.

D. Aggregation strategy
Shuffle-QUDIO narrows the discrepancy among distributed processors by randomly changing the observables of each processor in every local iteration, which partially guarantee the rationality of taking average of all models for synchronization.To further explore the effect of various aggregation strategy on the performance of Shuffle-QUDIO, we devise three additional model aggregation algorithms, named as random aggregation, median aggregation and weighted aggregation.
• Random aggregation: randomly select a local processor and distribute its parameters of quantum circuit to other processors.
• Median aggregation: rank all local processors by their loss value and select the median as the synchronized quantum circuit.
• Weighted aggregation: combine all quantum circuits of local processors by loss-induced weighted summation.The smaller the value of the loss function for a local processor, the bigger contributions the processor makes to the synchronized quantum model.
Refer to Appendix E for more details.We implement four quantum model aggregation methods in the framework of Shuffle-QUDIO to solve the ground state energy of molecule H 2 and LiH.The hyper-parameters are set as W ∈ {1, 2, 4, 8, 16, 32}, K ∈ {1, 2, 4, 8, 16}.Fig. 10 demonstrates the cumulative distribution function (CDF) of the approximation error to ground state energy.It appears no large cleavage of the approximation error among four aggregation strategies, indicating the strong robustness of Shuffle-QUDIO to quantum model aggregation.This stability may give credit to the introduction of the shuffle operation during distributed quantum computation, which diminishes the bias among different local quantum models.On the other hand, it is worth noting that average aggregation always achieves smaller approximation error with higher probability than random aggregation in the statistical sense.This difference of CDF implies that a superior aggregation algorithm for the quantum distributed optimization could further enhance the efficiency of Shuffle-QUDIO.We leave the design of an optimal aggregation method as the future work.

V. DISCUSSION
In this paper, we propose Shuffle-QUDIO, as a novel distributed optimization scheme for VQE with faster convergence and strong noise robustness.By introducing the shuffle operation into each iteration, the Hamiltonian terms received by each local processor are not fixed during the optimization.From the statistical view, the shuffle operation warrants that the gradients manipulated by all local processors are unbiased.In this way, Shuffle-QUDIO allows an improved convergence and a lower sensitivity to communication frequency as well as quantum noise.Meanwhile, the operator group strategy can be seamlessly embedded into Shuffle-QUDIO to reduce the number of measurements in each iteration.Theoretical analysis and extensive numerical experiments on VQE further verify the effectiveness and advantages in accelerating VQE and guaranteeing small approximation errors in both ideal and noisy scenarios.
Although the random shuffle operation performs well on the H 2 and LiH molecules, the performance can be further improved by developing more advanced shuffling strategies.First, instead of random shuffle, we can design a problem-specific and hardware-oriented Hamiltonian allocation tactic, which can eliminate the deviation of the optimization path of local models and better adapt to the limited quantum resources of various local processors.Second, due to the existence of barren plateau [76][77][78][79][80] in the optimization of ansatz, the training of local quantum models may get stuck.Inspired by the study that local observables enjoy a polynomially vanishing gradient [79], a promising direction is to group Hamiltonian terms with similar locality in QUDIO to avoid the barren plateau of some processors.Finally, a more fine-grained partition of the quantum circuit structure besides observables can be employed to reduce the number of parameters to be optimized for each local processor, as implemented in [52].
Another feasible attempt to enhance the performance of distributed VQE in practice is to unify Shuffle-QUDIO with other measurement reduction techniques.One successful example is operator grouping, as discussed in Section IV C. Specifically, when optimizing the circuit run on each distributed quantum processor, we can utilize the operator grouping strategy to reduce the required number of measurements of the allocated Hamiltonian.In this way, the measurement noise in the framework of Shuffle-QUDIO is eliminated under finite budget of shot number.Other two methods, like shot allocation and classical shadows, can be also integrated into Shuffle-QUDIO in the similar manner.
Besides the potential improvements in convergence and speedup for Shuffle-QUDIO, the data privacy leakage during transmitting gradient information among local nodes should be avoided.One the one hand, the shuffle operation in Shuffle-QUDIO naturally adds randomness to the system, hindering the recovery of intact data.On the other hand, previous studies proposed differential privacy [81,82] and blind quantum computing [83] to protect data security.When combining these techniques and Shuffle-QUDIO, it remains open to explore the consequent influence on the convergence of optimization.
Apart from utilizing the quantum-specific properties to enhance Shuffle-QUDIO, we can also leverage the experience from classical distributed optimization, such as Elastic Averaging SGD [84], decentralized SGD [85].It is worth noting that the flexibility of Shuffle-QUDIO makes it easy to replace some components with advanced classical techniques, as discussed in Sec.IV D. Taken together, it is expected to utilize Shuffle-QUDIO and its variants to speed up the computation of variational quantum circuits and tackle real-world problems with NISQ devices.
The appendix is organized as follows.Appendix A introduces basic notations and properties of the loss function.Appendices B, C, and D present the proofs of Lemma 1, Lemma 2, and Theorem 1, respectively.Appendix E explains the aggregation methods discussed in Section IV D. Appendix F demonstrates the additional experiment results and analysis about Shuffle-QUDDIO.
Lemma 6 (Gradient of the noisy loss function).Let N p be the global depolarizing channel with the strength p.For a quantum state ρ(θ) parameterized by θ, the gradient after applying the depolarizing channel is Proof of Lemma 6. Recalling the depolarizing channel N p (ρ) = (1 − p)ρ + pI/2 n , we have the noisy loss function Following the parameter shift rule in Eq. (A2), the estimated gradient under the depolarizing noise is where the first inequality follows (a − b) 2 ≤ 2(a 2 + b 2 ).
For Shuffle-QUDIO, the discrepancy ] is quantified by taking expectation over the randomly shuffling Hamitonians H k given parameters θ t k , i.e., where the second equality is derived by utilizing Lemma 1, the first inequality comes from the bound of gradient norm in Lemma 3, and the last inequality uses the induced bound of ||∇L(θ, H)|| 2 and holds when K ≥ 2 required by the condition 1 − 2 K ≥ 0.
Appendix D: Proof of Theorem 1 Introduce the ancillary variables According to the F -Lipschitz continuity of loss function in Lemma 4, we have where H k is a constant all the time, which is the reason why the superscript t is discarded.The first equality is derived by utilizing a, b = Next, the term T 1 in Eq. (D2) yields We first calculate the upper bound of ||θ t − θ t k || 2 .Assume the latest model synchronization happens at the iteration t c with t − t c < W , then θ tc+1 k = θ tc .According to the gradient descent rule, the parameter θ t k is derived as where the subscript of parameters θ tc+1 in the second equality is discarded without ambiguity because θ tc+1 k = θ tc+1 l , ∀k, l ∈ {1, ..., K}.Then Therefore, the deviation between local quantum model θ t l (note that we use notation θ t l instead of θ t k to avoid the confusion between a specified quantum worker l and the general representation of the k-th worker in the following derivation) and the average model θ t is measured as where the first and second inequalities follows , and the last inequality is deduced based on the bound of gradient.
We now derive the upper bound of the second term T 2 in the last inequality.Recall Lemma 3 such that the norm of the gradient of each worker is bounded by ||g Convergence of QUDIO.For the case of QUDIO where each quantum processor is assigned fixed partial Hamiltonian terms at the beginning, T 2 is bounded by (D7) Combining Eqs.(D3), (D7) and (D6), we can quantify the progress of one global iteration under the fixed Hamiltonian partition strategy as (D11) Then the convergence of gradient is quantified by (D12) Comparing Eqs.(D9) and (D12), especially for the terms highlighted by the bold face, it is obvious that Shuffle-QUDIO achieves faster convergence than original QUDIO.

Appendix E: Aggregation methods for quantum distributed optimization
The framework of QUDIO can be roughly deconstructed into two alternating operations, including local updates and global synchronization.While the former operation is upgraded by introducing the shuffle operation, the latter can be also modified by more advanced techniques.In the original Shuffle-QUDIO in Alg. 2, the average aggregation method is simply employed to merge the information from distributed nodes, which may be sub-optimal without considering the differences among these nodes.In this section, we give detailed descriptions about another three novel aggregation algorithms.
Random aggregation.For each distributed quantum model with parameter θ (t,W ) i after completing local updates, the synchronized parameter θ t+1 = θ (t,W ) j , where j ∈ [K] is randomly generated.Median aggregation.Different from average aggregation and random aggregation, median aggregation utilizes the loss value of every local quantum model as reference to determine the synchronized quantum model.To be concrete, assuming the loss of the i-th quantum processor at the t-th synchronization is denoted by L (t) i , the quantum model run on processor j whose loss value L (t) j is the median of {L (t) i } K i=1 is selected as the synchronized model θ t+1 = θ (t,W ) j .Weighted aggregation.Instead of simply utilizing a single local quantum model with median loss as the aggregated model, weighted aggregation merges all local quantum models in a weighted summation fashion.The direct motivation is that a local quantum model achieving lower loss should contribute more to the synchronized model.Similar to median aggregation method, we first collect the loss value of every local quantum model {L (t) i } K i=1 to compute the weights.Specifically, a monotone decreasing function is first applied to the loss and then a softmax function is adopted to obtained a normalized weight vector.The mathematical process is formulated as . (E1) Based on the loss-induced weights, the synchronized quantum model is obtained as θ (t+1) = K i=1 w (t) i θ (t,W ) i .

FIG. 1 :
FIG. 1:The scheme of Shuffle-QUDIO.The Shuffle-QUDIO consists of three subroutines, including initialization, local updates and global synchronization.During the phase of initialization, multiple copies of the original ansatz and the corresponding problem Hamiltonian H are dispatched into each local processor.Note that each processor shares the same seed of the random number generator.For each iteration in the local updates, the set of observables {Hi} M i=1 is randomly shuffled and the i-th local processor picks the subset of whole observables according to the assigned random number.In this way, the observables of each processor do not overlap with each other and the union of their observables exactly constitutes the problem Hamiltonian H.After W local updates, the parameters of each local ansatz are aggregated and then reassigned to all local processors, which is called global synchronization.When the maximal number T of iterations is reached, Shuffle-QUDIO executes the final synchronization and outputs the trained parameters.

FIG. 5 :FIG. 6 :
FIG. 5: Training process of VQE optimized by QUDIO and Shuffle-QUDIO respectively.Each data point is collected after the synchronization.The dashed black line denotes the exact ground state energy (GSE) at the same setting.The first row: the loss curve with respect to iterations in QUDIO.With exponentially increasing W , the convergence of training is severely degraded, as depicted in subplot at first row, sixth column.The second row: the loss curve with respect to the iterations in Shuffle-QUDIO.The speed of loss decrease sees a relatively slow decay with W growing.When W = 32 (second row, sixth column), the loss still converges to the same level of W = 1 within 200 iterations.

FIG. 7 :
FIG. 7: Mean value Err and standard deviation δ(Err) of the approximation error.Each data point is collected over various bond distances and random seeds.Shuffle-QUDIO outperforms QUDIO in achieving smaller approximation error and lower sensitivity to communication frequency W .

FIG. 9 :
FIG. 9: Performance comparison in NISQ era.ideal represents the fault-tolerant case without noise, p = a represents the case where there exists a depolarizing channel with strength a in the circuit, NISQ represents the case of running on a real NISQ device.

FIG. 10 :
FIG. 10: Performance comparison of various model aggregation algorithms.The closer the curve is to the upper left, the more accurate the estimated ground state energy is.
D3) where the first inequality follows the Jensen's inequality || 1 n n i=1 a i || 2 ≤ 1 n n i=1 ||a i || 2 , the second inequality holds because of the triangle inequality ||a + b|| ≤ ||a|| + ||b||, the last inequality is derived by F -Lipschitz continuity condition of gradient function in Lemma 5.
) where the last term in the last inequality follows|| n i=1 a i || 2 ≤ n n i=1 ||a i || 2 .Rearranging the inequality above and summing over t, we achieve 1 T Convergence of Shuffle-QUDIO.For the case of Shuffle-QUDIO where the whole Hamiltonian terms are shuffled and reassigned to local processors during each iteration, we can obtain a tighter bound for T 2 by taking expectation over the random local Hamiltonians H k given parametersθ t k E H k |θ t k [||∇L(θ t k , H) − g t k (θ t k , H k )|| 2 ] =E H k |θ t k [||∇L(θ t k , H)|| 2 ] − 2E H k |θ t k [∇ T L(θ t || 2 + (1 − p) 2 G 2 ≤(K − 1 + p) 2 G 2 , (D10)where the first inequality is based on Lemma 1 and property of bounded gradients of each worker, and the second inequality is derived by applying the induced bound of ||∇L(θ t k , H)|| 2 .Note that we introduce the implicit constraint K > 2(1 − p) to assure the coefficient of ||∇L(θ t k , H)|| 2 in the first inequality is non-negative.On the other hand, when shuffling the Hamiltonian terms in every iteration, the loss reduction is bounded byE h|θ t [L(θ t+1 ) − L(θ t )] ≤ − η 2 ||∇L(θ t )|| 2 + η 2 E h|θ t [||∇L(θ t ) − 1 K ||θ t −θ t k || 2 + E h|θ t [||∇L(θ t k , H) − g t k (θ t k , H k )|| 2 ]]