Generative model for learning quantum ensemble via optimal transport loss

Generative modeling is an unsupervised machine learning framework, that exhibits strong performance in various machine learning tasks. Recently we find several quantum version of generative model, some of which are even proven to have quantum advantage. However, those methods are not directly applicable to construct a generative model for learning a set of quantum states, i.e., ensemble. In this paper, we propose a quantum generative model that can learn quantum ensemble, in an unsupervised machine learning framework. The key idea is to introduce a new loss function calculated based on optimal transport loss, which have been widely used in classical machine learning due to its several good properties; e.g., no need to ensure the common support of two ensembles. We then give in-depth analysis on this measure, such as the scaling property of the approximation error. We also demonstrate the generative modeling with the application to quantum anomaly detection problem, that cannot be handled via existing methods. The proposed model paves the way for a wide application such as the health check of quantum devices and efficient initialization of quantum computation.


Introduction
In the recent great progress of quantum algorithms for both noisy near-term and future fault-tolerant quantum devices, particularly the quantum machine learning (QML) attracts huge attention.QML is largely categorised into two regimes in view of the type of data, which can be roughly called classical data and quantum data.The former has a conventional meaning used in the classical case; for the supervised learning scenario, e.g., a quantum system is trained to give a prediction for a given classical data such as an image.As for the latter, on the other hand, the task is to predict some property for a given quantum state drawn from a set of states, e.g., the phase of a many-body quantum state, again in the supervised learning scenario.Thanks to the obvious difficulty to directly represent a huge quantum state classically, some quantum advantage have been proven in QML for quantum data [1][2][3].
In the above paragraph we used the supervised learning setting to explain the difference of classical and quantum data.But the success of unsupervised learning in classical machine learning, particularly the generative modeling, is of course notable; actually a variety of algorithms have demonstrated strong performance in several applications, such as image generation [4][5][6], molecular design [7], and anomaly detection [8].Hence, it is quite reasonable that several quantum unsupervised learning algorithms have been actively developed, such as quantum circuit born machine (QCBM) [9,10], quantum generative adversarial network (QGAN) [11,12], and quantum autoencoder (QAE) [13,14].Also, Ref. [12] studied the generative modeling problem for quantum data; the task is to construct a model quantum system producing a set of quantum states, i.e., quantum ensemble, that approximates a given quantum ensemble.The model quantum system contains latent variables, the change of which corresponds to the change of output quantum state of the system.In classical case, such generative model governed by latent variables is called an implicit model.It is known that, to efficiently train an implicit model, we are often encouraged to take the policy to minimize a distance between the model dataset and training dataset, rather than minimizing e.g., the divergence between two probability distributions.The optimal transport loss (OTL), which typically leads to the Wasserstein distance, is suitable for the purpose of measuring the distance of two dataset; actually the quantum version of Wasserstein distance was proposed in [15,16] and was applied to construct a generative model for quantum ensemble in QGAN framework [17,18].
Along this line of research, in this paper we also focus on the generative modeling problem for quantum ensemble.We are motivated from the fact that the above-mentioned existing works employed the Wasserstein distance defined for two mixed quantum states corresponding to the training and model quantum ensembles, where each mixed state is obtained by compressing all element of the quantum ensemble to a single mixed state.This is clearly problematic, because this compression process loses a lot of information of the ensemble; for instance, single qubit pure states uniformly distributed on the equator of the Bloch sphere may be compressed to a maximally mixed state, which clearly does not recover the original ensemble.Accordingly, it is obvious that learning a single mixed state produced from the training ensemble does not lead to a model system that can approximate the original training ensemble.
In this paper, hence, we propose a new quantum OTL, which directly measures the difference between two quantum ensembles.The generative model can then be obtained by minimizing this quantum OTL between a training quantum ensemble and the ensemble of pure quantum states produced from the model.As the generative model, we use a parameterized quantum circuit (PQC) that contains tuning parameters and latent variables, which are both served by the angles of single-qubit rotation gates.A notable feature of the proposed OTL is that this has a form of sum of local functions that operates on a few neighboring qubits.This condition (i.e., the locality of the cost) is indeed necessary to train the model without suffering from the so-called vanishing gradient issue [19], meaning that the gradient vector with respect to the parameters decreases exponentially fast when increasing the number of qubits.
Using the proposed quantum OTL, which will be formally defined in Section 3, we will show the following result.The first result is given in Section 4, which provides performance analysis of OTL and its gradient from several aspects; e.g., scaling properties of OTL as a function of the number of training data and the number of measurement.We also numerically confirm that the gradient of OTL is certainly free from the vanishing gradient issue.The second result is provided in Section 5, showing some example of constructing a generative model for quantum ensemble by minimizing the OTL.This demonstration includes the application of quantum generative model to an anomaly detection problem of quantum data; that is, once a generative model is constructed by learning a given quantum ensemble, then it can be used to detect an anomaly quantum state by measuring the distance of this state to the output ensemble of the model.Section 6 gives a concluding remark, and some supporting materials including the proof of theorems are given in Appendix.

Preliminaries
In this section, we first review the implicit generative model for classical machine learning problems in Sec.2.1.Next, Sec.2.2 is devoted to describe the general OTL, which can be effectively used as a cost function to train a generative model.

Implicit Generative Model
The generative model is used to approximate an unknown probability distribution that produces a given training dataset.The basic strategy to construct a generative model is as follows; assuming the probability distribution α(x) behind the given training dataset {x i } M i=1 ∈ X M , where X denotes the space of random variables, we prepare a parameterized probability distribution β θ (x) and learn the parameters θ that minimize an appropriate loss function defined on the training dataset.
In general, generative models are categorized to two types: prescribed models and implicit models.The prescribed generative modeling explicitly defines β θ (x) with the parameters θ.Then we can calculate the log-likelihood function of β θ (x i ), and the parameters are determined by the maximum likelihood method, which corresponds to minimizing the Kullback-Leibler divergence KL(α β θ ).On the other hand, the implicit generative modeling does not give us an explicit form of β θ (x).An important feature of the implicit generative model is that it can easily describe a probability distribution whose random variables are confined on a hidden low-dimensional manifold; also the datageneration process can be interpreted as a physical process from a latent variable to the data [20].Examples of the implicit generative model includes Variational Auto-Encoders [21], Generative Adversarial Networks [22], and the flow-based method [23].This paper focuses on the implicit generative model.
In the implicit generative model we usually assume that the distribution behind the training data resides on a relatively low-dimensional manifold.That is, an implicit generative model is expressed as a map of a random latent variable z onto X ; z resides in a latent space Z whose dimension N z is significantly smaller than that of the sample space, N x .The latent random variable z follows a known distribution γ(z) such as a uniform distribution or a Gaussian distribution.That is, the implicit model distribution is given by β θ = G θ #γ, where # is called the push-forward operator [24] which moves the distribution γ on Z to a probability distribution on X through the map G θ .This implicit generative model is trained so that the set of samples generated from the model distribution are close to the set of training data, by adjusting the parameters θ to minimize some appropriate cost function L as follows: (2.1) αM (x) and βθ,Mg = G θ #γ Mg (z) denote empirical distributions defined with the sampled data {x i } M i=1 and {z i } Mg i=1 , which follow the probability distributions α(x) and γ(z), respectively:

Optimal Transport Loss
The OTL is used in various fields such as image analysis, natural language processing, and finance [24][25][26][27].In particular, the OTL is widely used as a loss function in the generative modeling, mainly because it can be applicable even when the support of probability distributions do not match, and it can naturally incorporate the distance in the sample space X [28][29][30][31][32][33].The OTL is defined as the minimum cost of moving a probability distribution α to another distribution β: Definition 1 (Optimal Transport Loss [34]).
where c(x, y) ≥ 0 is a non-negative function on X × X that represents the transport cost from x to y, and is called the ground cost.Also, we call the set of couplings π that minimizes L c (α, β) as the optimal transport plan.
In general, the OTL does not meet the axioms of metric between probability distributions; but it does when the ground cost is represented in terms of a metric function as follows: Definition 2 (p-Wasserstein distance [35]).When the ground cost c(x, y) is expressed as c(x, y) = d(x, y) p with a metric function d(x, y) and a real positive constant p, the p-Wasserstein distance is defined as (2.4) The p-Wasserstein distance satisfies the conditions of metric between probability distributions.That is, for arbitrary probability distributions α, β, γ, the p-Wasserstein distance W p satisfies W p (α, β) ≥ 0 and In general it is difficult to directly handle the probability distributions α and β θ to minimize the OTL L c (α, β θ ).Instead, as mentioned in Eq. (2.1), we try to minimize the approximation of the OTL via the empirical distributions (2.2): Definition 3 (Empirical estimator for optimal transport loss [35]). (2.5) The empirical estimator converges as L c (α M , βθ,M ) → L c (α, β θ ) in the limit M = M g → ∞.In general, the speed of this convergence is of the order of O(M −1/Nx ) with N x the dimension of the sample space X [36], but the p-Wasserstein distance enjoys the following convergence law [37].
Theorem 4 (Convergence rate of p-Wasserstein distance).For the upper Wasserstein dimension d * p (α) (which is given in Definition 4 of [37]) of the probability distribution α, the following expression holds when s is larger than where the expectation E is taken with respect to the samples drawn from the empirical distribution αM .
Intuitively, the upper Wasserstein dimension d * p (α) can be interpreted as the support dimension of the probability distribution α, which corresponds to the dimension of the latent space, N z , in the implicit generative model.Exploiting the metric properties of the p-Wasserstein distance, the following corollaries are immediately derived from Theorem 4: Corollary 5 (Convergence rate of p-Wasserstein distance between empirical distributions sampled from a common distribution).Let α1,M and α2,M be two different empirical distributions sampled from a common distribution α.The number of samples is M in both empirical distributions.Then the following expression holds for s > d * p (α): where the expectation E is taken with respect to the samples drawn from the empirical distributions α1,M and α2,M .
Corollary 6 (Convergence rate of p-Wasserstein distance between different empirical distributions).Suppose that the upper Wasserstein dimension of the probability distributions α and β θ is at most d * p , then the following expression holds for s > d * p : where the expectation E is taken with respect to the samples drawn from the empirical distribution αM and βθ,M .
These corollaries indicate that the empirical estimator (2.5) is a good estimator if the intrinsic dimension of the training data and the dimension of the latent space N z are sufficiently small, because the Wasserstein dimension d * p is almost the same as the intrinsic dimension of the training data and the latent dimension.In Sec.4.1, we numerically see that similar convergence laws hold even when the OTL is not the p-Wasserstein distance.

Learning algorithm of generative model for quantum ensemble
In Sec.3.1, we define the new quantum OTL that can be suitably used in the learning algorithm of the generative model for quantum ensemble.The learning algorithm is provided in Sec.3.2.

Optimal transport loss with local ground cost
Our idea is to directly use Eq.(2.5) yet with the ground cost for quantum states, c(|ψ , |φ ), rather than that for classical data vectors, c(x, y).This actually enables us to define the OTL between quantum ensembles {|ψ i } and {|φ i }, as follows: where p i and q j are probabilities that |ψ i and |φ i appears, respectively.Note that we can define the transport loss between the corresponding mixed states i p i |ψ i ψ i | and i q i |φ i φ i | or some modification of them, as discussed in [17]; but as mentioned in Sec. 1 , such mixed state loses the original configuration of ensemble (e.g., single qubit pure states uniformly distributed on the equator of the Bloch sphere) and thus are not applicable to our purpose.Then our question is how to define the ground cost c(|ψ , |φ ).An immediate choice might be the trace distance: Definition 7 (Trace distance for pure states [38]).
Because the trace distance satisfies the axioms of metric, we can define the p-Wasserstein distance for quantum ensembles, W p ({|ψ i }, {|φ i }) = L d p ({|ψ i }, {|φ i }) 1/p , which allows us to have some useful properties described in Corollary 5.It is also notable that the trace distance is relatively easy to compute on a quantum computer, using e.g. the swap test [39] or the inversion test [40].
We now give an important remark.As will be formally described, our goal is to find a quantum circuit that produces a quantum ensemble (via changing latent variables) which best approximates a given quantum ensemble.This task can be executed by the gradient descent method for a parametrized quantum circuit, but a naive setting leads to the vanishing gradient issue, meaning that the gradient vector decays to zero exponentially fast with respect to the number of qubits [19].There have been several proposals found in the literature [41,42], but a common prerequisite is that the cost should be a local one.To explain the meaning, let us consider the case where |φ is given by |φ = U |0 ⊗n where U is a unitary matrix (which will be a parametrized unitary matrix U (θ) defining the generative model) and n is the number of qubits.Then the trace distance is based on the fidelity . This is the probability to get all zeros via the global measurement on the state U † |ψ in the computational basis, which thus means that the trace distance is a global cost; accordingly, the fidelity-based learning method suffers from the vanishing gradient issue.On the other hand, we find that the following cost function is based on the localized fidelity measurement.Definition 8 (Ground cost for quantum states only with local measurements [43,44]).
where n is the number of qubits.Also, I i and |0 0| i denote the identity operator and the projection operator that act on the i-th qubit, respectively; thus p (k) represents the probability of getting 0 when observing the k-th qubit.
3) is certainly a local cost, and thus it may be used for realizing effective learning free from the vanishing gradient issue provided that some additional conditions (which will be described in Section 5) are satisfied.However, importantly, c local (|ψ , |φ ) is not a distance between the two quantum states, because it is not symmetric and it does not satisfy the triangle inequality, while the trace distance (3.2) satisfies the axiom of distance.Yet c local (|ψ , |φ ) is always non-negative and becomes zero only when |ψ = |φ , meaning that c local (|ψ , |φ ) functions as a divergence.Then we can prove that, in general, the OTL defined with a divergence ground cost also functions as a divergence, as follows.The proof is given in Appendix A.
then the OTL L c (α, β) with c(x, y) is also a divergence.That is, L c (α, β) satisfies the following properties for arbitrary probability distributions α and β: Therefore, the OTL L c ({|ψ i }, {|φ i }) given in Eq. (3.1) with the local ground cost c local (|ψ , |φ ) given in Eq. (3.3) functions as a divergence.This means that L c ({|ψ i }, {|φ i }) can be suitably used for evaluating the difference between a given quantum ensemble and the set of output states of the generative model.At the same time, recall that, for the purpose of avoiding the gradient vanishing issue, we had to give up using the fidelity measure and accordingly the distance property of the OTL.Hence we directly cannot use the desirable properties described in Theorem 4, Corollary 5, and Corollary 6; nonetheless, in Section 4, we will discuss if similar properties do hold even for the divergence measure.

Learning Algorithm
The goal of our task is to train an implicit generative model so that it outputs a quantum ensemble approximating a given ensemble {|ψ i } M i=1 ; that is, our generative model contains tunable parameters and latent variables, as in the classical case described in Section 2.1.In this paper, we employ the following implicit generative model:

Definition 10 (Implicit generative model on a quantum circuit). Using the initial state |0
⊗n and the parameterized quantum circuit U (z, θ), the implicit generative model on a quantum circuit is defined as Here θ is the vector of tunable parameters and z is the vector of latent variables that follow a known probability distribution; both θ and z are encoded in the rotation angles of rotation gates in U (z, θ).
The similar circuit model is also found in meta-VQE [45], which uses physical parameters such as the distance of atomic nucleus instead of random latent variables z.Also, the model proposed in [12] introduces the latent variables z as the computational basis of an initial state in the form |φ θ (z) = U (θ) |z ; however, in this model, states with different latent variables are always orthogonal with each other, and thus the model cannot capture a small change of state in the Hilbert space via changing the latent variables.In contrast, the model (3.6) fulfills this purpose as long as the expressivity of the state with respect to z is enough.In addition, our model is advantageous in that the analytical derivative is available by the parameter shift rule [46,47] not only for the tunable parameters θ but also for the latent variables z.This feature will be effectively utilized in the anomaly detection problem in Sec. 5.
Next, as for the learning cost, we take the following empirical estimator of OTL, calculated from the training data {|ψ i } M i=1 and the samples of latent variables {z j } Mg j=1 : where c local,i,j is the ground cost given by Note that in practice c local,i,j is estimated with the finite number of measurements (shots); we denote c(Ns) local,i,j to be the estimator with N s shots for the ideal one c local,i,j , and in this case the OTL is denoted as L c(Ns) local .
Based on the OTL (3.7), the pseudo-code of proposed algorithm is shown in Algorithm 1.The total number of training quantum states |ψ i required for the parameter update is of the order O(M M g N s ) in step 3, and O(max(M, M g )N s N p ) in step 5, since the parameter shift rule [46,47]  Generate latent variables {z j } Mg j=1 sampled from the latent distribution.

4:
Calculate the optimal transport plan π i,j by solving the linear programming (3.7).

5:
Calculate the gradients from π i,j and using the parameter shift rule.

6:
Update {θ k } Np k=1 by using the gradients with learning rate ε.
7: until convergence 4 Performance analysis of the cost and its gradient In this section, we analyze the performance of the proposed OTL (3.7) and its gradient vector.First, in Sec.4.1, we numerically study the approximation error of the loss with the focus on its dependence on the intrinsic dimension of data and the number of qubits, to see if the similar results to Theorem 4 and Corollaries 5 and 6 would hold even despite that the OTL is now a divergence rather than distance.Then, in Sec.4.2, we provide numerical and theoretical analyses on the approximation error as a function of the number of measurement (shots).Finally, in Sec.4.3, we numerically show that the OTL certainly avoids the vanishing gradient issue; i.e., thanks to the locality of the cost, the its gradient does not decay exponentially fast.All the analysis in this section is focused on the property of cost at a certain point of learning process (say, at the initial time); the performance analysis on the training process will be discussed in the next section.We employ the parameterized unitary matrix U (z, θ) shown in Fig. 1 to construct the implicit generative model (3.6), which is similar to that given in Ref. [19] except that our model contains the latent variables z.That is, the model is composed of the following N L repeated unitaries (we call each unitary the -th layer): where θ = {θ ,j } n j=1 , ξ = {ξ ,j } n j=1 , and η = {η ,j } n j=1 are n-dimensional parameter vectors in the -th layer.We summarize these vectors to θ = {θ } N L =1 , ξ = {ξ } N L =1 , and η = {η } N L =1 .Here θ are trainable parameters and z are latent variables.W is a fixed entangling unitary gate composed of the ladder-structured controlled-Z gates; that is, W operates the two-qubit controlled-Z gate on all adjacent qubits; where CZ i,i+1 is the controlled-Z gate acting on the i-th and (i + 1)-th qubits.The operator V ξ ,η (z, θ ) consists of the single-qubit rotation operators: where R ξ ,i (θ ,i z η ,i ) is the single-qubit Pauli rotation operator with angle θ ,i z η ,i and direction ξ ,i ∈ {X, Y, Z} in the -th layer, such as R X (θ ,3 z 5 ) = exp(−iθ ,3 z 5 σ x ).The component of the latent variables, η ,i ∈ {0, 1, 2, . . ., N z }, and ξ ,i ∈ {X, Y, Z} are randomly chosen at the beginning of learning and never changed during learning.Also, we have introduced a constant bias term z 0 = 1 so that the ansatz |φ θ (z) = U (z, θ) |0 ⊗n can take a variety of states (for instance, if z 1 , . . ., z Nz ∼ 0 in the absence of the bias, then |φ θ (z) is limited to around |0 ⊗n ).We used Qiskit [48] in all the simulation studies.

Approximation error as a function of the number of training data
We have seen in Sec.2.2 that, for the p-Wasserstein distance, the convergence rate of its approximation error is of the order O(M −1/s ) rather than O(M −1/Nx ), where M is the number of training samples, N x is the dimension of the sample space X , and s can be interpreted as the intrinsic dimension of the data or the latent space dimension.This is desirable, because in general s N x .However, the OTL (3.7) is not a distance but a divergence.Hence it is not clear if it could theoretically enjoy a similar error scaling, but in this subsection we give a numerical evidence to positively support this conjecture.
We present the following two types of numerical simulations.The aim of the first one (Experiment A) is to see if our OTL would satisfy a similar property to Eq. (2.7), which describes the difference of two empirical distributions The structure of parameterized quantum circuit (ansatz) used in the performance analysis in Section 4 and sec:demonstration.This ansatz consists of the repeated layers with a similar structure.In the -th layer, the single qubit Pauli rotation operators with angles {θ ,j × z η ,j } n j=1 and directions {ξ ,j } n j=1 are applied to each qubit followed by a ladder-structured controlled-Z gate.The rotation angles ξ = {ξ ,j } N L ,n ,j=1 and the components of the latent variables η = {η ,j } N L ,n ,j=1 are randomly chosen at the beginning of the learning process and never changed during the learning.
sampled from the common hidden distribution.The second one (Experiment B) is studied for the case of Eq. (2.8), which describes the difference of the ideal OTL assuming the infinite samples available and an empirical distributions sampled from the common the ideal one.We used the statevector simulator [48] to calculate the OTL assuming the infinite number of measurement; in the next subsection, we will study the influence of the finite number of shots on the performance.Throughout all the numerical experiments, we randomly chose the parameters ξ, η, θ, ξ, η, θ and did not change these values.Experiment A. As an analogous quantity appearing in Eq. (2.7), we here focus on the following expected value of the empirical OTL defined in Eq. (3.7): where In Eq. (4.4), we set ξ = ξ and η = η for the two unitary operators that appear in the argument of L c local in Eq. (4.5).This indicates that J c local ξ,η;ξ,η ( z, θ : z, θ; M ) in Eq. (4.4) would become zero in the limit of infinite number of training data (M → ∞).The expectation in Eq. (4.4) is taken with respect to the latent variables zi and z j subjected to the N z -dimensional uniform distribution U (0, 1) Nz , but we numerically approximate it by N Monte Monte Carlo samplings.Other conditions in the numerical simulation are shown in Table 1.
Figure 2 plots the values of Eq. (4.4) with several N z (the dimension of the latent variables) and n (the number of qubits).In the figures, the dotted lines show the scaling curve M −1/Nz .Notably, in the range of a large number of training data, the points and dotted lines are almost consistent, regardless of the number of qubits.This implies that the OTL for two different ensembles given by Eq. (4.4) is almost independent of the number of qubits n and depends mainly on the latent dimension N z , likewise the case of distance-based loss function proven in Corollary 5.
Experiment B. We next turn to the second experiment to confirm that the approximation error of the proposed OTL scales similar to Eq. (2.8).Specifically, we numerically show the dependence of the following expectation value on the number of training data M : where J c local is defined in Eq. (4.5).In this case, we set different fixed parameters ξ = ξ and η = η for the two unitary operators in Eq. (4.5).The parameters used in the numerical simulation are shown in Table 1.
The first term of Eq. (4.6) is an ideal quantity assuming that an infinite number of training data is available.Since the first term is independent of the number of training data, the second term is expected to take the following form,  For reference, the scaling curves M −1/Nz are added as dotted lines.These graphs show that Eq. (4.4) mainly scale as M −1/Nz and are almost independent to the number of qubits, n.
as suggested from Eq. (2.8): To identify the parameter b, we use the Monte Carlo method to calculate the left hand side of Eq. (4.7) as a function of M and then execute the curve-fitting via aM −1/b + c.We repeat this procedure with several values of the number of qubits n and the latent dimension N z ; see Appendix B for a more detailed discussion.The result of parameter identification is depicted in Fig. 3, which shows that the fitting parameter b is almost independent to the number of qubits n and linearly scales with respect to the latent dimension N z .This result is indeed consistent with Eq. (2.8).
without respect to the number of qubits.
This conjecture means that the proposed OTL can be efficiently computed via sampling, when the intrinsic dimension of the data and the latent dimension are sufficiently small.Proving this conjecture would be challenging and is the subject of future work.

Approximation error as a function of the number of shots
The error analysis in Sec.4.1 assumes that the number of shot is infinite and thereby the ground cost between quantum states can be perfectly determined.Here, we analyze the effect of the finiteness of the number of shots on the approximation error.The following proposition serves as a basis of the analysis.

Proposition 12. Let c(Ns)
local be an estimator of the ground cost c local of Eq. (3.3) using N s samples.Suppose that the support of two different probability distributions are strictly separated; as a result, there exists a lower bound g > 0 to the ground cost for any i, j ∈ {1, 2, . . ., M }, i.e., (c local (|ψ i , U (z j , θ) |0 ⊗n ) > g, ∀i, j).Then, for any positive constant δ, the following inequality holds: (4.9)Then, we provide a numerical simulation to show the averaged approximation error as a function of M as well as N s .Again, we employ the hardware efficient ansatz shown in Fig. 1 of Sec.4.1.The purpose of the numerical simulation is to see the dependence of the following expectation value on M and N s .
where J c(Ns) local can be computed via Eq.(4.5).As in the numerical simulation in Sec.4.1, we approximate the expectation E z,z∼U (0,1) Nz [•] by Monte Carlo sampling of z and z from the uniform distribution U (0, 1) Nz .Also, we randomly choose the fixed parameters ξ, η, θ, ξ, η, θ prior to the simulation.The other simulation parameters are given in Table 2. Simulation results are depicted in Fig. 4, where the notable points are summarized as follows.
• In the range of small number of training data M , the approximation error is roughly proportional to M −1/2 .
• In the range of large number of training data M , the approximation error takes √ c 1 ln M + c 2 with constants c 1 and c 2 .
• The dependence of the error on the number of shots N s is roughly proportional to N −1/2 s .Appendix D provides an intuitive explanation of these results.In particular, it is important to know that we need to choose a proper number of samples to reduce the approximation error.

Avoidance of the vanishing gradient issue
In Sec.3.1 we chose the cost function composed of the local measurements, as a least condition to avoid the vanishing gradient issue.Note that, however, employing a local cost is not enough to avoid this issue; for instance, Ref. [42] proposed the method of using a special type of parametrized quantum circuit called the alternating layered ansatz (ALA) in addition to using the local cost, which is actually proven to avoid the issue.Nevertheless, we here numerically demonstrate that our method can certainly mitigate the decrease of gradient even without such additional condition, compared to the case with global cost.
More specifically, we calculated the expectation of the variance of the partial derivative of the OTL (3.7), based on the training ensemble {|ψ i } M i=1 and the sampled data from the generative model The partial derivative is calculated using the parameter shift rule [46].The expectation of the variance is approximated by Monte Carlo calculations with respect to z, ξ, η, and θ, where z is sampled from the uniform distribution U (0, 1) and ξ, η, θ are randomly chosen from ξ ∈ {X, Y, Z} nN L , η ∈ {0, 1, 2, . . ., N z } nN L , and θ ∈ [0, 2π] nN L .The structure of the generative model U (z, θ) |0 ⊗n is the same as that shown in Fig. 1.The derivative is taken with respect to θ 1,1 .The training ensemble {|ψ i } M i=1 is prepared as follows; where ζ i 1 = {ζ i 1,j } n j=1 and ζ i 2 = {ζ i 2,j } n j=1 are randomly chosen from the uniform distribution on [0, 2π] and fixed during Monte Carlo calculation.The operators W , V 1 , and V 2 are defined as follows: where CX j,k denotes the controlled-X gate, which operates X gate on the k-th qubit with the j-th control qubit.R j,Y and R j,Z denote single qubit Pauli rotations around the x and y axes, respectively.The other simulation parameters are given in Table 3. Figure 5 shows the numerical simulation result of the variance (4.11), for the cases where (a) the global cost (3.2) is used and (b) the local cost (3.3) is used.The number of training data is fixed to M = 8.The clear exponential decays in variance of gradient are observed for the global cost, regardless of N L .In contrast, for the case of local cost, the relatively shallow circuits with N L = 10, 25 exhibit approximately constant scaling with respect to n ≥ 10, while the deep circuits with N L ≥ 50 also exhibit slower scaling than the global one and keep larger variance even when n ≥ 8.This result implies that the OTL with local ground cost can avoid the gradient vanishing issue, despite that the circuit is not specifically designed for this purpose.Note also that the result is consistent with that reported in [49], which studied the cost function composed of single ground cost of our setting.
In addition, Fig. 5(c) shows the variance (4.11) as a function of the number of training data, M , in the case of n = 14.In the figure, the points represent the Monte-Carlo numerical results and the dotted lines represent the scaling curves M −x where the value x is determined via fitting.This fitting result implies that the gradient obeys the simple statistical scaling with respect to M and thus the proposed algorithm would enjoy efficient learning even for a large training ensemble.

Demonstration of the generative model training and application to anomaly detection
In this section, we present anomaly detection based on the cost function defined in Eq. (2.3) as a proof-of-concept of the proposed loss function.

Quantum Anomaly Detection
Anomaly detection is a task to judge whether a given test data x (t) is anomalous (rare) data or not, based on the knowledge learned from the past training data x i , (i = 1, 2, . . ., M ), i.e., the generative model.Unlike typical classification tasks, this problem deals with a large imbalance in the number of normal data and that of anomalous data; actually, the former is usually much bigger than the latter.Therefore, typical classification methods are not suitable to solve this task, and some specialized schemes have been widely developed [50].
The anomaly detection problem is important in the field of quantum technology.That is, to realize accurate state preparation and control, we are required to detect contaminated quantum states and remove those states as quickly as possible.Previous quantum anomaly detection schemes rely on the measurement-based data processing [51,52], which however require a large number of measurement as in the case of quantum state tomography.In contrast, our anomaly detection scheme directly inputs quantum states to the constructed generative model and then diagnoses the anormality with much fewer measurements.
The following is the procedure for constructing the conventional anomaly detector based on the generative model [53], which we apply to our quantum case.Of these steps, the model probability distribution in Step 1 is constructed by the learning algorithm presented in Sec.3.2.To designing the AS in Step 2, we refer to AnoGAN [54] in classical machine learning.Namely, we define a loss function L(U (z, θ) |0 ⊗n , |ψ (t) ) for a test data |ψ (t) and the generative model U (z, θ) constructed from the training dataset with learned parameter θ; then take the minimum with respect to the latent variables z to calculate AS: As the loss function L, we use the local ground cost c local (|ψ (t) , U (z, θ) |0 ⊗n ) defined in Eq. (3.3).The above minimization is executed via the gradient descent with respect to z, which is obtained via the parameter shift rule similar to the derivative in θ.Algorithm 2 summarizes the procedure.

Algorithm 2 Algorithm to calculate Anomaly Score
Calculate the ground cost L from {U (z, θ)} and |ψ (t) according to Eq. ( Calculate the gradients ∂L ∂z k Nz k=1 using the parameter shift rule.

5:
Update z by using the gradient descent method via ∂L ∂z k Nz k=1 6: until convergence

Distributed dataset
The first demonstration is to construct a generative model that learns a quantum ensemble distributed on the equator of the generalized Bloch sphere.That is, the training ensemble (i.e., the normal dataset) {|ψ j } M j=1 to be learned is set as follows: where φ j is randomly generated from the uniform distribution on [0, 1] and |x denotes the x-th basis in the 2 ndimensional Hilbert space.Note that the configuration of this ensemble cannot be learned by the existing mixedstate-based quantum anomaly detection scheme [51,52], because the mixed state corresponding to this ensemble is nearly the maximally mixed state, the learning of which thus does not give us a generative model recovering the original ensemble.
We employ the same ansatz as that given in Sec. 4 with the parameters shown in Table 4 and construct the generative model according to Algorithm 1.As the optimizer, we take Adam [55] with learning rate 0.01.The number of learning iterations (i.e., the number of the updates of the parameters) is set to 1500 for n = 2 and 10000 for n = 10.
Once the model for normal ensemble is constructed, it is then used to anomaly detection.Here the set of test data {|ψ (t) } is given by where θ (t) , φ (t) ∈ {0, 0.1, 0.2, . . ., 2}.We calculate the AS using Algorithm 2. The other simulation parameters are shown in Table 4.
The numerical simulation result in the case of n = 2, 10 are presented in Figs. 6 and   Firstly, we see the clear correlation between the AS and the theoretical curve in Fig. 6, implying that AS is appropriately calculated via the proposed method.In the practical usecase, a user defines a threshold of AS depending on the task and then compare the calculated AS with the threshold for identifying the anomaly quantum states.For instance, if we set the threshold as AS = 0.3, the test states conditioned in 0.3 ≤ θ (t) /π ≤ 0.7 in Fig. 6 are judged as normal while others are anomaly.Moreover, the output states of the learned generative model and the result of anomaly detection in the case of n = 10 are shown in Fig. 7. Although, the result displayed in (b) would suggest that the learning fails, the output states show correlation with the training states displayed in (a); actually, the output states live on the xy-plane in the generalized Bloch sphere spanned by |0 ⊗10 and |1 ⊗10 .In addition, it is notable that only N s = 100 is enough even for the case of n = 10 to perform anomaly detection, provided that we obtain an appropriate generative model from the training normal states.This is an advantage for practical situation, as this indicates that the proposed method may scale up withe respect to the number of qubits.
where n is the number of qubits.∆θ j and ∆φ j are sampled from the normal distribution N (µ, σ) and the uniform distribution U (a, b), respectively (µ and σ represent the mean and the variance, respectively).We will consider the two cases (µ, σ, a, b) = (0, 0.02, 0, 0.1) for n = 6 and (µ, σ, a, b) = (0, 0.02, 0, 0.2) for n = 10.Note that in this choice of parameters, the ensemble {|ψ j } M j=1 is nearly two-dimensionally distributed on the generalized Bloch sphere, as illustrated in Fig. 9 (a, d).The other simulation parameters are shown in Table 5.
To construct a generative model via learning this two dimensional distribution, we set the dimension of latent variable as N z = 2 from the above-mentioned observation on the dimensionality of {|ψ j } M j=1 .Also, we here take the so-called alternating layered ansataz (ALA), which, together with the use of local cost, is guaranteed to mitigate the gradient vanishing issue [41,42].This ansatz is more favorable than the previous one which we here call the hardware efficient ansatz (HEA), in view of the possibility to avoid the gradient vanishing issue.Therefore it is worth comparing their learning curves.Typical learning curves are shown in Fig. 8.The blue plots, which are labeled "local", represent the case where the cost is the local one (3.3))and the ansatz is ALA; thus we denote this case as L-ALA.On the other hand, the orange plots, which are labeled "global", represent the case where the cost is the global one (3.2))and the ansatz is HEA; thus we denote this case as G-HEA.Not that both displayed costs are calculated as the global one, to directly compare them; that is, the "local" represents the cost calculated at each iteration based on the global cost with the parameters that optimize the local cost.We observe that L-ALA has a clear advantage over G-HEA in terms of the convergence speed.This result coincides with that of Sec.4.3, indicating the advantage of the local cost.In addition to the convergence speed, the final cost of L-ALA is lower than that of G-HEA.Note that the learning performance heavily depends on the initial random seed, yet it was indeed difficult to find a successful setting of G-HEA; actually in all cases we tried the trajectory seemed to be trapped in a local minima, presumably because the variance of G-HEA is much smaller than that of L-ALA.We apply the constructed generative model to the anomaly detection problem.In Fig. 9 (b) and (e), the test quantum states are displayed for the case of n = 6 and n = 10, respectively.The resultant anomaly score for each test data are shown in Fig. 9 (c, f).In both cases, we can say that the models are trained appropriately.In particular, the variance of the distribution of {|ψ j } M j=1 , i.e., the distribution of the blue points in (a, d), is well captured by the width of the dip of red lines in (c, f).Finally note that, in this section, we use QASM simulator for the numerical simulation; the number of shot is 1000 for each measurement in the learning process, and 50 for the anomaly detection task, even for the case of n = 10.Compared to the state tomography, these numbers of shot are clearly too small.Nonetheless, the proposed method enabled the model to learn the training ensemble with such small number of shots.
x y

Conclusion
In classical machine learning, many generative models are vigorously studied, but there are only a few studies on quantum generative models for quantum data.This paper offers a new approach for building such a quantum generative model, i.e., learning method for quantum ensemble with unsupervised machine learning setting.For that purpose, we proposed a loss function based on optimal transport loss (OTL), which can be effectively used even when both the target and model probability distributions are hard to characterize.We also adopted the previously proposed local cost as the ground cost of OTL, to avoid the vanishing gradient problem and thereby increase learnability, but this makes OTL being no longer a distance.Hence we have shown that OTL with the local cost satisfies the properties of divergence between probability distributions and confirmed that the proposed OTL is suitable as a cost for generative model.We then theoretically and numerically analyzed the properties of the proposed OTL.Our analysis indicates that the proposed OTL is a good cost function when the quantum data ensemble has a certain structure, i.e., it is confined in a relatively low dimensional manifold.In addition, We numerically showed that OTL can avoid the vanishing gradient issue thanks to the locality of the cost.Finally, we demonstrated the anomaly detection problems of quantum states, that cannot be handled via existing methods, to show a validity of our method.
There is still a lot of works to be done in the direction of this paper.First, this paper assumed that a quantum processing is performed individually for each quantum state, but it would be interesting to consider another setting such as allowing coherent access [1] or quantum memory [3].In this scenario, the anomaly detection technique could be used for processing quantum states which come from an external quantum processor, such as quantum sensor.Also, it would be interesting to extend the cost to the entropy regularized Wasserstein distance [56][57][58][59][60], which is effective for dealing with higher dimensional data in classical generative models.In this case, however, the localized cost proposed in this paper does not satisfy the axiom of distance likewise the case demonstrated in this paper; this is yet surely an interesting future work.Lastly, note that classical data can also be within the scope of our method, provided it can be effectively embedded into a quantum data; then the anomaly detection of financial data, which usually requires high dimensionality for precise detection, might be a suitable target.
Proof.Here, we only show the proof of L c 1 (α M , βM ) − L c 2 ( αM , βM ) < t.The other inequality L c 2 ( αM , βM ) − L c 1 (α M , βM ) < t can be proved in the same way.for any component i, j ∈ {1, 2, . . ., M }, where the notations are same as Lemma 14.Then, the following inequality holds for the corresponding optimal transport losses: Proof.For simplicity, we prove only for the the solution of the optimal transport plan that M components take the value 1/M and the other components take the value 0, but proofs for the other solutions can be performed in the same manner.For such a solution, the number of non-zero components is at most |A 1 ∪ A 2 | ≤ 2M , and using the assumption Eq. (C.5), the probability that those non-zero components have an error within t is upper bounded as P ∩ (i,j)∈A 1 ∪A 2 |c 1 (x i , y j ) − c 2 (x i , y j )| < t > 1 − δ 2M Proof.Under the same setting as Lemma 13, Chebyshev inequality for the random variable √ X leads, with some positive constant k, ,η ( z, θ : z, θ; M ) (Emperical Loss)

Figure 2 :
Figure2: Results of numerical simulations on the relationship between the number of training data and the OTL given by Eq. (4.4), with several qubit number n.Each subgraph shows the results for various latent dimensions N z .For reference, the scaling curves M −1/Nz are added as dotted lines.These graphs show that Eq. (4.4) mainly scale as M −1/Nz and are almost independent to the number of qubits, n.

Figure 3 :
Figure 3: The simulation results of the fitting parameter b as a function of (a) the number of qubits n and (b) the latent dimension N z .The fitting parameter b is obtained by fitting the second term of Eq. (4.6) by using Eq.(4.7).The subfigure (a) shows that b is almost independent to the number of qubits n, while the subgraph (b) shows that b linearly scales with the latent dimension N z .The results obtained in Experiments A and B suggest us to have the following conjecture: Conjecture 11.The scaling of the approximation error of the optimal transport loss (3.7) with respect to the number of training data M is determined via the latent dimension N z as follows:

Figure 4 :
Figure 4: Simulation results of the approximation error of the OTL due to the number of shots defined in Eq. (4.10).The left and right panels show the dependence of the error on the number of training data M and shots N s , respectively.The results for different latent dimension N z are shown in the figures (a), (b), and (c).For reference, we added the curve of M −1/2 with the dotted line in the left panels.The fitting result of the points N s = 128 with the curve c 1 ln(M ) + c 2 is added as the dashed line in the left panels.Also, we added the curve of N −1/2 s as the dotted line in the right panels.

Figure 5 :
Figure 5: The gradient of the cost function as a function of the number of qubit, n, for the case of (a) the global cost (3.2) and (b) the local cost (3.3). Figure (c) shows the dependence on the number of training data, M , for the case of local cost.Several curves are depicted for different number of layers N L .The clear exponential decay is observed in (a), but is avoided in (b).The polynomial decay ( M −1 ) is observed in (c), implying the simple statistical scaling.

1 .
(Distribution estimation): Construct a model probability distribution from the normal dataset.2. (Anomaly score design): Define an Anomaly Score (AS), based on the model distribution of normal data.3. (Threshold determination): Set a threshold of AS for diagnosing the anormality.
7, respectively.The training ensemble is shown in the figure (a), where each blue point corresponds to the generalized Bloch vector.Some output states of the constructed generative model, corresponding to different value of z ∈ [0, 1], are shown in the figure (b).Both of the red and blue points in Fig. (c) represent the test data state (5.3).The figure (d) shows the calculated AS, where the blue and red plots correspond to the blue and red points in (c), respectively.The dotted line in (d) illustrates the theoretical expected values assuming that the model completely learns the training data.

Figure 6 :
Figure 6: The simulation results in the case of n = 2, for the distributed training ensemble.(a) Generalized Bloch vector representation of training ensemble.(b) Output states from the generative model with latent variables z ∈ [0, 1].(c) Test data (both red and blue points).(d) Anomaly scores for different test data.The blue and red plots correspond to the blue and red points in (c), respectively.The values of angle θ (t) and φ(t) are given in the bottom and upper horizontal axis, respectively.The dotted line represents the theoretical value of AS.

Figure 7 :
Figure 7: The simulation results in the case of n = 10, for the distributed training ensemble.(a) Generalized Bloch vector representation of training states.(b) Output states from the generative model with latent variables z ∈ [0, 1].(c) Test data (both red and blue points).(d) Anomaly scores for different test data.The blue and red plots correspond to the blue and red points in (c), respectively.The values of angle θ (t) and φ (t) are given in the bottom and upper horizontal axis, respectively.

Figure 8 :
Figure 8: Learning curves for the case (a) n = 6 and (b) n = 10.The values of OTL is calculated with the parameters at each iteration trained with local cost (3.3)or global cost (3.2).Note that the displayed are the cost calculated with the global one (3.2) for both blue and orange cases, for fair comparison.The range of error bar is 1 standard deviation.

Figure 9 :
Figure 9: The simulation results for the localized training ensemble, in the case of (a,b,c) n = 6 and (d,e,f) n = 10.(a, d) Generalized Bloch vector representation of training states.(b, e) Test data (both red and blue points).(c,f) Anomaly scores for different test data.The blue and red plots correspond to the blue and red points in (b, e), respectively.The values of angle θ (t) and φ(t) are given in the bottom and upper horizontal axis, respectively.The dotted line represents the theoretical value of AS.
is applicable.Algorithm 1 Learning Algorithm with Quantum Optimal Transport Loss (3.7) Input: Quantum circuit model U (z, θ) with initial parameters θ, learning rate ε, ensemble {|ψ i } Output: A quantum circuit that outputs an ensemble approximating the input ensemble 1: repeat 2:

Table 1 :
List of parameters for numerical simulation of Sec.4.1

Table 2 :
List of parameters for numerical simulation of Sec.4.2 The proof is shown in Appendix C. Proposition 12 states that the approximation error of the OTL is upper bounded by a constant of the order O( M/N s ), under the condition M 1 and N s 1 .Therefore, if Observation 11 is true, the approximation error due to the finiteness of N s and M is upper bounded by O(M −1/Nz ) + O M/N s , where N z is the latent dimension.

Table 3 :
List of parameters for numerical simulation of Sec.4.3

Table 5 :
List of parameters for numerical simulation of Sec.
L c and √ p = L c , and comparing Eqs.(C.5) and (C.9), we obtain Proposition 12 by setting