1 Introduction

In the recent great progress of quantum algorithms for both noisy near-term and future fault-tolerant quantum devices, particularly the quantum machine learning (QML) attracts huge attention. QML is largely categorized into two regimes in view of the type of data, which can be roughly called classical data and quantum data. The former has a conventional meaning used in the classical case; in the supervised learning scenario, for instance, a quantum system is trained to give a prediction for a given classical data such as an image. As for the latter, on the other hand, the task is to predict some properties of a given quantum state drawn from a set of states, e.g., the phase of a many-body quantum state, again in the supervised learning scenario. Thanks to the obvious difficulty to directly represent a large quantum state classically, some quantum advantage have been proven in QML for quantum data (Aharonov et al. 2022; Wu et al. 2021; Huang et al. 2022).

While focused on the supervised learning setting in the above paragraph, the success of unsupervised learning in classical machine learning, particularly the generative modeling, is also notable; actually a variety of algorithms have demonstrated strong performance in several applications, such as image generation (Bao et al. 2017; Brock et al. 2018; Kulkarni et al. 2015), molecular design (Gómez-Bombarelli et al. 2018), and anomaly detection (Zhou and Paffenroth 2017). Hence, it is quite reasonable that several quantum unsupervised learning algorithms have been actively developed, such as quantum circuit born machine (QCBM) (Benedetti et al. 2019; Coyle et al. 2020), quantum generative adversarial network (QGAN) (Lloyd and Weedbrook 2018; Dallaire-Demers and Killoran 2018), and quantum autoencoder (QAE) (Romero et al. 2017; Wan et al. 2017). Also, Ref. Dallaire-Demers and Killoran (2018) studied the generative modeling problem for quantum data; the task is to construct a model quantum system producing a set of quantum states, i.e., quantum ensemble, that approximates a given quantum ensemble. The model quantum system contains latent variables, the change of which corresponds to the change of output quantum state of the system. In classical case, such generative model governed by latent variables is called an implicit model. It is known that, to efficiently train an implicit model, we are often encouraged to take the policy to minimize a distance between the model dataset and training dataset, rather than minimizing e.g., the divergence between two probability distributions. The optimal transport loss (OTL), which typically leads to the Wasserstein distance, is suitable for the purpose of measuring the distance of two datasets; actually the quantum version of Wasserstein distance was proposed in Zhou et al. (2022) and De Palma et al. (2021) and was applied to construct a generative model for quantum ensemble in QGAN framework (Chakrabarti et al. 2019; Kiani et al. 2022).

Along this line of research, in this paper, we also focus on the generative modeling problem for quantum ensemble. We are motivated from the fact that the above-mentioned existing works employed the Wasserstein distance defined for two mixed quantum states corresponding to the training and model quantum ensembles, where each mixed state is obtained by compressing all elements of the quantum ensemble to a single density matrix. This is clearly problematic, because this compression process loses a lot of information of the ensemble. For instance, single-qubit pure states uniformly distributed on the equator of the Bloch sphere may be compressed to a maximally mixed state, which clearly does not recover the original ensemble. Obviously, learning a single mixed state produced from the training ensemble does not give us a model that can approximate the original training ensemble.

In this paper, hence, we propose a new quantum OTL, which directly measures the difference between two quantum ensembles. The generative model can then be obtained by minimizing this quantum OTL between a training quantum ensemble and the ensemble of pure quantum states produced from the model. As the generative model, we use a parameterized quantum circuit (PQC) that contains tuning parameters and latent variables, which are both served by the angles of single-qubit rotation gates. A notable feature of the proposed OTL is that this has a form of sum of local functions that operates on a few neighboring qubits. This condition (i.e., the locality of the cost) is indeed necessary to train the model without suffering from the so-called vanishing gradient issue (McClean et al. 2018), meaning that the gradient vector with respect to the parameters decreases exponentially fast when increasing the number of qubits.

Using the proposed quantum OTL, which will be formally defined in Sect. 3, we will show the following result. The first result is given in Sect. 4, which provides performance analysis of OTL and its gradient from several aspects; e.g., scaling properties of OTL as a function of the number of training data and the number of measurement. We also numerically confirm that the gradient of OTL is free from the vanishing gradient issue. The second result is provided in Sect. 5, showing some examples of constructing a generative model for quantum ensemble by minimizing the OTL. This demonstration includes the application to an anomaly detection problem of quantum data; that is, once a generative model is constructed by learning a given quantum ensemble, then it can be used to detect an anomaly quantum state by measuring the distance of this state to the output ensemble of the model. Section 6 gives a concluding remark. Some supporting materials including the proof of theorems are given in Appendix.

2 Preliminaries

In this section, we first review the implicit generative model for classical machine learning problems in Sect. 2.1. Next, Sect. 2.2 is devoted to describe the general OTL, which can be effectively used as a cost function to train a generative model.

2.1 Implicit generative model

The generative model is designed so that it approximates an unknown probability distribution that produces a given training dataset. The basic design strategy is as follows; assuming the probability distribution \(\alpha (\varvec{x})\) behind the given training dataset \(\{\varvec{x}_i\}_{i=1}^{M}\in \mathcal {X}^{M}\), where \(\mathcal {X}\) denotes the space of random variables, we prepare a parameterized probability distribution \(\beta _{\varvec{\theta }}(\varvec{x})\) and learn the parameters \(\varvec{\theta }\) that minimize an appropriate loss function defined on the training dataset.

The implicit generative modeling does not give us an explicit form of \(\beta _{\varvec{\theta }}(\varvec{x})\). An important feature of the implicit generative model is that it can easily describe a probability distribution whose random variables are subjected to a relatively simple distribution but are confined on a hidden low-dimensional manifold; also the data-generation process can be interpreted as a physical process from a latent variable to the data (Bottou et al. 2018). Examples of the implicit generative model include Generative Adversarial Networks (Goodfellow et al. 2014). This paper focuses on the implicit generative model.

A more specific description of the implicit generative model is as follows. An implicit generative model is expressed as a map of a latent random variable \(\varvec{z}\) onto \(\mathcal {X}\); \(\varvec{z}\) resides in a latent space \(\mathcal {Z}\) whose dimension \(N_z\) is significantly smaller than that of the sample space, \(N_x\). The latent random variable \(\varvec{z}\) follows a known distribution \(\gamma (\varvec{z})\) such as a uniform distribution or a Gaussian distribution. That is, the implicit model distribution is given by \(\beta _{\varvec{\theta }}=G_{\varvec{\theta }}{\#}\gamma \), where \(\#\) is called the push-forward operator (Peyre and Cuturi 2019a) which moves the distribution \(\gamma \) on \(\mathcal {Z}\) to a probability distribution on \(\mathcal {X}\) through the map \(G_{\varvec{\theta }}\). This implicit generative model is trained so that the set of samples generated from the model distribution are close to the set of training data, by adjusting the parameters \(\varvec{\theta }\) to minimize some appropriate cost function \(\mathcal {L}\) as follows:

$$\begin{aligned} \theta ^\star = \underset{\varvec{\theta }}{\text {arg min} }\mathcal {L}(\hat{\alpha }_{M},\hat{\beta }_{\varvec{\theta },{M_g}}). \end{aligned}$$
(2.1)

\(\hat{\alpha }_{M}(\varvec{x})\) and \(\hat{\beta }_{\varvec{\theta },{M_g}}=G_{\varvec{\theta }}\#\hat{\gamma }_{M_g}(\varvec{z})\) denote empirical distributions defined with the sampled data \(\{\varvec{x}_i\}_{i=1}^{M}\) and \(\{\varvec{z}_i\}_{i=1}^{M_g}\), which follow the probability distributions \(\alpha (\varvec{x})\) and \(\gamma (\varvec{z})\), respectively:

$$\begin{aligned} \hat{\alpha }_{M}(\varvec{x}) = \frac{1}{{M}}\sum _{i=1}^{M} \delta (\varvec{x}-\varvec{x}_i), ~~~ \hat{\gamma }_{M_g}(\varvec{z}) = \frac{1}{{M_g}}\sum _{i=1}^{M_g} \delta (\varvec{z}-\varvec{z}_i). \end{aligned}$$
(2.2)

2.2 Optimal transport loss

The OTL is used in various fields such as image analysis, natural language processing, and finance (Ollivier et al. 2014; Santambrogio 2015; Peyre and Cuturi 2019a, b). In particular, the OTL is widely used as a loss function in the generative modeling, mainly because it can be applicable even when the support of probability distributions do not match, and it can naturally incorporate the distance in the sample space \(\mathcal {X}\) (Montavon et al. 2016; Bernton et al. 2017; Arjovsky et al. 2017; Tolstikhin et al. 2017; Genevay et al. 2017; Bousquet et al. 2017). The OTL is defined as the minimum cost of moving a probability distribution \(\alpha \) to another distribution \(\beta \):

Definition 1

(Optimal Transport Loss; Kantorovich (1942))

$$\begin{aligned} \mathcal {L}_c(\alpha ,\beta ) =&\min _{\pi } \int c(\varvec{x},\varvec{y}) d\pi (\varvec{x},\varvec{y}), \nonumber \\ \mathrm {subject\ to} \quad&\int \pi (\varvec{x},\varvec{y}) d\varvec{x}=\beta (\varvec{y}), \nonumber \\&\int \pi (\varvec{x},\varvec{y}) d\varvec{y}=\alpha (\varvec{x}), \pi (\varvec{x},\varvec{y}) \ge 0, \end{aligned}$$
(2.3)

where \(c(\varvec{x},\varvec{y}) \ge 0\) is a non-negative function on \(\mathcal {X}\times \mathcal {X}\) that represents the transport cost from \(\varvec{x}\) to \(\varvec{y}\), and is called the ground cost. Also, we call the set of \(\pi \) that minimizes \(\mathcal {L}_c(\alpha ,\beta )\) as the optimal transport plan.

In general, the OTL does not meet the axioms of metric between probability distributions; but it does when the ground cost is represented in terms of a metric function as follows:

Definition 2

(p-Wasserstein distance; Villani (2009)) When the ground cost \(c(\varvec{x},\varvec{y})\) is expressed as \(c(\varvec{x},\varvec{y})=d(\varvec{x},\varvec{y})^p\) with a metric function \(d(\varvec{x},\varvec{y})\) and a real positive constant p, then the OTL satisfies the axioms of metric and

$$\begin{aligned} \mathcal {W}_p(\alpha ,\beta ) = \mathcal {L}_{d^p}(\alpha ,\beta )^{1/p} \end{aligned}$$
(2.4)

is called the p-Wasserstein distance.

The p-Wasserstein distance satisfies the conditions of metric between probability distributions. That is, for arbitrary probability distributions \(\alpha , \beta , \gamma \), the following properties hold; \(\mathcal {W}_p(\alpha ,\beta )\ge 0\), \(\mathcal {W}_p(\alpha ,\beta )=\mathcal {W}_p(\beta ,\alpha )\), and \(\mathcal {W}_p(\alpha ,\beta )=0\Leftrightarrow \alpha = \beta \); also it satisfies the triangle inequality \(\mathcal {W}_p(\alpha ,\gamma )\le \mathcal {W}_p(\alpha ,\beta )+\mathcal {W}_p(\beta ,\gamma )\).

In general it is difficult to minimize the OTL \(\mathcal {L}_c(\alpha ,\beta _{\varvec{\theta }})\) for the probability distributions \(\alpha \) and \(\beta _{\varvec{\theta }}\). Instead, as mentioned in Eq. 2.1, we try to minimize the approximation of the OTL via the empirical distributions (2.2):

Definition 3

(Empirical estimator for OTL; Villani (2009))

$$\begin{aligned}&\mathcal {L}_c\left( \hat{\alpha }_{M}, \hat{\beta }_{\varvec{\theta },{M_g}} \right) = \min _{\{\pi _{i,j}\}_{i,j=1}^{{M},{M_g}} } \sum _{i=1}^{{M}} \sum _{j=1}^{{M_g}} c(\varvec{x}_i,G_{\varvec{\theta }}(\varvec{z}_j))\pi _{i,j}, \nonumber \\&\mathrm {subject\ to} \quad \sum _{i=1}^{M}\pi _{i,j} = \frac{1}{{M_g}}, \sum _{j=1}^{M_g}\pi _{i,j} = \frac{1}{{M}}, \pi _{i,j} \ge 0. \end{aligned}$$
(2.5)

The empirical estimator \(\mathcal {L}_c( \hat{\alpha }_{M},\hat{\beta }_{\varvec{\theta },{M}})\) converges to \(\mathcal {L}_c(\alpha ,\beta _{\varvec{\theta }})\) in the limit \({M=M_g}\rightarrow \infty \). In general, the speed of this convergence is of the order of \(O(M^{-1/N_x})\) with \(N_x\) the dimension of the sample space \(\mathcal {X}\) (Dudley 1969), but the p-Wasserstein distance enjoys the following convergence law (Weed and Bach 2019).

Theorem 1

(Convergence rate of p-Wasserstein distance) For the upper Wasserstein dimension \(d_p^*(\alpha )\) (which is given in Definition 4 of Weed and Bach (2019)) of the probability distribution \(\alpha \), the following expression holds when s is larger than \(d_p^*(\alpha )\):

$$\begin{aligned} \mathbb {E}\left[ \mathcal {W}_p(\alpha ,\hat{\alpha }_{M})\right] \lesssim O({M}^{-1/s}), \end{aligned}$$
(2.6)

where the expectation \(\mathbb {E}\) is taken with respect to the samples drawn from the empirical distribution \(\hat{\alpha }_{M}\).

Intuitively, the upper Wasserstein dimension \(d_p^*(\alpha )\) can be interpreted as the support dimension of the probability distribution \(\alpha \), which corresponds to the dimension of the latent space, \(N_z\), in the implicit generative model. Exploiting the metric properties of the p-Wasserstein distance, the following corollaries are immediately derived from Theorem 1:

Corollary 1

(Convergence rate of p-Wasserstein distance between empirical distributions sampled from a common distribution) Let \(\hat{\alpha }_{1,M}\) and \(\hat{\alpha }_{2,M}\) be two different empirical distributions sampled from a common distribution \(\alpha \). The number of samples is M in both empirical distributions. Then the following expression holds for \(s > d_p^*(\alpha )\):

$$\begin{aligned} \mathbb {E}\left[ \mathcal {W}_p(\hat{\alpha }_{1,M},\hat{\alpha }_{2,{M}})\right] \lesssim O({M}^{-1/s}), \end{aligned}$$
(2.7)

where the expectation \(\mathbb {E}\) is taken with respect to the samples drawn from the empirical distributions \(\hat{\alpha }_{1,M}\) and \(\hat{\alpha }_{2,M}\).

Corollary 2

(Convergence rate of p-Wasserstein distance between different empirical distributions) Suppose that the upper Wasserstein dimension of the probability distributions \(\alpha \) and \(\beta _{\varvec{\theta }}\) is at most \(d_p^*\), then the following expression holds for \(s>d_p^*\):

$$\begin{aligned} \mathbb {E} \left[ |\mathcal {W}_p(\alpha ,\beta _{\varvec{\theta }})-\mathcal {W}_p(\hat{\alpha }_{M},\hat{\beta }_{\varvec{\theta },{M}}) |\right] \lesssim O({M}^{-1/s}), \end{aligned}$$
(2.8)

where the expectation \(\mathbb {E}\) is taken with respect to the samples drawn from the empirical distribution \(\hat{\alpha }_{M}\) and \(\hat{\beta }_{\varvec{\theta },{M}}\).

These corollaries indicate that the empirical estimator (2.5) is a good estimator if the intrinsic dimension of the training data and the dimension of the latent space \(N_z\) are sufficiently small, because the Wasserstein dimension \(d_p^*\) is almost the same as the intrinsic dimension of the training data and the latent dimension. In Sect. 4.1, we numerically see that similar convergence laws hold even when the OTL is not the p-Wasserstein distance.

3 Learning algorithm of generative model for quantum ensemble

In Sect. 3.1, we define the new quantum OTL that can be suitably used in the learning algorithm of the generative model for quantum ensemble. The learning algorithm is provided in Sect. 3.2.

3.1 Optimal transport loss with local ground cost

Our idea is to directly use Eq. 2.5 yet with the ground cost for quantum states, \(c(| \psi \rangle , | \phi \rangle )\), rather than that for classical data vectors, \(c(\varvec{x}, \varvec{y})\). This actually enables us to define the OTL between quantum ensembles \(\{| \psi _i \rangle \}\) and \(\{| \phi _i \rangle \}\), as follows:

$$\begin{aligned} \mathcal {L}_c\left( \{|{\psi _i}\}, \{ |{\phi _i}\} \right) = \min _{\{\pi _{i,j}\} } \sum _{i,j} c\left( |{\psi _i}, |{\phi _i} \right) \pi _{i,j}, \nonumber \\ \mathrm {subject\ to} \quad \sum _{i}\pi _{i,j} = q_j, ~~ \sum _{j}\pi _{i,j} = p_i, ~~ \pi _{i,j} \ge 0, \end{aligned}$$
(3.1)

where \(p_i\) and \(q_j\) are probabilities that \(| \psi _i \rangle \) and \(| \phi _i \rangle \) appears, respectively. Note that we can define OTL between the corresponding mixed states \(\sum _i p_i | \psi _i \rangle \langle \psi _i |\) and \(\sum _i q_i | \phi _i \rangle \langle \phi _i |\) or some modification of them, as discussed in Chakrabarti et al. (2019); but as mentioned in Sect. 1, such mixed states lose the original configuration of ensemble (e.g., single-qubit pure states uniformly distributed on the equator of the Bloch sphere) and thus are not applicable to our purpose.

Then our question is how to define the ground cost \(c(| \psi \rangle , | \phi \rangle )\). An immediate choice might be the trace distance:

Definition 4

(Trace distance for pure states; Nielsen and Chuang (2000))

$$\begin{aligned} c_{\textrm{tr}}(|{\psi }\rangle ,|{\phi }\rangle )&= \sqrt{1-|\langle \psi |{\phi }\rangle |^2}. \end{aligned}$$
(3.2)

Because the trace distance satisfies the axioms of metric, we can define the p-Wasserstein distance for quantum ensembles,

$$\begin{aligned} \mathcal {W}_p( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} ) = \mathcal {L}_{d^p}( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} )^{1/p}, \end{aligned}$$

which allows us to have some useful properties described in Corollary 1. It is also notable that the trace distance is relatively easy to compute on a quantum computer, using, e.g., the swap test (Buhrman et al. 2001) or the inversion test (Havlíček et al. 2019).

We now give an important remark. As will be formally described, our goal is to find a quantum circuit that produces a quantum ensemble (via changing latent variables) which best approximates a given quantum ensemble. This task can be executed by the gradient descent method for a parametrized quantum circuit, but a naive setting leads to the vanishing gradient issue, meaning that the gradient vector decays to zero exponentially fast with respect to the number of qubits (McClean et al. 2018). There have been several proposals to avoid this issue found in the literature, e.g., Cerezo et al. (2021), but a common prerequisite is that the cost should be a local one. To explain the meaning, let us consider the case where \(| \phi \rangle \) is given by \(| \phi \rangle =U| 0 \rangle ^{\otimes n}\) where U is a unitary matrix (which will be a parametrized unitary matrix \(U(\varvec{\theta })\) defining the generative model) and n is the number of qubits. Then the trace distance is based on the fidelity \(|\langle \psi |{\phi }\rangle |^2 = |\langle \psi |U|{0}\rangle ^{\otimes n}|^2\). This is the probability to get all zeros via the global measurement on the state \(U^\dagger | \psi \rangle \) in the computational basis, which thus means that the trace distance is a global cost; accordingly, the fidelity-based learning method suffers from the vanishing gradient issue. On the other hand, we find that the following cost function is based on the localized fidelity measurement.

Definition 5

(Ground cost for quantum states only with local measurements; Khatri et al. (2019); Sharma et al. (2020))

$$\begin{aligned}&c_\textrm{local}(| \psi \rangle , | \phi \rangle ) = c_\textrm{local}(| \psi \rangle , U| 0 \rangle ^{\otimes n}) \nonumber \\&\qquad \qquad \qquad \quad = \sqrt{\frac{1}{n}\sum _{k=1}^n(1-p^{(k)})}, \nonumber \\&p^{(k)} = \textrm{Tr} \left[ P_0^k U^\dagger | \psi \rangle \langle \psi |U\right] , \nonumber \\&P_0^k = \mathbb {I}_1\otimes \mathbb {I}_2\otimes \cdots \otimes \overbrace{| 0 \rangle \langle 0 |_k}^{k\text {-}\mathrm {th\ bit}}\otimes \cdots \otimes \mathbb {I}_n, \end{aligned}$$
(3.3)

where n is the number of qubits. Also, \(\mathbb {I}_i\) and \(| 0 \rangle \langle 0 |_i\) denote the identity operator and the projection operator that act on the i-th qubit, respectively; thus \(p^{(k)}\) represents the probability of getting 0 when observing the k-th qubit.

Equation 3.3 is certainly a local cost, and thus it may be used for realizing effective learning free from the vanishing gradient issue provided that some additional conditions (which will be described in Sect. 5) are satisfied. However, importantly, \(c_\textrm{local}(| \psi \rangle ,| \phi \rangle )\) is not a distance between the two quantum states, because it is not symmetric and it does not satisfy the triangle inequality, while the trace distance (3.2) satisfies the axiom of distance. Yet \(c_\textrm{local}(| \psi \rangle ,| \phi \rangle )\) is always non-negative and becomes zero only when \(| \psi \rangle =| \phi \rangle \), meaning that \(c_\textrm{local}(| \psi \rangle ,| \phi \rangle )\) functions as a divergence. Then we can prove that, in general, the OTL defined with a divergence ground cost also functions as a divergence, as follows. The proof is given in Appendix A.

Proposition 1

When the ground cost \(c(\varvec{x},\varvec{y})\) is a divergence satisfying

$$\begin{aligned} c(\varvec{x},\varvec{y})&\ge 0, \nonumber \\ c(\varvec{x},\varvec{y})&= 0 \text {iff} \varvec{x} = \varvec{y}, \end{aligned}$$
(3.4)

then the OTL \(\mathcal {L}_c(\alpha ,\beta )\) with \(c(\varvec{x},\varvec{y})\) is also a divergence. That is, \(\mathcal {L}_c(\alpha ,\beta )\) satisfies the following properties for arbitrary probability distributions \(\alpha \) and \(\beta \):

$$\begin{aligned} \mathcal {L}_c(\alpha ,\beta )&\ge 0, \nonumber \\ \mathcal {L}_c(\alpha ,\beta )&= 0 \text {iff} \alpha =\beta . \end{aligned}$$
(3.5)

Hence, the OTL \(\mathcal {L}_c\left( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} \right) \) given in Eq. 3.1 defined with the local ground cost \(c_\textrm{local}(| \psi \rangle , | \phi \rangle )\) given in Eq. 3.3 functions as a divergence. This means that \(\mathcal {L}_c\left( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} \right) \) can be suitably used for evaluating the difference between a given quantum ensemble and the set of output states of the generative model. At the same time, recall that, for the purpose of avoiding the gradient vanishing issue, we had to give up using the fidelity measure and accordingly the distance property of the OTL. Hence we directly cannot use the desirable properties described in Theorem 1, Corollaries 1 and 2; nonetheless, in Sect. 4, we will see that similar properties do hold even for the divergence measure.

3.2 Learning algorithm

The goal of our task is to train an implicit generative model so that it outputs a quantum ensemble approximating a given ensemble \(\{| \psi _i \rangle \}_{i=1}^{M}\). Our generative model contains tunable parameters and latent variables, as in the classical case described in Sect. 2.1. More specifically, we employ the following implicit generative model:

Definition 6

(Implicit generative model on quantum circuit) Using the initial state \(| 0 \rangle ^{\otimes n}\) and the parameterized quantum circuit \(U(\varvec{z},\varvec{\theta })\), the implicit generative model on a quantum circuit is defined as

$$\begin{aligned} | \phi _{\varvec{\theta }}(\varvec{z}) \rangle =U(\varvec{z},\varvec{\theta })| 0 \rangle ^{\otimes n}. \end{aligned}$$
(3.6)

Here \(\varvec{\theta }\) is the vector of tunable parameters and \(\varvec{z}\) is the vector of latent variables that follow a known probability distribution; both \(\varvec{\theta }\) and \(\varvec{z}\) are encoded in the rotation angles of rotation gates in \(U(\varvec{z},\varvec{\theta })\).

The similar circuit model is also found in meta-VQE (Cervera-Lierta et al. 2021), which uses physical parameters such as the distance of atomic nucleus instead of random latent variables \(\varvec{z}\). Also, the model proposed in Dallaire-Demers and Killoran (2018) introduces the latent variables \(\varvec{z}\) as the computational basis of an initial state in the form \(| \phi _{\varvec{\theta }}(\varvec{z}) \rangle =U(\varvec{\theta })| \varvec{z} \rangle \); however, in this model, states with different latent variables are always orthogonal with each other, and thus the model cannot capture a small change of state in the Hilbert space via changing the latent variables. In contrast, the model (3.6) fulfills this purpose as long as the expressivity of the state with respect to \(\varvec{z}\) is enough. In addition, our model is advantageous in that the analytical derivative is available by the parameter shift rule (Mitarai et al. 2018; Schuld et al. 2019) not only for the tunable parameters \(\varvec{\theta }\) but also for the latent variables \(\varvec{z}\). This feature will be effectively utilized in the anomaly detection problem in Sect. 5.

Next, as for the learning cost, we take the following empirical estimator of OTL, calculated from the training data \(\{| \psi _i \rangle \}_{i=1}^{M}\) and the samples of latent variables \(\{\varvec{z}_j\}_{j=1}^{M_g}\):

$$\begin{aligned} {\begin{matrix} &{}\mathcal {L}_{c_{\text {local}}}\left( \{| \psi _i \rangle \}_{i=1}^{M},\{ | \phi _{\varvec{\theta }}(\varvec{z}_j) \rangle \}_{j=1}^{M_g}\right) \\ &{}\qquad \qquad \qquad \qquad \qquad = \min _{\{\pi _{i,j}\}_{i,j=1}^{{M},{M_g}} } \sum _{i=1}^{M}\sum _{j=1}^{M_g} c_{\text {local},i,j}\pi _{i,j}, \\ &{}\mathrm {subject\ to} \quad \sum _{i=1}^{M}\pi _{i,j} = \frac{1}{{M_g}}, \sum _{j=1}^{M_g}\pi _{i,j} = \frac{1}{{M}}, \pi _{i,j} \ge 0. \end{matrix}} \end{aligned}$$
(3.7)

where \(c_{\text {local},i,j}\) is the ground cost given by

$$\begin{aligned}&c_{\text {local},i,j} = \sqrt{\frac{1}{n}\sum _{k=1}^n (1-p^{(k)}_{i,j})}, \nonumber \\&p^{(k)}_{i,j} =\textrm{Tr} \left[ P_0^k U^\dagger (\varvec{z}_j,\varvec{\theta })| \psi _i \rangle \langle \psi _i |U(\varvec{z}_j,\varvec{\theta })\right] , \nonumber \\&P_0^k = \mathbb {I}_1\otimes \mathbb {I}_2\otimes \cdots \otimes \overbrace{| 0 \rangle \langle 0 |_k}^{k\text {-}\mathrm {th\ bit}}\otimes \cdots \otimes \mathbb {I}_n. \end{aligned}$$
(3.8)

Note that in practice \(c_{\text {local},i,j}\) is estimated with the finite number of measurements (shots); we denote \(\tilde{c}_{\text {local},i,j}^{({N_s})}\) to be the estimator with \(N_s\) shots for the ideal one \(c_{\text {local},i,j}\), and in this case the OTL is denoted as \(\mathcal {L}_{\tilde{c}_{\text {local}}^{({N_s})}}\).

Based on the OTL (3.7), the pseudo-code of proposed algorithm is shown in Algorithm 1. The total number of training quantum states \(| \psi _i \rangle \) required for the parameter update is of the order \(O({M}M_g N_s)\) in step 3, and \(O(\max (M,M_g) N_s N_p)\) in step 5, since the parameter shift rule (Mitarai et al. 2018; Schuld et al. 2019) is applicable.

Algorithm 1
figure a

Learning algorithm with quantum optimal transport loss Eq. (3.7).

4 Performance analysis of the cost and its gradient

In this section, we analyze the performance of the proposed OTL (3.7) and its gradient vector. First, in Sect. 4.1, we numerically study the approximation error of the loss with the focus on its dependence on the intrinsic dimension of data and the number of qubits, to see if the similar results to Theorem 1 and Corollaries 1 and 2 would hold even despite that the OTL is now a divergence rather than distance. Then, in Sect. 4.2, we provide numerical and theoretical analyses on the approximation error as a function of the number of measurement (shots). Finally, in Sect. 4.3, we numerically show that the OTL certainly avoids the vanishing gradient issue; i.e., thanks to the locality of the cost, its gradient does not decay exponentially fast. All the analysis in this section is focused on the property of cost at a certain point of learning process (say, at the initial time); the performance analysis on the training process will be discussed in the next section.

Fig. 1
figure 1

Structure of the parameterized quantum circuit (ansatz) used in the analysis in Sects. 4 and 5. This ansatz consists of the repeated layers having similar structure. In the \(\ell \)-th layer, the single-qubit rotation operators with angles \(\{\theta _{\ell ,j} \cdot z_{\eta _{\ell ,j}}\}_{j=1}^n\) and directions \(\{\xi _{\ell ,j}\}_{j=1}^n\) are applied to each qubit followed by a ladder-structured controlled-Z gate. The rotation angles \(\varvec{\xi }=\{\xi _{\ell ,j}\}_{\ell ,j=1}^{N_L,n}\) and the index parameter of the latent variables, \(\varvec{\eta }=\{\eta _{\ell ,j}\}_{\ell ,j=1}^{N_L,n}\), are randomly chosen at the beginning of the learning process and never changed during the learning

We employ the parameterized unitary matrix \(U(\varvec{z},\varvec{\theta })\) shown in Fig. 1 to construct the implicit generative model (3.6), which is similar to that given in Ref. McClean et al. (2018) except that our model contains the latent variables \(\varvec{z}\). That is, the model is composed of the following \(N_L\) repeated unitaries (we call the \(\ell \)-th unitary the \(\ell \)-th layer):

$$\begin{aligned} U_{N_L,\varvec{\xi },\varvec{\eta }} (\varvec{z},\varvec{\theta }) = \prod _{\ell =1}^{N_L} W V_{\varvec{\xi }_\ell ,\varvec{\eta }_\ell }(\varvec{z},\varvec{\theta }_\ell ), \end{aligned}$$
(4.1)

where \(\varvec{\theta }_\ell =\{\theta _{\ell ,j}\}_{j=1}^n\), \(\varvec{\xi }_\ell =\{\xi _{\ell ,j}\}_{j=1}^n\), and \(\varvec{\eta }_\ell =\{\eta _{\ell ,j}\}_{j=1}^n\) are n-dimensional parameter vectors in the \(\ell \)-th layer. We summarize these vectors to \(\varvec{\theta }=\{\varvec{\theta }_\ell \}_{\ell =1}^{N_L}\), \(\varvec{\xi }=\{\varvec{\xi }_\ell \}_{\ell =1}^{N_L}\), and \(\varvec{\eta }=\{\varvec{\eta }_\ell \}_{\ell =1}^{N_L}\). Here \(\varvec{\theta }\) are trainable parameters and \(\varvec{z}\) are latent variables. W is a fixed entangling unitary gate composed of the ladder-structured controlled-Z gates; that is, W operates the two-qubit controlled-Z gate on all adjacent qubits;

$$\begin{aligned} W =\prod _{i=1}^{n-1} CZ_{i,i+1}, \end{aligned}$$
(4.2)

where \(CZ_{i,i+1}\) is the controlled-Z gate acting on the i-th and \((i+1)\)-th qubits. The operator \(V_{\varvec{\xi }_\ell ,\varvec{\eta }_\ell }(\varvec{z},\varvec{\theta }_\ell )\) consists of the single-qubit rotation operators:

$$\begin{aligned} V_{\varvec{\xi }_\ell ,\varvec{\eta }_\ell }(\varvec{z},\varvec{\theta }_\ell ) =\prod _{i=1}^n R_{\xi _{\ell ,i}}(\theta _{\ell ,i}z_{\eta _{\ell ,i}}), \end{aligned}$$
(4.3)

where \(R_{\xi _{\ell ,i}}(\theta _{\ell ,i}z_{\eta _{\ell ,i}})\) is the single-qubit rotation operator with angle \(\theta _{\ell ,i}z_{\eta _{\ell ,i}}\) and direction \({\xi _{\ell ,i}}\in \{X,Y,Z\}\) in the \(\ell \)-th layer, such as \(R_X(\theta _{\ell ,3} z_5)=\textrm{exp}(-i \theta _{\ell ,3} z_5 \sigma _x)\). The index parameters \(\eta _{\ell ,i}\in \{0,1,\cdots ,N_z\}\) and \({\xi _{\ell ,i}}\in \{X,Y,Z\}\) are randomly chosen at the beginning of learning and never changed during learning. Also, we have introduced a constant bias term \(z_0=1\) so that the ansatz \(| \phi _{\varvec{\theta }}(\varvec{z}) \rangle =U(\varvec{z},\varvec{\theta })| 0 \rangle ^{\otimes n}\) can take a variety of states (for instance, if \(z_1, \ldots , z_{N_z}\sim 0\) in the absence of the bias, then \(| \phi _{\varvec{\theta }}(\varvec{z}) \rangle \) is limited to around \(| 0 \rangle ^{\otimes n}\)). We used Qiskit (ANIS et al. 2021) in all the simulation studies.

4.1 Approximation error with respect to the number of training data

We have seen in Sect. 2.2 that, for the p-Wasserstein distance, the convergence rate of its approximation error is of the order \(\mathcal {O}(M^{-1/s})\) rather than \(\mathcal {O}(M^{-1/N_x})\), where M is the number of training samples, \(N_x\) is the dimension of the sample space \(\mathcal {X}\), and s can be interpreted as the intrinsic dimension of the data or the latent space dimension. This is desirable, because in general \(s \ll N_x\). However, the OTL (3.7) is not a distance but a divergence. Hence it is not clear if it could theoretically enjoy a similar error scaling, but in this subsection we give a numerical evidence to positively support this conjecture.

We present the following two types of numerical simulations. The aim of the first one (Experiment A) is to see if our OTL would satisfy a similar property to Eq. 2.7, which describes the difference of two empirical distributions sampled from the common hidden distribution. The second one (Experiment B) is studied for the case of Eq. 2.8, which describes the difference between the ideal OTL assuming the infinite samples available and the OTL of empirical distributions. We used the statevector simulator (ANIS et al. 2021) to calculate the ideal OTL assuming the infinite number of measurement; in the next subsection, we will study the influence of the finite number of shots on the performance. Throughout all the numerical experiments, we randomly chose the parameters \(\varvec{\xi },\varvec{\eta },\varvec{\theta },\tilde{\varvec{\xi }},\tilde{\varvec{\eta }}, \tilde{\varvec{\theta }}\) and did not change these values.

Table 1 List of parameters for numerical simulation of Sect. 4.1

Experiment A. As an analogous quantity appearing in Eq. 2.7, we here focus on the following expected value of the empirical OTL defined in Eq. 3.7:

$$\begin{aligned} \mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}}\left[ J^{c_{\text {local}}}_{\varvec{{\xi }},{\varvec{\eta }};\varvec{\xi },\varvec{\eta }} (\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right] , \end{aligned}$$
(4.4)

where

$$\begin{aligned}&J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }} (\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \nonumber \\&=\mathcal {L}_{c_{\text {local}}}\left( \{ U_{N_L,\tilde{\varvec{\xi }},\tilde{\varvec{\eta }}}(\tilde{\varvec{z}}_i,\tilde{\varvec{\theta }})| 0 \rangle ^{\otimes n}\}_{i=1}^{M}, ~ \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \left. \{ U_{N_L,\varvec{\xi },\varvec{\eta }}(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M} \big ) \right) . \end{aligned}$$
(4.5)

In Eq. 4.4, we set \(\varvec{\xi }=\tilde{\varvec{\xi }}\) and \(\varvec{\eta }=\tilde{\varvec{\eta }}\) for the two unitary operators that appear in the argument of \(\mathcal {L}_{c_{\text {local}}}\) in Eq. 4.5. This indicates that \(J^{c_{\text {local}}}_{\varvec{{\xi }},{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},{\varvec{\theta }}:\varvec{z},\varvec{\theta };M)\) in Eq. 4.4 would become zero in the limit of infinite number of training data (\(M\rightarrow \infty \)). The expectation in Eq. 4.4 is taken with respect to the latent variables \(\tilde{\varvec{z}}_i\) and \(\varvec{z}_j\) subjected to the \(N_z\)-dimensional uniform distribution \(U(0,1)^{N_z}\), but we numerically approximate it by \(N_{\text {Monte}}\) Monte Carlo samplings. Other conditions in the numerical simulation are shown in Table 1.

Figure 2 plots the values of Eq. 4.4 with several \(N_z\) (the dimension of the latent variables) and n (the number of qubits). In the figures, the dotted lines show the scaling curve \(M^{-1/N_z}\). Notably, in the range of a large number of training data, the points and dotted lines are almost consistent, regardless of the number of qubits. This implies that the OTL for two different ensembles given by Eq. 4.4 is almost independent of the number of qubits n and depends mainly on the latent dimension \(N_z\), likewise the case of distance-based loss function proven in Corollary 1.

Fig. 2
figure 2

Results of numerical simulations on the relationship between the number of training data and the OTL given by Eq. 4.4, with several qubit number n. Each subgraph shows the results for various latent dimensions \(N_z\). For reference, the scaling curves \(M^{-1/N_z}\) are added as dotted lines. These graphs show that Eq. (4.4) mainly scale as \(M^{-1/N_z}\) and are almost independent to the number of qubits, n

Fig. 3
figure 3

Simulation results of the fitting parameter b with respect to (a) the number of qubits n and (b) the latent dimension \(N_z\). The fitting parameter b is obtained by fitting the second term of Eq. 4.6 by using Eq. 4.7. The subfigure (a) shows that b is almost independent to n, while the subfigure (b) shows that b linearly scales with \(N_z\)

Experiment B. We next turn to the second experiment to confirm that the approximation error of the proposed OTL scales similar to Eq. 2.8. Specifically, we numerically show the dependence of the following expectation value on the number of training data M:

$$\begin{aligned}&\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ \lim _{K\rightarrow \infty } J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };K) \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \quad -\left. J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right] , \end{aligned}$$
(4.6)

where \(J^{c_{\text {local}}}\) is defined in Eq. 4.5. In this case, we set different fixed parameters \(\varvec{\xi }\ne \tilde{\varvec{\xi }}\) and \(\varvec{\eta }\ne \tilde{\varvec{\eta }}\) for the two unitary operators in Eq. 4.5. The parameters used in the numerical simulation are shown in Table 1.

The first term of Eq. 4.6 is an ideal quantity assuming an infinite number of training data. The second term is expected to take the following form, as suggested from Eq. 2.8:

$$\begin{aligned} \mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M)\right] = a M^{-1/b} + c. \end{aligned}$$
(4.7)

To identify the parameter b, we use the Monte Carlo method to calculate the left-hand side of Eq. 4.7 as a function of M and then execute the curve-fitting via \(a M^{-1/b} + c\). We repeat this procedure with several values of the number of qubits n and the latent dimension \(N_z\); see Appendix B for a more detailed discussion. The result of parameter identification is depicted in Fig. 3, which shows that the fitting parameter b is almost independent to n and linearly scales with respect to \(N_z\). This result is indeed consistent with Eq. 2.8.

Table 2 List of parameters for numerical simulation of Sect. 4.2

The results obtained in Experiments A and B suggest us to have the following conjecture:

Conjecture 1

The scaling of the approximation error of OTL (3.7) with respect to the number of training data M is determined via the latent dimension \(N_z\) as follows:

$$\begin{aligned}&\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ \mathcal {L}_{c_\mathrm{{local}}}\left( \{ U(\tilde{\varvec{z}}_i,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{i=1}^{\infty }, \right. \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \left. \left. \{ U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M}\right) \right] \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \lesssim O({M}^{-1/N_z}), \end{aligned}$$
(4.8)

without respect to the number of qubits.

This conjecture means that the proposed OTL can be efficiently computed via sampling, when the intrinsic dimension of the data and the latent dimension are sufficiently small. Proving this conjecture would be challenging and is the subject of future work.

4.2 Approximation error with respect to the number of shots

Fig. 4
figure 4

Simulation results of the approximation error of the OTL due to the number of shots defined in Eq. (4.10). The left and right panels show the dependence of the error on the number of training data M and shots \(N_s\), respectively. The results for different latent dimension \(N_z\) are shown in the figures (a), (b), and (c). For reference, we added the curve of \(M^{-1/2}\) with the dotted line in the left panels. The fitting result of the points \(N_s =128\) with the curve \(\sqrt{c_1\ln (M)+c_2}\) is added as the dashed line in the left panels. Also, we added the curve of \(N_s^{-1/2}\) as the dotted line in the right panels

The error analysis in Sect. 4.1 assumes that the number of shot is infinite and thereby the ground cost between quantum states can be perfectly determined. Here, we analyze the effect of the finiteness of the number of shots on the approximation error. The following proposition serves as a basis of the analysis.

Proposition 2

Let \(\tilde{c}^{(N_s)}_\mathrm{{local}}\) be an estimator of the ground cost \(c_\mathrm{{local}}\) of Eq. 3.3 using \(N_s\) samples. Suppose that the support of two different probability distributions are strictly separated; as a result, there exists a lower bound \(g>0\) to the ground cost for any \(i,j\in \{1,2,\ldots ,M\}\), i.e., \((c_\mathrm{{local}}(| \psi _i \rangle ,U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n})>g, \forall i,j)\). Then, for any positive constant \(\delta \), the following inequality holds:

$$\begin{aligned} P\left( \Big | \mathcal {L}_{c_\mathrm{{local}}}\left( \{| \psi \rangle _i\}_{i=1}^{M},\{ U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M}\right) \right. \nonumber \\ - \mathcal {L}_{\tilde{c}^{(N_s)}_\mathrm{{local}}}\left( \{| \psi \rangle _i\}_{i=1}^{M},\{ U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M}\right) \Big | \nonumber \\ \left. \ge \sqrt{\frac{2M}{\delta }}\sqrt{ \frac{1-g}{N_s} + \frac{(1-g)^2}{4N_s^2 g}}+\frac{1-g}{2 N_s \sqrt{g}}\right) \le \delta . \end{aligned}$$
(4.9)

The proof is shown in Appendix C. Proposition 2 states that the approximation error of the OTL is upper bounded by a constant of the order \(O(\sqrt{M/N_s})\), under the condition \(M\gg 1\) and \(N_s \gg 1\). Therefore, if Observation 1 is true, the approximation error due to the finiteness of \(N_s\) and M is upper bounded by \(O(M^{-1/N_z})+O\left( \sqrt{{M}/N_s}\right) \), where \(N_z\) is the latent dimension.

Then, we provide a numerical simulation to show the averaged approximation error s a function of M as well as \(N_s\). Again, we employ the hardware efficient ansatz shown in Fig. 1 of Sect. 4.1. The purpose of the numerical simulation is to see the dependence of the following expectation value on M and \(N_s\).

$$\begin{aligned}&\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ \left| J^{\tilde{c}^{(N_s)}_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right. \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \left. \left. - J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right| \right] , \end{aligned}$$
(4.10)

where \(J^{\tilde{c}^{(N_s)}_{\text {local}}}\) can be computed via Eq. (4.5). As in the numerical simulation in Sect. 4.1, we approximate the expectation \(\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}}[\bullet ]\) by Monte Carlo sampling of \(\tilde{\varvec{z}}\) and \(\varvec{z}\) from the uniform distribution \(U(0,1)^{N_z}\). Also, we randomly choose the fixed parameters \(\varvec{\xi },\varvec{\eta },\varvec{\theta },\)

\(\tilde{\varvec{\xi }},\tilde{\varvec{\eta }},\tilde{\varvec{\theta }}\) prior to the simulation. The other simulation parameters are given in Table 2. Simulation results are depicted in Fig. 4, where the notable points are summarized as follows.

  • In the range of small number of training data M, the approximation error is roughly proportional to \(M^{-1/2}\).

  • In the range of large number of training data M, the approximation error takes \(\sqrt{c_1 \ln M+c_2}\) with constants \(c_1\) and \(c_2\).

  • The dependence of the error on the number of shots \(N_s\) is roughly proportional to \(N_s^{-1/2}\).

Appendix D provides an intuitive explanation of these results. In particular, it is important to know that we need to choose a proper number of samples to reduce the approximation error.

Table 3 List of parameters for numerical simulation of Sec. 4.3
Fig. 5
figure 5

The gradient of the cost function as a function of the number of qubit, n, for the case of (a) the global cost (3.2) and (b) the local cost (3.3). Figure (c) shows the dependence on the number of training data, M, for the case of local cost. Several curves are depicted for different number of layers \(N_L\). The clear exponential decay is observed in (a), but is avoided in (b). The polynomial decay (\(\simeq M^{-1}\)) is observed in (c), implying the simple statistical scaling

4.3 Avoidance of the vanishing gradient issue

In Sect. 3.1 we chose the cost function composed of the local measurements, as a least condition to avoid the vanishing gradient issue. Note that, however, employing a local cost is not enough to avoid this issue; for instance, Ref. Cerezo et al. (2021) proposed the method of using a special type of parametrized quantum circuit called the alternating layered ansatz (ALA) in addition to using the local cost, which is actually proven to avoid the issue. Nevertheless, we here numerically demonstrate that our method can certainly mitigate the decrease of gradient even without such additional condition, compared to the case with global cost.

More specifically, we calculated the expectation of the variance of the partial derivative of the OTL (3.7), based on the training ensemble \(\{| \psi _i \rangle \}^{M}_{i=1}\) and the sampled data from the generative model \(\{U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}^{M}_{j=1}\);

$$\begin{aligned} \mathbb {V}_{\varvec{\xi },\varvec{\eta },\varvec{\theta },\varvec{z}}\left[ \frac{\partial }{\partial \theta } \mathcal {L}_{c_{\text {local}}}\left( \{ | \psi \rangle _i\}, \{ U_{N_L,\varvec{\xi },\varvec{\eta }}(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\} \right) \right] . \end{aligned}$$
(4.11)

The partial derivative is calculated using the parameter shift rule (Mitarai et al. 2018). The expectation of the variance is approximated by Monte Carlo calculations with respect to \(\varvec{z}\), \(\varvec{\xi }\), \(\varvec{\eta }\), and \(\varvec{\theta }\), where \(\varvec{z}\) is sampled from the uniform distribution U(0, 1) and \(\varvec{\xi },\varvec{\eta },\varvec{\theta }\) are sampled from \({\varvec{\xi }}\in \{X,Y,Z\}^{n N_L}\), \(\varvec{\eta }\in \{0,1,\ldots ,N_z\}^{n N_L}\), and \(\varvec{\theta } \in [0,2\pi ]^{n N_L}\). The structure of the generative model \(U(\varvec{z},\theta )| 0 \rangle ^{\otimes n}\) is the same as that shown in Fig. 1. The derivative is taken with respect to \(\theta _{1,1}\). The training ensemble \(\{| \psi _i \rangle \}_{i=1}^{M}\) is prepared as follows;

$$\begin{aligned} | \psi \rangle _i = W' V'_2(\varvec{\zeta }^i_2)W' V'_1(\varvec{\zeta }^i_1)| 0 \rangle ^{\otimes n}, \end{aligned}$$

where \(\varvec{\zeta }^i_1=\{\zeta ^i_{1,j}\}_{j=1}^n\) and \(\varvec{\zeta }^i_2=\{\zeta ^i_{2,j}\}_{j=1}^n\) are randomly chosen from the uniform distribution on \([0,2\pi ]\) and fixed during Monte Carlo calculation. The operators \(W'\), \(V'_1\), and \(V'_2\) are defined as follows:

$$\begin{aligned}&W' =\prod _{i=1}^{n-1} CX_{j,j+1}, \nonumber \\&V'_1(\varvec{\zeta }^i_1) =\prod _{j=1}^n R_{j,Y}(\zeta ^i_{1,j}), \nonumber \\&V'_2(\varvec{\zeta }^i_2) =\prod _{j=1}^n R_{j,Z}(\zeta ^i_{2,j}), \end{aligned}$$
(4.12)

where \(CX_{j,k}\) denotes the controlled-X gate, which operates X gate on the k-th qubit with the j-th control qubit. \(R_{j,Y}\) and \(R_{j,Z}\) denote single-qubit Pauli rotations around the x and y axes, respectively. The other simulation parameters are given in Table 3.

Figure 5 shows the numerical simulation result of the variance (4.11), for the cases where (a) the global cost (3.2) is used and (b) the local cost (3.3) is used. The number of training data is fixed to \(M=8\). The clear exponential decays in variance of gradient are observed for the global cost, regardless of \(N_L\). In contrast, for the case of local cost, the relatively shallow circuits with \(N_L=10, 25\) exhibit approximately constant scaling with respect to \(n \ge 10\), while the deep circuits with \(N_L\ge 50\) also exhibit slower scaling than the global one and keep larger variance even when \(n \ge 8\). This result implies that the OTL with local ground cost can avoid the gradient vanishing issue, despite that the circuit is not specifically designed for this purpose. Note also that the result is consistent with that reported in Holmes et al. (2022), which studied the cost function composed of single ground cost of our setting.

In addition, Fig. 5(c) shows the variance (4.11) as a function of the number of training data, M, in the case of \(n=14\). In the figure, the points represent the Monte Carlo numerical results and the dotted lines represent the scaling curves \(M^{-x}\) where the value x is determined via fitting. This fitting result implies that the gradient obeys the simple statistical scaling with respect to M and thus the proposed algorithm would enjoy efficient learning even for a large training ensemble.

5 Demonstration of the generative model training and application to anomaly detection

5.1 Quantum anomaly detection

Anomaly detection is a task to judge whether a given test data \(\varvec{x}^{(t)}\) is anomalous (rare) data or not, based on the knowledge learned from the past training data \(\varvec{x}_i,(i=1,2,\ldots ,M)\), i.e., the generative model. Unlike typical classification tasks, this problem deals with a large imbalance in the number of normal data and that of anomalous data; actually, the former is usually much bigger than the latter. Therefore, typical classification methods are not suitable to solve this task, and some specialized schemes have been widely developed (Chandola et al. 2009).

The anomaly detection problem is important in the field of quantum technology. That is, to realize accurate state preparation and control, we are required to detect contaminated quantum states and remove those states as quickly as possible. Previous quantum anomaly detection schemes rely on the measurement-based data processing (Hara et al. 2014, 2016), which however require a large number of measurement as in the case of quantum state tomography. In contrast, our anomaly detection scheme directly inputs quantum states to the constructed generative model and then diagnoses the anormality with much fewer measurements.

The following is the procedure for constructing the conventional anomaly detector based on the generative model (Ide 2015), which we apply to our quantum case.

  1. 1.

    Distribution estimation): Construct a model probability distribution from the normal dataset.

  2. 2.

    (Anomaly score design): Define an Anomaly Score (AS), based on the model distribution of normal data.

  3. 3.

    (Threshold determination): Set a threshold of AS for diagnosing the anormality.

Table 4 List of parameters for numerical simulation in Sect. 5.2

Of these steps, the model probability distribution in Step 1 is constructed by the learning algorithm presented in Sec. 3.2. To designing the AS in Step 2, we refer to AnoGAN (Schlegl et al. 2017) in classical machine learning. Namely, we define a loss function

$$\begin{aligned} \mathcal {L}\left( U(\varvec{z},\varvec{\theta })| 0 \rangle ^{\otimes n},| \psi ^{(t)} \rangle \right) \end{aligned}$$

for a test data \(| \psi ^{(t)} \rangle \) and the generative model \(U(\varvec{z},\bar{\varvec{\theta }})\) constructed from the training dataset with learned parameter \(\bar{\varvec{\theta }}\). Then we take the minimum with respect to the latent variables \(\varvec{z}\) to calculate AS:

$$\begin{aligned} (\text {Anomaly Score}) =\min _{\varvec{z}} \mathcal {L} \left( U(\varvec{z},\bar{\varvec{\theta }})| 0 \rangle ^{\otimes n},| \psi ^{(t)} \rangle \right) . \end{aligned}$$
(5.1)

As the loss function \(\mathcal {L}\), we use the local ground cost \(c_\textrm{local}(| \psi ^{(t)} \rangle ,U(\varvec{z},\bar{\varvec{\theta }})| 0 \rangle ^{\otimes n})\) given in Eq. 3.3. The above minimization is executed via the gradient descent with respect to \(\varvec{z}\), which is obtained via the parameter shift rule similar to the derivative in \(\theta \). Algorithm 2 summarizes the procedure.

Algorithm 2
figure b

Algorithm to calculate Anomaly Score.

5.2 Distributed dataset

The first demonstration is to construct a generative model that learns a quantum ensemble distributed on the equator of the generalized Bloch sphere. That is, the training ensemble (i.e., the normal dataset) \(\{| \psi _j \rangle \}_{j=1}^{M}\) to be learned is set as follows:

$$\begin{aligned} | \psi _j \rangle = \textrm{cos}(\pi /4)| 0 \rangle +e^{2\pi i \phi ^j}\textrm{sin}(\pi /4)| 2^n -1 \rangle , \end{aligned}$$
(5.2)

where \(\phi ^j\) is randomly generated from the uniform distribution on [0, 1] and \(| x \rangle \) denotes the x-th basis in the \(2^n\)-dimensional Hilbert space. Note that the configuration of this ensemble cannot be learned by the existing mixed-state-based quantum anomaly detection scheme (Hara et al. 2014, 2016), because the mixed state corresponding to this ensemble is nearly the maximally mixed state, the learning of which thus does not give us a generative model recovering the original ensemble.

We employ the same ansatz as that given in Sect. 4 with the parameters shown in Table 4 and construct the generative model according to Algorithm 1. As the optimizer, we take Adam (Kingma and Ba 2014) with learning rate 0.01. The number of learning iterations (i.e., the number of the updates of the parameters) is set to 1500 for \(n=2\) and 10000 for \(n=10\).

Once the model for normal ensemble is constructed, it is then used to anomaly detection. Here the set of test data \(\{| \psi ^{(t)} \rangle \}\) is given by

$$\begin{aligned} \Big |{\psi ^{(t)}}\Big \rangle = \textrm{cos}\left( \frac{\pi }{2}\theta ^{(t)}\right) | 0 \rangle +e^{2\pi i \phi ^{(t)}}\textrm{sin}\left( \frac{\pi }{2}\theta ^{(t)}\right) | 2^n -1 \rangle , \end{aligned}$$
(5.3)

where \(\theta ^{(t)}, \phi ^{(t)} \in \{0, 0.1, 0.2, \ldots ,2\}\). We calculate the AS using Algorithm 2. The other simulation parameters are shown in Table 4.

The numerical simulation result in the case of \(n=2\) and \(n=10\) are presented in Figs. 6 and 7, respectively. The training ensemble is shown in the figures (a), where each blue point corresponds to the generalized Bloch vector. Some output states of the constructed generative model, corresponding to different value of \(z \in [0,1]\), are shown in the figures (b). Both of the red and blue points in the figures (c) represent the test data state (5.3). The figures (d) show the calculated AS, where the blue and red plots correspond to the blue and red points in (c), respectively. The dotted lines in (d) illustrate the theoretical expected values assuming that the model completely learns the training data.

Fig. 6
figure 6

Simulation results in the case of \(n=2\), for the distributed training ensemble. (a) Generalized Bloch vector representation of training ensemble. (b) Output states from the generative model with latent variables \(\varvec{z} \in [0,1]\). (c) Test data (both red and blue points). (d) Anomaly scores for different test data. The blue and red plots correspond to the blue and red points in (c), respectively. The values of angle \(\theta ^{(t)}\) and \(\phi ^{(t)}\) are given in the bottom and upper horizontal axis, respectively. The dotted line represents the theoretical value of AS

Fig. 7
figure 7

Simulation results in the case of \(n=10\), for the distributed training ensemble. (a) Generalized Bloch vector representation of training states. (b) Output states from the generative model with latent variables \(\varvec{z} \in [0,1]\). (c) Test data (both red and blue points). (d) Anomaly scores for different test data. The blue and red plots correspond to the blue and red points in (c), respectively. The values of angle \(\theta ^{(t)}\) and \(\phi ^{(t)}\) are given in the bottom and upper horizontal axis, respectively

Firstly, we see the clear correlation between the AS and the theoretical curve in Fig. 6, implying that AS is appropriately calculated via the proposed method. In the practical usecase, a user defines a threshold of AS depending on the task and then compare the calculated AS with the threshold for identifying the anomaly quantum states. For instance, if we set the threshold as \(\textrm{AS}=0.3\), the test states conditioned in \(0.3 \le \theta ^{(t)}/\pi \le 0.7\) in Fig. 6 are judged as normal while others are anomaly. Moreover, the output states of the learned generative model and the result of anomaly detection in the case of \(n=10\) are shown in Fig. 7. Although the result displayed in (b) would suggest that the learning fails, the output states show correlation with the training states displayed in (a); actually, the output states live on the xy-plane in the generalized Bloch sphere spanned by \(| 0 \rangle ^{\otimes 10}\) and \(| 1 \rangle ^{\otimes 10}\). In addition, it is notable that only \(N_s=100\) is enough even for the case of \(n=10\) to perform anomaly detection, provided that we obtain an appropriate generative model from the training normal states. This is an advantage for practical situation, as this indicates that the proposed method may scale up withe respect to the number of qubits.

5.3 Localized dataset

Next let us consider localized quantum ensemble. That is, the state of training ensemble \(\{| \psi _j \rangle \}_{j=1}^{M}\) corresponding to the normal dataset is given by

$$\begin{aligned} | \psi _j \rangle = \textrm{cos}\left( \frac{\pi }{2}{\varDelta }\theta _j\right) | 0 \rangle +e^{2\pi i {\varDelta }\phi _j}\textrm{sin}\left( \frac{\pi }{2}{\varDelta }\theta _j\right) | 2^n -1 \rangle , \end{aligned}$$
(5.4)
Table 5 List of parameters for numerical simulation of Sect. 5.3
Fig. 8
figure 8

Learning curves for the case (a) \(n=6\) and (b) \(n=10\). The values of OTL is calculated with the parameters at each iteration trained with local cost (3.3) or global cost (3.2). Note that the displayed are the cost calculated with the global one (3.2) for both blue and orange cases, for fair comparison. The range of error bar is 1 standard deviation

where n is the number of qubits. \({\varDelta }\theta _j\) and \({\varDelta }\phi _j\) are sampled from the normal distribution \(N(\mu , \sigma )\) and the uniform distribution U(ab), respectively (\(\mu \) and \(\sigma \) represent the mean and the variance, respectively). We will consider the two cases \((\mu , \sigma , a, b) = (0, 0.02, 0, 0.1)\) for \(n=6\) and \((\mu , \sigma , a, b) = (0, 0.02, 0, 0.2)\) for \(n=10\). Note that in this choice of parameters, the ensemble \(\{| \psi _j \rangle \}_{j=1}^{M}\) is nearly two-dimensionally distributed on the generalized Bloch sphere, as illustrated in Fig. 9(a, d). The other simulation parameters are shown in Table 5.

To construct a generative model via learning this two dimensional distribution, we set the dimension of latent variable as \(N_z=2\) from the above-mentioned observation on the dimensionality of \(\{| \psi _j \rangle \}_{j=1}^{M}\). Also, we here take the so-called alternating layered ansataz (ALA), which, together with the use of local cost, is guaranteed to mitigate the gradient vanishing issue (Cerezo et al. 2021; Nakaji and Yamamoto 2021). This ansatz is more favorable than the previous one which we here call the hardware efficient ansatz (HEA), in view of the possibility to avoid the gradient vanishing issue. Therefore it is worth comparing their learning curves. Typical learning curves are shown in Fig. 8. The blue plots, which are labeled “local”, represent the case where the cost is the local one (3.3) and the ansatz is ALA; thus we denote this case as L-ALA. On the other hand, the orange plots, which are labeled “global”, represent the case where the cost is the global one (3.2) and the ansatz is HEA; thus we denote this case as G-HEA. Not that both displayed costs are calculated as the global one, to directly compare them; that is, the "local" represents the cost calculated at each iteration based on the global cost with the parameters that optimize the local cost. We observe that L-ALA has a clear advantage over G-HEA in terms of the convergence speed. This result coincides with that of Sect. 4.3, indicating the advantage of the local cost. In addition to the convergence speed, the final cost of L-ALA is lower than that of G-HEA. Note that the learning performance heavily depends on the initial random seed, yet it was indeed difficult to find a successful setting of G-HEA; actually in all cases we tried the trajectory seemed to be trapped in a local minima, presumably because the variance of G-HEA is much smaller than that of L-ALA.

We apply the constructed generative model to the anomaly detection problem. In Fig. 9(b) and (e), the test quantum states are displayed for the case of \(n=6\) and \(n=10\), respectively. The resultant anomaly score for each test data are shown in Fig. 9(c, f). In both cases, we can say that the models are trained appropriately. In particular, the variance of the distribution of \(\{| \psi _j \rangle \}_{j=1}^{M}\), i.e., the distribution of the blue points in (a, d), is well captured by the width of the dip of red lines in (c, f). Finally note that, in this section, we use QASM simulator for the numerical simulation; the number of shot is 1000 for each measurement in the learning process, and 50 for the anomaly detection task, even for the case of \(n=10\). Compared to the state tomography, these numbers of shot are clearly too small. Nonetheless, the proposed method enabled the model to learn the training ensemble with such small number of shots.

Fig. 9
figure 9

Simulation results for the localized training ensemble, in the case of (a, b, c) \(n=6\) and (d, e, f) \(n=10\). (a, d) Generalized Bloch vector representation of training states. (b, e) Test data (both red and blue points). (c, f) Anomaly scores for different test data. The blue and red plots correspond to the blue and red points in (b, e), respectively. The values of angle \(\theta ^{(t)}\) and \(\phi ^{(t)}\) are given in the bottom and upper horizontal axis, respectively. The dotted line represents the theoretical value of AS

6 Conclusion

In classical machine learning, many generative models are vigorously studied, but there are only a few studies on quantum generative models for quantum data. This paper offers a new approach for building such a quantum generative model, i.e., the learning method for quantum ensemble in an unsupervised machine learning setting. For that purpose, we proposed a loss function based on the optimal transport loss (OTL) to measure the distance between the set of model pure states and the set of training pure states, rather than the distance between corresponding two mixed states. We then modified the proposed OTL to a sum of local costs, to avoid the vanishing gradient problem and thereby increase learnability, which however makes the cost being no longer a distance. Hence, we have shown that the localized OTL satisfies the properties of divergence between probability distributions and confirmed that the localized OTL is suitable as a cost for generative model. We then theoretically and numerically analyzed the properties of the OTL. Our analysis indicates that the proposed OTL is a good cost function when the quantum data ensemble has a certain structure, i.e., the case when it is confined in a relatively low-dimensional manifold. In addition, We numerically showed that OTL can avoid the vanishing gradient issue thanks to the locality of the cost. Finally, we demonstrated the anomaly detection problems of quantum states, that cannot be handled via existing methods.

Fig. 10
figure 10

Simulation results of relations ship between the approximation error of the proposed loss Eq. (4.6) and number of training data M. Ths points represent the result of the proposed loss, and the dotted lines represent the fitting result with Eq. 4.7. The values of parameter B obtained by the fitting is shown in the legend of the dotted lines

There remain many works to be done in the direction of this paper. First, this paper assumed that a quantum processing is performed individually for each quantum state, but it would be interesting to consider another setting such as the case allowing coherent access (Aharonov et al. 2022) or quantum memory (Huang et al. 2022). In this scenario, the anomaly detection technique could be used for processing quantum states which come from an external quantum processor, such as quantum sensor. Also, it would be interesting to extend the cost to the entropy regularized Wasserstein distance (Cuturi 2013; Feydy et al. 2018; Genevay et al. 2019; Amari et al. 2016, 2018), which is effective for dealing with higher dimensional data in classical generative models. In this case, however, the localized cost does not satisfy the axiom of distance likewise the case demonstrated in this paper; this is yet surely an interesting future work. Lastly, note that classical data can also be within the scope of our method, provided it can be effectively embedded into a quantum data; then, for instance, the anomaly detection of financial data might be a suitable target.