Generative model for learning quantum ensemble with optimal transport loss

Tezuka, Hiroyuki; Uno, Shumpei; Yamamoto, Naoki

doi:10.1007/s42484-024-00142-7

Generative model for learning quantum ensemble with optimal transport loss

Research Article
Open access
Published: 31 January 2024

Volume 6, article number 6, (2024)
Cite this article

Download PDF

You have full access to this open access article

Quantum Machine Intelligence Aims and scope Submit manuscript

Generative model for learning quantum ensemble with optimal transport loss

Download PDF

Hiroyuki Tezuka^1,2,3^na1,
Shumpei Uno^2,4^na1 &
Naoki Yamamoto^2,5

673 Accesses
1 Altmetric
Explore all metrics

Abstract

Generative modeling is an unsupervised machine learning framework, that exhibits strong performance in various machine learning tasks. Recently, we find several quantum versions of generative model, some of which are even proven to have quantum advantage. However, those methods are not directly applicable to construct a generative model for learning a set of quantum states, i.e., ensemble. In this paper, we propose a quantum generative model that can learn quantum ensemble, in an unsupervised machine learning framework. The key idea is to introduce a new loss function calculated based on optimal transport loss, which have been widely used in classical machine learning due to its good properties; e.g., no need to ensure the common support of two ensembles. We then give in-depth analysis on this measure, such as the scaling property of the approximation error. We also demonstrate the generative modeling with the application to quantum anomaly detection problem, that cannot be handled via existing methods. The proposed model paves the way for a wide application such as the health check of quantum devices and efficient initialization of quantum computation.

Perspective on superconducting qubit quantum computing

Article 02 May 2023

Quantum computing

Article Open access 05 August 2022

Archives of Quantum Computing: Research Progress and Challenges

Article 12 July 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the recent great progress of quantum algorithms for both noisy near-term and future fault-tolerant quantum devices, particularly the quantum machine learning (QML) attracts huge attention. QML is largely categorized into two regimes in view of the type of data, which can be roughly called classical data and quantum data. The former has a conventional meaning used in the classical case; in the supervised learning scenario, for instance, a quantum system is trained to give a prediction for a given classical data such as an image. As for the latter, on the other hand, the task is to predict some properties of a given quantum state drawn from a set of states, e.g., the phase of a many-body quantum state, again in the supervised learning scenario. Thanks to the obvious difficulty to directly represent a large quantum state classically, some quantum advantage have been proven in QML for quantum data (Aharonov et al. 2022; Wu et al. 2021; Huang et al. 2022).

While focused on the supervised learning setting in the above paragraph, the success of unsupervised learning in classical machine learning, particularly the generative modeling, is also notable; actually a variety of algorithms have demonstrated strong performance in several applications, such as image generation (Bao et al. 2017; Brock et al. 2018; Kulkarni et al. 2015), molecular design (Gómez-Bombarelli et al. 2018), and anomaly detection (Zhou and Paffenroth 2017). Hence, it is quite reasonable that several quantum unsupervised learning algorithms have been actively developed, such as quantum circuit born machine (QCBM) (Benedetti et al. 2019; Coyle et al. 2020), quantum generative adversarial network (QGAN) (Lloyd and Weedbrook 2018; Dallaire-Demers and Killoran 2018), and quantum autoencoder (QAE) (Romero et al. 2017; Wan et al. 2017). Also, Ref. Dallaire-Demers and Killoran (2018) studied the generative modeling problem for quantum data; the task is to construct a model quantum system producing a set of quantum states, i.e., quantum ensemble, that approximates a given quantum ensemble. The model quantum system contains latent variables, the change of which corresponds to the change of output quantum state of the system. In classical case, such generative model governed by latent variables is called an implicit model. It is known that, to efficiently train an implicit model, we are often encouraged to take the policy to minimize a distance between the model dataset and training dataset, rather than minimizing e.g., the divergence between two probability distributions. The optimal transport loss (OTL), which typically leads to the Wasserstein distance, is suitable for the purpose of measuring the distance of two datasets; actually the quantum version of Wasserstein distance was proposed in Zhou et al. (2022) and De Palma et al. (2021) and was applied to construct a generative model for quantum ensemble in QGAN framework (Chakrabarti et al. 2019; Kiani et al. 2022).

Along this line of research, in this paper, we also focus on the generative modeling problem for quantum ensemble. We are motivated from the fact that the above-mentioned existing works employed the Wasserstein distance defined for two mixed quantum states corresponding to the training and model quantum ensembles, where each mixed state is obtained by compressing all elements of the quantum ensemble to a single density matrix. This is clearly problematic, because this compression process loses a lot of information of the ensemble. For instance, single-qubit pure states uniformly distributed on the equator of the Bloch sphere may be compressed to a maximally mixed state, which clearly does not recover the original ensemble. Obviously, learning a single mixed state produced from the training ensemble does not give us a model that can approximate the original training ensemble.

In this paper, hence, we propose a new quantum OTL, which directly measures the difference between two quantum ensembles. The generative model can then be obtained by minimizing this quantum OTL between a training quantum ensemble and the ensemble of pure quantum states produced from the model. As the generative model, we use a parameterized quantum circuit (PQC) that contains tuning parameters and latent variables, which are both served by the angles of single-qubit rotation gates. A notable feature of the proposed OTL is that this has a form of sum of local functions that operates on a few neighboring qubits. This condition (i.e., the locality of the cost) is indeed necessary to train the model without suffering from the so-called vanishing gradient issue (McClean et al. 2018), meaning that the gradient vector with respect to the parameters decreases exponentially fast when increasing the number of qubits.

Using the proposed quantum OTL, which will be formally defined in Sect. 3, we will show the following result. The first result is given in Sect. 4, which provides performance analysis of OTL and its gradient from several aspects; e.g., scaling properties of OTL as a function of the number of training data and the number of measurement. We also numerically confirm that the gradient of OTL is free from the vanishing gradient issue. The second result is provided in Sect. 5, showing some examples of constructing a generative model for quantum ensemble by minimizing the OTL. This demonstration includes the application to an anomaly detection problem of quantum data; that is, once a generative model is constructed by learning a given quantum ensemble, then it can be used to detect an anomaly quantum state by measuring the distance of this state to the output ensemble of the model. Section 6 gives a concluding remark. Some supporting materials including the proof of theorems are given in Appendix.

2 Preliminaries

In this section, we first review the implicit generative model for classical machine learning problems in Sect. 2.1. Next, Sect. 2.2 is devoted to describe the general OTL, which can be effectively used as a cost function to train a generative model.

2.1 Implicit generative model

The generative model is designed so that it approximates an unknown probability distribution that produces a given training dataset. The basic design strategy is as follows; assuming the probability distribution $\alpha (\varvec{x})$ behind the given training dataset $\{\varvec{x}_i\}_{i=1}^{M}\in \mathcal {X}^{M}$, where $\mathcal {X}$ denotes the space of random variables, we prepare a parameterized probability distribution $\beta _{\varvec{\theta }}(\varvec{x})$ and learn the parameters $\varvec{\theta }$ that minimize an appropriate loss function defined on the training dataset.

The implicit generative modeling does not give us an explicit form of $\beta _{\varvec{\theta }}(\varvec{x})$. An important feature of the implicit generative model is that it can easily describe a probability distribution whose random variables are subjected to a relatively simple distribution but are confined on a hidden low-dimensional manifold; also the data-generation process can be interpreted as a physical process from a latent variable to the data (Bottou et al. 2018). Examples of the implicit generative model include Generative Adversarial Networks (Goodfellow et al. 2014). This paper focuses on the implicit generative model.

A more specific description of the implicit generative model is as follows. An implicit generative model is expressed as a map of a latent random variable $\varvec{z}$ onto $\mathcal {X}$; $\varvec{z}$ resides in a latent space $\mathcal {Z}$ whose dimension $N_z$ is significantly smaller than that of the sample space, $N_x$. The latent random variable $\varvec{z}$ follows a known distribution $\gamma (\varvec{z})$ such as a uniform distribution or a Gaussian distribution. That is, the implicit model distribution is given by $\beta _{\varvec{\theta }}=G_{\varvec{\theta }}{\#}\gamma $, where $\#$ is called the push-forward operator (Peyre and Cuturi 2019a) which moves the distribution $\gamma $ on $\mathcal {Z}$ to a probability distribution on $\mathcal {X}$ through the map $G_{\varvec{\theta }}$. This implicit generative model is trained so that the set of samples generated from the model distribution are close to the set of training data, by adjusting the parameters $\varvec{\theta }$ to minimize some appropriate cost function $\mathcal {L}$ as follows:

$$\begin{aligned} \theta ^\star = \underset{\varvec{\theta }}{\text {arg min} }\mathcal {L}(\hat{\alpha }_{M},\hat{\beta }_{\varvec{\theta },{M_g}}). \end{aligned}$$

(2.1)

$\hat{\alpha }_{M}(\varvec{x})$ and $\hat{\beta }_{\varvec{\theta },{M_g}}=G_{\varvec{\theta }}\#\hat{\gamma }_{M_g}(\varvec{z})$ denote empirical distributions defined with the sampled data $\{\varvec{x}_i\}_{i=1}^{M}$ and $\{\varvec{z}_i\}_{i=1}^{M_g}$, which follow the probability distributions $\alpha (\varvec{x})$ and $\gamma (\varvec{z})$, respectively:

$$\begin{aligned} \hat{\alpha }_{M}(\varvec{x}) = \frac{1}{{M}}\sum _{i=1}^{M} \delta (\varvec{x}-\varvec{x}_i), ~~~ \hat{\gamma }_{M_g}(\varvec{z}) = \frac{1}{{M_g}}\sum _{i=1}^{M_g} \delta (\varvec{z}-\varvec{z}_i). \end{aligned}$$

(2.2)

2.2 Optimal transport loss

The OTL is used in various fields such as image analysis, natural language processing, and finance (Ollivier et al. 2014; Santambrogio 2015; Peyre and Cuturi 2019a, b). In particular, the OTL is widely used as a loss function in the generative modeling, mainly because it can be applicable even when the support of probability distributions do not match, and it can naturally incorporate the distance in the sample space $\mathcal {X}$ (Montavon et al. 2016; Bernton et al. 2017; Arjovsky et al. 2017; Tolstikhin et al. 2017; Genevay et al. 2017; Bousquet et al. 2017). The OTL is defined as the minimum cost of moving a probability distribution $\alpha $ to another distribution $\beta $:

Definition 1

(Optimal Transport Loss; Kantorovich (1942))

$$\begin{aligned} \mathcal {L}_c(\alpha ,\beta ) =&\min _{\pi } \int c(\varvec{x},\varvec{y}) d\pi (\varvec{x},\varvec{y}), \nonumber \\ \mathrm {subject\ to} \quad&\int \pi (\varvec{x},\varvec{y}) d\varvec{x}=\beta (\varvec{y}), \nonumber \\&\int \pi (\varvec{x},\varvec{y}) d\varvec{y}=\alpha (\varvec{x}), \pi (\varvec{x},\varvec{y}) \ge 0, \end{aligned}$$

(2.3)

where $c(\varvec{x},\varvec{y}) \ge 0$ is a non-negative function on $\mathcal {X}\times \mathcal {X}$ that represents the transport cost from $\varvec{x}$ to $\varvec{y}$, and is called the ground cost. Also, we call the set of $\pi $ that minimizes $\mathcal {L}_c(\alpha ,\beta )$ as the optimal transport plan.

In general, the OTL does not meet the axioms of metric between probability distributions; but it does when the ground cost is represented in terms of a metric function as follows:

Definition 2

(p-Wasserstein distance; Villani (2009)) When the ground cost $c(\varvec{x},\varvec{y})$ is expressed as $c(\varvec{x},\varvec{y})=d(\varvec{x},\varvec{y})^p$ with a metric function $d(\varvec{x},\varvec{y})$ and a real positive constant p, then the OTL satisfies the axioms of metric and

$$\begin{aligned} \mathcal {W}_p(\alpha ,\beta ) = \mathcal {L}_{d^p}(\alpha ,\beta )^{1/p} \end{aligned}$$

(2.4)

is called the p-Wasserstein distance.

The p-Wasserstein distance satisfies the conditions of metric between probability distributions. That is, for arbitrary probability distributions $\alpha , \beta , \gamma $, the following properties hold; $\mathcal {W}_p(\alpha ,\beta )\ge 0$, $\mathcal {W}_p(\alpha ,\beta )=\mathcal {W}_p(\beta ,\alpha )$, and $\mathcal {W}_p(\alpha ,\beta )=0\Leftrightarrow \alpha = \beta $; also it satisfies the triangle inequality $\mathcal {W}_p(\alpha ,\gamma )\le \mathcal {W}_p(\alpha ,\beta )+\mathcal {W}_p(\beta ,\gamma )$.

In general it is difficult to minimize the OTL $\mathcal {L}_c(\alpha ,\beta _{\varvec{\theta }})$ for the probability distributions $\alpha $ and $\beta _{\varvec{\theta }}$. Instead, as mentioned in Eq. 2.1, we try to minimize the approximation of the OTL via the empirical distributions (2.2):

Definition 3

(Empirical estimator for OTL; Villani (2009))

$$\begin{aligned}&\mathcal {L}_c\left( \hat{\alpha }_{M}, \hat{\beta }_{\varvec{\theta },{M_g}} \right) = \min _{\{\pi _{i,j}\}_{i,j=1}^{{M},{M_g}} } \sum _{i=1}^{{M}} \sum _{j=1}^{{M_g}} c(\varvec{x}_i,G_{\varvec{\theta }}(\varvec{z}_j))\pi _{i,j}, \nonumber \\&\mathrm {subject\ to} \quad \sum _{i=1}^{M}\pi _{i,j} = \frac{1}{{M_g}}, \sum _{j=1}^{M_g}\pi _{i,j} = \frac{1}{{M}}, \pi _{i,j} \ge 0. \end{aligned}$$

(2.5)

The empirical estimator $\mathcal {L}_c( \hat{\alpha }_{M},\hat{\beta }_{\varvec{\theta },{M}})$ converges to $\mathcal {L}_c(\alpha ,\beta _{\varvec{\theta }})$ in the limit ${M=M_g}\rightarrow \infty $. In general, the speed of this convergence is of the order of $O(M^{-1/N_x})$ with $N_x$ the dimension of the sample space $\mathcal {X}$ (Dudley 1969), but the p-Wasserstein distance enjoys the following convergence law (Weed and Bach 2019).

Theorem 1

(Convergence rate of p-Wasserstein distance) For the upper Wasserstein dimension $d_p^*(\alpha )$ (which is given in Definition 4 of Weed and Bach (2019)) of the probability distribution $\alpha $, the following expression holds when s is larger than $d_p^*(\alpha )$:

$$\begin{aligned} \mathbb {E}\left[ \mathcal {W}_p(\alpha ,\hat{\alpha }_{M})\right] \lesssim O({M}^{-1/s}), \end{aligned}$$

(2.6)

where the expectation $\mathbb {E}$ is taken with respect to the samples drawn from the empirical distribution $\hat{\alpha }_{M}$.

Intuitively, the upper Wasserstein dimension $d_p^*(\alpha )$ can be interpreted as the support dimension of the probability distribution $\alpha $, which corresponds to the dimension of the latent space, $N_z$, in the implicit generative model. Exploiting the metric properties of the p-Wasserstein distance, the following corollaries are immediately derived from Theorem 1:

Corollary 1

(Convergence rate of p-Wasserstein distance between empirical distributions sampled from a common distribution) Let $\hat{\alpha }_{1,M}$ and $\hat{\alpha }_{2,M}$ be two different empirical distributions sampled from a common distribution $\alpha $. The number of samples is M in both empirical distributions. Then the following expression holds for $s > d_p^*(\alpha )$:

$$\begin{aligned} \mathbb {E}\left[ \mathcal {W}_p(\hat{\alpha }_{1,M},\hat{\alpha }_{2,{M}})\right] \lesssim O({M}^{-1/s}), \end{aligned}$$

(2.7)

where the expectation $\mathbb {E}$ is taken with respect to the samples drawn from the empirical distributions $\hat{\alpha }_{1,M}$ and $\hat{\alpha }_{2,M}$.

Corollary 2

(Convergence rate of p-Wasserstein distance between different empirical distributions) Suppose that the upper Wasserstein dimension of the probability distributions $\alpha $ and $\beta _{\varvec{\theta }}$ is at most $d_p^*$, then the following expression holds for $s>d_p^*$:

$$\begin{aligned} \mathbb {E} \left[ |\mathcal {W}_p(\alpha ,\beta _{\varvec{\theta }})-\mathcal {W}_p(\hat{\alpha }_{M},\hat{\beta }_{\varvec{\theta },{M}}) |\right] \lesssim O({M}^{-1/s}), \end{aligned}$$

(2.8)

where the expectation $\mathbb {E}$ is taken with respect to the samples drawn from the empirical distribution $\hat{\alpha }_{M}$ and $\hat{\beta }_{\varvec{\theta },{M}}$.

These corollaries indicate that the empirical estimator (2.5) is a good estimator if the intrinsic dimension of the training data and the dimension of the latent space $N_z$ are sufficiently small, because the Wasserstein dimension $d_p^*$ is almost the same as the intrinsic dimension of the training data and the latent dimension. In Sect. 4.1, we numerically see that similar convergence laws hold even when the OTL is not the p-Wasserstein distance.

3 Learning algorithm of generative model for quantum ensemble

In Sect. 3.1, we define the new quantum OTL that can be suitably used in the learning algorithm of the generative model for quantum ensemble. The learning algorithm is provided in Sect. 3.2.

3.1 Optimal transport loss with local ground cost

Our idea is to directly use Eq. 2.5 yet with the ground cost for quantum states, $c(| \psi \rangle , | \phi \rangle )$, rather than that for classical data vectors, $c(\varvec{x}, \varvec{y})$. This actually enables us to define the OTL between quantum ensembles $\{| \psi _i \rangle \}$ and $\{| \phi _i \rangle \}$, as follows:

$$\begin{aligned} \mathcal {L}_c\left( \{|{\psi _i}\}, \{ |{\phi _i}\} \right) = \min _{\{\pi _{i,j}\} } \sum _{i,j} c\left( |{\psi _i}, |{\phi _i} \right) \pi _{i,j}, \nonumber \\ \mathrm {subject\ to} \quad \sum _{i}\pi _{i,j} = q_j, ~~ \sum _{j}\pi _{i,j} = p_i, ~~ \pi _{i,j} \ge 0, \end{aligned}$$

(3.1)

where $p_i$ and $q_j$ are probabilities that $| \psi _i \rangle $ and $| \phi _i \rangle $ appears, respectively. Note that we can define OTL between the corresponding mixed states $\sum _i p_i | \psi _i \rangle \langle \psi _i |$ and $\sum _i q_i | \phi _i \rangle \langle \phi _i |$ or some modification of them, as discussed in Chakrabarti et al. (2019); but as mentioned in Sect. 1, such mixed states lose the original configuration of ensemble (e.g., single-qubit pure states uniformly distributed on the equator of the Bloch sphere) and thus are not applicable to our purpose.

Then our question is how to define the ground cost $c(| \psi \rangle , | \phi \rangle )$. An immediate choice might be the trace distance:

Definition 4

(Trace distance for pure states; Nielsen and Chuang (2000))

$$\begin{aligned} c_{\textrm{tr}}(|{\psi }\rangle ,|{\phi }\rangle )&= \sqrt{1-|\langle \psi |{\phi }\rangle |^2}. \end{aligned}$$

(3.2)

Because the trace distance satisfies the axioms of metric, we can define the p-Wasserstein distance for quantum ensembles,

$$\begin{aligned} \mathcal {W}_p( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} ) = \mathcal {L}_{d^p}( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} )^{1/p}, \end{aligned}$$

which allows us to have some useful properties described in Corollary 1. It is also notable that the trace distance is relatively easy to compute on a quantum computer, using, e.g., the swap test (Buhrman et al. 2001) or the inversion test (Havlíček et al. 2019).

We now give an important remark. As will be formally described, our goal is to find a quantum circuit that produces a quantum ensemble (via changing latent variables) which best approximates a given quantum ensemble. This task can be executed by the gradient descent method for a parametrized quantum circuit, but a naive setting leads to the vanishing gradient issue, meaning that the gradient vector decays to zero exponentially fast with respect to the number of qubits (McClean et al. 2018). There have been several proposals to avoid this issue found in the literature, e.g., Cerezo et al. (2021), but a common prerequisite is that the cost should be a local one. To explain the meaning, let us consider the case where $| \phi \rangle $ is given by $| \phi \rangle =U| 0 \rangle ^{\otimes n}$ where U is a unitary matrix (which will be a parametrized unitary matrix $U(\varvec{\theta })$ defining the generative model) and n is the number of qubits. Then the trace distance is based on the fidelity $|\langle \psi |{\phi }\rangle |^2 = |\langle \psi |U|{0}\rangle ^{\otimes n}|^2$. This is the probability to get all zeros via the global measurement on the state $U^\dagger | \psi \rangle $ in the computational basis, which thus means that the trace distance is a global cost; accordingly, the fidelity-based learning method suffers from the vanishing gradient issue. On the other hand, we find that the following cost function is based on the localized fidelity measurement.

Definition 5

(Ground cost for quantum states only with local measurements; Khatri et al. (2019); Sharma et al. (2020))

$$\begin{aligned}&c_\textrm{local}(| \psi \rangle , | \phi \rangle ) = c_\textrm{local}(| \psi \rangle , U| 0 \rangle ^{\otimes n}) \nonumber \\&\qquad \qquad \qquad \quad = \sqrt{\frac{1}{n}\sum _{k=1}^n(1-p^{(k)})}, \nonumber \\&p^{(k)} = \textrm{Tr} \left[ P_0^k U^\dagger | \psi \rangle \langle \psi |U\right] , \nonumber \\&P_0^k = \mathbb {I}_1\otimes \mathbb {I}_2\otimes \cdots \otimes \overbrace{| 0 \rangle \langle 0 |_k}^{k\text {-}\mathrm {th\ bit}}\otimes \cdots \otimes \mathbb {I}_n, \end{aligned}$$

(3.3)

where n is the number of qubits. Also, $\mathbb {I}_i$ and $| 0 \rangle \langle 0 |_i$ denote the identity operator and the projection operator that act on the i-th qubit, respectively; thus $p^{(k)}$ represents the probability of getting 0 when observing the k-th qubit.

Equation 3.3 is certainly a local cost, and thus it may be used for realizing effective learning free from the vanishing gradient issue provided that some additional conditions (which will be described in Sect. 5) are satisfied. However, importantly, $c_\textrm{local}(| \psi \rangle ,| \phi \rangle )$ is not a distance between the two quantum states, because it is not symmetric and it does not satisfy the triangle inequality, while the trace distance (3.2) satisfies the axiom of distance. Yet $c_\textrm{local}(| \psi \rangle ,| \phi \rangle )$ is always non-negative and becomes zero only when $| \psi \rangle =| \phi \rangle $, meaning that $c_\textrm{local}(| \psi \rangle ,| \phi \rangle )$ functions as a divergence. Then we can prove that, in general, the OTL defined with a divergence ground cost also functions as a divergence, as follows. The proof is given in Appendix A.

Proposition 1

When the ground cost $c(\varvec{x},\varvec{y})$ is a divergence satisfying

$$\begin{aligned} c(\varvec{x},\varvec{y})&\ge 0, \nonumber \\ c(\varvec{x},\varvec{y})&= 0 \text {iff} \varvec{x} = \varvec{y}, \end{aligned}$$

(3.4)

then the OTL $\mathcal {L}_c(\alpha ,\beta )$ with $c(\varvec{x},\varvec{y})$ is also a divergence. That is, $\mathcal {L}_c(\alpha ,\beta )$ satisfies the following properties for arbitrary probability distributions $\alpha $ and $\beta $:

$$\begin{aligned} \mathcal {L}_c(\alpha ,\beta )&\ge 0, \nonumber \\ \mathcal {L}_c(\alpha ,\beta )&= 0 \text {iff} \alpha =\beta . \end{aligned}$$

(3.5)

Hence, the OTL $\mathcal {L}_c\left( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} \right) $ given in Eq. 3.1 defined with the local ground cost $c_\textrm{local}(| \psi \rangle , | \phi \rangle )$ given in Eq. 3.3 functions as a divergence. This means that $\mathcal {L}_c\left( \{| \psi _i \rangle \}, \{ | \phi _i \rangle \} \right) $ can be suitably used for evaluating the difference between a given quantum ensemble and the set of output states of the generative model. At the same time, recall that, for the purpose of avoiding the gradient vanishing issue, we had to give up using the fidelity measure and accordingly the distance property of the OTL. Hence we directly cannot use the desirable properties described in Theorem 1, Corollaries 1 and 2; nonetheless, in Sect. 4, we will see that similar properties do hold even for the divergence measure.

3.2 Learning algorithm

The goal of our task is to train an implicit generative model so that it outputs a quantum ensemble approximating a given ensemble $\{| \psi _i \rangle \}_{i=1}^{M}$. Our generative model contains tunable parameters and latent variables, as in the classical case described in Sect. 2.1. More specifically, we employ the following implicit generative model:

Definition 6

(Implicit generative model on quantum circuit) Using the initial state $| 0 \rangle ^{\otimes n}$ and the parameterized quantum circuit $U(\varvec{z},\varvec{\theta })$, the implicit generative model on a quantum circuit is defined as

$$\begin{aligned} | \phi _{\varvec{\theta }}(\varvec{z}) \rangle =U(\varvec{z},\varvec{\theta })| 0 \rangle ^{\otimes n}. \end{aligned}$$

(3.6)

Here $\varvec{\theta }$ is the vector of tunable parameters and $\varvec{z}$ is the vector of latent variables that follow a known probability distribution; both $\varvec{\theta }$ and $\varvec{z}$ are encoded in the rotation angles of rotation gates in $U(\varvec{z},\varvec{\theta })$.

The similar circuit model is also found in meta-VQE (Cervera-Lierta et al. 2021), which uses physical parameters such as the distance of atomic nucleus instead of random latent variables $\varvec{z}$. Also, the model proposed in Dallaire-Demers and Killoran (2018) introduces the latent variables $\varvec{z}$ as the computational basis of an initial state in the form $| \phi _{\varvec{\theta }}(\varvec{z}) \rangle =U(\varvec{\theta })| \varvec{z} \rangle $; however, in this model, states with different latent variables are always orthogonal with each other, and thus the model cannot capture a small change of state in the Hilbert space via changing the latent variables. In contrast, the model (3.6) fulfills this purpose as long as the expressivity of the state with respect to $\varvec{z}$ is enough. In addition, our model is advantageous in that the analytical derivative is available by the parameter shift rule (Mitarai et al. 2018; Schuld et al. 2019) not only for the tunable parameters $\varvec{\theta }$ but also for the latent variables $\varvec{z}$. This feature will be effectively utilized in the anomaly detection problem in Sect. 5.

Next, as for the learning cost, we take the following empirical estimator of OTL, calculated from the training data $\{| \psi _i \rangle \}_{i=1}^{M}$ and the samples of latent variables $\{\varvec{z}_j\}_{j=1}^{M_g}$:

$$\begin{aligned} {\begin{matrix} &{}\mathcal {L}_{c_{\text {local}}}\left( \{| \psi _i \rangle \}_{i=1}^{M},\{ | \phi _{\varvec{\theta }}(\varvec{z}_j) \rangle \}_{j=1}^{M_g}\right) \\ &{}\qquad \qquad \qquad \qquad \qquad = \min _{\{\pi _{i,j}\}_{i,j=1}^{{M},{M_g}} } \sum _{i=1}^{M}\sum _{j=1}^{M_g} c_{\text {local},i,j}\pi _{i,j}, \\ &{}\mathrm {subject\ to} \quad \sum _{i=1}^{M}\pi _{i,j} = \frac{1}{{M_g}}, \sum _{j=1}^{M_g}\pi _{i,j} = \frac{1}{{M}}, \pi _{i,j} \ge 0. \end{matrix}} \end{aligned}$$

(3.7)

where $c_{\text {local},i,j}$ is the ground cost given by

$$\begin{aligned}&c_{\text {local},i,j} = \sqrt{\frac{1}{n}\sum _{k=1}^n (1-p^{(k)}_{i,j})}, \nonumber \\&p^{(k)}_{i,j} =\textrm{Tr} \left[ P_0^k U^\dagger (\varvec{z}_j,\varvec{\theta })| \psi _i \rangle \langle \psi _i |U(\varvec{z}_j,\varvec{\theta })\right] , \nonumber \\&P_0^k = \mathbb {I}_1\otimes \mathbb {I}_2\otimes \cdots \otimes \overbrace{| 0 \rangle \langle 0 |_k}^{k\text {-}\mathrm {th\ bit}}\otimes \cdots \otimes \mathbb {I}_n. \end{aligned}$$

(3.8)

Note that in practice $c_{\text {local},i,j}$ is estimated with the finite number of measurements (shots); we denote $\tilde{c}_{\text {local},i,j}^{({N_s})}$ to be the estimator with $N_s$ shots for the ideal one $c_{\text {local},i,j}$, and in this case the OTL is denoted as $\mathcal {L}_{\tilde{c}_{\text {local}}^{({N_s})}}$.

Based on the OTL (3.7), the pseudo-code of proposed algorithm is shown in Algorithm 1. The total number of training quantum states $| \psi _i \rangle $ required for the parameter update is of the order $O({M}M_g N_s)$ in step 3, and $O(\max (M,M_g) N_s N_p)$ in step 5, since the parameter shift rule (Mitarai et al. 2018; Schuld et al. 2019) is applicable.

4 Performance analysis of the cost and its gradient

In this section, we analyze the performance of the proposed OTL (3.7) and its gradient vector. First, in Sect. 4.1, we numerically study the approximation error of the loss with the focus on its dependence on the intrinsic dimension of data and the number of qubits, to see if the similar results to Theorem 1 and Corollaries 1 and 2 would hold even despite that the OTL is now a divergence rather than distance. Then, in Sect. 4.2, we provide numerical and theoretical analyses on the approximation error as a function of the number of measurement (shots). Finally, in Sect. 4.3, we numerically show that the OTL certainly avoids the vanishing gradient issue; i.e., thanks to the locality of the cost, its gradient does not decay exponentially fast. All the analysis in this section is focused on the property of cost at a certain point of learning process (say, at the initial time); the performance analysis on the training process will be discussed in the next section.

We employ the parameterized unitary matrix $U(\varvec{z},\varvec{\theta })$ shown in Fig. 1 to construct the implicit generative model (3.6), which is similar to that given in Ref. McClean et al. (2018) except that our model contains the latent variables $\varvec{z}$. That is, the model is composed of the following $N_L$ repeated unitaries (we call the $\ell $-th unitary the $\ell $-th layer):

$$\begin{aligned} U_{N_L,\varvec{\xi },\varvec{\eta }} (\varvec{z},\varvec{\theta }) = \prod _{\ell =1}^{N_L} W V_{\varvec{\xi }_\ell ,\varvec{\eta }_\ell }(\varvec{z},\varvec{\theta }_\ell ), \end{aligned}$$

(4.1)

where $\varvec{\theta }_\ell =\{\theta _{\ell ,j}\}_{j=1}^n$, $\varvec{\xi }_\ell =\{\xi _{\ell ,j}\}_{j=1}^n$, and $\varvec{\eta }_\ell =\{\eta _{\ell ,j}\}_{j=1}^n$ are n-dimensional parameter vectors in the $\ell $-th layer. We summarize these vectors to $\varvec{\theta }=\{\varvec{\theta }_\ell \}_{\ell =1}^{N_L}$, $\varvec{\xi }=\{\varvec{\xi }_\ell \}_{\ell =1}^{N_L}$, and $\varvec{\eta }=\{\varvec{\eta }_\ell \}_{\ell =1}^{N_L}$. Here $\varvec{\theta }$ are trainable parameters and $\varvec{z}$ are latent variables. W is a fixed entangling unitary gate composed of the ladder-structured controlled-Z gates; that is, W operates the two-qubit controlled-Z gate on all adjacent qubits;

$$\begin{aligned} W =\prod _{i=1}^{n-1} CZ_{i,i+1}, \end{aligned}$$

(4.2)

where $CZ_{i,i+1}$ is the controlled-Z gate acting on the i-th and $(i+1)$-th qubits. The operator $V_{\varvec{\xi }_\ell ,\varvec{\eta }_\ell }(\varvec{z},\varvec{\theta }_\ell )$ consists of the single-qubit rotation operators:

$$\begin{aligned} V_{\varvec{\xi }_\ell ,\varvec{\eta }_\ell }(\varvec{z},\varvec{\theta }_\ell ) =\prod _{i=1}^n R_{\xi _{\ell ,i}}(\theta _{\ell ,i}z_{\eta _{\ell ,i}}), \end{aligned}$$

(4.3)

where $R_{\xi _{\ell ,i}}(\theta _{\ell ,i}z_{\eta _{\ell ,i}})$ is the single-qubit rotation operator with angle $\theta _{\ell ,i}z_{\eta _{\ell ,i}}$ and direction ${\xi _{\ell ,i}}\in \{X,Y,Z\}$ in the $\ell $-th layer, such as $R_X(\theta _{\ell ,3} z_5)=\textrm{exp}(-i \theta _{\ell ,3} z_5 \sigma _x)$. The index parameters $\eta _{\ell ,i}\in \{0,1,\cdots ,N_z\}$ and ${\xi _{\ell ,i}}\in \{X,Y,Z\}$ are randomly chosen at the beginning of learning and never changed during learning. Also, we have introduced a constant bias term $z_0=1$ so that the ansatz $| \phi _{\varvec{\theta }}(\varvec{z}) \rangle =U(\varvec{z},\varvec{\theta })| 0 \rangle ^{\otimes n}$ can take a variety of states (for instance, if $z_1, \ldots , z_{N_z}\sim 0$ in the absence of the bias, then $| \phi _{\varvec{\theta }}(\varvec{z}) \rangle $ is limited to around $| 0 \rangle ^{\otimes n}$). We used Qiskit (ANIS et al. 2021) in all the simulation studies.

4.1 Approximation error with respect to the number of training data

We have seen in Sect. 2.2 that, for the p-Wasserstein distance, the convergence rate of its approximation error is of the order $\mathcal {O}(M^{-1/s})$ rather than $\mathcal {O}(M^{-1/N_x})$, where M is the number of training samples, $N_x$ is the dimension of the sample space $\mathcal {X}$, and s can be interpreted as the intrinsic dimension of the data or the latent space dimension. This is desirable, because in general $s \ll N_x$. However, the OTL (3.7) is not a distance but a divergence. Hence it is not clear if it could theoretically enjoy a similar error scaling, but in this subsection we give a numerical evidence to positively support this conjecture.

We present the following two types of numerical simulations. The aim of the first one (Experiment A) is to see if our OTL would satisfy a similar property to Eq. 2.7, which describes the difference of two empirical distributions sampled from the common hidden distribution. The second one (Experiment B) is studied for the case of Eq. 2.8, which describes the difference between the ideal OTL assuming the infinite samples available and the OTL of empirical distributions. We used the statevector simulator (ANIS et al. 2021) to calculate the ideal OTL assuming the infinite number of measurement; in the next subsection, we will study the influence of the finite number of shots on the performance. Throughout all the numerical experiments, we randomly chose the parameters $\varvec{\xi },\varvec{\eta },\varvec{\theta },\tilde{\varvec{\xi }},\tilde{\varvec{\eta }}, \tilde{\varvec{\theta }}$ and did not change these values.

Table 1 List of parameters for numerical simulation of Sect. 4.1

Full size table

Experiment A. As an analogous quantity appearing in Eq. 2.7, we here focus on the following expected value of the empirical OTL defined in Eq. 3.7:

$$\begin{aligned} \mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}}\left[ J^{c_{\text {local}}}_{\varvec{{\xi }},{\varvec{\eta }};\varvec{\xi },\varvec{\eta }} (\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right] , \end{aligned}$$

(4.4)

where

$$\begin{aligned}&J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }} (\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \nonumber \\&=\mathcal {L}_{c_{\text {local}}}\left( \{ U_{N_L,\tilde{\varvec{\xi }},\tilde{\varvec{\eta }}}(\tilde{\varvec{z}}_i,\tilde{\varvec{\theta }})| 0 \rangle ^{\otimes n}\}_{i=1}^{M}, ~ \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \left. \{ U_{N_L,\varvec{\xi },\varvec{\eta }}(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M} \big ) \right) . \end{aligned}$$

(4.5)

In Eq. 4.4, we set $\varvec{\xi }=\tilde{\varvec{\xi }}$ and $\varvec{\eta }=\tilde{\varvec{\eta }}$ for the two unitary operators that appear in the argument of $\mathcal {L}_{c_{\text {local}}}$ in Eq. 4.5. This indicates that $J^{c_{\text {local}}}_{\varvec{{\xi }},{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},{\varvec{\theta }}:\varvec{z},\varvec{\theta };M)$ in Eq. 4.4 would become zero in the limit of infinite number of training data ($M\rightarrow \infty $). The expectation in Eq. 4.4 is taken with respect to the latent variables $\tilde{\varvec{z}}_i$ and $\varvec{z}_j$ subjected to the $N_z$-dimensional uniform distribution $U(0,1)^{N_z}$, but we numerically approximate it by $N_{\text {Monte}}$ Monte Carlo samplings. Other conditions in the numerical simulation are shown in Table 1.

Figure 2 plots the values of Eq. 4.4 with several $N_z$ (the dimension of the latent variables) and n (the number of qubits). In the figures, the dotted lines show the scaling curve $M^{-1/N_z}$. Notably, in the range of a large number of training data, the points and dotted lines are almost consistent, regardless of the number of qubits. This implies that the OTL for two different ensembles given by Eq. 4.4 is almost independent of the number of qubits n and depends mainly on the latent dimension $N_z$, likewise the case of distance-based loss function proven in Corollary 1.

Experiment B. We next turn to the second experiment to confirm that the approximation error of the proposed OTL scales similar to Eq. 2.8. Specifically, we numerically show the dependence of the following expectation value on the number of training data M:

$$\begin{aligned}&\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ \lim _{K\rightarrow \infty } J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };K) \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \quad -\left. J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right] , \end{aligned}$$

(4.6)

where $J^{c_{\text {local}}}$ is defined in Eq. 4.5. In this case, we set different fixed parameters $\varvec{\xi }\ne \tilde{\varvec{\xi }}$ and $\varvec{\eta }\ne \tilde{\varvec{\eta }}$ for the two unitary operators in Eq. 4.5. The parameters used in the numerical simulation are shown in Table 1.

The first term of Eq. 4.6 is an ideal quantity assuming an infinite number of training data. The second term is expected to take the following form, as suggested from Eq. 2.8:

$$\begin{aligned} \mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M)\right] = a M^{-1/b} + c. \end{aligned}$$

(4.7)

To identify the parameter b, we use the Monte Carlo method to calculate the left-hand side of Eq. 4.7 as a function of M and then execute the curve-fitting via $a M^{-1/b} + c$. We repeat this procedure with several values of the number of qubits n and the latent dimension $N_z$; see Appendix B for a more detailed discussion. The result of parameter identification is depicted in Fig. 3, which shows that the fitting parameter b is almost independent to n and linearly scales with respect to $N_z$. This result is indeed consistent with Eq. 2.8.

Table 2 List of parameters for numerical simulation of Sect. 4.2

Full size table

The results obtained in Experiments A and B suggest us to have the following conjecture:

Conjecture 1

The scaling of the approximation error of OTL (3.7) with respect to the number of training data M is determined via the latent dimension $N_z$ as follows:

$$\begin{aligned}&\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ \mathcal {L}_{c_\mathrm{{local}}}\left( \{ U(\tilde{\varvec{z}}_i,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{i=1}^{\infty }, \right. \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \left. \left. \{ U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M}\right) \right] \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \lesssim O({M}^{-1/N_z}), \end{aligned}$$

(4.8)

without respect to the number of qubits.

This conjecture means that the proposed OTL can be efficiently computed via sampling, when the intrinsic dimension of the data and the latent dimension are sufficiently small. Proving this conjecture would be challenging and is the subject of future work.

4.2 Approximation error with respect to the number of shots

The error analysis in Sect. 4.1 assumes that the number of shot is infinite and thereby the ground cost between quantum states can be perfectly determined. Here, we analyze the effect of the finiteness of the number of shots on the approximation error. The following proposition serves as a basis of the analysis.

Proposition 2

Let $\tilde{c}^{(N_s)}_\mathrm{{local}}$ be an estimator of the ground cost $c_\mathrm{{local}}$ of Eq. 3.3 using $N_s$ samples. Suppose that the support of two different probability distributions are strictly separated; as a result, there exists a lower bound $g>0$ to the ground cost for any $i,j\in \{1,2,\ldots ,M\}$, i.e., $(c_\mathrm{{local}}(| \psi _i \rangle ,U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n})>g, \forall i,j)$. Then, for any positive constant $\delta $, the following inequality holds:

$$\begin{aligned} P\left( \Big | \mathcal {L}_{c_\mathrm{{local}}}\left( \{| \psi \rangle _i\}_{i=1}^{M},\{ U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M}\right) \right. \nonumber \\ - \mathcal {L}_{\tilde{c}^{(N_s)}_\mathrm{{local}}}\left( \{| \psi \rangle _i\}_{i=1}^{M},\{ U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}_{j=1}^{M}\right) \Big | \nonumber \\ \left. \ge \sqrt{\frac{2M}{\delta }}\sqrt{ \frac{1-g}{N_s} + \frac{(1-g)^2}{4N_s^2 g}}+\frac{1-g}{2 N_s \sqrt{g}}\right) \le \delta . \end{aligned}$$

(4.9)

The proof is shown in Appendix C. Proposition 2 states that the approximation error of the OTL is upper bounded by a constant of the order $O(\sqrt{M/N_s})$, under the condition $M\gg 1$ and $N_s \gg 1$. Therefore, if Observation 1 is true, the approximation error due to the finiteness of $N_s$ and M is upper bounded by $O(M^{-1/N_z})+O\left( \sqrt{{M}/N_s}\right) $, where $N_z$ is the latent dimension.

Then, we provide a numerical simulation to show the averaged approximation error s a function of M as well as $N_s$. Again, we employ the hardware efficient ansatz shown in Fig. 1 of Sect. 4.1. The purpose of the numerical simulation is to see the dependence of the following expectation value on M and $N_s$.

$$\begin{aligned}&\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}} \left[ \left| J^{\tilde{c}^{(N_s)}_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right. \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \left. \left. - J^{c_{\text {local}}}_{\varvec{\tilde{\xi }},\tilde{\varvec{\eta }};\varvec{\xi },\varvec{\eta }}(\tilde{\varvec{z}},\tilde{\varvec{\theta }}:\varvec{z},\varvec{\theta };M) \right| \right] , \end{aligned}$$

(4.10)

where $J^{\tilde{c}^{(N_s)}_{\text {local}}}$ can be computed via Eq. (4.5). As in the numerical simulation in Sect. 4.1, we approximate the expectation $\mathbb {E}_{\tilde{\varvec{z}},\varvec{z}\sim U(0,1)^{N_z}}[\bullet ]$ by Monte Carlo sampling of $\tilde{\varvec{z}}$ and $\varvec{z}$ from the uniform distribution $U(0,1)^{N_z}$. Also, we randomly choose the fixed parameters $\varvec{\xi },\varvec{\eta },\varvec{\theta },$

$\tilde{\varvec{\xi }},\tilde{\varvec{\eta }},\tilde{\varvec{\theta }}$ prior to the simulation. The other simulation parameters are given in Table 2. Simulation results are depicted in Fig. 4, where the notable points are summarized as follows.

In the range of small number of training data M, the approximation error is roughly proportional to $M^{-1/2}$.
In the range of large number of training data M, the approximation error takes $\sqrt{c_1 \ln M+c_2}$ with constants $c_1$ and $c_2$.
The dependence of the error on the number of shots $N_s$ is roughly proportional to $N_s^{-1/2}$.

Appendix D provides an intuitive explanation of these results. In particular, it is important to know that we need to choose a proper number of samples to reduce the approximation error.

Table 3 List of parameters for numerical simulation of Sec. 4.3

Full size table

4.3 Avoidance of the vanishing gradient issue

In Sect. 3.1 we chose the cost function composed of the local measurements, as a least condition to avoid the vanishing gradient issue. Note that, however, employing a local cost is not enough to avoid this issue; for instance, Ref. Cerezo et al. (2021) proposed the method of using a special type of parametrized quantum circuit called the alternating layered ansatz (ALA) in addition to using the local cost, which is actually proven to avoid the issue. Nevertheless, we here numerically demonstrate that our method can certainly mitigate the decrease of gradient even without such additional condition, compared to the case with global cost.

More specifically, we calculated the expectation of the variance of the partial derivative of the OTL (3.7), based on the training ensemble $\{| \psi _i \rangle \}^{M}_{i=1}$ and the sampled data from the generative model $\{U(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\}^{M}_{j=1}$;

$$\begin{aligned} \mathbb {V}_{\varvec{\xi },\varvec{\eta },\varvec{\theta },\varvec{z}}\left[ \frac{\partial }{\partial \theta } \mathcal {L}_{c_{\text {local}}}\left( \{ | \psi \rangle _i\}, \{ U_{N_L,\varvec{\xi },\varvec{\eta }}(\varvec{z}_j,\varvec{\theta })| 0 \rangle ^{\otimes n}\} \right) \right] . \end{aligned}$$

(4.11)

The partial derivative is calculated using the parameter shift rule (Mitarai et al. 2018). The expectation of the variance is approximated by Monte Carlo calculations with respect to $\varvec{z}$, $\varvec{\xi }$, $\varvec{\eta }$, and $\varvec{\theta }$, where $\varvec{z}$ is sampled from the uniform distribution U(0, 1) and $\varvec{\xi },\varvec{\eta },\varvec{\theta }$ are sampled from ${\varvec{\xi }}\in \{X,Y,Z\}^{n N_L}$, $\varvec{\eta }\in \{0,1,\ldots ,N_z\}^{n N_L}$, and $\varvec{\theta } \in [0,2\pi ]^{n N_L}$. The structure of the generative model $U(\varvec{z},\theta )| 0 \rangle ^{\otimes n}$ is the same as that shown in Fig. 1. The derivative is taken with respect to $\theta _{1,1}$. The training ensemble $\{| \psi _i \rangle \}_{i=1}^{M}$ is prepared as follows;

$$\begin{aligned} | \psi \rangle _i = W' V'_2(\varvec{\zeta }^i_2)W' V'_1(\varvec{\zeta }^i_1)| 0 \rangle ^{\otimes n}, \end{aligned}$$

where $\varvec{\zeta }^i_1=\{\zeta ^i_{1,j}\}_{j=1}^n$ and $\varvec{\zeta }^i_2=\{\zeta ^i_{2,j}\}_{j=1}^n$ are randomly chosen from the uniform distribution on $[0,2\pi ]$ and fixed during Monte Carlo calculation. The operators $W'$, $V'_1$, and $V'_2$ are defined as follows:

$$\begin{aligned}&W' =\prod _{i=1}^{n-1} CX_{j,j+1}, \nonumber \\&V'_1(\varvec{\zeta }^i_1) =\prod _{j=1}^n R_{j,Y}(\zeta ^i_{1,j}), \nonumber \\&V'_2(\varvec{\zeta }^i_2) =\prod _{j=1}^n R_{j,Z}(\zeta ^i_{2,j}), \end{aligned}$$

(4.12)

where $CX_{j,k}$ denotes the controlled-X gate, which operates X gate on the k-th qubit with the j-th control qubit. $R_{j,Y}$ and $R_{j,Z}$ denote single-qubit Pauli rotations around the x and y axes, respectively. The other simulation parameters are given in Table 3.

Figure 5 shows the numerical simulation result of the variance (4.11), for the cases where (a) the global cost (3.2) is used and (b) the local cost (3.3) is used. The number of training data is fixed to $M=8$. The clear exponential decays in variance of gradient are observed for the global cost, regardless of $N_L$. In contrast, for the case of local cost, the relatively shallow circuits with $N_L=10, 25$ exhibit approximately constant scaling with respect to $n \ge 10$, while the deep circuits with $N_L\ge 50$ also exhibit slower scaling than the global one and keep larger variance even when $n \ge 8$. This result implies that the OTL with local ground cost can avoid the gradient vanishing issue, despite that the circuit is not specifically designed for this purpose. Note also that the result is consistent with that reported in Holmes et al. (2022), which studied the cost function composed of single ground cost of our setting.

In addition, Fig. 5(c) shows the variance (4.11) as a function of the number of training data, M, in the case of $n=14$. In the figure, the points represent the Monte Carlo numerical results and the dotted lines represent the scaling curves $M^{-x}$ where the value x is determined via fitting. This fitting result implies that the gradient obeys the simple statistical scaling with respect to M and thus the proposed algorithm would enjoy efficient learning even for a large training ensemble.

5 Demonstration of the generative model training and application to anomaly detection

5.1 Quantum anomaly detection

Anomaly detection is a task to judge whether a given test data $\varvec{x}^{(t)}$ is anomalous (rare) data or not, based on the knowledge learned from the past training data $\varvec{x}_i,(i=1,2,\ldots ,M)$, i.e., the generative model. Unlike typical classification tasks, this problem deals with a large imbalance in the number of normal data and that of anomalous data; actually, the former is usually much bigger than the latter. Therefore, typical classification methods are not suitable to solve this task, and some specialized schemes have been widely developed (Chandola et al. 2009).

The anomaly detection problem is important in the field of quantum technology. That is, to realize accurate state preparation and control, we are required to detect contaminated quantum states and remove those states as quickly as possible. Previous quantum anomaly detection schemes rely on the measurement-based data processing (Hara et al. 2014, 2016), which however require a large number of measurement as in the case of quantum state tomography. In contrast, our anomaly detection scheme directly inputs quantum states to the constructed generative model and then diagnoses the anormality with much fewer measurements.

The following is the procedure for constructing the conventional anomaly detector based on the generative model (Ide 2015), which we apply to our quantum case.

1.
Distribution estimation): Construct a model probability distribution from the normal dataset.
2.
(Anomaly score design): Define an Anomaly Score (AS), based on the model distribution of normal data.
3.
(Threshold determination): Set a threshold of AS for diagnosing the anormality.

Table 4 List of parameters for numerical simulation in Sect. 5.2

Full size table

Of these steps, the model probability distribution in Step 1 is constructed by the learning algorithm presented in Sec. 3.2. To designing the AS in Step 2, we refer to AnoGAN (Schlegl et al. 2017) in classical machine learning. Namely, we define a loss function

$$\begin{aligned} \mathcal {L}\left( U(\varvec{z},\varvec{\theta })| 0 \rangle ^{\otimes n},| \psi ^{(t)} \rangle \right) \end{aligned}$$

for a test data $| \psi ^{(t)} \rangle $ and the generative model $U(\varvec{z},\bar{\varvec{\theta }})$ constructed from the training dataset with learned parameter $\bar{\varvec{\theta }}$. Then we take the minimum with respect to the latent variables $\varvec{z}$ to calculate AS:

$$\begin{aligned} (\text {Anomaly Score}) =\min _{\varvec{z}} \mathcal {L} \left( U(\varvec{z},\bar{\varvec{\theta }})| 0 \rangle ^{\otimes n},| \psi ^{(t)} \rangle \right) . \end{aligned}$$

(5.1)

As the loss function $\mathcal {L}$, we use the local ground cost $c_\textrm{local}(| \psi ^{(t)} \rangle ,U(\varvec{z},\bar{\varvec{\theta }})| 0 \rangle ^{\otimes n})$ given in Eq. 3.3. The above minimization is executed via the gradient descent with respect to $\varvec{z}$, which is obtained via the parameter shift rule similar to the derivative in $\theta $. Algorithm 2 summarizes the procedure.

5.2 Distributed dataset

The first demonstration is to construct a generative model that learns a quantum ensemble distributed on the equator of the generalized Bloch sphere. That is, the training ensemble (i.e., the normal dataset) $\{| \psi _j \rangle \}_{j=1}^{M}$ to be learned is set as follows:

$$\begin{aligned} | \psi _j \rangle = \textrm{cos}(\pi /4)| 0 \rangle +e^{2\pi i \phi ^j}\textrm{sin}(\pi /4)| 2^n -1 \rangle , \end{aligned}$$

(5.2)

where $\phi ^j$ is randomly generated from the uniform distribution on [0, 1] and $| x \rangle $ denotes the x-th basis in the $2^n$-dimensional Hilbert space. Note that the configuration of this ensemble cannot be learned by the existing mixed-state-based quantum anomaly detection scheme (Hara et al. 2014, 2016), because the mixed state corresponding to this ensemble is nearly the maximally mixed state, the learning of which thus does not give us a generative model recovering the original ensemble.

We employ the same ansatz as that given in Sect. 4 with the parameters shown in Table 4 and construct the generative model according to Algorithm 1. As the optimizer, we take Adam (Kingma and Ba 2014) with learning rate 0.01. The number of learning iterations (i.e., the number of the updates of the parameters) is set to 1500 for $n=2$ and 10000 for $n=10$.

Once the model for normal ensemble is constructed, it is then used to anomaly detection. Here the set of test data $\{| \psi ^{(t)} \rangle \}$ is given by

$$\begin{aligned} \Big |{\psi ^{(t)}}\Big \rangle = \textrm{cos}\left( \frac{\pi }{2}\theta ^{(t)}\right) | 0 \rangle +e^{2\pi i \phi ^{(t)}}\textrm{sin}\left( \frac{\pi }{2}\theta ^{(t)}\right) | 2^n -1 \rangle , \end{aligned}$$

(5.3)

where $\theta ^{(t)}, \phi ^{(t)} \in \{0, 0.1, 0.2, \ldots ,2\}$. We calculate the AS using Algorithm 2. The other simulation parameters are shown in Table 4.

The numerical simulation result in the case of $n=2$ and $n=10$ are presented in Figs. 6 and 7, respectively. The training ensemble is shown in the figures (a), where each blue point corresponds to the generalized Bloch vector. Some output states of the constructed generative model, corresponding to different value of $z \in [0,1]$, are shown in the figures (b). Both of the red and blue points in the figures (c) represent the test data state (5.3). The figures (d) show the calculated AS, where the blue and red plots correspond to the blue and red points in (c), respectively. The dotted lines in (d) illustrate the theoretical expected values assuming that the model completely learns the training data.

Firstly, we see the clear correlation between the AS and the theoretical curve in Fig. 6, implying that AS is appropriately calculated via the proposed method. In the practical usecase, a user defines a threshold of AS depending on the task and then compare the calculated AS with the threshold for identifying the anomaly quantum states. For instance, if we set the threshold as $\textrm{AS}=0.3$, the test states conditioned in $0.3 \le \theta ^{(t)}/\pi \le 0.7$ in Fig. 6 are judged as normal while others are anomaly. Moreover, the output states of the learned generative model and the result of anomaly detection in the case of $n=10$ are shown in Fig. 7. Although the result displayed in (b) would suggest that the learning fails, the output states show correlation with the training states displayed in (a); actually, the output states live on the xy-plane in the generalized Bloch sphere spanned by $| 0 \rangle ^{\otimes 10}$ and $| 1 \rangle ^{\otimes 10}$. In addition, it is notable that only $N_s=100$ is enough even for the case of $n=10$ to perform anomaly detection, provided that we obtain an appropriate generative model from the training normal states. This is an advantage for practical situation, as this indicates that the proposed method may scale up withe respect to the number of qubits.

5.3 Localized dataset

Next let us consider localized quantum ensemble. That is, the state of training ensemble $\{| \psi _j \rangle \}_{j=1}^{M}$ corresponding to the normal dataset is given by

$$\begin{aligned} | \psi _j \rangle = \textrm{cos}\left( \frac{\pi }{2}{\varDelta }\theta _j\right) | 0 \rangle +e^{2\pi i {\varDelta }\phi _j}\textrm{sin}\left( \frac{\pi }{2}{\varDelta }\theta _j\right) | 2^n -1 \rangle , \end{aligned}$$

(5.4)

Table 5 List of parameters for numerical simulation of Sect. 5.3

Full size table

where n is the number of qubits. ${\varDelta }\theta _j$ and ${\varDelta }\phi _j$ are sampled from the normal distribution $N(\mu , \sigma )$ and the uniform distribution U(a, b), respectively ($\mu $ and $\sigma $ represent the mean and the variance, respectively). We will consider the two cases $(\mu , \sigma , a, b) = (0, 0.02, 0, 0.1)$ for $n=6$ and $(\mu , \sigma , a, b) = (0, 0.02, 0, 0.2)$ for $n=10$. Note that in this choice of parameters, the ensemble $\{| \psi _j \rangle \}_{j=1}^{M}$ is nearly two-dimensionally distributed on the generalized Bloch sphere, as illustrated in Fig. 9(a, d). The other simulation parameters are shown in Table 5.

To construct a generative model via learning this two dimensional distribution, we set the dimension of latent variable as $N_z=2$ from the above-mentioned observation on the dimensionality of $\{| \psi _j \rangle \}_{j=1}^{M}$. Also, we here take the so-called alternating layered ansataz (ALA), which, together with the use of local cost, is guaranteed to mitigate the gradient vanishing issue (Cerezo et al. 2021; Nakaji and Yamamoto 2021). This ansatz is more favorable than the previous one which we here call the hardware efficient ansatz (HEA), in view of the possibility to avoid the gradient vanishing issue. Therefore it is worth comparing their learning curves. Typical learning curves are shown in Fig. 8. The blue plots, which are labeled “local”, represent the case where the cost is the local one (3.3) and the ansatz is ALA; thus we denote this case as L-ALA. On the other hand, the orange plots, which are labeled “global”, represent the case where the cost is the global one (3.2) and the ansatz is HEA; thus we denote this case as G-HEA. Not that both displayed costs are calculated as the global one, to directly compare them; that is, the "local" represents the cost calculated at each iteration based on the global cost with the parameters that optimize the local cost. We observe that L-ALA has a clear advantage over G-HEA in terms of the convergence speed. This result coincides with that of Sect. 4.3, indicating the advantage of the local cost. In addition to the convergence speed, the final cost of L-ALA is lower than that of G-HEA. Note that the learning performance heavily depends on the initial random seed, yet it was indeed difficult to find a successful setting of G-HEA; actually in all cases we tried the trajectory seemed to be trapped in a local minima, presumably because the variance of G-HEA is much smaller than that of L-ALA.

We apply the constructed generative model to the anomaly detection problem. In Fig. 9(b) and (e), the test quantum states are displayed for the case of $n=6$ and $n=10$, respectively. The resultant anomaly score for each test data are shown in Fig. 9(c, f). In both cases, we can say that the models are trained appropriately. In particular, the variance of the distribution of $\{| \psi _j \rangle \}_{j=1}^{M}$, i.e., the distribution of the blue points in (a, d), is well captured by the width of the dip of red lines in (c, f). Finally note that, in this section, we use QASM simulator for the numerical simulation; the number of shot is 1000 for each measurement in the learning process, and 50 for the anomaly detection task, even for the case of $n=10$. Compared to the state tomography, these numbers of shot are clearly too small. Nonetheless, the proposed method enabled the model to learn the training ensemble with such small number of shots.

6 Conclusion

In classical machine learning, many generative models are vigorously studied, but there are only a few studies on quantum generative models for quantum data. This paper offers a new approach for building such a quantum generative model, i.e., the learning method for quantum ensemble in an unsupervised machine learning setting. For that purpose, we proposed a loss function based on the optimal transport loss (OTL) to measure the distance between the set of model pure states and the set of training pure states, rather than the distance between corresponding two mixed states. We then modified the proposed OTL to a sum of local costs, to avoid the vanishing gradient problem and thereby increase learnability, which however makes the cost being no longer a distance. Hence, we have shown that the localized OTL satisfies the properties of divergence between probability distributions and confirmed that the localized OTL is suitable as a cost for generative model. We then theoretically and numerically analyzed the properties of the OTL. Our analysis indicates that the proposed OTL is a good cost function when the quantum data ensemble has a certain structure, i.e., the case when it is confined in a relatively low-dimensional manifold. In addition, We numerically showed that OTL can avoid the vanishing gradient issue thanks to the locality of the cost. Finally, we demonstrated the anomaly detection problems of quantum states, that cannot be handled via existing methods.

There remain many works to be done in the direction of this paper. First, this paper assumed that a quantum processing is performed individually for each quantum state, but it would be interesting to consider another setting such as the case allowing coherent access (Aharonov et al. 2022) or quantum memory (Huang et al. 2022). In this scenario, the anomaly detection technique could be used for processing quantum states which come from an external quantum processor, such as quantum sensor. Also, it would be interesting to extend the cost to the entropy regularized Wasserstein distance (Cuturi 2013; Feydy et al. 2018; Genevay et al. 2019; Amari et al. 2016, 2018), which is effective for dealing with higher dimensional data in classical generative models. In this case, however, the localized cost does not satisfy the axiom of distance likewise the case demonstrated in this paper; this is yet surely an interesting future work. Lastly, note that classical data can also be within the scope of our method, provided it can be effectively embedded into a quantum data; then, for instance, the anomaly detection of financial data might be a suitable target.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Aharonov D, Cotler J, Qi XL (2022) Quantum algorithmic measurement. Nat Commun 13(1):1–9
Article Google Scholar
Amari Si, Tsuchiya N, Oizumi M (2016) Geometry of information integration. In: Information geometry and its applications IV, Springer, pp 3–17
ANIS MS, Abraham H, et al (2021) Qiskit: An Open-source Framework for Quantum Computing
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol 70, pp 214–223, https://proceedings.mlr.press/v70/arjovsky17a.html
Bao J, Chen D, Wen F, Li H, Hua G (2017) Cvae-gan: fine-grained image generation through asymmetric training. In: Proceedings of the IEEE international conference on computer vision, pp 2745–2754, https://openaccess.thecvf.com/content_iccv_2017/html/Bao_CVAE-GAN_Fine-Grained_Image_ICCV_2017_paper.html
Benedetti M, Garcia-Pintos D, Perdomo O, Leyton-Ortega V, Nam Y, Perdomo-Ortiz A (2019) A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Inf 5(1):1–9
Article Google Scholar
Bernton E, Jacob PE, Gerber M, Robert CP (2017) On parameter estimation with the Wasserstein distance. 1(8):9. arXiv:1701.05146
Bottou L, Arjovsky M, Lopez-Paz D, Oquab M (2018) Geometrical insights for implicit generative modeling. Braverman Readings in Machine Learning. Springer, Key Ideas from Inception to Current State, pp 229–268
Google Scholar
Bousquet O, Gelly S, Tolstikhin I, Simon-Gabriel CJ, Schoelkopf B (2017) From optimal transport to generative modeling: the VEGAN cookbook. arXiv:1705.07642
Brock A, Donahue J, Simonyan K (2018) Large scale GAN training for high fidelity natural image synthesis. arXiv:1809.11096
Buhrman H, Cleve R, Watrous J, De Wolf R (2001) Quantum fingerprinting. Phys Rev Lett 87(16):167902
Article Google Scholar
Cerezo M, Sone A, Volkoff T, Cincio L, Coles PJ (2021) Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat Commun 12(1):1–12
Article Google Scholar
Cervera-Lierta A, Kottmann JS, Aspuru-Guzik A (2021) Meta-variational quantum eigensolver: Learning energy profiles of parameterized hamiltonians for quantum simulation. PRX Quantum 2(2):020329
Article Google Scholar
Chakrabarti S, Yiming H, Li T, Feizi S, Wu X (2019) Quantum Wasserstein Generative Adversarial Networks. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, curran associates, Inc., vol 32, https://proceedings.neurips.cc/paper/2019/file/f35fd567065af297ae65b621e0a21ae9-Paper.pdf
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: A survey. ACM Comput Surv (CSUR) 41(3):1–58
Article Google Scholar
Coyle B, Mills D, Danos V, Kashefi E (2020) The Born supremacy: quantum advantage and training of an Ising Born machine. npj Quantum Inf 6(1):1–11
Article Google Scholar
Cuturi M (2013) Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems, Curran Associates, Inc., vol 26, pp 2292–2300, https://proceedings.neurips.cc/paper/2013/file/af21d0c97db2e27e13572cbf59eb343d-Paper.pdf
Dallaire-Demers PL, Killoran N (2018) Quantum generative adversarial networks. Phys Rev A 98(1):012324
Article Google Scholar
De Haan L, Ferreira A, Ferreira A (2006) Extreme value theory: an introduction, vol 21. Springer
Book Google Scholar
De Palma G, Marvian M, Trevisan D, Lloyd S (2021) The quantum wasserstein distance of order 1. IEEE Trans Inf Theor 67(10):6627–6643
Article MathSciNet Google Scholar
Dudley RM (1969) The speed of mean Glivenko-Cantelli convergence. Annals Math Stat 40(1):40–50. http://www.jstor.org/stable/2239196
Feydy J, Séjourné T, Vialard FX, Amari Si, Trouvé A, Peyré G (2018) Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. arXiv:1810.08278
Genevay A, Peyré G, Cuturi M (2017) GAN and VAE from an optimal transport point of view. arXiv:1706.01807
Genevay A, Chizat L, Bach F, Cuturi M, Peyré G (2019) Sample complexity of Sinkhorn divergences. In: Chaudhuri K, Sugiyama M (eds) Proceedings of the twenty-second international conference on artificial intelligence and statistics, PMLR, Proceedings of Machine Learning Research, vol 89, pp 1574–1583, https://proceedings.mlr.press/v89/genevay19a.html
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4(2):268–276
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst 27:2672–2680
Google Scholar
Hara S, Ono T, Okamoto R, Washio T, Takeuchi S (2014) Anomaly detection in reconstructed quantum states using a machine-learning technique. Phys Rev A 89(2):022104
Article Google Scholar
Hara S, Ono T, Okamoto R, Washio T, Takeuchi S (2016) Quantum-state anomaly detection for arbitrary errors using a machine-learning technique. Phys Rev A 94(4):042341
Article Google Scholar
Havlíček V, Córcoles AD, Temme K, Harrow AW, Kandala A, Chow JM, Gambetta JM (2019) Supervised learning with quantum-enhanced feature spaces. Nature 567(7747):209–212
Article Google Scholar
Holmes Z, Sharma K, Cerezo M, Coles PJ (2022) Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX Quantum 3(1):010313
Article Google Scholar
Huang HY, Broughton M, Cotler J, Chen S, Li J, Mohseni M, Neven H, Babbush R, Kueng R, Preskill J, McClean JR (2022) Quantum advantage in learning from experiments. Science 376(6598):1182–1186
Article MathSciNet Google Scholar
Ide T (2015) Introduction to Anomaly Detection Using Machine Learning–a Practical Guide With R (in Japanese). Corona Publishing pp 132–139
Kantorovich LV (1942) On the translocation of masses. Dokl Akad Nauk USSR (NS) 37:199–201
MathSciNet Google Scholar
Khatri S, LaRose R, Poremba A, Cincio L, Sornborger AT, Coles PJ (2019) Quantum-assisted quantum compiling. Quantum 3:140
Article Google Scholar
Kiani BT, De Palma G, Marvian M, Liu ZW, Lloyd S (2022) Learning quantum data with the quantum earth mover’s distance. Quantum Sci Technol 7(4):045002
Article Google Scholar
Kingma DP, Ba J (2014) Adam: A method for Stochastic Optimization. arXiv:1412.6980
Kulkarni TD, Whitney WF, Kohli P, Tenenbaum J (2015) Deep convolutional inverse graphics network. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 28, https://proceedings.neurips.cc/paper/2015/file/ced556cd9f9c0c8315cfbe0744a3baf0-Paper.pdf
Lloyd S, Weedbrook C (2018) Quantum generative adversarial learning. Phys Rev Lett 121(4):040502
Article MathSciNet Google Scholar
McClean JR, Boixo S, Smelyanskiy VN, Babbush R, Neven H (2018) Barren plateaus in quantum neural network training landscapes. Nat Commun 9(1):1–6
Article Google Scholar
Mitarai K, Negoro M, Kitagawa M, Fujii K (2018) Quantum circuit learning. Phys Rev A 98(3):032309
Montavon G, Müller KR, Cuturi M (2016) Wasserstein Training of Restricted Boltzmann Machines. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 29, https://proceedings.neurips.cc/paper/2016/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf
Nakaji K, Yamamoto N (2021) Expressibility of the alternating layered ansatz for quantum computation. Quantum 5:434
Article Google Scholar
Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information (Cambridge Series on Information and the Natural Sciences), paperback edn. Cambridge University Press, https://lead.to/amazon/jp/?op=bt &la=en &key=0521635039
Ollivier Y, Herve P, Villani C (2014) Optimal Transport: Theory and Applications. London Mathematical Society Lecture Note Series, Cambridge University Press,. https://doi.org/10.1017/CBO9781107297296
Peyre G, Cuturi M (2019a) Computational Optimal Transport: With Applications to Data Science (Foundations and Trends in Machine Learning), paperback edn. Now Publishers, https://lead.to/amazon/jp/?op=bt &la=en &key=1680835505
Peyre G, Cuturi M (2019) Editorial IMA IAI - Information and Inference special issue on optimal transport in data sciences. Inf Inference: J IMA 8(4):655–656
Article MathSciNet Google Scholar
Romero J, Olson JP, Aspuru-Guzik A (2017) Quantum autoencoders for efficient compression of quantum data. Quantum Sci Technol 2(4):045001
Article Google Scholar
Si A, Karakida R, Oizumi M (2018) Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Inf Geom 1(1):13–37
Article MathSciNet Google Scholar
Santambrogio F (2015) Optimal transport for applied mathematicians, vol 87. Birkäuser Cham
Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs G (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International conference on information processing in medical imaging, Springer, pp 146–157
Schuld M, Bergholm V, Gogolin C, Izaac J, Killoran N (2019) Evaluating analytic gradients on quantum hardware. Phys Rev A 99(3):032331
Article Google Scholar
Sharma K, Khatri S, Cerezo M, Coles PJ (2020) Noise resilience of variational quantum compiling. New J Phys 22(4):043006
Article MathSciNet Google Scholar
Tolstikhin I, Bousquet O, Gelly S, Schoelkopf B (2017) Wasserstein auto-encoders. arXiv:1711.01558
Villani C (2009) Optimal transport: old and new, vol 338. Springer
Google Scholar
Wan KH, Dahlsten O, Kristjánsson H, Gardner R, Kim M (2017) Quantum generalisation of feedforward neural networks. npj Quantum Inf 3(1):1–8
Article Google Scholar
Weed J, Bach F (2019) Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 25(4A):2620–2648
Article MathSciNet Google Scholar
Wu Y, Wu B, Wang J, Yuan X (2021) Provable Advantage in Quantum Phase Learning via Quantum Kernel Alphatron. arXiv:2111.07553
Zhou C, Paffenroth RC (2017) Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 665–674
Zhou L, Yu N, Ying S, Ying M (2022) Quantum earth mover’s distance, a no-go quantum kantorovich-rubinstein theorem, and quantum marginal problem. J Math Phys 63(10):102201
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by MEXT Quantum Leap Flagship Program Grant Number JPMXS0118067285 and JPMXS0120319794, as well as by JST Grand Number JPMJPF2221.

Author information

Hiroyuki Tezuka and Shumpei Uno contributed equally to this work.

Authors and Affiliations

Advanced Research Laboratory, Technology Infrastructure Center, Technology Platform, Sony Group Corporation, 1-7-1 Konan, Minato-ku, Tokyo, 108-0075, Japan
Hiroyuki Tezuka
Quantum Computing Center, Keio University, Hiyoshi 3-14-1, Kohoku-ku, Yokohama, 223-8522, Japan
Hiroyuki Tezuka, Shumpei Uno & Naoki Yamamoto
Graduate School of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa, 223-8522, Japan
Hiroyuki Tezuka
Mizuho Research & Technologies, Ltd., 2-3 Kanda-Nishikicho, Chiyoda-ku, Tokyo, 101-8443, Japan
Shumpei Uno
Department of Applied Physics and Physico-Informatics, Keio University, Hiyoshi 3-14-1, Kohoku-ku, Yokohama, 223-8522, Japan
Naoki Yamamoto

Authors

Hiroyuki Tezuka
View author publications
You can also search for this author in PubMed Google Scholar
Shumpei Uno
View author publications
You can also search for this author in PubMed Google Scholar
Naoki Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by H.T. and S.U.. The first draft of the manuscript was written by H.T. and S.U., and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Naoki Yamamoto.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Appendix A. Proof of Proposition 1

Here, we give the proof of Proposition 1.

Proof

First, positivity $\mathcal {L}_c(\alpha ,\beta ) \ge 0$ is obvious from $c(\varvec{x},\varvec{y})\ge 0$ and $\pi (\varvec{x},\varvec{y})\ge 0$.

Next, we move on the proof of $\alpha =\beta \Rightarrow \mathcal {L}_c(\alpha ,\beta )=0$. When $\alpha =\beta $, then $\pi (\varvec{x},\varvec{y}) = \alpha (\varvec{x})\delta (\varvec{x}-\varvec{y})$ is one of the candidates that satisfy the constraint Eq. (2.3), which gives the following inequality:

$$\begin{aligned} \mathcal {L}_c(\alpha ,\beta )\le & {} \int c(\varvec{x},\varvec{y}) \alpha (\varvec{x}) \delta (\varvec{x}-\varvec{y})d\varvec{x}d\varvec{y} \nonumber \\= & {} \int c(\varvec{x},\varvec{x})\alpha (\varvec{x}) d\varvec{x} = 0. \end{aligned}$$

(A.1)

Thus we obtain $\mathcal {L}_c(\alpha ,\beta )=0$.

Finally, we turn to $\mathcal {L}_c(\alpha ,\beta )=0\Rightarrow \alpha =\beta $. When $\varvec{x}\ne \varvec{y}$, the ground distance $c(\varvec{x},\varvec{y})$ is always greater than 0, and if the transport plan $\pi (\varvec{x},\varvec{y})$ takes a non-zero value, then $\mathcal {L}_c(\alpha ,\beta )$ is always greater than 0. This indicates that in order for $\mathcal {L}_c(\alpha ,\beta )=0$, the transport plan $\pi (\varvec{x},\varvec{y})$ should be a function that concentrated on $\varvec{x}=\varvec{y}$, i.e., the transport plans can be written as $\pi (\varvec{x},\varvec{y})=f(\varvec{x})\delta (\varvec{x}-\varvec{y})$ with Dirac delta function $\delta (\varvec{x}-\varvec{y})$. Then, the constraint $\int \pi (\varvec{x},\varvec{y}) d\varvec{x}=\beta (\varvec{x})$ of Eq. 2.3 leads to $f(\varvec{y})=\beta (\varvec{y})$. Thus, we have $\pi (\varvec{x},\varvec{y})=\beta (\varvec{x})\delta (\varvec{x}-\varvec{y})$, and we finally obtain $\alpha (\varvec{x})=\beta (\varvec{x})$ by taking into account the constraint $\int \pi (\varvec{x},\varvec{y}) d\varvec{y}=\alpha (\varvec{x})$. $\square $

Appendix B. Simulation results of Eq. 4.7

Here, we give the result of numerical simulations for the approximate error of the empirical loss Eq. (4.6) and the result of fitting them with Eq. 4.7. The results are shown in Fig. 10. The parameters of the simulation are given in Table 1. Figure 3 are created based on the result of Fig. 10.

Appendix C. Proof of Proposition 2

Here, we provide the proof of Proposition 2 in Sect. 4.2. To prove the proposition, we exploit the following lemma.

Lemma 1

Suppose that $X_i\ (i=1,2,\ldots ,N_s)$ are i.i.d. random variables with positive values and expected value $p=\mathbb {E}[X_i]$. Then, the following inequalities hold for the expected value of root mean $\mathbb {E}[\sqrt{\bar{X}}]$ and the variance $\mathbb {V}[\sqrt{\bar{X}}]$ of $\bar{X} = \frac{1}{N_s}\sum _{i=1}^{N_s} X_i$:

$$\begin{aligned}&\sqrt{p} - \frac{1-p}{2 N_s \sqrt{p}} \le \mathbb {E}[\sqrt{\bar{X}}]\le \sqrt{p}, \end{aligned}$$

(C.1)

$$\begin{aligned}&\mathbb {V}[\sqrt{\bar{X}}] \le \frac{1-p}{N_s} + \frac{(1-p)^2}{4N_s^2 p}. \end{aligned}$$

(C.2)

Proof

The inequality on the right side of Eq. C.1 is obvious from Jensen’s inequality. The inequality on the left side of Eq. C.1 can be ontained by substituting $r=\frac{\bar{X}}{\mathbb {E}(\bar{X})}$ into the inequality $\sqrt{r}\ge \frac{-r^2+3 r}{2}$, which holds for any non-negative number r. The second inequality for the variance Eq. (C.2) be obtained by substituting the first inequality Eq. (C.1) into the definition of the variance $\mathbb {V}[\sqrt{\bar{X}}]=\mathbb {E}\left[ \left( \sqrt{\rangle \langle {X}-\mathbb {E}(\sqrt{\rangle \langle {X}})}\right) ^2\right] $.$\square $

Here, the random variable of this lemma $X_i$ would be identified with the random variable $X_i=\sum _{k=1}^n X_{i,j,k}^{(s)}$, where $X_{i,j,k}^{(s)}$ is defined just above Eq. (3.8).

Next, we consider the following lemma, which indicates that if the difference between two ground costs $c_1,c_2$ is sufficiently small, the difference between corresponding two optimal transport losses $\mathcal {L}_{c^1},\mathcal {L}_{c^2}$ is also sufficiently small.

Lemma 2

Denote the optimal transport plans of Eq. 2.3 between the empirical distributions $\hat{\alpha }_M(x)=\frac{1}{M}\sum _{i=1}^M\delta (\varvec{x}-\varvec{x}_i)$ and $\hat{\beta }_M(y)=\frac{1}{M}\sum _{j=1}^M\delta (\varvec{y}-\varvec{y}_j)$ for two different ground costs $c_1$ and $c_2$ as $\pi ^1_{i,j}=\pi ^1(\varvec{x}_i,\varvec{y}_j)$ and $\pi ^2_{i,j}=\pi ^2(\varvec{x}_i,\varvec{y}_j)$, respectively. Also, denote the non-zero components of the optimal transport plans $\pi ^1_{i,j}$ and $\pi ^2_{i,j}$ as $A^1=\{(i,j)| \pi ^1_{i,j}>0\}$ and $A^2=\{(i,j)| \pi ^2_{i,j}>0\}$, respectively. Suppose that the difference between the ground costs $c_1$ and $c_2$ for any $(i,j) \in A^1 \cup A^2$ is upper bounded with a constant t as follows:

$$\begin{aligned} |c^1(\varvec{x}_i,\varvec{y}_j)-c^2(\varvec{x}_i,\varvec{y}_j)| < t. \end{aligned}$$

(C.3)

Then, the difference between the corresponding optimal transport losses $\mathcal {L}_{c^1}$ and $\mathcal {L}_{c^2}$ follows

$$\begin{aligned} |\mathcal {L}_{c^1}(\hat{\alpha }_M,\hat{\beta }_M) - \mathcal {L}_{c^2}(\hat{\alpha }_M,\hat{\beta }_M)| < t. \end{aligned}$$

(C.4)

Proof

Here, we only show the proof of $\mathcal {L}_{c^1}(\hat{\alpha }_M,\hat{\beta }_M) - \mathcal {L}_{c^2}(\hat{\alpha }_M,\hat{\beta }_M) < t$. The other inequality $\mathcal {L}_{c^2}(\hat{\alpha }_M,\hat{\beta }_M) - \mathcal {L}_{c^1}(\hat{\alpha }_M,\hat{\beta }_M) < t$ can be proved in the same way.

$$\begin{aligned} \mathcal {L}_{c^1}(\hat{\alpha }_M,\hat{\beta }_M) \\&\quad -\mathcal {L}_{c^2}(\hat{\alpha }_M,\hat{\beta }_M) \\&= \min _{\pi } \int c^1(\varvec{x},\varvec{y})d\pi (\varvec{x},\varvec{y})\\&\quad -\min _{\pi } \int c^2(\varvec{x},\varvec{y})d\pi (\varvec{x},\varvec{y}) \\&= \min _{\pi } \int c^1(\varvec{x},\varvec{y})d\pi (\varvec{x},\varvec{y})\\&\quad -\sum _{i,j=1}^M c^2(\varvec{x}_i,\varvec{y}_j)\pi ^2(\varvec{x}_i,\varvec{y}_j) \\&\le \sum _{i,j=1}^M c^1(\varvec{x}_i,\varvec{y}_j)\pi ^2(\varvec{x}_i,\varvec{y}_j)\\&\quad -\sum _{i,j=1}^M c^2(\varvec{x}_i,\varvec{y}_j)\pi ^2(\varvec{x}_i,\varvec{y}_j) \\&\le \sum _{i,j=1}^M t\pi ^2(\varvec{x}_i,\varvec{y}_j) = t \end{aligned}$$

$\square $

Lemma 2 immediately leads the following lemma.

Lemma 3

Given a set of positive constant t and $\delta $, which satisfies

$$\begin{aligned} P(|c^1(\varvec{x}_i,\varvec{y}_j) - c^2(\varvec{x}_i,\varvec{y}_j)|\ge t) \le \frac{\delta }{2M} \end{aligned}$$

(C.5)

for any component $i,j\in \{1,2,\ldots ,M\}$, where the notations are same as Lemma 2. Then, the following inequality holds for the corresponding optimal transport losses:

$$\begin{aligned} P(|\mathcal {L}_{c^1}(\hat{\alpha }_M,\hat{\beta }_M) - \mathcal {L}_{c^2}(\hat{\alpha }_M,\hat{\beta }_M)| \ge t) \le \delta . \end{aligned}$$

(C.6)

Proof

For simplicity, we prove only for the solution of the optimal transport plan that M components take the value 1/M and the other components take the value 0, but proofs for the other solutions can be performed in the same manner. For such a solution, the number of non-zero components is at most $|A^1 \cup A^2| \le 2 M$, and using the assumption Eq. (C.5), the probability that those non-zero components have an error within t is upper bounded as

$$\begin{aligned}&P\left( \cap _{(i,j)\in A^1 \cup A^2}|c^1(\varvec{x}_i,\varvec{y}_j) - c^2(\varvec{x}_i,\varvec{y}_j)|< t\right) > \left( 1-\frac{\delta }{2M} \right) ^{2M} \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \ge 1-\delta . \end{aligned}$$

(C.7)

Then, Lemma 3 can be straightforwardly proven by taking into account that Lemma 2 leads

$$\begin{aligned}&P\left( \cap _{(i,j)\in A^1 \cup A^2}|c^1(\varvec{x}_i,\varvec{y}_j) - c^2(\varvec{x}_i,\varvec{y}_j)|< t\right) \nonumber \\&\qquad \qquad \qquad \le P(|\mathcal {L}_{c^1}(\hat{\alpha }_M,\hat{\beta }_M) - \mathcal {L}_{c^2}(\hat{\alpha }_M,\hat{\beta }_M)| \le t). \end{aligned}$$

(C.8)

$\square $

Now, we are ready to prove Proposition 2.

Proof

Under the same setting as Lemma 1, Chebyshev inequality for the random variable $\sqrt{\bar{X}}$ leads, with some positive constant k,

$$\begin{aligned} \frac{1}{k^2}&\ge P \left( \left| \sqrt{\bar{X}} -\mathbb {E}\left( \sqrt{\bar{X}}\right) \right| \ge k \sqrt{\mathbb {V}\left[ \sqrt{\bar{X}}\right] }\right) \nonumber \\&\ge P \left( \left| \sqrt{\bar{X}} -\sqrt{p}\right| - \left| \sqrt{p}-\mathbb {E}\left( \sqrt{\bar{X}}\right) \right| \ge k \sqrt{\mathbb {V}\left[ \sqrt{\bar{X}}\right] }\right) \nonumber \\&\ge P \left( \left| \sqrt{\bar{X}} \!-\!\sqrt{p}\right| \ge k \sqrt{ \frac{1-p}{N_s} \!+\! \frac{(1-p)^2}{4N_s^2 p}}\!+\!\frac{1-p}{2 N_s \sqrt{p}}\right) \end{aligned}$$

(C.9)

By setting $\sqrt{\bar{X}} = \mathcal {L}_{\tilde{c}}$ and $\sqrt{p}=\mathcal {L}_c$, and comparing Eqs. (C.5) and (C.9), we obtain Proposition 2 by setting

$$\begin{aligned} t&= \sqrt{\frac{2n}{\delta }}\sqrt{ \frac{1-p}{N_s} + \frac{(1-p)^2}{4N_s^2 p}}+\frac{1-p}{2 N_s \sqrt{p}}. \end{aligned}$$

$\square $

Appendix D. Intuitive explanation on the shape of Fig. 4

Here, we give the rough explanation of the dependence of the mean approximation error on the number of training data, presented in Fig. 10 of Sect. 4.2. Throughout this section, we assume that the number of shots $N_s$ is sufficiently large to hold the asymptotic theory.

We first focus on the case of small number of training data M, where the approximation error behaves like $M^{-1/2}$. In this case, the training data $\{\varvec{x}_i\}_{i=1}^M$ would be well separated from each other and the optimal transport plan $\{\pi _{i,j}\}_{i,j=1}^M$ is not expected to be affected by the number of shots $N_s$. In most cases, the number of non-zero elements of optimal transport plan $A=\{(i,j)| \pi _{i,j}>0\}$ is M and the value of those are 1/M.

The estimated value of the ground cost $c_{\text {local},i,j}$ of Eq. (3.8) with $N_s$ shots is given by $\tilde{c}_{\text {local},i,j}^{({N_s})} = \sqrt{ \frac{1}{n}\sum _{k=1}^n \frac{1}{N_s}\sum _{s=1}^{N_s} X_{i,j,k}^{(s)} }$, where $X_{i,j,k}^{(s)}$ are random variables following the Bernoulli distribution with probability $1-p^{(k)}_{i,j}$. Due to the central limit theorem, the inside of the root $Y_{i,j}^{(s)}= \frac{1}{n}\sum _{k=1}^n X_{i,j,k}^{(s)}$ asymptotically converge to a normal distribution $\sqrt{N_s}(\sum _{s=1}^{N_s}Y_{i,j}^{(s)}/N_s-\mu _{i,j})\sim \mathcal {N}( 0,\sigma _{i,j}^2 )$ with mean $\mu _{i,j}=\sum _{k=1}^n(1-p^{(k)}_{i,j})$ and variance $\sigma _{i,j}^2=\sum _{k=1}^n(1-p^{(k)}_{i,j})p^{(k)}_{i,j}$. Thus, delta method tells us that the approximation error of the ground cost follows $\sqrt{N_s}\left( \tilde{c}^{(N_s)}_{\text {local},i,j}-c_{\text {local},i,j}\right) \sim \mathcal {N}\left( 0,\frac{\sigma ^2_{i,j}}{4\mu _{i,j}}\right) $.

Assume that the means and variances are almost the same for all the components, i.e., $\mu _{i,j}\approx \mu $, $\sigma _{i,j}\approx \sigma $ $\forall i,j$, the approximation error of the optimal transport loss due to the number of shots can be written as

$$\begin{aligned} \mathcal {L}_{\tilde{c}_\mathrm{{local}}^{(N_s)}} -\mathcal {L}_{c_\mathrm{{local}}}&= \frac{1}{M}\sum _{(i,j)\in A} \left( \tilde{c}^{(N_s)}_{\text {local},i,j}-c_{\text {local},i,j}\right) \nonumber \\&\sim \mathcal {N}\left( 0,\sum _{(i,j)\in A}\frac{\sigma ^2_{i,j}}{ 4N_s\mu _{i,j}M^2}\right) \nonumber \\&\approx \mathcal {N}\left( 0,\frac{\sigma ^2}{ 4N_s\mu M}\right) . \end{aligned}$$

(D.1)

Thus the approximation error with small number of training data behaves like $(N_s M)^{-1/2}$ under the condition shown here.

On the other hand, behavior of the approximation error in the range of large number of training data M is explained by the theory of the extreme value distribution (De Haan et al. 2006). In the case of large number of training data, it would be expected that there are many ground costs $c_{\text {local},i,j}$ with almost the same value. As an extreme case, consider the case where all ground costs have a common constant value, $c_{\text {local},i,j}=c,\ \forall i,j$. Again assume that the number of shots $N_s$ is sufficiently large, then the ground cost $\tilde{c}^{(N_s)}_{\text {local},i,j}$ follows a normal distribution, which we denote as $\mathcal {N}\left( c,\frac{\sigma ^2}{ N_s}\right) $. Then, using i.i.d random numbers $\{X_{i,j}\}_{i,j=1}^M$ which follow a normal distribution $\mathcal {N}\left( 0,\frac{\sigma ^2}{N_s}\right) $, the approximation error can be written as

$$\begin{aligned}&\mathcal {L}_{\tilde{c}_\mathrm{{local}}^{(N_s)}} -\mathcal {L}_{c_\mathrm{{local}}} \approx \min _{\{\pi _{i,j}\}_{i,j=1}^{M} } \sum _{i,j=1}^{{M}}X_{i,j}\pi _{i,j}, \nonumber \\&\mathrm {subject\ to} \quad \sum _{i=1}^{M}\pi _{i,j} = \frac{1}{{M}},\sum _{j=1}^{M_g}\pi _{i,j} = \frac{1}{{M}},\pi _{i,j} \ge 0. \end{aligned}$$

(D.2)

Now we approximate this minimization by greedy algorithm, i.e., consider first obtaining the minimum value $X_{i_1,j_1}$ from the $M^2$ components, and then the second minimum value $X_{i_2,j_2}$ from the rest $(M-1)^2$ components other than i-th row and j-th column, and so on. Denoting the cumulative distribution function of a random variable $X_{i,j}$ as F(x), the distribution of the minimum value of the k data can be written as

$$\begin{aligned} G(x,k)&= 1-(1-F(x))^k,\nonumber \\ p(x,k)&= \frac{dG(x,k)}{dx} = k\frac{d F(x)}{dx}(1-F(x))^{k-1}. \end{aligned}$$

(D.3)

Then the probability density at which $x_1,x_2,x_3,\ldots ,x_M$ are obtained from the greedy algorithm is given as

$$\begin{aligned} p(x_1,x_2,\ldots ,x_M)= & {} p(x_1,M^2) \frac{p(x_2,(M\!-\!1)^2)\theta (x_2\!-\!x_1)}{1-G(x_1,(M-1)^2)}\nonumber \\{} & {} \times \frac{p(x_3,(M-2)^2)\theta (x_3-x_2)}{1-G(x_2,(M-2)^2)} \nonumber \\{} & {} \times \cdots \times \frac{p(x_M,1^2)\theta (x_M-x_{M-1})}{1-G(x_{M-1},1^2)}\nonumber \\= & {} \prod _{k=1}^M \frac{k^2}{2k\!-\!1}p(x_k,2k-1)\theta (x_k\!-\!x_{k-1}),\nonumber \\ \end{aligned}$$

(D.4)

where $\theta (x)$ denotes a step function, and we set $x_0=-\infty $ in the last expression. Finally, we approximate this expression by the mode. Then, from the theory of the extreme value distribution, the mode of p(x, k) can be written as $x_{mode}\approx -\sigma \sqrt{2 \ln M/N_s}$ and we reach

$$\begin{aligned} \mathcal {L}_{\tilde{c}_\mathrm{{local}}^{(N_s)}} -\mathcal {L}_{c_\mathrm{{local}}}&\approx \frac{\sigma }{\sqrt{N_s}}\left( \frac{1}{M}\sum _{k=1}^M \sqrt{2 \ln (2k-1)} \right) \nonumber \\&\approx \frac{\sigma }{\sqrt{N_s}} \sqrt{2 \ln (2M-1)}. \end{aligned}$$

(D.5)

Thus we can roughly understand that the approximation error with large number of training data behaves like $N_s^{-1/2}\sqrt{ \ln (M)}$.

Appendix E. Simulation results of Sect. 4.3

Here, we give the numerical simulation results on the variance in the gradient of the proposed cost function. As discussed in Sect. 4.3, we introduce the loss function with the ground cost calculated by the local cost defined in Eq. 3.3, and it avoids the vanishing of the gradient. The results based on different data numbers are shown in Fig. 11. The parameters of the simulation are given in Table 3. Figure 5(c) is created based on the result of Fig. 11.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tezuka, H., Uno, S. & Yamamoto, N. Generative model for learning quantum ensemble with optimal transport loss. Quantum Mach. Intell. 6, 6 (2024). https://doi.org/10.1007/s42484-024-00142-7

Download citation

Received: 21 April 2023
Accepted: 12 January 2024
Published: 31 January 2024
DOI: https://doi.org/10.1007/s42484-024-00142-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Generative model for learning quantum ensemble with optimal transport loss

Abstract

Similar content being viewed by others

Perspective on superconducting qubit quantum computing

Quantum computing

Archives of Quantum Computing: Research Progress and Challenges

1 Introduction

2 Preliminaries

2.1 Implicit generative model

2.2 Optimal transport loss

Definition 1

Definition 2

Definition 3

Theorem 1

Corollary 1

Corollary 2

3 Learning algorithm of generative model for quantum ensemble

3.1 Optimal transport loss with local ground cost

Definition 4

Definition 5

Proposition 1

3.2 Learning algorithm

Definition 6

4 Performance analysis of the cost and its gradient

4.1 Approximation error with respect to the number of training data

Conjecture 1

4.2 Approximation error with respect to the number of shots

Proposition 2

4.3 Avoidance of the vanishing gradient issue

5 Demonstration of the generative model training and application to anomaly detection

5.1 Quantum anomaly detection

5.2 Distributed dataset

5.3 Localized dataset

6 Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Appendix A. Proof of Proposition 1

Proof

Appendix B. Simulation results of Eq. 4.7

Appendix C. Proof of Proposition 2

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Proof

Appendix D. Intuitive explanation on the shape of Fig. 4

Appendix E. Simulation results of Sect. 4.3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation