1 Introduction

Machine Learning has become ubiquitous, with applications in nearly every aspect of society today, in particular for image and speech recognition, traffic prediction, product recommendation, medical diagnosis, stock market trading and fraud detection. One specific Machine Learning tool, deep neural networks, has seen tremendous developments over the past few years. Despite clear advances, these networks however often suffer from the lack of training data: in Finance, time series of a stock price only occur once, physical experiments are sometimes expensive to run many times. To palliate this, attention has turned to methods aimed at reproducing existing data with a high degree of accuracy. Among these, Generative Adversarial Networks (GAN) are a class of unsupervised Machine Learning devices whereby two neural networks, a generator and a discriminator, contest against each other in a minimax game in order to generate information similar to a given dataset (Goodfellow et al. 2014). They have been successfully applied in many fields over the past few years, in particular for image generation (Yu et al. 2018; Schawinski et al. 2017), medicine (Anand and Huang 2018; Zhavoronkov 2019), and in Quantitative Finance (Ruf and Wang 2021). They however often suffer from instability issues, vanishing gradient and potential mode collapse (Saxena and Cao 2021). Even Wasserstein GANs, assuming the Wasserstein distance from optimal transport instead of the classical Jensen–Shannon Divergence, are still subject to slow convergence issues and potential instability (Gulrajani et al. 2017).

In order to improve the accuracy of this method, Lloyd and Weedbrook (2018) and Dallaire-Demers and Killoran (Dallaire-Demers and Killoran 2018) simultaneously introduced a quantum component to GANs, where the data consists of quantum states or classical data while the two players are equipped with quantum information processors. Preliminary works have demonstrated the quality of this approach, in particular for high-dimensional data, thus leveraging on the exponential advantage of quantum computing (Huang et al. 2021). An experimental proof-of-principle demonstration of QuGAN in a superconducting quantum circuit was shown in Hu et al. (2019), while in Stein et al. (2020) the authors made use of quantum fidelity measurements to propose a loss function acting on quantum states. Further recent advances, providing more insights on how quantum entanglement can play a decisive role, have been put forward in Niu et al. (2022). While actual Quantum computers are not available yet, Noisy intermediate-scale quantum (NISQ) algorithms are already here and allow us to perform quantum-like operations (Bharti et al. 2021). The importance of such computations appear can be seen through the lens of data. Indeed, over the past five years, Quantitative Finance has put a large emphasis on data-based models (with the use of deep learning and reinforcement learning), with the obvious increasing need for large amount of data for training purposes. Generative models (Kondratyev and Schwarz 2019) have thus found themselves key to help generate (any amount of) realistic data that can then be used for training, and any computational speedup (due to the extremely large size of these datasets), is urgently welcome; in particular that of quantum computing. In fact, quoting from (Herman et al. 2022), ‘Numerous financial use cases require the ability to assess a wide range of potential outcomes. To do this, banks employ algorithms and models that calculate statistical probabilities. Such techniques are fairly effective, but not infallible. In a world where huge amounts of data are generated daily, computers that can compute probabilities accurately are becoming a predominant need. For this reason, several banks are turning to quantum computing given its promise to analyse vast amounts of data and compute results faster and more accurately than what any classical computer has ever been able to do’.

We focus here on building a fully connected Quantum Generative Adversarial network (QuGAN) Footnote 1, namely an entire quantum counterpart to a classical GAN. A quantum version of GAN was first introduced in Dallaire-Demers and Killoran (2018) and Lloyd and Weedbrook (2018), showing that it may exhibit an exponential advantage over classical adversarial networks. We should also like to mention some closely related works, in particular Situ et al. (2020), making clever use of Matrix Product State (MPS) quantum circuits, Nakaji and Yamamoto (2021) for classification and Zoufal et al. (2019), where the generated distributions are brilliantly used to bypass the need to load classical data in quantum computers (here for option pricing purposes), a standard bottleneck in quantum algorithms. However, all these advances use a quantum generator and a classical discriminator, slightly different from our approach here, which builds a fully quantum GAN.

The paper is structured as follows: In Section 2, we recall the basics of a classical neural network and show how to build a fully quantum version of it. This is incorporated in the full architecture of a Quantum Generative Adversarial Network in Section 3. Since classical GANs are becoming an important focus in Quantitative Finance (Koshiyama et al. 2021; Buehler et al. 2019; Ni et al. 2020; Wiese et al. 2020), we provide an example of application for QuGAN for volatility modelling in Section 4, hoping to bridge the gap between the Quantum Computing and the Quantitative Finance communities. For completeness, we gather some essential background on Quantum Computing in Appendix ??.

2 A quantum version of a non-linear quantum neuron

The quantum phase estimation procedure lies at the very core of building a quantum counterpart for a neural network. In this part, we will mainly focus on how to build a single quantum neuron. As the fundamental building block of artificial neural networks, a neuron classically maps a normalised input x = (x0,…,xn− 1)∈ [0,1]n to an output g(xw), where w = (w0,…,wn− 1)∈ [− 1,1]n is the weight vector, for some activation function g. The non-linear quantum neuron requires the following steps:

  • Encode classical data into quantum states (Section 2.2);

  • Perform the (quantum version of the) inner product xw (Section 2.3);

  • Applying the (quantum version of the) non-linear activation function (Section 2.4).

Before diving into the quantum version of neural networks, we recall the basics of classical (feedforward) neural networks, which we aim at mimicking.

2.1 Classical neural network architecture

Artificial neural networks (ANNs) are a subset of machine learning and lie at the heart of Deep Learning algorithms. Their name and structure are inspired by the human brain (Marblestone et al. 2016), mimicking the way that biological neurons signal to one another. They consist of several layers, with an input layer, one or more hidden layers, and an output layer, each one of them containing several nodes. An example of ANN is depicted in Fig. 1.

Fig. 1
figure 1

ANN with one input layer, 2 hidden layers and one output layer

For a given an input vector \(\boldsymbol {\mathrm {x}} = (x_{1},\ldots ,x_{n})\in \mathbb {R}^{n}\), the connectivity between x and the j th neuron \(h^{(1)}_{j}\) of the first hidden layer (Fig. 1) is done via \(h^{(1)}_{j}=\sigma _{1,j}(b_{1,j}+{\sum }_{i=1}^{n} x_{i}w_{i,j})\), where σ1,j is called the activation function. By denoting \(H_{k}\in \mathbb {R}^{s_{k}}\) the vector of the k th hidden layer, where \(s_{k}\in \mathbb {N}^{*}\) and \(H_{k}=(h^{(k)}_{1},\ldots ,h^{(k)}_{s_{k}})\) the connectivity model generalises itself to the whole network:

$$ h_{j}^{(k+1)}=\sigma_{k+1,j}\left( b_{k+1,j}+{\sum}_{i=1}^{s_{k}} h_{i}^{(k)}w_{i,k+1,j}\right), $$
(2.1)

where j ∈{1,…,sk+ 1}. Therefore for l hidden layers the entire network is parameterised by \({\Omega }=(\sigma _{k,r_{k}},b_{k,r_{k}},w_{v_{k},k,r_{k}})_{k,r_{k},v_{k}}\) where first 1 ≤ kl, then 1 ≤ rksk and 1 ≤ vksk− 1. For a given training data set of size N, (Xi,Yi)i= 1,…,N, the goal of a neural network is to build a mapping between (Xi)i= 1,…,N and (Yi)i= 1,…,N. The idea for the neural network structure comes from the Kolmogorov-Arnold representation Theorem (Arnold 1957; Kolmogorov 1956):

Theorem 2.1

Let \(f: [0,1]^{d}\rightarrow \mathbb {R}\) be a continuous function. There exist sequences (Φi)i= 1,…,2d and (Ψi,j)i= 1,…,2d;i= 1,…,d of continuous functions from \(\mathbb {R}\) to \(\mathbb {R}\) such that for all (x1,…,xd) ∈ [0,1]d,

$$ f(x_{1},\ldots,x_{d})={\sum}_{i=1}^{2d}{\Phi}_{i}\left( {\sum}_{j=1}^{d}{\Psi}_{i,j}(x_{j})\right). $$
(2.2)

The representation of f resembles a two-hidden layer ANN, where Φii,j are the activation functions.

2.2 Quantum encoding

Since a quantum computer only takes qubits as inputs, we first need to encode the classical data into a quantum state. For xj ∈ [0,1] and \(p\in \mathbb {N}\), denote by \(\frac {x_{j,1}}{2} + \frac {x_{j,2}}{2^{2}} + {\ldots } + \frac {x_{j,p}}{2^{p}}\) the p-binary approximation of xj, where each xj,k belongs to {0,1}, for k ∈{1,2,…,p}. The quantum code for the classical value xj is then defined via this approximation as

$$ |{x_{j}}\rangle := |{x_{j,1}}\rangle\otimes|{x_{j,2}}\rangle\otimes\ldots\otimes|{x_{j,p}}\rangle=|{x_{j,1}x_{j,2}{\ldots} x_{j,p}}\rangle, $$

and therefore the encoding for the vector x is

$$ |{\boldsymbol{\mathrm{x}}}\rangle := |{x_{0,1} x_{0,2}{\ldots} x_{0,p}}\rangle\otimes\ldots\otimes|{x_{n-1,1}{\ldots} x_{n-1,p}}\rangle. $$
(2.3)

2.3 Quantum inner product

We now show how to build the quantum version of the inner product performing the operation

$$ |{0}\rangle^{\otimes m}|{\boldsymbol{\mathrm{x}}}\rangle\rightarrow |\widetilde{\mathbf{{x}}}^{\top} \boldsymbol{\mathrm{w}}\rangle|{\boldsymbol{\mathrm{x}}}\rangle. $$

Denote the two-qubit controlled Z-Rotation gate by

$$ {~}_{\mathrm{c}}\mathrm{R}_{z}(\alpha)= \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & \mathrm{e}^{2\mathrm{i}\pi \alpha} \end{pmatrix}, $$

where α is the phase shift with period π. For x ∈{0,1} and \(|{+}\rangle :=\frac {1}{\sqrt {2}}(|{0}\rangle +|{1}\rangle )\), note that, for \(k\in \mathbb {N}\),

$$ {~}_{\mathrm{c}}\mathrm{R}_{z}\left( \frac{1}{2^{k}}\right) \left( |{+}\rangle|{x}\rangle\right) =\frac{1}{\sqrt{2}}\left( |{0}\rangle|{x}\rangle + \exp\left\{\frac{2\mathrm{i}\pi x}{2^{k}}\right\}|{1}\rangle|{x}\rangle\right) $$

Indeed, either x = 0 and then |x〉 = |0〉 so that

$$ {~}_{\mathrm{c}}\mathrm{R}_{z}\left( \frac{1}{2^{k}}\right) \left( |{+}\rangle|{x}\rangle\right) =\frac{1}{\sqrt{2}} \left( |{0}\rangle|{0}\rangle+|{1}\rangle|{0}\rangle\right), $$

or x = 1 and hence

$$ {~}_{\mathrm{c}}\mathrm{R}_{z}\left( \frac{1}{2^{k}}\right) \left( |{+}\rangle|{x}\rangle\right) =\frac{1}{\sqrt{2}}\left( |{0}\rangle|{1}\rangle + \exp\left\{\frac{2\mathrm{i}\pi}{2^{k}}\right\}|{1}\rangle|{1}\rangle\right). $$

The gate \(_{\mathrm {c}}\mathrm {R}_{z}\left (\alpha \right )\) applies to two qubits where the first one constitutes what is called an ancilla qubit since it controls the computation. From there one should define the ancilla register that is composed of all the qubits that are used as controlled qubits.

2.3.1 The case where with m ancilla qubits and x w ∈{0,…,2m − 1}

The first part of the circuit consists of applying Hadamard gates on the ancilla register |0〉m, which produces

$$ \mathrm{H}^{\otimes m}|{0}\rangle^{\otimes m}|{\boldsymbol{\mathrm{x}}}\rangle =\left( \frac{1}{\sqrt{2^{m}}}\sum\limits_{j=0}^{2^{m}-1}|{j}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle. $$
(2.4)

The goal here is then to encode as a phase the result of the inner product xw. With the binary approximation 2.3 for |x〉 and m ancilla qubits, define for l ∈{1,…,m}, j ∈{0,…,n − 1} and k ∈{1,…,p}, \({~}_{\mathrm {c}}\mathrm {R}_{z}^{l,j,k}\left (\alpha \right )\), the cRz(α) matrix applied to the qubit |xj,k〉 with the l th qubit of the ancilla register as control. Finally, introduce the unitary operator

$$ \mathrm{U}_{\boldsymbol{\mathrm{w}},m} := \prod\limits_{l=0}^{m-1}\left\{\prod\limits_{j=0}^{n-1}\prod\limits_{k=1}^{p}{~}_{\mathrm{c}}\mathrm{R}_{z}^{m-l,j,k}\left( \frac{w_{j}}{2^{m+k}}\right)\right\}^{m-l}. $$
(2.5)

Proposition 2.2

The following identity holds for all \(n,p,m \in \mathbb {N}\):

$$ \mathrm{U}_{\boldsymbol{\mathrm{w}},m}\mathrm{H}^{\otimes m}|{0}\rangle^{\otimes m}|{\boldsymbol{\mathrm{x}}}\rangle = \left( \frac{1}{\sqrt{2^{m}}}{\sum}_{j=0}^{2^{m}-1} \exp\left\{2\mathrm{i}\pi j \frac{\widetilde{\boldsymbol{\mathrm{x}}}^{\top}\boldsymbol{\mathrm{w}}}{2^{m}}\right\}|{j}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle, $$
(2.6)

where

$$ \widetilde{\boldsymbol{\mathrm{x}}}^{\top}\boldsymbol{\mathrm{w}} := {\sum}_{j=0}^{n-1}w_{j}{\sum}_{k=1}^{p}\frac{x_{j,k}}{2^{k}} $$

is the p-binary approximation of xw.

Proof

We prove the proposition for n = p = m = 2 for simplicity and the general case is analogous. Therefore we consider \(\mathrm {U}_{\boldsymbol {\mathrm {w}},2} :=\left \{{\prod }_{j=0}^{1}{\prod }_{k=1}^{2} {~}_{\mathrm {c}}\mathrm {R}_{z}^{2,j,k}\left (\frac {w_{j}}{2^{2+k}}\right )\right \}^{2}\)\( {\prod }_{j=0}^{1}{\prod }_{k=1}^{2} {~}_{\mathrm {c}}\mathrm {R}_{z}^{1,j,k}\left (\frac {w_{j}}{2^{2+k}}\right ).\) First, we have

$$ \begin{array}{@{}rcl@{}} &&{\prod}_{j=0}^{1}{\prod}_{k=1}^{2} {~}_{\mathrm{c}}\mathrm{R}_{z}^{1,j,k}\left( \frac{w_{j}}{2^{2+k}}\right) \otimes\left( \frac{1}{\sqrt{2^{2}}}{\sum}_{j=0}^{2^{2}-1}|{j}\rangle\right) \otimes|{\boldsymbol{\mathrm{x}}}\rangle \\&&=\frac{1}{\sqrt{2^{2}}}\left( |{0}\rangle+|{1}\rangle\right)\left( |{0}\rangle+\exp\left\{2\mathrm{i}\pi \frac{\widetilde{\boldsymbol{\mathrm{x}}}^{\top}\boldsymbol{\mathrm{w}}}{2^{2}}\right\}|{1}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle, \end{array} $$

result to which we apply \(\left \{{\prod }_{j=0}^{1}{\prod }_{k=1}^{2} {~}_{\mathrm {c}}\mathrm {R}_{z}^{2,j,k}\left (\frac {w_{j}}{2^{2+k}}\right )\right \}^{2}\) which yields

$$ \frac{1}{\sqrt{2^{2}}}\left( |{0}\rangle+\exp\left\{2\mathrm{i}\pi 2 \frac{\widetilde{\boldsymbol{\mathrm{x}}}^{\top}\boldsymbol{\mathrm{w}}}{2^{2}}\right\}|{1}\rangle\right) \otimes \left( |{0}\rangle+\exp\left\{2\mathrm{i}\pi \frac{\widetilde{\boldsymbol{\mathrm{x}}}^{\top}\boldsymbol{\mathrm{w}}}{2^{2}}\right\}|{1}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle, $$

achieving the proof of 2.6. □

From the definition of the Quantum Fourier transform in A.3, if \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top }\boldsymbol {\mathrm {w}}=k\in \{0,\ldots ,2^{m}-1\}\), the resulting state is

$$ \mathrm{U}_{\boldsymbol{\mathrm{w}},m}\left( \left( \mathrm{H}^{\otimes m}|{00}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle\right) = \left( {~}_{\mathrm{q}}\mathcal{F}|{k}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle =\left( {~}_{\mathrm{q}}\mathcal{F}|{\widetilde{\boldsymbol{\mathrm{x}}}^{\top} \boldsymbol{\mathrm{w}}}\rangle\right)\otimes|{\boldsymbol{\mathrm{x}}}\rangle. $$

Thus only applying the Quantum Inverse Fourier Transform would be enough to retrieve \(|{\widetilde {\boldsymbol {\mathrm {x}}}^{\top }\boldsymbol {\mathrm {w}}}\rangle \). The pseudo-code is detailed in Algorithm 1 and the quantum circuit in the case n = p = m = 2 is depicted in Fig. 2 (and detailed in Example 2.3).

Algorithm 1
figure a

Quantum Inner Product (QIP) (w,x,Uw,m,m,p,ε)

Fig. 2
figure 2

QIP circuit for m = 2 ancilla qubits. The c line represents the classical register from which we retrieve the outcomes of the measurements. The controlled gate γ performs as \( C(\gamma ): |{q_{1}}\rangle |{q_{2}}\rangle \mapsto 1{1}_{|{q_{1}}\rangle =|{1}\rangle }(|{q_{1}}\rangle )|{1}\rangle \otimes \mathrm {e}^{-\mathrm {i}\frac {\pi }{4}}|{q_{2}}\rangle +1{1}_{|{q_{1}}\rangle =|{0}\rangle }(|{q_{1}}\rangle )|{0}\rangle \otimes |{q_{2}}\rangle \)

Example 2.3

To understand the computations performed by the quantum gates, consider the case where n = p = 2. Therefore we only need 2 × 2 qubits to represent each element of the dataset which constitute the main register. Introduce an ancilla register composed of m = 2 qubits each initialised at |0〉, and suppose that the input state on the main register is |x〉. The goal here is then to encode as a phase the result of the inner product xw where w = (w0,w1). So in this example the entire wave function combining both the main register’s qubits and the ancilla register’s qubits is encoded in six qubits. By denoting \({~}_{\mathrm {c}}\mathrm {R}_{z}^{1,j,k}(\alpha )\) the cRz(α) matrix applied to the first qubit of the ancilla register and the qubit \(|{x^{i}_{j,k}}\rangle \), and \({~}_{\mathrm {c}}\mathrm {R}_{z}^{2,j,k}(\alpha )\) the cRz(α) matrix applied to the second qubit of the ancilla register and the qubit |xj,k〉. Using the gates in 2.5, namely

$$ \begin{array}{@{}rcl@{}} &&\mathrm{U}_{\boldsymbol{\mathrm{w}},1} = {\prod}_{j=0}^{1}{\prod}_{k=1}^{2} {~}_{\mathrm{c}}\mathrm{R}_{z}^{1,j,k}\left( \frac{w_{j}}{2^{1+k}}\right) \quad\text{and}\quad\\&& \mathrm{U}_{\boldsymbol{\mathrm{w}},2} = \left\{{\prod}_{j=0}^{1}{\prod}_{k=1}^{2} {~}_{\mathrm{c}}\mathrm{R}_{z}^{2,j,k}\left( \frac{w_{j}}{2^{2+k}}\right)\right\}^{2} {\prod}_{j=0}^{1}{\prod}_{k=1}^{2} {~}_{\mathrm{c}}\mathrm{R}_{z}^{1,j,k}\left( \frac{w_{j}}{2^{2+k}}\right).\end{array} $$

Remark 2.4

There is an interesting and potentially very useful difference here between the quantum and the classical versions of a feedforward neural network; in the former, the input x is not lost after running the circuit, while this information is lost in the classical setting. This in particular implies that it can be used again for free in the quantum setting.

2.3.2 The case x w∉{0,…,2m − 1}

What happens if \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\) is not a integer and \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\geq 0\)? Again, the short answer is that we are able to obtain a good approximation of \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\), which is already an approximation of the true value of the inner product xw. Indeed, with the gates constructed above, QIP performs exactly like QPE. Just a quick comparison between what is obtained at stage 3 of the QPE Algorithm (Algorithm 2) and the output obtained at the third stage of the QIP 2.6 would be enough to state that the QIP is just an application of the QPE procedure. Thus \(\left \{{\prod }_{j=0}^{n-1}{\prod }_{k=1}^{p}{~}_{\mathrm {c}}\mathrm {R}_{z}^{1,j,k}\left (\frac {w_{j}}{2^{m+k}}\right )\right \}\) is a unitary matrix such that |1〉⊗|x〉 is an eigenvector of eigenvalue \(\exp \left \{2\mathrm {i}\pi \frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2^{m}}\right \}\).

Algorithm 2
figure b

Quantum phase estimation (U|u〉,m,ε)

Let \(\phi :=\frac {1}{2^{m}}\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\); the QPE procedure (Appendix A) can only estimate ϕ ∈ [0,1). Firstly ϕ ≤ 0 can happen and secondly \(\lvert \phi \rvert \geq 1\) can also happen. Therefore such circumstances have to be addressed. One first step would be to have w ∈ [− 1,1]n, so that \(\lvert \widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}} \rvert \leq n\). Then one should have m (the number of ancillas) large enough so that

$$ \left| \frac{\widetilde{\boldsymbol{\mathrm{x}}}^{\top} \boldsymbol{\mathrm{w}} }{2^{m}}\right| \leq 1, $$
(2.7)

which produces \(m\geq \log _{2}(n)\). Having these constrains respected, one obtains |ϕ|≤ 1, which is not enough since we should have ϕ ∈ [0,1) instead. The main idea behind solving that is based on computing \(\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}} }{2}\) instead of \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\) which means dividing by 2 all the parameters of the \({~}_{\mathrm {c}}\mathrm {R}_{z}^{m,j,k}\) gates. Indeed with 2.7, we have \(-2^{m} \leq \widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}} \leq 2^{m}\), and thus \(-2^{m-1} \leq \frac {1}{2} \widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}} \leq 2^{m-1}\).

  • In the case where \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\geq 0\) we have \(\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2} \in [0,2^{m-1}]\) and then by defining \(\widetilde {\phi }^{+}:=\frac {1}{2^{m}}\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2}\) we then obtain \(\widetilde {\phi }^{+} \in [0,\frac {1}{2}]\), therefore the QPE can produce an approximation of \(\widetilde {\phi }^{+}\) as put forward in Algorithm 2 which then can be multiplied by 2m+ 1 to retrieve \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top }\boldsymbol {\mathrm {w}}\).

  • In the case where \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}} \leq 0\), then \(\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2} \in [-2^{m-1},0]\). As above, |1〉⊗|x〉 is an eigenvector of \(\left \{{\prod }_{j=0}^{n-1}{\prod }_{k=1}^{p}{~}_{\mathrm {c}}\mathrm {R}_{z}^{1,j,k}\left (\frac {\frac {w_{j}}{2}}{2^{m+k}}\right )\right \}\) with corresponding eigenvalue \(\exp \left \{2\mathrm {i}\pi \frac {\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2}}{2^{m}}\right \}= \exp \left \{2\mathrm {i}\pi \left [1+ \frac {\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2}}{2^{m}}\right ]\right \}\). Defining \(\widetilde {\phi }^{-} := \frac {1}{2^{m}}\left (2^{m}+ \frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2}\right ) = 1+\frac {1}{2^{m}}\frac {\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}}{2}\) we then obtain \(\widetilde {\phi }^{-} \in [\frac {1}{2},1]\) which a QPE procedure can estimate and from which we can retrieve \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}\)

For values of ϕ measured in \([0,\frac {1}{2}) \cup (\frac {1}{2},1)\) we are sure about the associated value of the inner product. This means that for a fixed x, the map

$$ f: \Big[0,\frac{1}{2}\Big) \cup \left( \frac{1}{2},1\right)\ni \phi \mapsto \widetilde{\boldsymbol{\mathrm{x}}}^{\top} \boldsymbol{\mathrm{w}} \in [-n,n] $$

is injective. A measurement output equal to half could mean either that \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}=2^{m}\) or \(\widetilde {\boldsymbol {\mathrm {x}}}^{\top } \boldsymbol {\mathrm {w}}=-2^{m}\), which could be prevented for w ∈ [− 1,1]n and m large enough such that n < 2m. Under these circumstances, f can be extended to an injective function on [0,1), with 1 being excluded since the QPE can only estimate values in [0,1).

2.4 Quantum activation function

We consider an activation function \(\sigma :\mathbb {R}\to \mathbb {R}\). A classical example is the sigmoid \(\sigma (x):=\left (1+\mathrm {e}^{-x}\right )^{-1}\). The goal here is to build a circuit performing the transformation |x〉↦|σ(x)〉 where |x〉 and |σ(x)〉 are the quantum encoded versions of their classical counterparts as in Section 2.2. Again, we shall appeal to the Quantum Phase Estimation algorithm. For a q-qubit state \(|{x}\rangle =|{x_{1}{\ldots } x_{q}}\rangle \in \mathbb {C}^{2^{q}}\), we wish to build a matrix \(\mathrm {U} \in {\mathscr{M}}_{2^{q}}(\mathbb {C})\) such that

$$ \mathrm{U}|{x}\rangle=\mathrm{e}^{2\mathrm{i}\pi \sigma(x)}|{x}\rangle.$$

Considering

$$ \mathrm{U} := \text{Diag}\left( \mathrm{e}^{2\mathrm{i}\pi \sigma(0)},\mathrm{e}^{2\mathrm{i}\pi \sigma(1)},\mathrm{e}^{2\mathrm{i}\pi \sigma(2)},\ldots,\mathrm{e}^{2\mathrm{i}\pi \sigma(2^{q}-1)}\right), $$

then, for m ancilla qubits, the Quantum Phase estimation yields

$$ \text{QPE}: |{0}\rangle^{\otimes m}\otimes|{x}\rangle \mapsto|{\widetilde{\sigma(x)}}\rangle\otimes|{x}\rangle, $$

where again \(\widetilde {\sigma (x)}\) is the m-bit binary fraction approximation for σ(x) as detailed in Algorithm 2. In Fig. 3, we can see that the information flows from |x〉 = |x0,1x1,1x2,1x3,1〉 to the register attached to |q2〉 to obtain the inner product and from the register |q2〉 to |q1〉 for the activation of the inner product. This explains why only measuring the register |q1〉 is enough to retrieve σ(xww).

Fig. 3
figure 3

Quantum single neuron for \(|{\boldsymbol {\mathrm {x}}}\rangle \in \mathbb {C}^{2^{4}},\) one ancilla qubit |q2〉 for the QIP implemented via the controlled gate Uw,1 for w ∈ [− 1,1]4, and one ancilla qubit |q1〉 for the activation function σ

3 Quantum GAN architecture

A Generative Adversarial Network (GAN) is a network composed of two neural networks. In a classical setting, two agents, the generator and the discriminator, compete against each other in a zero-sum game (Kakutani 1941), playing in turns to improve their own strategy; the generator tries to fool the discriminator while the latter aims at correctly distinguishing real data (from a training database) from generated ones. As put forward in Goodfellow et al. (2014), the generative model can be thought of as an analogue to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminator plays the role of the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles. Under reasonable assumptions (the strategy spaces of the agents are compact and convex) the game has a unique (Nash) equilibrium point, where the generator is able to reproduce exactly the target data distribution. Therefore, in a classical setting, the generator G, parameterised by a vector of parameters 𝜃G, produces a random variable \(X_{\boldsymbol {\theta }_{G}}\), which we can write as the map

$$ \textbf{G}: \boldsymbol{\theta}_{G} \rightarrow X_{\boldsymbol{\theta}_{G}}. $$

The goal of the discriminator D, parameterised by 𝜃D, is to distinguish samples \(\boldsymbol {\mathrm {x}}_{\boldsymbol {\theta }_{G}}\) of \(X_{\boldsymbol {\theta }_{G}}\) from \(\boldsymbol {\mathrm {x}}_{\textit {Real}} \in \mathcal {D}\), where xReal has been sampled from the underlying distribution \(\mathbb {P}_{\mathcal {D}}\) of the database \(\mathcal {D}\). The map D thus reads

$$ \textbf{D}: \boldsymbol{\mathrm{x}}_{\boldsymbol{\theta}_{G}},\boldsymbol{\theta}_{D} \mapsto \mathbb{P}_{\boldsymbol{\theta}_{D}}\left( \boldsymbol{\mathrm{x}}_{\boldsymbol{\theta}_{G}} \text{ sampled from } \mathbb{P}_{\mathcal{D}}\right). $$

We aim here at mimicking this classical GAN architecture into quantum version. Not surprisingly, we first build a quantum discriminator, followed by a quantum generator, and we finally develop the quantum equivalent of the zero-sum game, defining an objective loss function acting on quantum states.

3.1 Quantum discriminator

In the case of a fully connected quantum GAN — which we study here — where both the discriminator and generator are quantum circuits, one of the main differences between a classical GAN and a QuGAN lays in the input of the discriminator. Indeed, as said above, in a classical discriminator the input is a sample \(\boldsymbol {\mathrm {x}}_{\boldsymbol {\theta }_{G}}\) generated by the generator G, whereas in a quantum discriminator the input is a wave function

$$ |{v_{\boldsymbol{\theta}_{G}}}\rangle={\sum}_{j=0}^{2^{n}-1}v_{j,\boldsymbol{\theta}_{G}}|{j}\rangle $$
(3.1)

generated by a quantum generator. In such a setting, the goal is to create a wave function of the form 3.1 which is a physical way of encoding a given discrete distribution, namely

$$ \mathbb{P}\left( |{v_{\boldsymbol{\theta}_{G}}}\rangle=|{j}\rangle\right) = |v_{j,\boldsymbol{\theta}_{G}}|^{2}=p_{j}, \qquad\text{fo each } j=0,\ldots, 2^{n}-1, $$
(3.2)

where \((p_{j})_{j=0,\ldots , 2^{n}-1} \in [0,1]^{2^{n}}\) with \({\sum }_{j=0}^{2^{n}-1}p_{j}=1\). We choose here a simple architecture for the discriminator, as a quantum version of a perceptron with a sigmoid activation function (Fig. 4).

Fig. 4
figure 4

Classical perceptron mapping \(\boldsymbol {\mathrm {x}}\in \mathbb {R}^{n}\) to \(\sigma \left (\boldsymbol {\mathrm {x}}^{\top }\boldsymbol {\mathrm {w}}\right ) \in \mathbb {R}\)

This approach of building the circuit is new since in the papers that use quantum discriminators, the circuits that are used are what is called ansatz circuits (Braccia et al. 2021), in other words generic circuits built with layers of rotation gates and controlled rotation gates (see 3.6 and 3.7 below for the definition of these gates). Such ansatz circuits are therefore parameterised circuits as put forward in Chakrabarti et al. (2019), where generally an interpretation on the circuit’s architecture performing as a classifying neural network cannot be made. As pointed out in Braccia et al. (2021), the architectures of both the generator and the discriminator are the same, which on the one hand solves the issue of having to monitor whether there is a imbalance in terms of expressivity between the generator and the discriminator; however, on the other hand,3 it prevents us from being able to give a straightforward interpretation for the given architectures.

The main task here is then to translate these classical computations to a quantum input for the discriminator. This challenge has been taken up in both Sections 2.3 and 2.4 where we have built from scratch a quantum perceptron which performs exactly like a classical perceptron. There is however one main difference in terms of interpretation: let the wave function 3.1 be the input for the discriminator with N = 2n and, for \(j = \overline {j_{1}{\cdots } j_{n}}\) (defined in A.4), define ϕj := (j1,…,jn). Denote \(\mathfrak {D}(\boldsymbol {\mathrm {w}}) \in {\mathscr{M}}_{2^{n+m_{1}+m_{2}}}(\mathbb {C})\) the transformation performed by the entire quantum circuit depicted in Fig. 5, where \(\mathfrak {D}(\boldsymbol {\mathrm {w}})\) is unitary and \(\boldsymbol {\mathrm {w}}\in \mathbb {R}^{n}\), namely for m1 + m2 ancilla qubits,

$$ \mathfrak{D}(\boldsymbol{\mathrm{w}})|{0}\rangle^{\otimes{m_{1}+m_{2}}}|{j}\rangle = |{\sigma\left( \phi_{j}^{\top} \boldsymbol{\mathrm{w}}\right)}\rangle|\phi_{j}^{\top}\mathrm{\boldsymbol{w}})\rangle|{j}\rangle, $$

where \(|{\sigma \left (\phi _{j}^{\top } \boldsymbol {\mathrm {w}}\right )}\rangle \in \mathbb {C}^{2^{m_{1}}}\) and \(|{\phi _{j}^{\top } \boldsymbol {\mathrm {w}}}\rangle \in \mathbb {C}^{2^{m_{2}}}\) and where we only measure \(|{\sigma \left (\phi _{j}^{\top } \boldsymbol {\mathrm {w}}\right )}\rangle \). Thus, for the input 3.1, the discriminator outputs the wave function (with m1 + m2 ancilla qubits)

Fig. 5
figure 5

Quantum perceptron with \(\boldsymbol {\mathrm {w}} \in \mathbb {R}^{4}\) and one ancilla qubit for the inner product (m2 = 1) and one ancilla qubit for the activation (m1 = 1). Here we only measure the result produced by the activation function

$$ \mathfrak{D}(\boldsymbol{\mathrm{w}})|{0}\rangle^{\otimes{m_{1}+m_{2}}}|{v_{\boldsymbol{\theta}_{G}}}\rangle = {\sum}_{j=0}^{2^{n}-1}v_{j,\boldsymbol{\theta}_{G}}|{\sigma\left( \phi_{j}^{\top} \boldsymbol{\mathrm{w}}\right)}\rangle|{\phi_{j}^{\top} \boldsymbol{\mathrm{w}}})\rangle|{j}\rangle. $$
(3.3)

Therefore, in a QuGAN setting the goal for the discriminator is to distinguish the target wave function |ψtarget〉 from the generated one \(|{v_{\boldsymbol {\theta }_{G}}}\rangle \). In Zoufal et al. (2019) where — for a distribution with 23 possible outcomes — the authors use a classical discriminator composed of a 512-node input layer, a 256-node hidden layer, and a single-node output layer; in contrast, our quantum discriminator has only n = 3. Therefore while achieving comparable results, our approach avoids an over-parameterisation of the discriminator. While this over-parameterisation may be useful (for example to reduce the error of the estimation made by sampling from the generator, as in Zoufal et al. (2019)), it is not always desirable as interpretability of the network may suffer (Molnar 2020). A precise characterisation of the optimal network (number of gates for example) is still an open question, as in classical machine learning, which we shall investigate in the future.

Example 3.1

As an example, consider m2 = 1 ancilla qubit for the inner product, m1 = 1 ancilla qubit for the activation, |ψtarget〉 = ψ0|0〉 + ψ1|1〉 and \(|{v_{\boldsymbol {\theta }_{G}}}\rangle =v_{0,\boldsymbol {\theta }_{G}}|{0}\rangle +v_{1,\boldsymbol {\theta }_{G}}|{1}\rangle \). As we only measure the outcome produced by the activation function, the only possible outcomes are |0〉 and |1〉. Therefore, measuring the output of the discriminator only consists of a projection on either |0〉 or |1〉. Define these projectors

$$ {\Pi}_{0} := |{0}\rangle\langle{0}|\otimes \mathrm{I_{d}}^{\otimes m_{2}+n} \in \mathcal{M}_{2^{m_{1}+n+m_{2}}}(\mathbb{C}) \qquad\text{and}\qquad {\Pi}_{1} := |{1}\rangle\langle{1}|\otimes \mathrm{I_{d}}^{\otimes m_{2}+n} \in \mathcal{M}_{2^{m_{1}+n+m_{2}}}(\mathbb{C}), $$

where m2 = 1 and n = 1 since in our toy example the wave functions encoding the distributions are 1-qubit distributions. Interpreting measuring |0〉 as labelling the input distribution Fake and measuring |1〉 as labelling it Real, the optimal discriminator with parameter w would perform as

$$ \begin{array}{@{}rcl@{}} \mathbb{P}\left( \mathfrak{D}(\boldsymbol{\mathrm{w}}^{*})|{0}\rangle^{\otimes{m_{1}+m_{2}}}|{v_{\boldsymbol{\theta}_{G}}}\rangle =|{0}\rangle\otimes{\sum}_{j=0}^{2^{n}-1}v_{j,\boldsymbol{\theta}_{G}}|{\phi_{j}^{\top} \boldsymbol{\mathrm{w}}^{*}}\rangle|{j}\rangle\right) & = & \left\|{\Pi}_{0}\mathfrak{D}(\boldsymbol{\mathrm{w}}^{*})|{0}\rangle^{\otimes{m_{1}+m_{2}}}|{v_{\boldsymbol{\theta}_{G}}}\rangle\right\|^{2} =1,\\ \mathbb{P}\left( \mathfrak{D}(\boldsymbol{\mathrm{w}}^{*})|{0}\rangle^{\otimes{m_{1}+m_{2}}}|{\psi_{\text{target}}}\rangle =|{1}\rangle\otimes{\sum}_{j=0}^{2^{n}-1}\psi_{j}|{\phi_{j}^{\top} \boldsymbol{\mathrm{w}}^{*}}\rangle|{j}\rangle\right) & = & \left\|{\Pi}_{1}\mathfrak{D}(\boldsymbol{\mathrm{w}}^{*})|{0}\rangle^{\otimes{m_{1}+m_{2}}}|{\psi_{\text{target}}}\rangle\right\|^{2} =1, \end{array} $$
(3.4)

where still in our toy example we have n = 1, m1 = 1 and m2 = 1. Here n could be any positive integer. We illustrate the circuit in Fig. 5.

3.1.1 Bloch sphere representation

The Bloch sphere (Nielsen and Chuang 2000) is important in Quantum Computing, providing a geometrical representation of pure states. In our case, it yields a geometric visualisation of the way an optimal quantum discriminator works as it separates the two complementary regions

$$ \begin{array}{@{}rcl@{}} \mathcal{R}_{F} &:=& \left\{{\sum}_{i=0}^{2^{m-1} -1}\alpha_{i}|{i}\rangle \text{ such that } {\sum}_{i=0}^{2^{m-1} -1}|\alpha_{i}|^{2}=1\right\},\\ \mathcal{R}_{T} &:=& \left\{{\sum}_{i=2^{m-1}}^{2^{m} -1}\alpha_{i}|{i}\rangle \text{ such that } {\sum}_{i=2^{m-1}}^{2^{m} -1}|\alpha_{i}|^{2}=1\right\}, \end{array} $$
(3.5)

where m := m1 + m2 + n is the total number of qubits for the inputs of the discriminator. The optimal discriminator \(\mathfrak {D}(\boldsymbol {\mathrm {w}}^{*})\) would perform as

$$ \mathfrak{D}(\boldsymbol{\mathrm{w}}^{*})|{\textit{Fake}}\rangle \in \mathcal{R}_{F} \quad\text{and}\quad \mathfrak{D}(\boldsymbol{\mathrm{w}}^{*})|{\textit{Real}}\rangle \in \mathcal{R}_{T}, \quad\text{almost surely}, $$

where \(|{\textit {Fake}}\rangle :=|{0}\rangle |{0}\rangle |{v_{\boldsymbol {\theta }_{G}}}\rangle \) and |Real〉 := |0〉|0〉|ψtarget〉. Now, the challenge lays in finding such an optimal discriminator; however, one should note that the nature of the state |Fake〉 plays a major role in finding such a discriminator. Therefore, in the following part we focus on the generator responsible for generating |Fake〉.

Example 3.2

Consider Example 3.1 with \((\psi _{0}, \psi _{1}) = (\frac {1}{\sqrt {2}}, \frac {1}{\sqrt {2}})\) and \((v_{0,\boldsymbol {\theta }_{G}}, v_{1,\boldsymbol {\theta }_{G}}) = (\frac {\sqrt {3}}{2}, \frac {1}{2})\). The states |ψtarget〉 and \(|{v_{\boldsymbol {\theta }_{G}}}\rangle \) are shown in Fig. 6. The wave function produced by the discriminator is composed of three qubits (m1 = 1, m2 = 1 and n = 1 qubit for the input wave function 3.3); therefore, one optimal transformation for the discriminator having |ψtarget〉 as an input is one such that the first qubit never collapses onto the state |0〉 (Fig. 7).

Fig. 6
figure 6

Bloch spheres representations for |ψtarget〉 (left) and \(|{v_{\boldsymbol {\theta }_{G}}}\rangle \) (right) where there is no phase shift between |0〉 and |1〉 and where the sizes of the lobes are proportional to the probability of measuring the associated states

Fig. 7
figure 7

Left: \(\mathfrak {D}(w^{*}_{1})|{0}\rangle |{0}\rangle |{\psi _{\text {target}}}\rangle \). Total system post-one optimal discriminator transformation. The first qubit never collapses onto |0〉 and therefore such a discriminator is optimal at labelling |ψtarget〉 as Real. Right: \(\mathfrak {D}(w^{*}_{2})|{0}\rangle |{0}\rangle |{v_{\boldsymbol {\theta }_{G}}}\rangle \). Total system post-one optimal discriminator transformation. The first qubit never collapses onto |1〉 and therefore such a discriminator is optimal at labelling \(|{v_{\boldsymbol {\theta }_{G}}}\rangle \) Fake

3.2 Quantum generator

The quantum generator is a quantum circuit producing a wave function that encodes a discrete distribution. Such a circuit takes as an input the ground state \(|{0}\rangle \otimes ^{n-m_{1}-m_{2}}\) and outputs a wave function \(|{v_{\boldsymbol {\theta }_{G}}}\rangle \) parameterised by 𝜃G, the set of parameters for the discriminator. We recall here a few quantum gates that will be key to constructing a quantum generator. Recall that a quantum gate can be viewed as a unitary matrix; of particular interest will be gates acting on two (or more) qubits, as its allows quantum entanglement, thus fully leveraging the power of quantum computing. The NOT gate X acts on one qubit and is represented as

$$ \mathrm{X} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, $$

so that X|0〉 = |1〉 and X|1〉 = |0〉. The RY is a one-qubit gate represented by the matrix

$$ \mathrm{R}_{\mathrm{Y}}(\theta) := \begin{pmatrix} \cos\left( \frac{\theta}{2}\right) & -\sin\left( \frac{\theta}{2}\right)\\ \sin\left( \frac{\theta}{2}\right) & \cos\left( \frac{\theta}{2}\right) \end{pmatrix}, $$
(3.6)

thus performing as

$$ \mathrm{R}_{\mathrm{Y}}(\theta)|{0}\rangle = \cos\left( \frac{\theta}{2}\right)|{0}\rangle + \sin\left( \frac{\theta}{2}\right)|{1}\rangle \qquad\text{and}\qquad \mathrm{R}_{\mathrm{Y}}(\theta)|{1}\rangle = \cos\left( \frac{\theta}{2}\right)|{1}\rangle - \sin\left( \frac{\theta}{2}\right)|{0}\rangle. $$

The cRY Gate is the controlled version of the RY gate, acting on two qubits, one control qubit and one transformed qubit, producing quantum entanglement. The RY transformation applies on the second qubit only when provided the control qubit is in |1〉, otherwise leaves the second qubit unaltered. Its matrix representation is

$$ {~}_{\mathrm{c}}\mathrm{R}_{\mathrm{Y}}(\theta)=\begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & \cos\left( \frac{\theta}{2}\right) & -\sin\left( \frac{\theta}{2}\right)\\ 0 & 0 & \sin\left( \frac{\theta}{2}\right) & \cos\left( \frac{\theta}{2}\right). \end{pmatrix} $$
(3.7)

Given n qubits let X := (X1Xn) be a random vector taking values in \(\mathcal {X}_{n} := \{0, 1\}^{n}\). Set

$$ p_{\boldsymbol{\mathrm{x}}} := \mathbb{P}[\boldsymbol{\mathrm{X}} = \boldsymbol{\mathrm{x}}], \quad\text{for } \boldsymbol{\mathrm{x}}\in \mathcal{X}_{n}. $$

When building the generator we are looking for a quantum circuit that implements the transformation

$$ |{0}\rangle^{\otimes n}\mapsto{\sum}_{\boldsymbol{\mathrm{x}}\in \{0,1\}^{n}}\sqrt{p_{\boldsymbol{\mathrm{x}}}}\mathrm{e}^{{\mathrm{i}\theta_{\boldsymbol{\mathrm{x}}}}}|{\boldsymbol{\mathrm{x}}}\rangle. $$

We could follow a classical algorithm. For 1 ≤ kn, let x:k := (x1,…,xk) and, given \(\boldsymbol {\mathrm {x}}\in \mathcal {X}_{n}\),

$$ q_{\boldsymbol{\mathrm{x}}_{:k}} := \left\{ \begin{array}{ll} \mathbb{P}[X_{1} = 0], & \text{if }k=1,\\ \mathbb{P}[X_{k} = 0|\boldsymbol{\mathrm{X}}_{:k-1} = \boldsymbol{\mathrm{x}}_{:k-1}], & \text{if }2\leq k\leq n. \end{array} \right. $$
(3.8)

We then proceed by induction: start with a random draw of X1 as a Bernoulli sample with failure probability \(q_{\boldsymbol {\mathrm {x}}_{1}}\). Assuming that X:k− 1 has been sampled as x:k− 1 for some 1 ≤ kn, sample Xk from a Bernoulli distribution with failure probability \(q_{\boldsymbol {\mathrm {x}}_{:k-1}}\). The quantum circuit will equivalently consist of n stages, where at each stage 1 ≤ kn we only work with the first k qubits, and at the end of each stage there is the correct distribution for the first k qubits in the sense that, upon measuring, their distribution coincides with that of X:k.

The first step is simple: a single Y-rotation of the first qubit with angle 𝜃 ∈ [0,π] satisfying \(\cos \limits (\frac {\theta }{2}) = \sqrt {q_{\boldsymbol {\mathrm {x}}_{1}}}\). In other words, with U1 := RY(𝜃), we map |0〉 to \(\mathrm {U}_{1}|{0}\rangle = \sqrt {q_{\boldsymbol {\mathrm {x}}_{1}}}|{0}\rangle + \sqrt {1-q_{\boldsymbol {\mathrm {x}}_{1}}}|{1}\rangle .\) Clearly, when measuring the first qubit, we obtain the correct law. Now, inductively, for 2 ≤ kn, suppose the first k − 1 qubits fixed, namely in the state

$$ {\sum}_{\boldsymbol{\mathrm{x}}_{:k-1}\in\mathcal{X}_{k-1}}\sqrt{p_{\boldsymbol{\mathrm{x}}_{:k-1}}}|{\boldsymbol{\mathrm{x}}_{:k-1}}\rangle|{0}\rangle^{\otimes n-k+1}, $$

For each \(\boldsymbol {\mathrm {x}}_{:k-1}\in \mathcal {X}_{k-1}\), let \(\theta _{\boldsymbol {\mathrm {x}}_{:k-1}}\in [0;\pi ]\) satisfy \(\cos \limits \left (\frac {1}{2} \theta _{\boldsymbol {\mathrm {x}}_{:k-1}}\right )=\sqrt {q_{\boldsymbol {\mathrm {x}}_{:k-1}}}\) and consider the gate \(\mathrm {C}_{\boldsymbol {\mathrm {x}}_{:k-1}}\) acting on the first k qubits which is a RY(𝜃x) on the last qubit k, controlled on whether the first k − 1 qubits are equal to x:k− 1. We then have

$$ C_{\boldsymbol{\mathrm{x}}_{:k-1}}|{\boldsymbol{\mathrm{y}}}\rangle|{0}\rangle = \left\{\begin{array}{ll} \sqrt{q_{\boldsymbol{\mathrm{x}}_{:k-1}}}|{\boldsymbol{\mathrm{x}}_{:k-1}}\rangle|{0}\rangle + \sqrt{1-q_{\boldsymbol{\mathrm{x}}_{:k-1}}}|{\boldsymbol{\mathrm{x}}_{:k-1}}\rangle|{1}\rangle, & \quad \text{if } \boldsymbol{\mathrm{y}} = \boldsymbol{\mathrm{x}}_{:k-1},\\ |{\boldsymbol{\mathrm{y}}}\rangle|{0}\rangle, \quad\text{for }\boldsymbol{\mathrm{y}} \ne \boldsymbol{\mathrm{x}}_{:k-1}. \end{array} \right. $$
(3.9)

Therefore, defining \(\mathrm {U}_{k} := {\prod }_{\boldsymbol {\mathrm {x}}_{:k-1}\in \mathcal {X}_{k-1}}\mathrm {C}_{\boldsymbol {\mathrm {x}}_{:k-1}}\), and noting that the order of multiplication does not affect the computations below, it follows that

$$ \begin{array}{@{}rcl@{}} \mathrm{U}_{k}{\sum}_{\boldsymbol{\mathrm{x}}_{:k-1}\in\mathcal{X}_{k-1}}\sqrt{p_{\boldsymbol{\mathrm{x}}_{:k-1}}}|{\boldsymbol{\mathrm{x}}_{:k-1}}\rangle|{0}\rangle^{\otimes n-k+1} & =& {\sum}_{\boldsymbol{\mathrm{x}}_{:k-1}\in\mathcal{X}_{k-1}}\left\{\sqrt{p_{\boldsymbol{\mathrm{x}}_{:k-1}}q_{\boldsymbol{\mathrm{x}}_{:k-1}}}|{\boldsymbol{\mathrm{x}}_{:k-1}}\rangle +\sqrt{p_{\boldsymbol{\mathrm{x}}_{:k-1}}\left( 1-q_{\boldsymbol{\mathrm{x}}_{:k-1}}\right)}|{1}\rangle\right\}|{0}\rangle\\ && {\sum}_{\boldsymbol{\mathrm{x}}_{:k}\in\mathcal{X}_{k}}\sqrt{p_{\boldsymbol{\mathrm{x}}_{:k}}}|{\boldsymbol{\mathrm{x}}_{:k}}\rangle|{0}\rangle^{\otimes n-k}, \end{array} $$

where the last equality follows from properties of conditional expectations since

$$ p_{\boldsymbol{\mathrm{x}}_{:k-1}} q_{\boldsymbol{\mathrm{x}}_{:k-1}} = p_{{\boldsymbol{\mathrm{x}}_{:k-1}}.0} \qquad\text{and}\qquad p_{\boldsymbol{\mathrm{x}}_{:k-1}}\left( 1-q_{\boldsymbol{\mathrm{x}}_{:k-1}}\right)=p_{{\boldsymbol{\mathrm{x}}_{:k-1}}.1}, $$

for \({\boldsymbol {\mathrm {x}}_{:k-1}}\in \mathcal {X}_{k-1}\), \({\boldsymbol {\mathrm {x}}_{:k-1}}.0 \in \mathcal {X}_{k}\) and \({\boldsymbol {\mathrm {x}}_{:k-1}}.1 \in \mathcal {X}_{k}\) (see after A.4 for the binary representation of decimals). This concludes the inductive step. The generator has therefore been built accordingly to a ‘classical’ algorithm, however only up until \(\mathcal {X}_{2}\) (see Fig. 8 for the architecture for qubits q3 and q2) to avoid to have a network that is too deep and therefore untrainable in a differentiable manner because of the barren plateau phenomenon (McClean et al. 2018). Indeed, in order to build Uk from simple controlled gates (with only one control qubit) the number of gates is of order \(\mathcal {O}(2^{k-1})\), making the generator deeper. Thus the number of gates we would have to use would be of order \(\mathcal {O}(2^{n})\), making the generator very expressive yet very hard to train.

Fig. 8
figure 8

Entangled generator composed of RY, cRY and X gates, with parameters values for {𝜃1,…,𝜃9} indicated alongside the gates

Example 3.3

With n = 4, the architecture for our generator is depicted in Fig. 8 and the full QuGAN (generator and discriminator) algorithm in Fig. 9.

Fig. 9
figure 9

The entire associated entangled QuGAN

3.3 Quantum adversarial game

In GANs the goal of the discriminator (D) is to discriminate real (R) data from the fake ones generated by the generator (G), while the goal of the latter is to fool the discriminator by generating fake data. Here both real and generated data are modeled as quantum states, respectively described by their wave functions |ψtarget〉 and \(|{v_{\boldsymbol {\theta }_{G}}}\rangle \). Define the objective function

$$ \begin{array}{@{}rcl@{}} &&\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}) := \mathbb{P}\Big(\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{\psi_{\text{target}}}\rangle\in \mathcal{R}_{T}\Big) \\&&\quad- \mathbb{P}\Big(\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{v_{\boldsymbol{\theta}_{G}}}\rangle\in \mathcal{R}_{T}\Big), \end{array} $$

where the region \(\mathcal {R}\) is defined in 3.5. Here \(\mathbb {P}(\mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})|{0}\rangle |{0}\rangle |{\psi _{\text {target}}}\rangle \in \mathcal {R}_{T})\) is the probability of labelling the real data |0〉|0〉|ψtarget〉 as real via the discriminator and \(\mathbb {P}(\mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})|{0}\rangle |{0}\rangle |{v_{\boldsymbol {\theta }_{G}}}\rangle \in \mathcal {R}_{T})\) is the probability of having the generator fool the discriminator. As stated in 3.4 for two ancilla qubits (m1 + m2 = 2, i.e. one qubit for inner product and one qubit for activation) we have

$$ \mathbb{P}\Big(\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{\psi_{\text{target}}}\rangle\in \mathcal{R}_{T}\Big) = \left\|{\Pi}_{1}\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{\psi_{\text{target}}}\rangle\right\|^{2}. $$

By defining the projection of the output of the discriminator onto \(\mathcal {R}_{T}\),

$$ |{\psi_{\text{out},\text{target},\boldsymbol{\mathrm{w}}_{D}}}\rangle := {\Pi}_{1}\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{\psi_{\text{target}}}\rangle, $$

we can also write

$$ \mathbb{P}\Big(\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{\psi_{\text{target}}}\rangle\in \mathcal{R}_{T}\Big) = \text{Tr}(\rho_{\text{out},\text{target},\boldsymbol{\mathrm{w}}_{D}}), $$

where \(\rho _{\text {out},\text {target},\boldsymbol {\mathrm {w}}_{D}}:=|{\psi _{\text {out},\text {target},\boldsymbol {\mathrm {w}}_{D}}}\rangle \langle {\psi _{\text {out},\text {target},\boldsymbol {\mathrm {w}}_{D}}}|\) is the density operator associated to \(\psi _{\text {out},\text {target},\boldsymbol {\mathrm {w}}_{D}}\). The same goes for the probability of fooling the discriminator, namely

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}\Big(\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{v_{\boldsymbol{\theta}_{G}}}\rangle\in \mathcal{R}_{T}\Big) = \left\|{\Pi}_{1}\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{v_{\boldsymbol{\theta}_{G}}}\rangle\right\|^{2} \\&&\quad=\text{Tr}(\rho_{\text{out},\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}}), \end{array} $$

where \(|{\psi _{\text {out},\boldsymbol {\theta }_{G},\boldsymbol {\mathrm {w}}_{D}}}\rangle :={\Pi }_{1}\mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})|{0}\rangle |{0}\rangle |{v_{\boldsymbol {\theta }_{G}}}\rangle \) and \(\rho _{\text {out},\boldsymbol {\theta }_{G},\boldsymbol {\mathrm {w}}_{D}}:=|{\psi _{\text {out},\boldsymbol {\theta }_{G},\boldsymbol {\mathrm {w}}_{D}}}\rangle \langle {\psi _{\text {out},\boldsymbol {\theta }_{G},\boldsymbol {\mathrm {w}}_{D}}}|\). The min-max game played by the Generative Adversarial network is therefore defined as the optimisation problem

$$ \min_{\boldsymbol{\theta}_{G}}\max_{\boldsymbol{\mathrm{w}}_{D}} \mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}). $$
(3.10)

Moreover, since \(\mathcal {S}\) is differentiable and given the architecture of our circuits, according to the shift rule formula (Schuld et al. 2019), the partial derivatives of \(\mathcal {S}\) admit the closed-form representations

$$ \begin{array}{@{}rcl@{}} \nabla_{\boldsymbol{\theta}_{G}}\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}) & =& \frac{1}{2}\left\{ \mathcal{S}\left( \boldsymbol{\theta}_{G}+\frac{\pi}{2},\boldsymbol{\mathrm{w}}_{D}\right) - \mathcal{S}\left( \boldsymbol{\theta}_{G}-\frac{\pi}{2},\boldsymbol{\mathrm{w}}_{D}\right)\right\},\\ \nabla_{\boldsymbol{\mathrm{w}}_{D}}\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}) & = &\frac{1}{2}\left\{ \mathcal{S}\left( \boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}+\frac{\pi}{2}\right) - \mathcal{S}\left( \boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}-\frac{\pi}{2}\right)\right\}, \end{array} $$
(3.11)

so that training will be based on stochastic gradient ascent and descent. The reason for a stochastic algorithm lies in the nature of \(\mathcal {S}(\boldsymbol {\theta }_{G},\boldsymbol {\mathrm {w}}_{D})\), seen as the difference between two probabilities to estimate. A natural estimator for l measurements/observations is

$$ \widehat{\mathcal{S}}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D})_{l} := \frac{1}{l}{\sum}_{k=1}^{l} 1{1}_{\left\{\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{\psi_{\text{target}}^{k}}\rangle\in \mathcal{R}_{T}\right\}} - 1{1}_{\left\{\mathfrak{D}(\boldsymbol{\mathrm{w}}_{D})|{0}\rangle|{0}\rangle|{v^{k}_{\boldsymbol{\theta}_{G}}}\rangle\in \mathcal{R}_{T}\right\}}, $$

where \(|{v_{\boldsymbol {\theta }_{G}}^{k}}\rangle \) is the k th wave function produced by the generator and \(|{\psi _{\text {target}}^{k}}\rangle \) is the k th copy for the target distribution.

Given the nature of the problem, two strategies arise: for fixed parameters 𝜃G, when training the discriminator, we first minimise the labelling error, ie.

$$ \max_{\boldsymbol{\mathrm{w}}_{D}}\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}), $$

which we achieve by stochastic gradient ascent with a learning rate ηD = 0.9. Moreover, we chose to initialise the weights following a Uniform distribution as \(\boldsymbol {\mathrm {w}}_{D} \sim \mathcal {U}([-1,1])\). Then, when training the generator the goal is to fool the discriminator, so that, for fixed wD, the target is

$$ \min_{\boldsymbol{\theta}_{G}}\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}), $$

which is achieved by stochastic gradient descent with a learning rate ηG = 0.05. Similarly to the discriminator, we initialise the weights as \(\boldsymbol {\theta }_{G} \sim \mathcal {U}([0,2\pi ])\). Our experiments seem to indicate that other initialisation assumptions overall yield analogous results. This choice of learning rates may look arbitrary at first sight. Unfortunately, there is yet no rigorous approach to finding optimal learning rates, even in the classical machine learning / stochastic gradient literature. One could also use tools from annealing, i.e. start with large values of learning rates and slowly decrease them, to go from exploration to exploitation, but we leave this to future investigations.

Remark 3.4

In the classical GAN setting, this optimisation problem may fail to converge (Goodfellow 2014). Over the past few years, progress has been made to improve the convergence quality of the algorithm and to improve its stability, using different loss functions or adding regularising terms. We refer the interested reader to the corresponding papers (Arjovsky et al. 2017; Denton et al. 2015; Deshpande et al. 2018; Gulrajani et al. 2017; Miyato et al. 2018; Radford et al. 2016; Salimans et al. 2016), and leave it to future research to integrate these improvements into a quantum setting.

Proposition 3.5

The solution \((\boldsymbol {\theta }_{G}^{*}, \boldsymbol {\mathrm {w}}_{D}^{*})\) to the \(\min \limits -\max \limits \) problem 3.10 is such that the wave function \(|{v_{\boldsymbol {\theta }_{G}^{*}}}\rangle \) satisfies \(|\langle {\psi _{\text {target}}}||{v_{\boldsymbol {\theta }_{G}^{*}}}\rangle |^{2}=1\), namely, for each i ∈{0,…,2n − 1},

$$ \mathbb{P}(|{\psi_{\text{target}}}\rangle)=|{i}\rangle)=\mathbb{P}(|{v_{\boldsymbol{\theta}_{G}^{*}}}\rangle=|{i}\rangle). $$

Proof

Define the density matrices ρtarget := |ψtarget〉 〈ψtarget| and \(\rho _{\boldsymbol {\theta }_{G}}:=|{v_{\boldsymbol {\theta }_{G}}}\rangle \langle {v_{\boldsymbol {\theta }_{G}}}|\) as well as the operator \(P_{\boldsymbol {\mathrm {w}}_{D}}^{R} := \mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})^{\dagger }{\Pi }_{1}^{\dagger }{\Pi }_{1}\mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})\). Then

$$ \mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D})= \text{Tr}\left( P_{\boldsymbol{\mathrm{w}}_{D}}^{R}\{\rho_{\text{target}}-\rho_{\boldsymbol{\theta}_{G}}\}\right) $$

Since π1 + π0 = Id and \(\mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})\) is unitary, setting \(P_{\boldsymbol {\mathrm {w}}_{D}}^{F} := \mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})^{\dagger }{\Pi }_{0}^{\dagger }{\Pi }_{0}\mathfrak {D}(\boldsymbol {\mathrm {w}}_{D})\), it is straightforward to rewrite \(\mathcal {S}(\boldsymbol {\theta }_{G},\boldsymbol {\mathrm {w}}_{D})\) as

$$ \mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}) = \text{Tr}\left( P_{\boldsymbol{\mathrm{w}}_{D}}^{R}\rho_{\text{target}})+\text{Tr}(P_{\boldsymbol{\mathrm{w}}_{D}}^{F}\rho_{\boldsymbol{\theta}_{G}}\right) - 1, $$

since \(\text {Tr}(\rho _{\boldsymbol {\theta }_{G}})=1\) according to the Born Rule (Theorem A.1) and \(P_{\boldsymbol {\mathrm {w}}_{D}}^{R}+P_{\boldsymbol {\mathrm {w}}_{D}}^{F}=\mathrm {I_{d}}\). Again, we also have

$$ \begin{array}{@{}rcl@{}} &&\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}) = -1 + \frac{1}{2}\text{Tr}\left( \left( P_{\boldsymbol{\mathrm{w}}_{D}}^{R}+P_{\boldsymbol{\mathrm{w}}_{D}}^{F}\right) \left( \rho_{\text{target}}+\rho_{\boldsymbol{\theta}_{G}}\right)\right) \\&&\quad+ \frac{1}{2}\text{Tr}\left( \left( P_{\boldsymbol{\mathrm{w}}_{D}}^{R}-P_{\boldsymbol{\mathrm{w}}_{D}}^{F}\right) \left( \rho_{\text{target}}-\rho_{\boldsymbol{\theta}_{G}}\right)\right), \end{array} $$

and finally

$$ \mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}) = \frac{1}{2}\text{Tr}\left( \left( P_{\boldsymbol{\mathrm{w}}_{D}}^{R}-P_{\boldsymbol{\mathrm{w}}_{D}}^{F}\right)\left( \rho_{\text{target}}-\rho_{\boldsymbol{\theta}_{G}}\right)\right). $$

Recall that for two Hermitian matrices A,B, the inequality Tr(AB) ≤∥ApBq holds for p,q ≥ 1 with \(\frac {1}{p}+\frac {1}{q}=1\), where ∥⋅∥p denotes the p-norm. Since \(P_{\boldsymbol {\mathrm {w}}_{D}}^{R}\) and \(P_{\boldsymbol {\mathrm {w}}_{D}}^{F}\) are Hermitian, we obtain (with \(p=\infty \) and q = 1)

$$ \mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D})\leq\frac{1}{2} \left\|P_{\boldsymbol{\mathrm{w}}_{D}}^{R}-P_{\boldsymbol{\mathrm{w}}_{D}}^{F}\right\|_{\infty} \left\|\rho_{\text{target}}-\rho_{\boldsymbol{\theta}_{G}}\right\|_{1}, $$

where \(\left \|P_{\boldsymbol {\mathrm {w}}_{D}}^{R}-P_{\boldsymbol {\mathrm {w}}_{D}}^{F}\right \|_{\infty }\leq 1\). Thus the optimal \(\boldsymbol {\mathrm {w}}_{D}^{*}\) satisfies

$$ \max_{\boldsymbol{\mathrm{w}}_{D}}\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D})=\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D}^{*})=\frac{1}{2}\left\|\rho_{\text{target}}-\rho_{\boldsymbol{\theta}_{G}}\right\|_{1}. $$

Again, since \(\|\rho _{\text {target}}-\rho _{\boldsymbol {\theta }_{G}}\|_{1}\geq 0\) the optimal \(\boldsymbol {\theta }_{G}^{*}\) gives

$$ \min_{\boldsymbol{\theta}_{G}}\max_{\boldsymbol{\mathrm{w}}_{D}}\mathcal{S}(\boldsymbol{\theta}_{G},\boldsymbol{\mathrm{w}}_{D})=\mathcal{S}(\boldsymbol{\theta}_{G}^{*},\boldsymbol{\mathrm{w}}_{D}^{*})=0, $$

which is equivalent to \(\|\rho _{\text {target}}-\rho _{\boldsymbol {\theta }_{G}}\|_{1}=0\), itself also equivalent to \(\mathbb {P}(|{v_{\boldsymbol {\theta }_{G}^{*}}}\rangle =|{i}\rangle )=\mathbb {P}(|{\psi _{\text {target}}}\rangle =|{i}\rangle )=p_{i}\), for all i ∈{0,…,2n − 1}. □

Remark 3.6

Our strategy to reach and approximate a solution to the \(\min \limits -\max \limits \) problem will be as follows: we train the discriminator by stochastic gradient ascent nD times and then train the generator nG times by stochastic gradient descent and repeat this \(\mathfrak {e}\) times.

4 Financial application: SVI goes quantum

We provide here a simple example of generating data in a financial context with the aim to increase interdisciplinarity between quantitative finance and quantum computing.

4.1 Financial background and motivation

Some of the most standard and liquid traded financial derivatives are so-called European Call and Put options. A Call (resp. Put) gives its holder the right, but not the obligation, to buy (resp. sell) an asset at a specified price (the strike price K) at a given future time (the maturity T). Mathematically, the setup is that of a filtered probability space \(({\Omega }, \mathcal {F},(\mathcal {F}_{t})_{t\geq 0}, \mathbb {P})\) where \((\mathcal {F}_{t})_{t\geq 0}\) represents the flow of information; on this space, an asset S = (St)t≥ 0 is traded and assumed to be adapted (namely St is \(\mathcal {F}_{t}\)-measurable for each t ≥ 0). We further assume that there exists a probability \(\mathbb {Q}\), equivalent to \(\mathbb {P}\) such that S is a \(\mathbb {Q}\)-martingale. This martingale assumption is key as the Fundamental Theorem of Asset Pricing (Delbaen and Schachermayer 1994) in particular implies that this is equivalent to Call and Put prices being respectively equal, at inception of the contract, to

$$ \mathrm{C}(K,T) = \mathbb{E}[\max(S_{T}-K, 0)|\mathcal{F}_{0}] \qquad\text{and}\qquad \mathrm{P}(K,T) = \mathbb{E}[\max(K-S_{T}, 0)|\mathcal{F}_{0}], $$

where the expectation \(\mathbb {E}\) is taken under the risk-neutral probability \(\mathbb {Q}\). Under sufficient smoothness property of the law of ST, differentiating twice the Call price yields that the probability density function of the log stock price \(\log (S_{T})\) is given by

$$ p_{T}(k) = \left( \frac{\partial^{2}\mathrm{C}(K,T)}{\partial K^{2}}\right)_{K=S_{0}\mathrm{e}^{k}}, $$
(4.1)

implying that the real distribution of the (log) stock price can in principle be recovered from options data. However, prices are not quoted smoothly in (K,T) and interpolation and extrapolation are needed. Doing so at the level or prices turns out to be rather cumbersome and market practice usually does it at the level of the so-called implied volatility. The basic fundamental model of a continuous-time financial martingale is given by the Black-Scholes model (Black and Scholes 1973), under which

$$ \frac{\mathrm{d} S_{t}}{S_{t}} = \sigma \mathrm{d} W_{t}, \qquad S_{0}>0, $$

where σ > 0 is the (constant) instantaneous volatility and W a standard Brownian motion adapted to the filtration \((\mathcal {F}_{t})_{t\geq 0}\). In this model, Call prices admit the closed-form formula

$$ \mathrm{C}_{\text{BS}}(K,T,\sigma) :=\mathbb{E}[\max(S_{T}-K, 0)|\mathcal{F}_{0}] = S_{0} \text{BS}\left( \log\left( \frac{K}{S_{0}}\right), \sigma^{2} T\right), $$

where

$$ \text{BS}(k,v) := \left\{ \begin{array}{ll} \mathcal{N}(d_{+}(k,v)) - \mathrm{e}^{k}\mathcal{N}(d_{-}(k,v)), & \text{if } v>0, \\ (1-\mathrm{e}^{k})_{+}, & \text{if } v=0, \end{array} \right. $$

with \(d_{\pm }(k,v):=-\frac {k}{\sqrt {v}} \pm \frac {\sqrt {v}}{2}\), where \(\mathcal {N}\) denotes the cumulative distribution function of the Gaussian distribution. With a slight abuse of notation, we shall from now on write CBS(K,T,σ) = CBS(k,T,σ), where \(k:= \log (\frac {K}{S_{0}})\) represents the logmoneyness.

Definition 4.1

Given a strike K ≥ 0, a maturity T ≥ 0 and a Call price C(K,T) (either quoted on the market orcomputed from a model), the implied volatility σimp(k,T) is defined as the unique non-negative solution to the equation

$$ \mathrm{C}_{\text{BS}}(k, T, \sigma_{\text{imp}}(k,T))=\mathrm{C}(K,T). $$
(4.2)

Note that this equation may not always admit a solution. However, under no-arbitrage assumptions (equivalently under bound constraints for C(K,T)), it does so. We refer the interested reader to the volatility bible (Gatheral 2006) for full explanations of these subtle details. It turns out that the implied volatility is a much nicer object to work with (both practically and academically); plugging this definition into (4.1) yields that the map kσimp(k,T) fully characterises the distribution of \(\log (S_{T})\) as

$$ p_{T}(k) = \left( \frac{\partial^{2} \mathrm{C}_{\text{BS}}(k, T, \sigma_{\text{imp}}(k,T))}{\partial K^{2}}\right)_{K=S_{0}\mathrm{e}^{k}}. $$
(4.3)

While a smooth input σimp(⋅,T)) is still needed, it is however easier than for option prices. A market standard is the Stochastic Volatility Inspired (SVI) parameterisation proposed by Gatheral (2004) (and improved in Gatheral and Jacquier (2013) and Guo et al. (2016)), where the total implied variance \(w_{\text {SVI}}(k,T):=\sigma _{\text {imp}}^{2}(k,T)T\) is assumed to satisfy

$$ w_{\text{SVI}}(k,T) = a+b\left( k-m + \rho\sqrt{(k-m)^{2}+\xi^{2}}\right), \quad\text{for any }k \in \mathbb{R}, $$
(4.4)

with the parameters ρ ∈ [− 1,1], a,b,ξ ≥ 0 and \(m \in \mathbb {R}\). The probability density function (4.1) of the log stock price then admits the closed-form expression (Gatheral 2004)

$$ p_{T}(k) = \frac{g_{\text{SVI}}(k,T)}{\sqrt{2\pi w_{\text{SVI}}(k, T)}}\exp\left\{-\frac{d_{-}(k,w_{\text{SVI}}(k,T))^{2}}{2}\right\}, $$
(4.5)

where

$$ \begin{array}{@{}rcl@{}} &&g_{\text{SVI}}(k,T) := \left( 1-\frac{k w^{\prime}_{\text{SVI}}(k,T)}{2 w_{\text{SVI}}(k,T)}\right)^{2} \\&&\quad- \frac{w^{\prime}_{\text{SVI}}(k,T)^{2}}{4}\left( \frac{1}{4}+\frac{1}{w_{\text{SVI}}(k,T)}\right) + \frac{w^{\prime\prime}_{\text{SVI}}(k,T)}{2}, \end{array} $$

where all the derivatives are taken with respect to k. In Fig. 10, we plot the typical shape of the implied volatility smile, together with the corresponding density for the following parameters:

$$ a =0.030358 ,\qquad b = 0.0503815,\qquad \rho = -0.1 ,\qquad m =0.3 ,\qquad \xi = 0.048922 ,\qquad T = 1. $$
(4.6)
Fig. 10
figure 10

Density of \(\log (S_{T})\) computed from 4.5 and the corresponding SVI total variance 4.4. The parameters are given in (4.6)

4.2 Numerics

The goal of this numerical part is to be able to generate discrete versions of the SVI probability distribution given in (4.5). Our target distribution shall be the one plotted in Fig. 10, corresponding to the parameters (4.6). Since the Quantum GAN (likewise for the classical GAN) algorithm starts from a discrete distribution, we first need to discretise the SVI one. For convenience, we normalise the distribution on the closed interval [− 1,1] and discretise with the uniform grid.

$$ \left\{\left\lfloor(2^{n}-1)\left( \frac{k+1}{2}\right)\right\rfloor\right\}_{k=0,\ldots, 2^{n}-1}, $$

which we then convert into binary form. This uniform discretisation does not take into account the SVI probability masses at each point, and a clear refinement would be to use a one-dimensional quantisation of the SVI distribution. Indeed, the latter (see (Pagès et al. 2004) for full details about the methodology) minimises the distance (with respect to some chosen norm) between the initial distribution and its discretised version. We leave this precise study and its error analysis to further research, in the fear that it would clutter the present description of the algorithm. The discretised distribution, with n qubits, together with the binary mapping, is plotted in Fig. 11 and gives rise to the wave function

$$ |{\psi_{\text{target}}}\rangle = {\sum}_{i=0}^{2^{n}-1}\sqrt{p_{i}}|{i}\rangle, $$

where, for each i ∈{0,…,2n − 1},

$$ p_{i} = \mathbb{P}\left( \log(S_{T})\in\Bigg[ -1+\frac{2i}{2^{n}},-1+\frac{2(i+1)}{2^{n}} \Bigg)\right). $$
Fig. 11
figure 11

Discretised version for the distribution of \(\log (S_{T})\) on [− 1,1] with 24 points

We need metrics to monitor the training of our QuGAN algorithm, for example the Fidelity function (Nielsen and Chuang 2000, Chapter 9.2.2)

$$ \mathcal{F}: |{v_{1}}\rangle,|{v_{2}}\rangle\in \mathbb{C}^{2^{n}}\times \mathbb{C}^{2^{n}} \mapsto |\langle{v_{1}}||{v_{2}}\rangle|, $$

so that for the wave function (3.1) \(|{v_{\boldsymbol {\theta }_{G}}}\rangle ={\sum }_{i=0}^{2^{n}-1}v_{i,\boldsymbol {\theta }_{G}}|{i}\rangle \), the goal is to obtain \(\mathcal {F}\left (|{v_{i,\boldsymbol {\theta }_{G}}}\rangle ,|{\psi _{\text {target}}}\rangle \right )=1\), which gives \(\mathbb {P}(|{v_{\boldsymbol {\theta }_{G}}}\rangle = |{i}\rangle ) = \left |v_{i,\boldsymbol {\theta }_{G}}\right |^{2}=p_{i}\), for all i ∈{0,…,2n − 1}. The Kullback-Leibler Divergence is also a useful monitoring metric, defined as

$$ \text{KL}(|{\psi_{\text{target}}}\rangle,|{v_{\boldsymbol{\theta}_{G}}}\rangle) :={\sum}_{i=0}^{2^{n}-1}p_{i}\log\left( \frac{p_{i}}{\left|v_{i,\boldsymbol{\theta}_{G}}\right|^{2}}\right). $$

4.2.1 Training and generated distributions

In the training of the QuGAN algorithm, in each epoch \(\mathfrak {e}\), we train the discriminator nD = 9 times and the generator nG = 1. The results, in Figure 4.2.1, are quite interesting as the QuGAN manages to overall learn the SVI distribution. Aside from the limited number of qubits, the limitations however could be explained via the expressivity of our network which is only parameterised via (𝜃i)i∈{1,…,9} and (wi)i∈{1,…,4} which is clearly not enough. This lack of expressivity is a choice, and more parameters deepen the network, but can create a barren plateau phenomenon (McClean et al. 2018), where the gradient vanishes in \(\mathcal {O}(2^{-d})\) where d is the depth of the network. This would in turn require an exponentially larger number of shots to obtain a good enough estimation of (3.11), thereby creating a trade-off between expressivity and trainability in a differentiable manner.

figure c

4.2.2 Results: further improvements

By looking at the obtained results, we are able to observe a convergence for the training routine that we have followed. However, the aforementioned convergence doe not occur at the neighborhood of 0 for the Kullback-Leibler Divergence proxy metric, this could be explained by the shape of the target distribution. Indeed, given any target distribution, the generator’s architecture will allow for reproducing exactly the target distribution only for a unique set of variable 𝜃. At this point, when combining this unicity in terms of optimal solution with the shape of the target distribution that induces a certain geometry for the score function that we are trying to optimise, there is a risk of converging at sub-optimal points, i.e. saddle points in our case. Therefore, an entire study on such geometry induced by the shape of the target along with the development of a strategy preventing us from falling into such saddle points will constitute potential future candidate for further research (Fig. 12).

Fig. 12
figure 12

Evolution of \(\||{v_{\boldsymbol {\theta }_{G}}}\rangle \langle {v_{\boldsymbol {\theta }_{G}}}|-|{\psi _{\text {target}}}\rangle \langle {\psi _{\text {target}}}|\|_{1}\) during QuGAN training

All the numerics in the paper were performed using the IBM-Qiskit library in Python.