A continuous variable Born machine

Generative modelling has become a promising use case for near-term quantum computers. Due to the fundamentally probabilistic nature of quantum mechanics, quantum computers naturally model and learn probability distributions, perhaps more efficiently than can be achieved classically. The quantum circuit Born machine is an example of such a model, easily implemented on near-term quantum computers. However, the Born machine was originally defined to naturally represent discrete distributions. Since probability distributions of a continuous nature are commonplace in the world, it is essential to have a model which can efficiently represent them. Some proposals have been made in the literature to supplement the discrete Born machine with extra features to more easily learn continuous distributions; however, all invariably increase the resources required. In this work, we discuss the continuous variable Born machine, built on the alternative architecture of continuous variable quantum computing, which is much more suitable for modelling such distributions in a resource-minimal way. We provide numerical results indicating the model’s ability to learn both quantum and classical continuous distributions, including in the presence of noise.


Introduction
With the dawn of the noisy intermediate-scale quantum (NISQ) [1] device era comes a possibility of performing useful and large-scale computations that implement quantum information processing.While NISQ technologies do not entail fault-tolerance or large numbers of qubits (generally in the range of about  which we expect to be necessary for obtaining useful processing power, they do provide avenues for brand new methods of exploiting quantum information.One type of framework which can utilise the restricted architectures of NISQ devices is that of hybrid quantum-classical (HQC) methods, which have found many applications recently in fields like quantum chemistry [2] and quantum machine learning (QML) [3].These depend on dividing an algorithm into several parts which can be delegated to either quantum and classical servers, reducing the amount of quantum resources required to generate a solution.
In the field of quantum machine learning (QML), the benefits of HQC are key in approaches that employ paramterized quantum circuits (PQC) (also referred to as a quantum neural networks), which act as an Ansatz solution to some particular problem that can be optimised classically.QML has employed PQC's for several problems, including classification, [4][5][6][7][8], generative modelling, [9][10][11][12][13], and problems in quantum information and computation themselves [14][15][16][17][18][19].The 'learnability' and expressive power of these models has been also been studied [8,[20][21][22][23].One can, for example, use quantum states as sources of probability distributions, with measurement playing the role of random sampling.These quantum states are obtained via PQC's which are composed of a number of tunable quantum gates with parameters that can be optimised using a classical subroutine.The process is typically iterative, requiring smaller quantum circuits with less depth to be run several times, thus decreasing information loss and quantum resource requirements.Alternatively, one could use coherent quantum training procedures [24] which may in fact be necessary to see quantum advantages in certain cases [25].
QML is not restricted to a specific quantum computing paradigm and several models have shown potential in recent years, often divided into two broad categories: discrete-and continuous-variable [26] (DV and CV) systems.DV systems allow for individually addressable states that are finite, often preferred for their analogous nature to classical computers.CV systems, on the other hand, deal with quantum states which behave as bosonic quantum modes (and are therefore referred to commonly as qumodes) which effectively have infinite eigenstates with an added difficulty in addressing each individual state.This challenge notwithstanding, CV quantum computers allow for a far more effective manner of dealing with problems which require continuous values, making them a perfect candidate for modelling continuous distributions.Furthermore, much progress has been made in the field of QML using continuous variables, with software packages specifically created to deal with such scenarios [27].
In this work, we use the CV model to study the Born machine (BM) [9,28,29], a mathematical model which generates statistics from a probability distribution p(x) according to the fundamental randomness of quantum mechanics, i.e.Born's measurement rule (see Section 3).We can generate samples of a distribution defined according to Born's rule via the measurement of some quantum state, making this a generative QML method.Most commonly, a quantum circuit Born machine (QCBM) is implemented, meaning the quantum state is prepared as by a PQC, although other definitions are possible, for example the parameterisation of density matrices via a combination of classical and quantum resources [30][31][32].The output distribution is then altered via a classical optimisation of the PQC's parameters in order to match the distribution of some target data.The target distribution may be the output of a quantum system -such as a quantum computer or an experimental measurement -making the quantum generative model naturally suited for learning it.Furthermore, Born machines are promising candidates for models which could demonstrate a quantum advantage in machine learning in the near term, [22,31,[33][34][35].

Continuous Variable Quantum Computing
Here we give a brief overview of the nature of CV quantum computing as is pertinent to the rest of the work.For a full treatment of the topic, please refer to [41,42].CV states can be represented in both phase space and Fock space [36][37][38], owing to the wave-particle duality of quantum mechanics.Importantly, both formulations give equivalent predictions about the behaviour of quantum systems.In Fock space, states {|n } ∞ n=0 are eigenstates of the photon number operator n = â † â (where â and â † are the canonical creation and annihilation operators), giving n |n = n |n .Importantly, the Fock basis is discrete as in the case of qubits (although with an infinite basis), so if we want to extract the benefits of continuous variables from CV systems, we need to consider them via the phase space approach.
In the phase space formulation, any state of a single qumode can be represented as a real-valued function F(x, p), (x, p) ∈ R 2 in phase space called the Wigner quasi-probability function [38][39][40].The two axes of phase space are then the quadrature variables governed by quadrature operators x and p which have a continuous and infinite basis with eigenstates |x and |p , representing conjugate variables of a quantum system.The marginals of the Wigner function are the probability distributions of each of the quadrature variables, |ψ(x)| 2 and |φ(p)| 2 where ψ(x), φ(p) are complex-valued functions that can be understood as the amplitudes of their wavefunctions |x and |p .
The most basic qumode state is the vacuum state: where |0 is its vector representation in the Fock basis (n = 0).The vacuum state often serves as a starting point for a computation (similar to the discrete variable case) which can then be expressed as an evolution according to some bosonic Hamiltonian H for a time t: In the above, H can be seen as a polynomial function of the quadrature operators H = H(x, p) with arbitrary but fixed degree.This Hamiltonian can be decomposed into a series of CV quantum gates (see Table 1) in order to be implemented in a quantum circuit setting.
States |Ψ (t) for which the Hamiltonian is at most quadratic in x and p are called Gaussian [41].For the single qumode, the most common Gaussian quantum gates are rotation: R(φ), displacement: D(α) and squeezing: S(r, φ) (see Table 1).The simplest two-mode gate is the beamsplitter, BS(θ), which is a rotation between two qumodes.These gates are parametrised accordingly: Importantly, Gaussian gates are linear when acting on quantum states in phase space.On n qumodes, a general Gaussian operator has the effect [40]: with symplectic matrix M [41] and complex vector β ∈ C 2n .The variables x and p contain the position and momentum information of each qumode, and are vectors in C n .Because of this linearity, Gaussian states and their operators are so-called "easy" operations of a CV quantum computer.They can be efficiently simulated using classical methods and are generally easy to implement experimentally [40,41].
In order to achieve a notion of universality [26] in the CV architecture we need the ability to construct a Hamiltonian as in (Eq.2) that can translate to every possible state in the Hilbert space.This is done by introducing something called a non-Gaussian transformation into the toolbox, which is essentially a non-linear transformation on (x, p).Non-Gaussian gates involve quadrature operators that have a degree of 3 or higher.Some examples we consider in this work are the cubic phase gate V (γ) and the Kerr interaction K(κ) with γ, κ ∈ C ∼ = R 2 .Non-Gaussian gates are generally difficult to implement and difficult to accurately simulate using a classical processor, as they cannot be decomposed in the same fashion as Gaussian gates (Eq.3).This lack of decomposition implies that a completely accurate classical simulation of non-Gaussianity requires access to infinite matrices, thus requiring a choice of cut-off dimension which introduces some errors.On a quantum processor, this notion of infinity is inherent in the native physics of the hardware.
A list of the CV gates used throughout this work as well as their exponential operator form is presented in the Table 1 below.Note that the squeezing gate is parameterised by ζ and encompasses the parameters r (squeezing strength) and φ (squeezing direction) via the relationship ζ = re iφ :

Gate
Operators Finally, classical information can be extracted from CV systems via three types of measurement: Homodyne, Heterodyne and photon counting.For the purposes of the CVBM, we are interested largely in Homodyne detection, which returns real, continuous values of the two quadratures of a CV state.We discuss the measurement in more detail in the next section.

A Continuous Variable Born Machine
Having defined the CV model we can now construct a continuous variable Born machine (CVBM).As discussed in the introduction, the Born machine can generate statistics sampled from a probability distribution according to Born's measurement rule: The state |ψ(θ) is generated by evolving the vacuum state (Eq. 1) according to a Hamiltonian H that is constructed from CV gates given in Table 1.These gates form a PQC which is parameterised by the variables governing each gate (indicated in (Eq.4) by θ).These parameters should be easily tunable and allow for an evolution to any state that can serve as the solution to the given problem.
Thus, in ML terms we can think of the state |ψ(θ) as a generative model, which when measured in some pre-determined basis will generate samples of a distribution that is of interest.This model is parameterised by θ, which defines an n-qumode quantum circuit U (θ) made up of a set of quantum gates such that: Now the final ingredient is the measurement of this parameterised state to extract the samples.This is key to the resource efficiency of the model.Here, a sample from a single qumode (either its position, x, or momentum, p) is extracted by measuring the corresponding quadrature using the infinite basis of operators in a homodyne measurement 4 [42]: By tuning the angle to φ = 0, we can extract the position, x, and for φ = π/2, we can get the momentum p.For this work, we set φ = 0 for all examples, and leave incorporation of alternative measurement strategies to future work.A full output sample, x, can then be determined by measuring all n qumodes.The parameters of these gates then need to be trained according to the problem at hand (as will be discussed in more detail in the next section).This generally involves a so-called cost function whose value depends on the output of the model as well as some training data samples and which needs to be minimised by varying the values of the parameters of the circuit U (θ).This is referred to as empirical risk minimisation [43] in the classical ML literature.The choice of cost function and gates with which the circuit is built is problem-dependent, although there are often multiple cost functions which can be utilised for the same model, varying in efficiency and accuracy.
Once the value of the cost function converges to a minimum, the CVBM should be able to generate samples that behave like those of a target distribution (to be learned).Importantly, the purpose of the CVBM is not to exactly reproduce the samples that were fed into it during the training, but rather to emulate the original distribution from which they arose.This makes the CVBM a synthetic data generator if the original is no longer available, as well as opening up new avenues to study the model itself: the number of, and importance of, the degrees of freedom of the model, for example (given by the gates and their influence on the final distribution).It also means that there is not necessarily a requirement to reach a global minimum of the cost function once training is completed, as a local minimum may lead to a model that is just as effective in achieving the desired samples.Furthermore, different cost function minima might present further insight into the target distribution and its behaviour through the parameters of the model.This is particularly pertinent when talking about QML methods versus classical ones, as the model itself is inherently quantum and the physics may be studied natively within the experimental set-up once the model is trained.
The notion of 'quantum' parameters leads to one of the key reasons to employ a CVBM in learning continuous probability distributions which are quantum mechanical in nature: as well as lending itself to further analysis once trained, it also promises to be far more efficient and accurate in learning a quantum state than any classical model.Primarily this is down to the fact that the degrees of freedom we would expect to affect the distributions are captured explicitly by the parameters θ, saving us gymnastics of developing a new mathematical model that might strive to capture their behaviour.While it is possible to write emulators of quantum systems on classical machines for simple cases, once we are outside of the regime of being able to simulate a quantum system classically, the Born Machine can act as an exponentially complex black-box used to generate samples that could otherwise not be obtained, governed by a tractable number of parameters that can be understood and manipulated.This is irrespective of whether or not we know anything about the quantum system in question, which makes it an advantage even over classical methods that may employ approximations and other shortcuts in dealing with the exponential size of the Hilbert space they need to explore.
Finally, we may add that while a CVBM may be perfectly suited for learning continuous probability distributions which are quantum mechanical in nature, there is no reason to believe that they could not be used to learn classical distributions.Indeed, a Gaussian distribution (Eq.15) is parameterised in a way that has direct correspondence with the Squeezing S(ζ) and Displacement D(α) gates (see Table 1) via its standard deviation and mean respectively.Thus, a CVBM circuit composed of only those two gates is expressive enough to capture any Gaussian distribution (for multidimensional Gaussians we simply need to add a qumode for each new dimension).Other classical distributions are not necessarily as straightforward, but we expect that a well-tailored CVBM (size, gate-set and their sequence may all be important factors) could be suitable adapted.

Previous Work
The CVBM is introduced as a resource efficient method of generating continuous probability distributions.In order to do so, it is instructive to revisit other attempts to generate continuous distributions using quantum generators (here we compare only the sample generation mechanism, and not specifics of the model or training, which we discuss for the CVBM in the following sections).As discussed above, the CVBM allows sample generation via single homodyne measurements of a particular quadrature, for example the position x5 , so an element of a real sample vector is generated without any post-processing.
Firstly, the work of [10] numerically tested the performance of a QCBM when trained differentiably to learn a continuous distribution composed of a mixture of Gaussian distributions.This work used 10 qubits which results in approximating the real distribution using 2 10 = 1024 basis states without any measurement post-processing.Secondly, the work of [12,44] explicitly addresses this question of generating real valued distributions using a Born machine as a generator in an adversarially trained scenario.Specifically, previous works utilised a Born machine to generate n-bit binary strings, x ∈ {0, 1} n by simply taking the measurement result from measuring (for example) every qubit in the computational basis, which generates one bit, x i ∈ {0, 1} per qubit.In contrast, the innovation of [12] was to instead evaluate the expectation value (for example) of the Pauli-Z observable6 from the measurement results to generate a real value, x i ∈ [−1, 1] for each qubit.More concretely, the final sample is an n bit string, x ∈ R n generated by the following process: The intermediate quantity, x is the vector of expectation values for each qubit, x ∈ [−1, 1] n , which is fed into a classical function, f : R n → R n .The function could be, for example, one layer of a classical feedforward neural network, where φ := (W , b) are the weights and biases of the network.This method addresses the continuous distribution problem, but at the expense of adding O(n/ 2 log δ) extra measurements which must be evaluated (and hence circuits which must be run) in order to compute the expectation values, x, with sufficiently high probability (1 − δ) by Hoeffding's inequality [45].Hence, this adds a large overhead to the efficiency of the model in order to generate a single sample, x ∈ R n .An alternate approach, intermediate to [10] (using a completely discrete output) and [12], (which increases number of circuit evaluations), is to use a QCBM, with n qubits, and convert the resulting n bit binary outputs into real valued numbers with a corresponding precision.Again, this method is resource intensive as n qubits are required to generate each vector element in R, and so is less ideal than our main approach.We illustrate this latter method numerically in Section 5 to contrast with the efficiency of the CVBM to learn a Gaussian distribution.

Training
In this work, we are specifically interested in training a CVBM to generate data samples which behave as if they were sampled from some unknown target continuous probability distribution.We need to be able to do this while having access only to some limited number of training samples from the distribution we wish to learn as well as samples from the CVBM at any point in the training process.The metric we choose to implement in training our model based on these requirements is the Maximum Mean Discrepancy (MMD), which we discuss in the next section.

Maximum Mean Discrepancy
The MMD is a suitable metric for our purposes in several respects, having been used to train a discrete Born machine in the past [10] and requiring a relatively low number of samples from each distribution [46].
A key component of the MMD is the kernel function (defined below) and ML methods which use them are unsurprisingly known collectively as 'kernel methods' [5,47,48].The so-called 'kernel-trick' is useful for comparing data points, even when the underlying feature map may be difficult to compute.They use a similarity measure κ(x, x ) between two data points x and x in order to construct models that capture the properties and patterns of a data distribution.This measure of distance is related to the inner products of a feature space, the idea of which is key to kernel methods.
Kernels are symmetric functions of the form κ : H × H → C wherein H is a Hilbert space (the aforementioned feature space).The main idea behind Kernel Methods is to embed data samples from their original sample space X into a space H via a mapping φ : X → H.This is called a feature map and plays the the role of a 'filter' for the samples, with the aim to, say, achieve a reduction in dimensionality or some form of useful restructuring of the data that might aid in the training procedure.A (positive definite, real-valued) kernel inner product should also have the property κ(x, x ) 0 as well as being symmetric: κ(x, x ) = κ(x , x).
A typical example is the so-called 'Gaussian' kernel: where x and y are two data points (generally called feature vectors), || • || 2 is the Euclidean distance between them and σ is a constant which represents the variance or the 'bandwidth' which determines the scale at which the points are compared.The kernel value decreases with distance between the two data points and as such is an effective similarity measure [49].
Quantum Kernels: One nice consequence of implementing kernels is that any positive definite kernel can be replaced by another and this opens up the doors to a brand new approach of improving ML algorithms.Before moving on to the MMD itself, we note the fact that quantum states are like feature vectors themselves in that they also reside in Hilbert spaces and allow for a very natural definition of a quantum kernel.if we find an effective way of encoding input data points x ∈ X into quantum states |φ(x) then we end up with a feature map.Furthermore, the overlap of two quantum states can then be implemented as a kernel distance, with increasing orthogonality of two states leading to a decrease in kernel value.
In order to compute this overlap, one could use CV SWAP-test like primitives, [50,51], but due to the special form of the kernel, it can be evaluated more simply on NISQ devices [6] by running a unitary and then its inverse.Quantum kernels have already been explored within the CVQC framework and we refer to [5] for a full treatment of the topic.For the purposes of this work, we employ several different encodings of classical data samples into CV quantum kernels to explore their efficiency and effectiveness in training the CVBM.While determining a good kernel mapping is a complex and generally problemdependent issue, there are grounds for the use of quantum kernels as they may promise a far more complex mapping than anything that can be achieved classically and could prove to be useful in settings involving large data sets and particularly in terms of high dimensionality or correlation within the data.Quantum kernels were first investigated in the context of generative modelling in [22,52].

MMD Estimator:
The kernel acts as a feature map which embeds data samples in a Reproducing Kernel Hilbert Space (RKHS) [5] and it can be shown that the MMD is a metric describing exactly the difference in mean embeddings between two distributions from which the samples are collected [47,49].
With this, we can define a cost function associated with the metric for distributions P and Q: where x ∼ Q indicates a sample x drawn from distribution Q and κ is the MMD kernel.Given i.i.d.samples drawn from each distribution, the MMD can be estimated by replacing the expectation values in (Eq.10) by their empirical values to produce the (unbiased) MMD estimator : with M samples x := (x 1 , ..., x M ) drawn from distribution P and N samples ỹ := (y 1 , ..., y N ) drawn from Q.Given a large enough number of samples, (Eq.11) should converge to the true expectations given in (Eq.10).The MMD is useful as an estimator due to the relatively low number of samples it requires to satisfy the above requirement [46] (O( −2 ) to reach a precision ).The estimator allows for a practical, numerical implementation to train a model in order to reproduce samples from a target distribution.In this work, we want to apply it to the CVBM.

Training the CVBM
The CVBM is parameterised by the operations given in Table 1 and in order to be able to train it efficiently, we need to determine a way to quickly navigate the parameter space of the MMD estimator for any quantum circuit composed of any of the given operations.A common method used in many machine learning algorithms is gradient descent, often implemented in a stochastic fashion, with many varieties to facilitate a trade off between accuracy and speed [53].
A key component of gradient descent is the calculation of loss function gradients with respect to each of the parameters.For each parameter θ k , we need to determine ∂ θ k L MMD [P (θ), Q] wherein the distribution P (θ) is a CVBM circuit composed of gates parameterised by θ := θ 1 , ..., θ l .
The work of [10] shows that in the case of the discrete Born machine, by measuring an observable Ô = |x x|, the gradient of the probability distribution generated by a QCBM, p θ , with respect to parameter θ k is: Where the parameters θ ± k imply that the gate parameter θ k has been shifted by an amount ± π 2 .In the CVQC case, this needs to be adapted for CV operators by adding specific scaling factors and choosing different shift amounts for θ ± k depending on the circuit gate (see Table 2).We chose to implement the analytic gradients derived in [54] (also see [55][56][57]) for the Gaussian gates of Table 1 as well as additional approximations of gradients for the cubic phase gate V (γ) and Kerr gate K(κ) (see Appendix A for details of the approximation and a more detailed derivation of the gradients).
The gradient of the MMD Estimator with respect to the CVBM parameters can be described by: where p, q samples ã := {a  2, we can determine the gradient of the MMD Estimator with respect to any gate from the set given in Table 1.Note that the error on the gradient for non-Gaussian gates scales approximately as the small-angle approximation error sinh(t) ≈ t.
We train the CVBM iteratively by implementing batch gradient descent for each parameter θ k : Table 2: The gradients of the MMD cost function with respect to the parameters of each possible gate.The expression ∂ θ k LMMD is equivalent to (Eq.13).For the non-Gaussian gates V (γ) and K(κ), the gradients are approximations for small enough parameter shifts t rather than analytic gradients.It should be noted that while these latter approximations increase in accuracy with decreasing t, there is an inherent difficulty in implementing such small parameter shifts on current CV hardware.However, in practice it is a far more effective and accurate approach than any finite-difference method even with such drawbacks.v with t representing the iteration number and µ being the learning rate [53].This is done sequentially for each θ k per iteration of the training in order to update their values in a way that minimises the MMD estimator (Eq.11).The training terminates once a set number of iterations has been performed or else when the cost function to observed to converge.In Figure 1, we emphasise the key ingredients of the model; a compact data encoding method for continuous distributions via the CVBM itself, an efficient training method via the MMD and a potentially classically-hard-to-compute ingredient in the quantum kernel.In the next section we validate its efficacy through numerical results on example classical and quantum distributions.

Numerical Experiments
Here we present numerical results demonstrating the performance of the model on both classical and quantum data sets, as well as the impact of a noise channel and several different kernels.The simulations are implemented using Xanadu's Strawberry Fields API [27], which uses the symplectic matrix approach in simulating Gaussian states (see (Eq. 3)) and a truncation of dimensions in Fock space when dealing with non-Gaussianity.Throughout all experiments, a cut-off dimension of 7 was used unless explicitly stated otherwise, owing to the fact that states with higher Fock numbers are likely to have little impact on the statistics of the single and two-qumode states that were used.Fig. 2: Comparing a QCBM to our CVBM for learning a simple Gaussian distribution (a) with mean, µ = 0, and standard deviation σ = 1.In (c), we show the QCBM with increasing numbers of qubits (2, 4 and 6 qubits respectively, which results in increased precision for each sample.The QCBM is trained with the Adam optimiser [58] using the Sinkhorn divergence [22,[59][60][61].In (b), we show a single qumode with one parameterised squeezing and displacement gate for the CVBM, which can produce a much better fit to the data distribution with significantly fewer resources.

A Classical Distribution
An obvious choice of classical distribution to train the model on is the canonical Gaussian distribution.The classical Gaussian probability density function (PDF) is parameterised by mean µ and standard deviation σ, (which can correspond to the displacement α and squeezing ζ operators as discussed above).
To generate data representing the (single mode) classical Gaussian PDF, we take M samples from π, as given by: To illustrate the advantage of using our method with a natural continuous encoding, we compare to a discrete variable Born machine, which outputs binary strings of length n.However, as mentioned above, in order to use a discrete Born machine to learn real valued distributions, we must decide on the precision we wish to use.In Fig. 2 we use a discrete Born machine (QCBM) with 2, 4 and 6 qubits to learn a Gaussian distribution, N (0, 1) with increasingly higher precision, and also we use the CVBM with a single qumode to learn the same distribution.For the QCBM, we simulate the results using Pyquil [62] and for an Ansatz we choose a hardware efficient layered Ansatz.Each layer consists of CZ matching the topology of a sublattice of the Rigetti Aspen-7 chip, with parameterised R y (θ) gates.For the 2, 4 and 6 qubit QCBMs, we use 8, 4 and 6 layers respectively resulting in 64, 16 and 36 trainable parameters.To convert between the binary outputs of the QCBM, and the continuous valued data, we use a simple conversion algorithm described in [63].We also mention that the comparisons we make here are preliminary, and open the door to rigorous comparison and benchmarking between the CVBM and other models.

Quantum Distributions
Next, we focus on learning quantum distributions, i.e. one which arise as a result of measurements on a quantum state.This in some sense is equivalent to learning parameters which affect a quantum state, and in turn, gives some information about the parameters of the unitary which prepared the state, as a form of weak compilation, as noted by [22].We generate data by sampling CV circuits with different parameters and then use a CVBM that is parameterised by the same gates.Fig. 3 shows the learning process for a Gaussian and a non-Gaussian state using a classical kernel (Eq.8).
It should be noted that the slightly larger discrepancy in 3b is due to the difficulty in capturing the parametrisation and its relationship to the distribution of such states via sampling.This issue calls for a future investigation into finding better kernel functions for non-Gaussianity -ones which could capturing this behaviour more effectively and efficiently.In the Gaussian case, however (3a), the model converges to a very good approximation of the original distribution within less than 25 iterations of training while requiring only 50 samples (measurements) of the quantum state.This bodes well for the application of the method in -for example -characterising experimental data in the CV setting with very few data samples.Fig. 3: Snapshots of the CVBM learning process at iterations 1, 25 and 50 for two single-qumode quantum states.50 data samples are used for the CVBM and target distributions respectively with 30 samples from shifted circuits for gradient computation.

Quantum Kernels
Here, we explore the behaviour of the MMD estimator with two separate kernel mappings: the cubic phase kernel φ V (Eq.16) and the squeezed kernel (Eq.17) φ S .For data point x = x 1 , x 2 , ..., x n we can implement n modes, applying the given unitary with strength x i to the qumode indexed by i.
φ S : The full feature map is given by the tensor product of each of these states, for example, with (Eq.16), We can then compute the overlap of two mapped states |V xi and V x i to extract the kernel as in (Eq.9).Note that the overlap is computed by expressing the state in Fock space with a selected dimensional cut-off dependent on the number of qumodes required to estimate the kernel.Fig. 4 demonstrates the behaviour of the MMD estimator during training for both of the quantum kernels as well as the classical Gaussian kernel given in (Eq.8).Fig. 4: A 3 qumode Gaussian CVBM composed of squeezing, displacement and beamsplitter gates is trained to learn a distribution generated by another CV circuit made of the same components in a different order.The plot illustrates the behaviour of the loss when using a classical Gaussian kernel (Eq.8), a cubic phase kernel (Eq.16) and a squeezed kernel (Eq.17).We use 100 data samples each for the CVBM and the target distribution with 50 samples from shifted circuits to calculate MMD gradients.The shading around the plots indicates the standard deviation for each kernel after 5 runs of the training algorithm.
While both quantum kernels exhibit a convergence to some minimum of the MMD loss, they do not show any improvement over the classical kernel with regards of the resulting distributions.The squeezed kernel, while more erratic, gives a better final result while the cubic kernel is very poor for this task (note the cut-off dimension used in simulating Fock space for both kernels was 15).This is by no means a reason to discount the use of quantum kernels in such algorithms, but does indicate that more analysis needs to be done for which types of data might benefit more from particular mappings.

Noise
Finally, we investigate the effect of a simple noise model on the training of the CVBM.In general, noise in quantum systems can be modelled using a completely positive trace preserving (CPTP) map, N , which can equivalently be expressed in an operator-sum [64] formalism, decomposed into Kraus operators.For the CVBM, we choose a simple noise model available in Strawberry Fields [27], in order to study the effect of loss, whose Kraus representation, acting on a state, ρ, is modelled by: where This has the effect of coupling a mode â to another mode in the vacuum state b via the transformation: which is then traced out.The noise parameter T represents energy transmissivity and T = 1 represents the identity map.For T = 0, the state is mapped to a vacuum state.Fig. 5 shows the effect of noise on the CVBM learning process for both a Gaussian and non-Gaussian quantum state.
As might be expected, the Gaussian state is significantly more robust to noise since the noise channel given by (Eq.18) couples the mode of interest to a vacuum state which is itself a Gaussian distribution.).We use 100 data samples each for the CVBM and the target distribution with 50 samples from shifted circuits to calculate MMD gradients as well as the classical Gaussian kernel (Eq.8).The shading around the plots indicates the standard deviation for 5 runs of the training algorithm.
However, we can see that in the non-Gaussian case the algorithm adapts to some level of noise also, lending credence to using a CVBM in less-than-perfect experimental set-ups.Note that the target distribution data is assumed to have no noise channel affecting it as the assumption is that its source is unknown.However, the CVBM can also be used to determine the strength of noise on quantum data if its own coupling to a relevant noise channel can be manipulated.

Conclusion
We have presented a new QML method within the CVQC model for learning continuous probability distributions.While there is still much to be explored (in particular, in the areas of selecting more apt kernels for the MMD estimator -be they classical or quantum in nature -as well as optimizing its implementation both on the classical and quantum hardware), it promises to be an interesting tool in exploring the relationship between the statistics of quantum experiments on CV states and the unitaries that 'compile' them.It should be noted that the larger the number of samples from the target distribution, the more accurate the CVBM result is in producing relevant samples or in determining how close its circuit is to reproducing a particular state, particularly in the case of non-Gaussian states.It can also be implemented in learning classical distributions, particularly with high dimensionality, in cases where their parameterisation is difficult to capture with a classical model.As with many QML algorithms, the greatest advantage of this method is that it presents one with a complex model with an inherent large memory (Hilbert space) and a parameterisation that is mathematically complex yet easily accessible once run on a quantum device as well as being native to quantum physics.

∂LMMD ∂θ k =
x,y κ(x, y) p θ (y) ∂p θ (x) ∂θ k + p θ (x) ∂p θ (y) ∂θ k − 2 x,y κ(x, y) ∂p θ (x) ∂θ k π(y) If we now treat p θ (x) as a Born machine with observable x, i.e. the distribution we are interested in training, then we need to determine its derivative with respect to a particular parameter, ∂p θ /∂θ k .Luckily, we can refer to [54] for the derivatives of each of the Gaussian gates with respect to their parameters as well as to Appendix A.2 for the non-Gaussian ones.For the sake of demonstrating the method, we choose the displacement gate D(α).In [54] its partial derivative is given as: According to the same paper, this gradient corresponds to the gradient of an observable, which in our case would be the Homodyne measurement (Eq.6).Since the gradient of the Born machine is parametrised the same way we arrive at the partial derivative: Where α ± = α ± s, s ∈ R. Substituting (Eq.24) into (Eq.22) we get: x,y κ(x, y)pα(y)p α + (x) − x,y κ(x, y)pα(y)p α − (x)+ x,y κ(x, y)pα(x)p α + (y) − x,y κ(x, y)pα(x)p α − (y) − 1 s x,y x,y κ(x, y)p α + (x)π(y) − x,y κ(x, y)p α − (x)π(y) Then we can use the symmetric condition of the kernel, κ(x, y) = κ(y, x) to arrive at the form of the gradient given in Table 2.The same method can be applied to each of the parameters of each of the CV gates in order to derive the rest of the gradients given in Table 2.

A.2 Partial Derivatives for Non-Gaussian Transformations
While in [54] the gradients that are derived for Gaussian gates are analytic and based on their decomposition into covariance matrices in phase space, no such simplification was possible for the non-Gaussian gates: Luckily, these unitaries exhibit certain properties which can be exploited in order to derive an approximate gradient.We take the cubic phase gate V (γ) as an example.
First, we notice that regardless of what dimension we choose to truncate the unitary matrix that describes the gate at, the form of it will always look like: where the terms (x11...xNN ) come from the N −dimensional matrix of the operator x3 = 2 (â + â † ) and whose values will vary slightly depending on the dimension at which we truncate the operators â and â † .The derivative of the cubic phase gate with respect to its parameter γ is then of the form: 1 , ..., a R }, b := {b 1 , ..., b S } are drawn from shifted circuits p θ + k (a) and p θ − k (b) respectively while x = {x 1 , ..., x M } and ỹ = {y 1 , ..., y N } are drawn from the CVBM and the target distribution.Using the above equation and the scaling factors and shift amounts given in Table

Fig. 1 :
Fig. 1: Components of the Hybrid Quantum-Classical CVBM.The Quantum hardware is used to produce measurement samples of the parameterised circuit U (θ k ) as well as samples from circuits with shifted parameter values θ ± k .These samples are then employed to classically compute the value of the cost function as well as its gradient in order to update the circuit parameters.If the cost function implements a kernel, this can be a classical function (such as the Gaussian kernel displayed in the Figure above) or quantum in nature, for example by running the corresponding encoding circuit E φ(x) , with φ(x) being a quantum feature encoding for a data point x.

Fig. 5 :
Fig.5: Plots of MMD Loss for single-mode CVBMs learning a Gaussian (Figure5a) and non-Gaussian (Figure5b) single-qumode state.The CVBM is coupled to a loss channel with different values of parameter T (as given in (Eq.18)).We use 100 data samples each for the CVBM and the target distribution with 50 samples from shifted circuits to calculate MMD gradients as well as the classical Gaussian kernel (Eq.8).The shading around the plots indicates the standard deviation for 5 runs of the training algorithm.

Table 1 :
A table of the most common Gaussian and several non-Gaussian transformations in the CV quantum computing model.