Variational quantum Boltzmann machines

This work presents a novel realization approach to quantum Boltzmann machines (QBMs). The preparation of the required Gibbs states, as well as the evaluation of the loss function’s analytic gradient, is based on variational quantum imaginary time evolution, a technique that is typically used for ground-state computation. In contrast to existing methods, this implementation facilitates near-term compatible QBM training with gradients of the actual loss function for arbitrary parameterized Hamiltonians which do not necessarily have to be fully visible but may also include hidden units. The variational Gibbs state approximation is demonstrated with numerical simulations and experiments run on real quantum hardware provided by IBM Quantum. Furthermore, we illustrate the application of this variational QBM approach to generative and discriminative learning tasks using numerical simulation.


I. INTRODUCTION
Boltzmann Machines (BMs) [1,2] offer a powerful framework for modelling probability distributions.These types of neural networks use an undirected graphstructure to encode relevant information.More precisely, the respective information is stored in bias coefficients and connection weights of network nodes, which are typically related to binary spin-systems and grouped into those that determine the output, the visible nodes, and those that act as latent variables, the hidden nodes.Furthermore, the network structure is linked to an energy function which facilitates the definition of a probability distribution over the possible node configurations by using a concept from statistical mechanics, i.e., Gibbs states [3,4].The aim of BM training is to learn a set of weights such that the resulting model approximates a target probability distribution which is implicitly given by training data.This setting can be formulated as discriminative as well as generative learning task [5].Applications have been studied in a large variety of domains such as the analysis of quantum many-body systems, statistics, biochemistry, social networks, signal processing and finance, see, e.g., [6][7][8][9][10][11][12][13][14][15].However, BMs are complicated to train in practice because the loss function's derivative requires the evaluation of a normalization factor, the partition function, that is generally difficult to compute.Usually, it is approximated using Markov Chain Monte Carlo methods which may require long runtimes until convergence [16,17].Alternatively, the gradients could be estimated approximately using contrastive divergence [18] or pseudo-likelihood [19] potentially leading to inaccurate results [20,21].
Quantum Boltzmann Machines (QBMs) [22] are a natural adaption of BMs to the quantum computing framework.Instead of an energy function with nodes being * Electronic address: wor@zurich.ibm.comrepresented by binary spin values, QBMs define the underlying network using a Hermitian operator, a parameterized Hamiltonian with θ ∈ R p and h i = n−1 j=0 σ j,i for σ j,i ∈ { I, X, Y, Z } acting on the j th qubit.The network nodes are hereby characterized by the Pauli matrices σ j,i .This Hamiltonian relates to a quantum Gibbs state, ρ Gibbs = e −H θ /(k B T) /Z with k B and T denoting the Boltzmann constant and the system temperature, and Z = Tr e −H θ /(k B T) .It should be noted that those qubits which determine the model output are referred to as visible and those which act as latent variables as hidden qubits.The aim of the model is to learn Hamiltonian parameters such that the resulting Gibbs state reflects a given target system.In contrast to BMs, this framework allows the use of quantum structures which are potentially inaccessible classically.Equivalently to the classical model, QBMs are suitable for discriminative as well as generative learning.
We present here a QBM implementation that circumvents certain issues which emerged in former approaches.The first paper on QBMs [22] and several subsequent works [23][24][25][26] are incompatible with efficient evaluation of the loss function's analytic gradients if the given model has hidden qubits and ∃j : H θ , ∂H θ ∂θ j = 0.
Instead, the use of hidden qubits is either avoided, i.e., only fully-visible settings are considered [24][25][26], or the gradients are computed with respect to an upper bound of the loss [22][23][24], which is based on the Golden-Thompson inequality [27,28].It should be noted that training with an upper bound, renders the use of transverse Hamiltonian components, i.e., off-diagonal Pauli terms, difficult and imposes restrictions on the compatible models.
Further, we would like to point out that, in general, it is not trivial to evaluate a QBM Hamiltonian with a classical computer, i.e., using exact simulation with Quantum Monte Carlo methods [29], because the underlying Hamiltonian can suffer from the so-called signproblem [30][31][32][33][34].As already discussed in [35], evaluations on quantum computers can avoid this problem.
Our QBM implementation works for generic Hamiltonians H θ with real coefficients θ and arbitrary Pauli terms h i , and furthermore, is compatible with near-term, gatebased quantum computers.The method exploits Variational Quantum Imaginary Time Evolution [36,37] (Var-QITE), which is based on McLachlan's variational principle [38], to not only prepare approximate Gibbs states, ρ Gibbs ω , but also to train the model with gradients of the actual loss function.During each step of the training, we use VarQITE to generate an approximation to the Gibbs state underlying H θ and to enable automatic differentiation for computing the gradient of the loss function which is needed to update θ.This Variational QBM algorithm (VarQBM) is inherently normalized which implies that the training does not require the explicit evaluation of the partition function.
We focus on training quantum Gibbs states whose sampling behavior reflects a classical probability distribution.However, the scheme could be easily adapted to an approximate quantum state preparation scheme by using a loss function which is based on the quantum relative entropy [24][25][26].Hereby, the approximation to ρ Gibbs is fitted to a given target state ρ data .Notably, this approach is not necessarily suitable for learning classical distributions.More precisely, we do not need to train a quantum state that captures all features of the density matrix ρ data but only those which determine the sampling probability.It follows that fitting the full density matrix may impede the training.
The remainder of this paper is structured as follows.Firstly, we review classical BMs and VarQITE in Sec.II.Then, we outline VarQBM in Sec.III.Next, we illustrate the feasibility of the Gibbs state preparation and present QBM applications in Sec.IV.Finally, a conclusion and an outlook are given in Sec.V.

II. PRELIMINARIES
This section introduces the concepts which form the basis of our VarQBM algorithm.First, classical BMs are presented in Sec.II A. Then, we discuss VarQITE, the algorithm that VarQBM uses for approximate Gibbs state preparation, in Sec.II B.

A. Boltzmann Machines
Here, we will briefly review the original concept of classical BMs [1].A BM represents a network model that stores the learned knowledge in connection weights between network nodes.More explicitly, the connection weights are trained to generate outcomes according to a probability distribution of interest, e.g., to generate samples which are similar to given training samples or to output correct labels depending on input data samples.
Typically, this type of neural network is related to an Ising-type model [39,40] such that each node i corresponds to a binary variable z i ∈ { −1, +1 }.Now, the set of nodes may be split into visible and hidden nodes representing observed and latent variables, respectively.Furthermore, a certain configuration z = {v, h} of all nodes -visible and hidden -determines an energy, which is given as with θi , θ ij ∈ R denoting the weights and z i representing the value taken by node i.It should be noted that the parameters θ ij correspond to the weights of connections between different nodes.More explicitly, if two nodes are connected in the network, then a respective term appears in the energy function.The probability to observe a configuration v of the visible nodes is defined as where We would like to point out that BMs adopt a concept from statistical mechanics.Suppose a closed system that is in thermal equilibrium with a coupled heat bath at constant temperature.The possible configuration space is determined by the canonical ensemble, i.e., the probability for observing a configuration is given by the Gibbs distribution [3,4] which corresponds to Eq. (1).Now, the goal of a BM is to fit the target probability distribution p data with p BM .Typically, this training objective is achieved by optimizing the cross-entropy In theory, fully-connected BMs have interesting representation capabilities [1,41,42], i.e., they are universal approximators [43].However, in practice they are difficult to train as the optimization easily gets expensive.Thus, it has become common practice to restrict the connectivity between nodes which relates to restricted Boltzmann Machines (RBMs) [44].Furthermore, several approximation techniques, such as contrastive divergence [18], have been developed to facilitate BM training.However, these approximation techniques typically still face issues such as long computation time due to a large amount of required Markov chain steps or poor compatibility with multimodal probability distributions [17].For further details, we refer the interested reader to [45][46][47].

B. Variational Quantum Imaginary Time Evolution
Imaginary time evolution (ITE) [48] is an approach that is well known for (classical) ground state computation [36,49,50].
Suppose a starting state |ψ 0 and a time-independent Hamiltonian H = p−1 i=0 θ i h i with real coefficients θ i and Pauli terms h i .Then, the normalized ITE propagates |ψ 0 with respect to H for time τ according to The differential equation that describes this evolution is the Wick-rotated Schrödinger equation where  [38].The basic idea of the method is to introduce a parameterized trial state |ψ ω and to project the temporal evolution of |ψ τ to the parameters, i.e., ω := ω(τ ).We refer to this algorithm as VarQITE and, now, discuss it in more detail.
First, we define an input state |ψ in and a quantum circuit V (ω) = U q (ω q ) • • • U 1 (ω 1 ) with parameters ω ∈ R q to generate the parameterized trial state determines the time propagation of the parameters ω(τ ).This principle aims to minimize the distance between the right hand side of Eq. ( 3) and the change d |ψ ω /dτ .Eq. ( 4) leads to a system of linear equations for ω = dω/dτ , i.e., with where Re (•) denotes the real part and ρ in = |ψ in ψ in |.
The vector C describes the derivative of the system energy ψ ω | H |ψ ω and A is proportional to the classical Fisher information matrix, a metric tensor that reflects the system's information geometry [51].To evaluate A and C, we compute expectation values with respect to quantum circuits of a particular form which is illustrated and discussed in Appendix A. This evaluation is compatible with arbitrary parameterized unitaries in V (ω) because all unitaries can be written as U (ω) = e iM (ω) , where M (ω) denotes a parameterized Hermitian matrix.Further, Hermitian matrices can be decomposed into weighted sums of Pauli terms, i.e., M (ω) = p m p (ω) h p with m p (ω) ∈ R and acting on the j th qubit.Thus, the gradients of U k (ω k ) are given by This decomposition allows us to compute A and C with the techniques described in [36,37,53].Furthermore, it should be noted that Eq. ( 5) is often ill-conditioned and may, thus, require the use of regularized regression methods, see Sec.IV A. Now, we can use, e.g., an explicit Euler method to evolve the parameters as

III. QUANTUM BOLTZMANN MACHINE ALGORITHM
A QBM is defined by a parameterized Hamiltonian for σ j,i ∈ { I, X, Y, Z } acting on the j th qubit.Equivalently to classical BMs, QBMs are typically represented by an Ising model [39], i.e., a 2-local system [54] with nearest-neighbor coupling that is defined with regard to a particular grid.In principle, however, any Hamiltonian compatible with Boltzmann distributions could be used.
In contrast to BMs, the network nodes, given by the Pauli terms σ j,i , do not represent the visible and hidden units.These are defined with respect to certain sub-sets of qubits.More explicitly, those qubits which determine the output of the QBM are the visible qubits, whereas the others correspond to the hidden qubits.Now, the probability to measure a configuration v of the visible qubits is defined with respect to a projective measurement Λ v = |v v| ⊗ I on the quantum Gibbs state ) , i.e., the probability to measure |v is given by For the remainder of this work, we assume that Λ v refers to projective measurements with respect to the computational basis of the visible qubits.Thus, the configuration v is determined by v i ∈ { 0, 1 }.It should be noted that this formulation does not require the evaluation of the configuration of the hidden qubits.
Our goal is to train the Hamiltonian parameters θ such that the sampling probabilities of the corresponding ρ Gibbs reflect the probability distribution underlying given classical training data.For this purpose, the same loss function as described in the classical case, see Eq. ( 2), can be used where p data v denotes the occurrence probability of item v in the training data set.
To enable efficient training, we want to evaluate the derivative of L with respect to the Hamiltonian parameters.Unlike existing QBM implementations, VarQBM facilitates the use of analytic gradients of the loss function given in Eq. ( 7) for generic QBMs.The presented algorithm involves the following steps.First, we use Var-QITE to approximate the Gibbs state, see Sec.III A for further details.Then, we compute the gradient of L to update the parameters θ with automatic differentiation, as is discussed in Sec.III B. The parameters are trained with a classical optimization routine where one training step consists of the Gibbs state preparation with respect to the current parameter values and a consecutive parameter update, as illustrated in Fig. 1.
In the remainder of this section, we discuss Gibbs state preparation with VarQITE in Sec.III A and VarQBM in more detail in Sec.III B.

A. Gibbs State Preparation with VarQITE
The Gibbs state ρ Gibbs describes the probability density operator of the configuration space of a system in thermal equilibrium with a heat bath at constant temperature T [55].Originally, Gibbs states were studied in the context of statistical mechanics but, as shown in [56], the density operator also facilitates the description of quantum statistics.Gibbs state preparation can be approached from different angles.Hereby, different techniques not only have different strengths but also different drawbacks.Some schemes [57][58][59] use Quantum Phase Estimation [60] as a subroutine, which is likely to require error-corrected quantum computers.Other methods enable the evaluation of quantum thermal averages [61][62][63] for states with finite correlations.However, since QBM-related states may exhibit long-range correlations, these methods are not the first choice for the respective preparation.A thermalization based approach is presented in [23], where the aim is to prepare a quantum Gibbs state by coupling the state register to a heat bath given in the form of an ancillary quantum register.Correct preparation requires a thorough study of suitable ancillary registers for a generic Hamiltonian as the most useful ancilla system is not a-priori known.Further, a variational Gibbs state preparation method has been presented [64] which is based on the fact that Gibbs states minimize the free energy of a system at constant temperature.Thus, the goal is to fit a parameterized quantum state such that it minimizes the free energy.The parameter update is hereby conducted with a finite difference method instead of analytic gradients which may impair the training accuracy.Additionally, the method requires the application of Quantum Amplitude Estimation [65], as well as matrix exponentiation of the input state, and thus, is not well suited for near-term quantum computing applications.
In contrast to these Gibbs state preparation schemes, VarQITE is compatible with near-term quantum computers, and is neither limited to states with finite correlations nor requires ambiguous ancillary systems.In the following, we discuss how VarQITE can be utilized to generate an approximation of the Gibbs state ρ Gibbs for a generic n−qubit Hamiltonian H θ = p−1 i=0 θ i h i with θ ∈ R p and h i = n−1 j=0 σ j,i for σ j,i ∈ { I, X, Y, Z } acting on the j th i qubit.
First, we need to choose a suitable variational quantum circuit V (ω), ω ∈ R q , and set of initial parameters ω(0) such that the initial state is ) represents a Bell state.We define two n-qubit sub-systems a and b such that the first and second qubit of each |φ + is in a and b, respectively.Accordingly, an effective 2n-qubit Hamiltonian H eff = H a θ + I b , where H θ and I act on sub-system a and b, is considered.It should be noted that tracing out subsystem b from |ψ 0 results in an n-dimensional maximally mixed state Now, the Gibbs state approximation ρ Gibbs ω can be generated by propagating the trial state with VarQITE with respect to H eff for τ = 1/2 (k B T).The resulting state gives an approximation for the Gibbs state of interest by tracing out the ancillary system b.We would like to point out that the VarQITE propagation relates ω to θ via the energy derivative C given in Eq. ( 6).Equivalently to Eq. ( 5), Eq. ( 10) is also prone to being ill-conditioned.Thus, the use of regularization schemes may be required.
Notably, this is an approximate state preparation scheme that relies on the representation capabilities of |ψ ω .However, since the algorithm is employed in the context of machine learning we do not necessarily require perfect state preparation.The noise may even improve the training, as discussed e.g., in [66].
McLachlan's variational principle is not only the key component for Gibbs state preparation.It also enables the QBM training with gradients of the actual loss function for generic Pauli terms in H θ , even if some of the qubits are hidden.Further details are given in Sec.III B.
by using the chain rule, i.e., automatic differentiation.More precisely, the gradient of L can be computed by using the chain rule for Firstly, ∂p /∂ω k (τ ) can be evaluated with quantum gradient methods discussed in [67][68][69][70][71] because the term has the following form Secondly, ∂ω k (τ )/∂θ i is evaluated by computing the derivative of Eq. ( 5) with respect to the Hamiltonian parameters ∂A ω (τ ) This gives the following system of linear equations Now, solving for ∂ ω (τ ) /∂θ i in every time step of the Gibbs state preparation enables the use of, e.g., an explicit Euler method to get We discuss the structure of the quantum circuits used to evaluate ∂ θi A and ∂ θi C, in Appendix A.
In principle, the gradient of the loss function could also be approximated with a finite difference method.If the number of Hamiltonian parameters is smaller than the number of trial state parameters, this requires less evaluation circuits.However, given a trial state that has less parameters than the respective Hamiltonian, the automatic differentiation scheme presented in this section is favorable in terms of the number of evaluation circuits.A more detailed discussion on this topic can be found in Appendix B.
An outline of the Gibbs state preparation and evaluation of ∂ω k (τ )/∂ θi with VarQITE is presented in Algorithm 1.

IV. RESULTS
In this section, the Gibbs state preparation with Var-QITE is demonstrated using numerical simulation as well as the quantum hardware provided by IBM Quantum [74].Furthermore, we present numerically simulated QBM training results for a generative and a discriminative learning task.First, aspects which are relevant for the practical implementation are discussed in Sec.

A. Methods
To begin with, we discuss the choice of a suitable parameterized trial state consisting of V (ω) and |ψ in .Most importantly, the initial state |ψ in must not be an eigenstate of V (ω) as this would imply that the circuit could only act trivially onto the state.Furthermore, the state needs to be able to represent a sufficiently accurate approximation of the target state.If we have to represent, e.g., a non-symmetric Hamiltonian, the chosen trial state needs to be able to generate non-symmetric states.Moreover, V (ω) should not exhibit too much symmetry as this may lead to a singular A which in turn causes ill-conditioning of Eq. ( 5).Assume, e.g., that all entries of C are zero and, thus, that Eq. ( 5) is homogeneous.If A is singular, infinitely many solutions exist and it is difficult for the algorithm to estimate which path to choose.If A is non-singular, the solution is ω = 0 and the evolution stops although we might have only reached a local extreme point.Another possibility to cope with illconditioned systems of linear equations are least-squares methods in combination with regularization schemes.We test Tikhonov regularization [75] and Lasso regularization [76] with an automatic parameter evaluation based on L-curve fitting [77], as well as an -perturbation of the diagonal, i.e., A → A + I.It turns out that all regularization methods perform similarly well.
The results discussed in this section employ Tikhonov regularization.Furthermore, we use trial states which are parameterized by Pauli-rotation gates.Therefore, the gradients of the QBM probabilities with respect to the trial state parameters can be computed using a π/2−shift method which is, e.g., described in [71].All experiments employ an additional qubit |0 add and parameter ω add to circumvent a potential phase mismatch between the target |ψ τ and the trained state |ψ (ω (τ )) [36,37,51] by applying Notably, the additional parameter increases the dimension of A and C by one.The effective temperature, which in principle acts as a scaling factor on the Hamiltonian parameters, is set to (k B T) = 1 in all experiments.

B. Gibbs State Preparation with VarQITE
To demonstrate that VarQITE is able to generate suitable approximations to Gibbs states, we illustrate the convergence of the state fidelity with respect to the target state for the following two simple one-and two-qubit Hamiltonians The results are computed using the parameterized quantum circuit shown in Fig. 2.
The algorithm is executed for 10 time steps on different backends: an ideal simulator and the he ibmq_johannesburg 20-qubit backend.Notably, readout error-mitigation [78][79][80] is used to obtain the final results run on real quantum hardware.Fig. 3 depicts the results considering the fidelity between the trained and the target Gibbs state for each time step.It should be noted that the fidelity for the quantum backend evaluations employ state tomography.The plots illustrate that the method approximates the states, we are interested in, reasonably well and that also the real quantum hardware achieves fidelity values over 0.99 and 0.96, respectively.

C. Generative Learning
Now, the results from an illustrative example of a generative QBM model are presented.More explicitly, the QBM is trained to mimic the sampling statistics of a Bell state (|00 + |11 )/ √ 2, which is a state that exhibits non-local correlations.Numerical simulations show that the distribution can be trained with a fully visible QBM which is based on the following Hamiltonian We draw the initial values of the Hamiltonian parameters θ from a uniform distribution on [−1, 1].The optimization runs on an ideal simulation of a quantum computer using AMSGrad [81] with initial learning rate 0.1, maximum number of iterations 200, first momentum 0.7, and second momentum 0.99 as optimization routine.The Gibbs state preparation uses the initial trial state shown in Fig. 4 and 10 steps per state preparation.
The training is run 10 times using different randomly drawn initial parameters.The averaged values of the loss function as well as the distance between the target distribution p data = [0.5, 0., 0., 0.5] and the trained distribution p QBM with respect to the 1 norm are illustrated over 50 optimization iterations in Fig. 5.The plot shows that loss and distance converge toward the same values for all sets of initial parameters.Likewise, the trained parameters θ converge to similar values.Furthermore, Fig. 6 illustrates the target probability distribution and for the best and worst of the trained distributions.The plot reveals that the model is able to train the respective distribution very well.

D. Discriminative Learning
QBMs are not only applicable for generative but also for discriminative learning.We discuss the application to a classification task, the identification of fraudulent credit card transactions.
To enable discriminative learning with QBMs, we use the input data points x as bias for the Hamiltonian weights.More explicitly, the parameters of the Hamiltonian are given by a function f i (θ, x) which maps θ and x to a scalar in R. Now, the respective loss function reads Gibbs ω , where ρ (x) Gibbs ω denotes the approximate Gibbs state corresponding to H θ (x).The model encodes the class labels in the measured output configuration of the visible qubits v of ρ (x) . Now, the aim of the training is to find Hamiltonian parameters θ such that, given a data sample x, the probability of sampling the correct output label from ρ (x) Gibbs ω is maximized.The training is based on 500 artificially created credit card transactions [82] with about 15% fraudulent instances.To avoid redundant state preparation, the training is run for all unique item instances in the data set and the results are averaged according to the item's occurrence counts.The dataset includes the following features: location (ZIP code), time, amount, and Merchant Category Code (MCC) of the transactions.To facilitate the training, the features of the given data set are discretized and normalized as follows.Using k-means clustering, each of the first three features are independently discretized to 3 reasonable bins.Furthermore, we consider MCCs < 10000 and group them into 10 different categories.The discretization is discussed in more detail in Table I.Furthermore, for each feature, we map the values x to x = x−µ σ with µ denoting the mean and σ denoting the standard deviation.The complexity of this model demands a Hamiltonian that has sufficient representation capabilities.Our choice is the following where f i (θ, x) = θ i • x corresponds to the dot product of the vector corresponding to the data item x and a parameter vector θ i of equal length.Additionally, the first and second qubit correspond to a hidden and visible qubit, respectively.Since the numerical simulation of variational Gibbs state preparation for various H θ (x), with x corresponding to all unique data items, is computationally expensive, we decided to train the parameters θ using an exact representation of the quantum Gibbs states.The resulting θ are then used for Gibbs state preparation with VarQITE.Even though the parameters are not trained with variational Gibbs state preparation, the results discussed in this section demonstrate that we can find suitable parameters θ such that VarQBM corresponds to a well-performing discriminative model.
The exact training uses a Truncated Netwon optimization routine [72] with a maximum iteration number of 100 and the step size for the numerical approximation of the Jacobian being set to 10 Given a test data set consisting of 250 instances, with about 10% fraudulent transactions, the Gibbs states, corresponding to the unique items of the test data, are approximated using VarQITE with the trained parameters θ and the trial state shown in Fig. 7. To predict the labels of the data instances, we sample from the states ρ Gibbs ω and choose the label with the highest sampling probability.These results are, then, used to evaluate the accuracy, precision, recall and F 1 score.It should be noted that we choose a relatively simple quantum circuit to keep the simulation cost small.However, it can be expected that a more complex parameterized quantum circuit would lead to further improvement in the training results.
The resulting values are compared to a set of standard classifiers defined in a scikit-learn [83] classifier compari-son tutorial [84], see Tbl.II.The respective classifiers are used with the hyper parameters defined in this tutorial.Notably, the Linear SVM does not classify any test data item as fraudulent and, thus, the classifier sets precision and recall score to 0. The comparison reveals that the QBM performs similarly well to the classical classifiers considering accuracy, is competitive regarding precision, and even outperforms them in terms of recall.The best F 1 score is achieved with VarQBM.

V. CONCLUSION AND OUTLOOK
This work presents the application of McLachlan's variational principle to facilitate VarQBM, a variational QBM algorithm, that is compatible with generic Hamiltonians and can be trained using analytic gradients of the actual loss function even if some of the qubits are hidden.Suppose a sufficiently powerful variational trial state, the presented scheme is not only compatible with local but also long-range correlations and for arbitrary system temperatures.
We outline the practical steps for utilizing VarQITE for Gibbs state preparation and verify that it can train states which are reasonably close to the target using simulation as well as real quantum hardware.Moreover, applications to generative learning and classification are discussed and illustrated with further numerical results.The presented model offers a versatile framework which facilitates the representation of complex structures with quantum circuits.
An interesting question for future research is the investigation of performance measures that improve our understanding of the model's representation capabilities.Furthermore, QBMs are not limited to the presented applications.They could also be utilized to train models for data from experiments with quantum systems.This is a problem that has recently gained interest, see e.g., [85].Additionally, they might be employed for combinatorial optimization.Classical BMs have been investigated in this context [86] and developing and analyzing quantum algorithms for combinatorial optimization is an active area of research [87,88].
All in all, there are many possible applications which still have to be explored.

VI. ACKNOWLEDGMENTS
We would like to thank Erik Altman for making the synthetic credit card transaction dataset available to us.Moreover, we are grateful to Pauline Ollitrault, Guglielmo Mazzola and Mario Motta for sharing their knowledge and engaging in helpful discussion.Furthermore, we thank Julien Gacon for his help with the implementation of the algorithm and all of the IBM Quantum team for its constant support.
Also, we acknowledge the support of the National Centre of Competence in Research Quantum Science and Technology (QSIT).
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide.Other product and service names might be trademarks of IBM or other companies.The current list of IBM trademarks is available at https://www.ibm.com/legal/copytrade.As discussed in [36,37,53], such terms can be computed by sampling the expectation value of an observable Z with respect to the quantum circuit shown in Fig. 8. Notably, the phase e iα in the first qubit is needed to include phases which may occur from gate derivatives.In our case, α needs to be set to 0 respectively π/2 when computing the terms of A or C.More precisely, the first qubit is initialized by an H gate for A and H followed by an S gate for C.These phases come from the fact that the trial states, used in this work, are constructed via Pauli rotations, i.e., U (ω) = R σ l (ω) with σ l ∈ { X, Y, Z }, which leads to Furthermore, this method can be applied for the evaluation of ∂A/∂θ and ∂C/∂θ, i.e., the respective terms can be written in the form of Eq. (A1 Hereby, α must be set to π/2 respectively 0 for the terms in ∂A/∂θ respectively ∂C/∂θ.This is achieved with the same gates as mentioned before. number of circuits is Θ (tq(q + p)).Now, computing the gradient with forward finite differences reads ∂L ∂θ ≈ L (θ + ) − L (θ) , for 0 < 1.For this purpose, VarQITE must be run once with θ and p times with an -shift which leads to a total number of Θ (tpq(q + p)) circuits.
The automatic differentiation gradient, given in Eq. ( 8), corresponds to  .Furthermore, the evaluation of ∂ω k /∂θ requires that ∂A/∂θ and ∂C/∂θ are computed for every step of the Gibbs state preparation.This leads to Θ tq 2 (q + p) circuits.The resulting overall complexity of the number of circuits is Θ tq 2 (q + p) .
The results are summarized in Tbl.III.Automatic differentiation is more efficient than finite differences if q < p.For q > p, on the other hand, focusing mainly on computational complexity, one should rather use finite differences.Considering, e.g., a k-local Ising model that corresponds to a Hamiltonian with O n k parameters.Suppose that we can find a reasonable variational n-qubit trial state with O (n) layers of parameterized and entangling gates, which results in q = O n 2 parameters, then, automatic differentiation would outperform finite differences for k > 2.

Method
Number Circuits Finite Diff Θ (tqp(q + p)) Automatic Diff Θ tq 2 (q + p) the Boltzmann constant, T the system temperature and Z the canonical partition function Z = z={v, h} e −Ez/(k B T) .

FIG. 1
FIG. 1 The VarQBM training includes the following steps.First, we need to fix the Pauli terms for H θ and choose initial parameters θ.Then, VarQITE is used to generate ρ Gibbs ω

B
. Variational QBM In the following, VarQBM and the respective utilization of McLachlan's variational principle and VarQITE is discussed.We consider training data that takes at most 2 n different values and is distributed according to a discrete probability distribution p data .The aim of a QBM is to train the parameters of H θ such that the sampling probability distribution of the corresponding ρ Gibbs ω = e −H θ /(k B T) /Z for |v , v ∈ 0, . . ., 2 n − 1 withp QBM v = Tr Λ v ρ Gibbs ω ,approximates p data .The QBM model is trained to represent p data by minimizing the loss, given in Eq.(7), with respect to the Hamiltonian parameters θ, i.e., facilitates gradient-based optimization with the derivative of the actual loss function IV A. Next, experiments of quantum Gibbs state preparation with VarQITE are shown in Sec.IV B. Then, we illustrate the training of a QBM with the goal to generate a state which exhibits the sampling behavior of a Bell state, see Sec.IV C, and to classify fraudulent credit card transactions, Sec.IV D.

FIG. 2 1 (
FIG.2The depicted circuits illustrate the initial trial state for the Gibbs state preparation of (a) ρ Gibbs 1

FIG. 4
FIG.4 We train a QBM to mimic the sampling behavior of a Bell state.The underlying Gibbs state preparation with VarQITE uses the illustrated parameterized quantum circuit to prepare the initial trial state.The first two qubits represent the target system and the last two qubits are ancillas needed to generate the maximally-mixed state as starting state for the evolution.

FIG. 6
FIG.6The figure illustrates the sampling probability of the Bell state (blue), as well as the best (pink) and worst (purple) probability distribution achieved from 10 different random seeds.
FIG.7Given a transaction instance, the measurement output of the QBM labels it as being either fraudulent or valid.The underlying Gibbs state preparation with VarQITE uses the illustrated parameterized quantum circuit as initial trial state.The first qubit is the visible node that determines the QBM output, the second qubit represents the hidden unit, and the last two qubits are ancillas needed to generate the maximally-mixed state as starting state for the evolution.
Appendix A: Evaluation of A, C and their gradientsThe elements of the matrix A and the vector C, see Eq. (6) are of the following form Re e iα Tr U † V ρ in (A1)with Re (•) denoting the real part and ρ in = |ψ in ψ in |.
with . . .= Tr ρ Gibbs ω . . . .VarQITE needs to be run once to prepare ρ Gibbs ω |ψ τ originates from the normalization of |ψ τ .The terms in e −Hτ , corresponding to small eigenvalues of H, decay slower than the ones corresponding to large eigenvalues.Due to the continuous normalization, the smallest eigenvalue dominates for τ → ∞.