1 Introduction

Modelling the structure of direct interaction between distinct random variables is a fundamental sub-task in various applications of artificial intelligence (Wang et al. 2013), including natural language processing (Lafferty et al. 2001), computational biology (Kamisetty et al. 2008), sensor networks (Piatkowski et al. 2013), and computer vision (Yin and Collins 2007). Probabilistic models, such as discrete graphical models (Wainwright and Jordan 2008), allow for an explicit representation of these structures and, hence, build the foundation for various classes of machine learning techniques. Existing literature on quantum algorithms for learning and inference of probabilistic models include quantum Bayesian networks (Low et al. 2014), quantum Boltzmann machines (Amin et al. 2018; Kieferova and Wiebe 2017; Wiebe and Wossnig 2019; Zoufal et al. 2021), and Markov random fields (Zhao et al. 2021; Bhattacharyya 2021; Nelson et al. 2021). These methods are either approximate or strongly rely on so-called fault-tolerant quantum computers—a concept that cannot yet be realized with the state-of-the-art quantum hardware. In this work, we derive a quantum circuit construction for probabilistic graphical model sampling that is exact and shows promise for scalability with near-term quantum computing hardware.

While discrete graphical models are powerful and may be employed for diverse use-cases, their applicability can face difficulties. More specifically, for structures with high-order interactions, probabilistic inference can become challenging. This stems from the fact that the calculation of the related normalizing constant is in general computationally intractable. A common way to circumvent the explicit evaluation of the normalizing constant is to compute the quantities of interest based on samples which are drawn from the graphical model. The problem of generating samples from graphical models has already been discussed in seminal works on the Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970) and Gibbs sampling (Geman and Geman 1984). As of today, Markov chain Monte Carlo (MCMC) methods are still the most commonly used approach to generate samples from high-dimensional graphical models—a model’s dimension refers to the number of vertices of the underlying graph. MCMC methods are iterative, i.e., an initial guess is randomly modified repeatedly until the chain converges to the desired distribution. The actual time until convergence is model dependent (Bubley and Dyer 1997) and in general intractable to compute. A more recent promising line of research for sampling relies on random perturbations of the model parameters. These perturb-and-MAP (PAM) techniques (Hazan et al. 2013) compute the maximum a posterior (MAP) state of a graphical model whose potential function is perturbed by independent samples from a Gumbel distribution. The resulting perturbed MAP state can be shown to be an unbiased independent sample from the underlying graphical model. Assigning the correct Gumbel noise (see Appendix A.1 for details) and solving the MAP problem both result in algorithmic runtimes that are exponential in the number of variables. Efficient perturbations have been discovered (Niepert et al. 2021), sacrificing the unbiasedness of samples while delivering viable practical results. However, the exponential time complexity of MAP computation renders the method intractable.

In this work, we propose a method for generating unbiased and independent samples from discrete undirected graphical models on quantum processors—quantum circuits for graphical models (QCGM). The samples may be used accordingly for maximum likelihood learning, MAP state approximation, and other inference tasks. Instead of an iterative construction of samples as in MCMC or a perturbation as in PAM, our method coherently generates samples from measurements of a quantum state that reflects the probability distribution underlying the graphical model. The state preparation algorithm employs a repeat-until-success scheme where a subsystem measurement informs us whether a sample is drawn successfully or if it has to be discarded with a success probability that is independent of the number of variables. This is in contrast to MCMC or PAM samplers where individual samples are not guaranteed to reflect the properties of the desired probability distribution. Notably, the resources for QCGM scale in the worst-case exponentially in the number of cliques and the model parameter vector which can be a significant improvement compared to classical methods. To prove the practical viability of our approach, we provide proof-of-concept experimental results with quantum simulation as well as actual quantum hardware. Notably, the average fidelity of the quantum hardware results executed with simple readout error mitigation is 0.987. These results show that our method has the potential to enable unbiased and statistically sound sampling and parameter learning for practically interesting problems on near-term quantum computers.

2 Problem definition

In this section, we formalize the preliminaries and define the problem of drawing unbiased and independent samples from a graphical model using a coherent quantum embedding for Gibbs state preparation. Hence, this section sets the stage for the main results presented in Section 3. To simplify notation, we frequently use the short hand notation \(A\oplus B:=|0\rangle \langle 0|\otimes A + |1\rangle \langle 1|\otimes B\) for scalars or matrices A and B.

2.1 Parametrized family of models

We consider positive joint probability distributions over n discrete variables \(\varvec{X}_v\) with realizations \(\varvec{x}_v\in \{0,1\}\) for \(v\in V\). The set of variable indices \(V=\{1,\dots ,n\}\) is referred to as vertex set and its elements \(v\in V\) as vertices. Without loss of generality, the positive probability mass function (pmf) of the n-dimensional random vector \(\varvec{X}\) can be expressed as

$$\begin{aligned} \mathbb {P}_{\varvec{\theta },\phi }(\varvec{X}=\varvec{x}) = \frac{1}{Z(\varvec{\theta })} \exp \left( \sum _{j=1}^d\varvec{\theta }_j \phi _j(\varvec{x}) \right) , \end{aligned}$$
(1)

where \(\phi =(\phi _1,\dots ,\phi _d)\) is a set of basis functions or sufficient statistics that specify a family of distributions and \(\varvec{\theta }\in \mathbb {R}^d\) are parameters that specify a model within this family. Each basis function \(\phi _i\) depends on a specific sub-set \(C\subseteq V\) of vertices—we call C a clique, and its vertices are called connected. Thus, \(\phi \) implies a graphical structure \(G=(V,E)\) among all vertices, called conditional independence structure. The number of parameters, d, is the sum over all clique states: \(d=\sum _{C\in \mathcal {C}}2^{|C|}\), i.e., a clique C that contains |C| vertices has \(2^{|C|}\) distinct clique states.

When \(\phi \) is clear from the context, we simply write \(\mathbb {P}_{\varvec{\theta }}\) and drop the explicit dependence on \(\phi \). The quantity \(Z(\varvec{\theta })=\sum _{\varvec{x}\in \{0,1\}^n} \exp \sum _{j=1}^d\varvec{\theta }_j \phi _j(\varvec{x})\) denotes the model’s partition function and is required for normalization such that \(\mathbb {P}\) becomes a proper probability mass function. Equation 1 can be rewritten as

$$\begin{aligned} \mathbb {P}_{\varvec{\theta }}(\varvec{X}=\varvec{x})&= \frac{1}{Z(\varvec{\theta })} \prod _{C\in \mathcal {C}} \exp \left( \sum _{\varvec{y}\in \{0,1\}^{|C|}} \varvec{\theta }_{C,\varvec{y}} \phi _{C,\varvec{y}}(\varvec{x}) \right) \nonumber \\&\quad = \frac{1}{Z(\varvec{\theta })} \prod _{C\in \mathcal {C}} \psi _C(\varvec{x}_C), \end{aligned}$$
(2)

where \(\mathcal {C}\) denotes the set of maximal cliques of the graph implied by \(\phi \). A clique is maximal when no other clique contains it—e.g., the graph consists of two maximal cliques of size 3. The equality between (1) and (2) is known as the Hammersley-Clifford theorem (Hammersley and Clifford 1971). Setting

$$\begin{aligned} \phi _{C,\varvec{y}}(\varvec{x}) = \prod _{v\in C} \mathbbm {1}(\varvec{x}_v = \varvec{y}_v), \end{aligned}$$
(3)

is sufficient for representing any arbitrary pmf with conditional independence structure G (Pitman 1936; Besag 1975; Wainwright and Jordan 2008). In this case, the graphical model is called Markov random field. Moreover, \(\phi (\varvec{x})=(\phi _{C,\varvec{y}}(\varvec{x}):C\in \mathcal {C}, \varvec{y}\in \{0,1\}^{|C|})\) represents an overcomplete family, since there exists an entire affine subset of parameter vectors \(\varvec{\theta }\), each associated with the same distribution.

In machine learning, the maximum likelihood principle is applied to estimate the parameters of Eq. 1 based on a given data set \(\mathcal {D}\).

2.2 Quantum embedding of probabilistic graphical models

Classically generating samples from Eq. 1 via naive inversion sampling is intractable as there are \(\mathcal {O}(2^n)\) distinct probabilities involved. For special types of graphical models, e.g., decomposable models, the graphical structure can be exploited to yield inference methods whose time complexity is instead exponential in the size of the largest clique (Wainwright and Jordan 2008) and not in the number of variables. These methods are efficient whenever the size of the largest clique is bounded. However, in the general non-decomposable case, structure cannot be exploited. The aim of this work is hence to devise a coherent quantum embedding which enables an efficient sampling process for general graphical models. More explicitly, we prepare a quantum state whose sampling behavior reflects the sampling behavior of a graphical model. For this purpose, a Hamiltonian \(H_{\varvec{\theta }}\) is defined which represents the conditional independence structure of the graphical model. Furthermore, the non-unitary mapping \(\exp (-H_{\varvec{\theta }})\) is implemented by a gate-based quantum circuit \(\varvec{C}\). Applying \(\exp (-H_{\varvec{\theta }})\) to an initial state given as \(|+\rangle ^{\otimes n}\), where n corresponds to the number of discrete variables in the graphical model, gives a Gibbs state of the form \(2^{-n}{\exp (-H_{\varvec{\theta }})}\). Measuring this quantum state in the computational basis lets us directly extract unbiased samples that are generated by the graphical model at hand. Notably, a (unitary) quantum circuit \(\varvec{C}\) mapping of the (non-unitary) transformation \(\exp (-H_{\varvec{\theta }})\) requires the use of \(n_{\varvec{a}}\) auxiliary qubits \(\varvec{a}\), corresponding to a random variable \(\varvec{A}\). The successful implementation of \(\exp (-H_{\varvec{\theta }})\) is conditioned on measuring \(\varvec{a}\) in the \(|0\rangle ^{\otimes n_{\varvec{a}}}\) state. Hence, the probability for measuring a specific bit string \(\varvec{x}\) from the graphical model as the output of the circuit may be written in terms of the Born rule as follows:

$$\begin{aligned} \mathbb {P}_{\varvec{C}}(\varvec{X}= \varvec{x},\varvec{A}=\varvec{0}) = \left| \left( \langle 0|^{\otimes n_{\varvec{a}}}\otimes \langle \varvec{x}|\right) \varvec{C}\left( |0\rangle ^{\otimes n_{\varvec{a}}}\otimes |+\rangle ^{\otimes n}\right) \right| ^2. \end{aligned}$$
(4)

Now, we can formally define the sampling problem.

Definition 1

(Graphical model quantum state preparation problem) Given any discrete graphical model over n binary variables, defined via \((\varvec{\theta },\phi )\), find a quantum circuit \(\varvec{C}\) which maps an initial quantum state \(|+\rangle ^{\otimes n}\) to a quantum state such that

$$\begin{aligned} \mathbb {P}_{\varvec{\theta }}(\varvec{X}= \varvec{x}) = \frac{ \mathbb {P}_{\varvec{C}}(\varvec{X}= \varvec{x},\varvec{A}=\varvec{0})}{\mathbb {P}_{\varvec{C}}(\varvec{A}= \varvec{0})}, \end{aligned}$$
(5)

as specified by Eqs. 1 and 4. Moreover, when \(\varvec{A}\not =0\), the relation between \(\mathbb {P}_{\varvec{\theta }}(\varvec{X}= \varvec{x})\) and \(\mathbb {P}_{\varvec{C}}(\varvec{X}= \varvec{x},\varvec{A}\not =\varvec{0})\) is undefined. We denote the quantity \(\mathbb {P}_{\varvec{C}}(\varvec{A}= \varvec{0})\) as “success probability.”

In what follows, we will eventually derive a circuit \(\varvec{C}\) for which \(\mathbb {P}_{\varvec{C}}(\varvec{A}=\varvec{0})\) is provably lower bounded by \(\exp (-|\mathcal {C}|\Vert \varvec{\theta }\Vert _\infty )\), i.e., the success probability decays exponentially in the number of cliques and the norm of the parameter vector. In other words, the success probability is at least \(\delta \) when \(\Vert \varvec{\theta }\Vert _\infty \le -\log (\delta )/|\mathcal {C}|\).

Algorithm 1
figure b

Pauli-Markov sufficient statistics.

3 Main results

We devise the quantum algorithm QCGM in which each vertex of a graphical model over binary variables is mapped to one qubit of a quantum circuit. In addition, \(|\mathcal {C}|+1\) auxiliary qubits are required to realize specific operations as explained below. Our result consists of two parts. First, we present a derivation of the Hamiltonian \(H_{\varvec{\theta }}\) encoding the un-normalized, negative log-probabilities. Then, we employ \(H_{\varvec{\theta }}\) to construct a quantum circuit that allows us to draw unbiased and independent samples from the respective graphical model.

3.1 The Hamiltonian

We start by transferring the sufficient statistics of the graphical model family into a matrix form which allows us to construct a \(|C_{\max }|-\)local Hamiltonian \(H_{\varvec{\theta }}\) that encodes the parameters \(\varvec{\theta }\) as well as the conditional independence structure G–with \(C_{\max }\) corresponding to the largest clique in \(\mathcal {C}\) and \(|C_{\max }|\) to the number of nodes in \(C_{\max }\). We now explain, step by step, how this Hamiltonian is constructed.

Definition 2

(Pauli-Markov sufficient statistics) Let \(\phi _{C,\varvec{y}}:\{0,1\}^n\rightarrow \mathbb {R}\) for \(C\in \mathcal {C}\) and \(\varvec{y}\in \{0,1\}^{|C|}\) denote the sufficient statistics of some overcomplete family of graphical models. The diagonal matrix \(\Phi _{C,\varvec{y}} \in \{0,1\}^{2^n \times 2^n}\), defined via

$$\begin{aligned} (\Phi _{C,\varvec{y}})_{i,j} = {\left\{ \begin{array}{ll} \phi _{C,\varvec{y}}(\varvec{x}^{(j)}),&{}\text {if }i=j\\ 0,&{}\text {otherwise}, \end{array}\right. } \end{aligned}$$
(6)

denotes the Pauli-Markov sufficient statistic, where \(\varvec{x}^{(j)}\) denotes the j-th full n-bit joint configuration w.r.t. some arbitrary but fixed order.

A naive computation of the Pauli-Markov sufficient statistic for any fixed \((C,\varvec{y})\)-pair is intractable due to the sheer dimension of \(\Phi _{C,\varvec{y}}\). However, it turns out that \(\Phi _{C,\varvec{y}}\) can be efficiently represented with a linear number of Kronecker products of single-qubit Pauli matrices as is described in Algorithm 1. In a nutshell, the algorithm marks all full joint states that coincide with \(\varvec{y}\in \{0,1\}^{|C|}\) by setting the corresponding diagonal entry in the matrix \(\Phi \) to 0 or 1, respectively.

Obviously, the Pauli representation of the tensor product computed by Algorithm 1 has length \(\Theta (n)\). Hence, the algorithm runs in time linear in n. The correctness of Algorithm 1 is implied by the following theorem.

Theorem 1

(Statistics) Algorithm 1 computes \(\Phi _{C,\varvec{y}}\) with \(\mathcal {O}(n)\) Kronecker products.

The reader will find the proof of Theorem 1 in Appendix 3. Next, we may use the construction from Algorithm 1 to define \(H_{\varvec{\theta }}\) which accumulates the conditional independence structure G of the underlying random variable via \(\Phi _{C,\varvec{y}}\) as well as the model parameters \(\varvec{\theta }\) as the first part of our main result—stated in the following theorem.

Theorem 2

(Hamiltonian) Assume an overcomplete binary graphical model specified by \((\varvec{\theta },\phi )\) encoded into a \(|C_{\max }|-\) local Hamiltonian \(H_{\varvec{\theta }} = -\sum _{C\in \mathcal {C}} \sum _{\varvec{y}\in \{0,1\}^{|C|}} \varvec{\theta }_{C,\varvec{y}} \Phi _{C,\varvec{y}}\) with \(|C_{\max }|\) denoting the number of nodes in the largest clique \(C_{\max }\) in \(\mathcal {C}\), then \(\mathbb {P}_{\varvec{\theta }}(\varvec{x}^{(j)})={\text {Tr}}\left[ {|j\rangle \langle j|\exp (-H_{\varvec{\theta }})}/{\text {Tr}}\left[ \exp \right. \right. \left. \left. (- H_{\varvec{\theta }})\right] \right] \) where \({\text {Tr}}\) is the trace and \(|j\rangle \langle j|\) acts as projector on the \(j^{\text {th}}\) diagonal entry of the matrix \(\exp (-H_{\varvec{\theta }})/{\text {Tr}}\left[ \exp \right. \left. (-H_{\varvec{\theta }})\right] \).

The reader will find the proof of Theorem 2 in Appendix 4.

3.2 The circuit

Based on the Hamiltonian \(H_{\varvec{\theta }}\) from the previous section, we now construct a circuit that implements the non-unitary operation \(\exp {\left( - H_{\varvec{\theta }}\right) }\). The construction relies on the unitary embedding of \(H_{\varvec{\theta }}\), which corresponds to a special pointwise polynomial approximation, and the factorization over the cliques. Application of this quantum circuit results in a quantum state whose sampling distribution is proportional to that of any desired graphical model over binary variables. Our findings are summarized in the following theorem.

Theorem 3

(Quantum circuit for discrete graphical models) Given any overcomplete discrete graphical model over n binary variables, defined via \((\varvec{\theta },\phi )\) with \(\varvec{\theta }\in \mathbb {R}^d\). There exists a quantum circuit \(\varvec{C}_{\varvec{\theta }}\) over \(m=n+1+|\mathcal {C}|\) qubits that prepares a quantum state whose sampling distribution is equivalent to the graphical model. That is, it solves Problem 1, namely \(\mathbb {P}_{\varvec{\theta }}(\varvec{x}) = \mathbb {P}_{\varvec{C}_{\varvec{\theta }}}(\varvec{X}=\varvec{x},\varvec{A}=\varvec{0})/\mathbb {P}_{\varvec{C}_{\varvec{\theta }}}(\varvec{A}=\varvec{0})\).

The reader will find the proof of Theorem 3 in Appendix 3. In practice, samples from the discrete graphical model are drawn by measuring the joint state of auxiliary and target qubits and discarding those samples where any of the auxiliary qubits is measured as \(|1\rangle \). More explicitly, depending on the outcome of auxiliary qubit measurements, a sample is either valid and can be kept or has to be discarded if any auxiliary qubit is measured as \(|1\rangle \).

Corollary 1

(Sampling success probability) The success probability for measuring a sample \(\varvec{x}\) with a quantum circuit sufficing Theorem 3 is lower bounded via

$$\begin{aligned} \mathbb {P}(\text {\texttt {SUCCESS}})=\mathbb {P}_{\varvec{C}_{\varvec{\theta }}}(\varvec{A}=\varvec{0}) \ge \exp (-|\mathcal {C}|\Vert \varvec{\theta }\Vert _\infty )\;. \end{aligned}$$

Corollary 1 is proven in Appendix 6. It should be noted that while the lower bound on the success probability depends on the number of cliques, it is independent of the total number of variables. Interestingly, setting \(\Vert \varvec{\theta }\Vert _\infty = k \log \left( \root |\mathcal {C}| \of {n} \right) \) expresses our lower bound on the success probability as a polynomial in n, i.e., \(\mathbb {P}(\text {\texttt {SUCCESS}}) \ge n^{-k}\), for some constant k.

Fig. 1
figure 1

Exemplary quantum circuit \(\varvec{C}_{\varvec{\theta }}\) as specified in Eq. 9. In this example, the underlying graph has vertex set \(V=\{v_0,v_1,v_2\}\) and clique set \(\mathcal {C}=\{A,B\}\). The circuit requires |V| target qubits denoted by the qubits \(x_0,x_1,x_2\) and \(|\mathcal {C}|+1=3\) auxiliary qubits, \(a_0\) for the unitary embedding \(U^j\) of the Hamiltonian characterizing the sufficient statistic and one for the real part extraction of each clique, i.e., \(a_1\) and \(a_2\)

An exemplary circuit according to Theorem 3 is shown in Fig. 1. Each vertex in the graph is identified with a circuit qubit, i.e., the first n qubits of the circuit realize the target register \(|\varvec{x}\rangle \) that represents the n binary variables of the graphical model. Furthermore, the latter \(n_{\varvec{a}}=1+|\mathcal {C}|\) qubits represent an auxiliary register \(|\varvec{a}\rangle \) which is required for the unitary embedding and the extraction of real parts as described in Appendix 3. The Hadamard gates at the beginning are required to bring the target register into the state \(|+\rangle ^{\otimes n}\), as described in Section 2.2. The structural information corresponding to the cliques can be seen to be implemented via a unitary embedding. Finally, the sample readout is realized via a measurement of the target system and conditioned on the measurement outcome of the last \(|\mathcal {C}|\) auxiliary qubits. Notably, these auxiliary qubits can already be measured before sampling the state of \(|\varvec{x}\rangle \). This allows for an early restart whenever real part extraction fails.

The unitaries \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) are manipulating the quantum state to achieve proportionality to \(\mathbb {P}_{\varvec{\theta }}\) with the target register \(|\varvec{x}\rangle \). More specifically, these unitaries embed the statistics \(\exp (\varvec{\theta }_{C,\varvec{y}} \Phi _{C,\varvec{y}})\) where \(\Phi _{C,\varvec{y}}\) is computed via Algorithm 1. The embedding is

$$\begin{aligned} U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})&= I^{\otimes n+1} + (\exp (i2\varvec{\gamma }_{C,\varvec{y}})\nonumber \\&\quad -1) (\Phi _{C,\varvec{y}} \oplus (I^{\otimes n} - \Phi _{C,\varvec{y}})) \end{aligned}$$
(7)

with \(\varvec{\gamma }_{C,\varvec{y}} = (1/2) \arccos (\exp (\varvec{\theta }_{C,\varvec{y}}))\). The decomposition of \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) into basis gates is explained in Section 3.3. We see from Eq. 7 that the first auxiliary qubit arises from the construction of \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\). Moreover, a real part extraction is required for each

$$\begin{aligned} U^C = \prod _{\varvec{y}\in \{0,1\}^{|C|}} U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\;. \end{aligned}$$
(8)

As described in Appendix 2, real part extraction is a repeat-until-success procedure—an auxiliary qubit indicates whether the extraction was successful. Since real parts of all \(U^C\) are required, this contributes \(|\mathcal {C}|\) additional auxiliary qubits, which gives us a total of \(n_{\varvec{a}}=1+|\mathcal {C}|\). Making the real part extraction explicit reveals that our circuit construction shares a defining property of undirected graphical models.

Corollary 2

(Unitary Hammersley-Clifford) Take \(U^C\) from Eq. 7 and set

$$ \tilde{U}_C = H_j ( I^{\otimes (|\mathcal {C}|-j-1)} \otimes ( ( I^{\otimes j} \otimes U^C) \oplus (I^{\otimes j} \otimes {(U^C)}^{\dagger } ))) H_j\;, $$

where \(j:=j(C)\) is the 0-based index of the clique C in some arbitrary but fixed ordering of all cliques \(\mathcal {C}\), and \(H_j:= I^{\otimes (|\mathcal {C}|-j-1)} \otimes H \otimes I^{\otimes (n+1+j)}\). The circuit can then be rewritten as

$$\begin{aligned} \varvec{C}_{\varvec{\theta }} = \prod _{C\in \mathcal {C}} \tilde{U}^C\;, \end{aligned}$$
(9)

which reveals the clique factorization of probabilistic graphical models, as predicted by the Hammersley-Clifford theorem (Hammersley and Clifford 1971).

3.3 Decomposition of clique gates

At the core of our circuit construction lies a unitary embedding of the clique factors for each clique C and each corresponding clique-state \(\varvec{y}\in \{0,1\}^{|C|}\) as given by Eq. 7. To arrive at a decomposition of \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) in terms of basis gates, one has to notice two facts: (i) \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) is diagonal, and (ii) the diagonal of \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) carries only two distinct values, namely \(u_0=1\) and \(u_1=\exp (i2\varvec{\gamma }_{C,\varvec{y}})\). Any unitary U with these properties can be decomposed whenever one can identify a unitary \(U_f: |\varvec{x},a\rangle \rightarrow |\varvec{x},a~\texttt {XOR}~f(\varvec{x})\rangle \) with \(f: \{0,1\}^n \rightarrow \{0,1\}\) such that the function value of f at \(\varvec{x}\) indicates whether the corresponding diagonal entry is the first or the second value (Hogg et al. 1999, [Section 2.1]). We now apply this construction to find a decomposition of our clique gates.

Theorem 4

(Decomposition) Consider \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) from Eq. 7. It is \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}}) = U^{C,\varvec{y}}_{\texttt {AND}} ( P(2\varvec{\gamma }_{C,\varvec{y}}) \otimes I^{\otimes n} ) U^{C,\varvec{y}}_{\texttt {AND}}\). Here, P is a phase gate, and \(U^{C,\varvec{y}}_{\texttt {AND}}\) is a Boolean-AND gate over qubits in C, whose inputs are negated in accordance with \(\varvec{y}\).

The reader will find the proof of Theorem 4 in Appendix 7. Finally, the above result implies an upper bound on the depth of \(\varvec{C}_{\varvec{\theta }}\).

Corollary 3

(Depth of \(\varvec{C}_{\varvec{\theta }}\)) The depth of \(\varvec{C}_{\varvec{\theta }}\) is in \(\mathcal {O}(d\times {\text {depth}}(\texttt {AND}))\).

The corollary follows by observing that \(\varvec{C}_{\varvec{\theta }}\) essentially consists of d \(U^{C,\varvec{y}}({\varvec{\theta }_{C,\varvec{y}}})\) gates, which itself consist of two AND gates. Here, \({\text {depth}}(\texttt {AND})\) denotes the circuit depth of a unitary operator that implements a Boolean AND over \(|C_{\max }|\) qubits (Nielsen and Chuang 2016).

4 Limitations

As shown in Corollary 1, drawing an unbiased and independent sample might fail with probability \(1-\delta \le 1-\exp (-|\mathcal {C}|\Vert \varvec{\theta }\Vert _{\infty })\). The practical impact of this fact is studied in the experiments. However, measuring the extra qubits tells us whether the real part extraction succeeded. In expectation, we have to repeat the procedure at most \(\exp (|\mathcal {C}|\Vert \varvec{\theta }\Vert _{\infty })\) times until we observe a success. As shown in Appendix 6, the exact success probability depends on \(Z(\varvec{\theta })\).

In fact, computing this quantity requires a full and hence exponentially expensive simulation of the quantum system. In practice, we only have access to an empirical estimate \(\hat{\delta }:=\hat{\mathbb {P}}(\text {\texttt {SUCCESS}})\) obtained from multiple QCGM runs. Amplification techniques (Brassard et al. 2000; Grover 2005; Yoder et al. 2014; Gilyen et al. 2019) can help us to increase the success probability at the cost of additional auxiliary qubits or a higher circuit depth. Specifically, applying a singular value transformation (Gilyen et al. 2019) can raise \(\delta \) to \(1 - \varepsilon \) for any desired \(\varepsilon >0\) with additional depth \(\Omega (\log (1/\varepsilon )\sqrt{\delta })\). Moreover, the real part extraction necessitates the use of additional auxiliary qubits—one per graph clique. The increase in qubit number can be prohibitive for the realization of large models with current quantum computers. In principle, the number of auxiliary qubits may be reduced with intermediate measurements. Since the real part extractions are applied in series, one may use a single auxiliary qubit which is measured and reset to \(|0\rangle \) after every extraction. Due to limited coherence times and physical noise, increasing the qubit number or the circuit depth makes it harder to run the algorithm with near-term quantum hardware. Thus, we do neither apply intermediate measurements nor amplitude amplification in order to ensure the feasibility of our approach on actual quantum computing hardware. The hardware-related limitations of our method can be summarized as follows.

Theorem 5

(Resource limitations) The circuit construction from Theorem 3 requires \(|\mathcal {C}|+1\) extra qubits. The expected runtime until a valid sample is generated is \(\mathcal {O}(1/\delta )\) with \(\delta =\exp (-|\mathcal {C}|\Vert \varvec{\theta }\Vert _{\infty })\), and hence, exponential in the number of cliques.

Finally, our method assumes a discrete graphical model family with binary variables and overcomplete sufficient statistics. Nevertheless, any discrete family with vertex alphabets of size k can be transformed into an equivalent family with \(\mathcal {O}(n \log _2 k)\) binary variables. Clearly, increasing the number of variables increases the number of required qubits which complicates the execution of our method on actual quantum processors.

5 Inference

The four main inference tasks that can be done with graphical models are (i) sampling, (ii) MAP inference, (iii) parameter learning, and (iv) estimating the partition function. The ability to generate samples from the graphical model follows directly from Theorem 3. Here, we provide the foundations required to address inference tasks (ii)–(iv) based on our circuit construction.

5.1 MAP prediction

Computing the MAP state of a discrete graphical model is required when the model serves as the underlying engine of some supervised classification procedure. More precisely, the MAP problem is \( \varvec{x}^* = {\text {arg\,max}}_{\varvec{x}\in \{0,1\}^n} \mathbb {P}_{\varvec{\theta }}(\varvec{x}) = {\text {arg\,max}}_{\varvec{x}\in \{0,1\}^n} \varvec{\theta }^\top \phi (\varvec{x}) \). Theorem 2 asserts that our Hamiltonian \(H_{\varvec{\theta }}\) carries \(-\varvec{\theta }^\top \phi (\varvec{x})\) for all \(\varvec{x}\in \{0,1\}^n\) on its diagonal. Since \(H_{\varvec{\theta }}\) is itself diagonal, \(-\varvec{\theta }^\top \phi (\varvec{x}^*)\) is the smallest eigenvalue of \(H_{\varvec{\theta }}\). Computing the smallest eigenvalue and the corresponding eigenvector—which corresponds to \(\varvec{x}^*\) in the \(2^n\)-dimensional state space—is a well-studied QMA-hard problem. Heuristic algorithms like the variational quantum eigensolver (Peruzzo et al. 2014) or variational quantum imaginary time evolution (McArdle et al. 2019; Yuan et al. 2019) can be applied to our Hamiltonian in order to approximate the MAP state.

Moreover, sampling from a model with inverse temperature \(\beta \) can also be applied to compute the MAP state. However, this effectively samples from a model with parameter \(\varvec{\theta }'=\beta \varvec{\theta }\). According to Corollary 1, driving the temperature to 0 drives \(\Vert \varvec{\theta }'\Vert _{\infty }\) to \(\infty \), which would result in an arbitrary small success probability.

5.2 Parameter learning

Parameters of the graphical model can be learned consistently via the maximum likelihood principle. Given a data set \(\mathcal {D}\) that contains samples from the desired distribution \(\mathbb {P}^*\), we have to minimize the convex objective \(\ell (\varvec{\theta })=-(1/|\mathcal {D}|)\sum _{\varvec{x}\in \mathcal {D}} \log \mathbb {P}_{\varvec{\theta }}(\varvec{x})\) with respect to \(\varvec{\theta }\). Differentiation reveals \(\nabla \ell (\varvec{\theta })=\hat{\varvec{\mu }}-\tilde{\varvec{\mu }}\) where \(\tilde{\varvec{\mu }}=(1/|\mathcal {D}|)\sum _{\varvec{x}\in \mathcal {D}}\phi (\varvec{x})\) and \(\hat{\varvec{\mu }}=\sum _{\varvec{x}\in \{0,1\}^n}\mathbb {P}_{\varvec{\theta }}(\varvec{x}) \phi (\varvec{x})\). The latter quantity can be estimated by drawing samples from the graphical model while \(\tilde{\varvec{\mu }}\) is a constant computed from \(\mathcal {D}\). Notably, our circuit does not require a quantum-specific training procedure, since the circuit \(\varvec{C}_{\varvec{\theta }}\) is itself parametrized by the canonical parameters \(\varvec{\theta }\) of the corresponding family. This allows us to run any iterative numerical optimization procedure on a digital computer and employ our circuit as a sampling engine for estimating \(\hat{\varvec{\mu }}\). That is,

$$\begin{aligned} \varvec{\theta }^{(t+1)} = \varvec{\theta }^{t} + \frac{1}{2|\mathcal {C}|} \nabla \ell (\varvec{\theta }) \end{aligned}$$
(10)

Therein, \(\nicefrac {1}{2|\mathcal {C}|}\) is a stepsize that guarantees convergence to an \(\epsilon \)-optimal solution within \(\mathcal {O}(|\mathcal {C}|)\) steps (Piatkowski 2018, [Theorem 2.4]).

We have to remark that our circuit allows for an alternative way to estimate the parameters. The circuit can be parametrized by a vector of rotation angles \(\varvec{\gamma }\). Thus, we may learn \(\varvec{\gamma }\) directly, utilizing a quantum gradient \(\nabla _{\varvec{\gamma }}^{\varvec{x}}\) (Dallaire-Demers and Killoran 2018; Zoufal 2021), where

$$\begin{aligned} \left( \nabla _{\varvec{\gamma }}^{\varvec{x}}\right) _j = \textstyle {\frac{\partial }{\partial \varvec{\gamma }_j}} \sum \limits _{i\in \{0,1\}}|(\langle 0|^{\otimes |\mathcal {C}|}\otimes \langle i|\otimes \langle \varvec{x}|)\varvec{C}_{\varvec{\gamma }} |+\rangle ^{\otimes m}|^2. \end{aligned}$$

The corresponding canonical parameters can be recovered from \(\varvec{\gamma }\) via \(\varvec{\theta }_j = 2\log \cos (2\varvec{\gamma }_j)\).

5.3 Approximating the partition function

Estimating the partition function \(Z(\varvec{\theta })\) of a graphical model allows us to compute the probability of any desired state directly via the exponential family form \(\mathbb {P}_{\varvec{\theta }}(\varvec{x})=\exp (\varvec{\theta }^\top \phi (\varvec{x})-\ln Z(\varvec{\theta }))\). Computing \(Z(\varvec{\theta })\) is a well-recognized problem because of its complexity—the problem is #P-hard. Nevertheless, given a set \(\mathcal {S}\) that consists of samples from \(\mathbb {P}_{\varvec{\theta }}\), the Ogata-Tanemura method (Ogata and Tanemura 1981; Potamianos and Goutsias 1997) can be applied to get an unbiased estimate of the inverse partition function via \({1}/{\hat{Z}}(\varvec{\theta }) = 2^{-n}/|\mathcal {S}| \sum _{\varvec{x}\in \mathcal {S}} 1/\exp (\varvec{\theta }^\top \phi (\varvec{x}))\). Moreover, following (Bravyi et al. 2021), there is a circuit that employs our \(\varvec{C}_{\varvec{\theta }}\) to estimate \(\tau (\theta )=2^{-n}2 Z(\varvec{\theta })\) with accuracy \(\varepsilon \) and probability of at least 3/4. The procedure adds \(\mathcal {O}(\log \nicefrac {1}{\varepsilon })\) extra auxiliary qubits and increases the depth to \(\mathcal {O}( d {\text {depth}}(\texttt {AND}) ({\text {poly}}(n)+1/(\varepsilon \sqrt{\tau (\varvec{\theta })})))\). The basic idea is to apply quantum trace estimation as defined in (Bravyi et al. 2021, [Theorem 7]) to the matrix from Eq. E4 (provided in Appendix 5). Due to the high depth, the resource consumption of this procedure is prohibitive for current quantum computers. Nevertheless, it opens up avenues for probabilistic inference on upcoming fault-tolerant quantum hardware.

6 Experimental evaluation

Here, we want to evaluate the practical behavior of our method, named QCGM, by answering a set of questions which are discussed below. Theorem 3 provides the guarantee that the sampling distribution of our QCGM is proportional to that of some given discrete graphical model. However, actual quantum computers are not perfect and the computation is influenced by various sources of quantum noise, each having an unknown distribution (Nielsen and Chuang 2016). Hence, we investigate the following:

(Q1):

How close is the sampling distribution of QCGM on actual state-of-the-art quantum computing devices to a noise-free quantum simulation, classical Gibbs sampling, and classical perturb-and-MAP sampling?

Measuring the auxiliary qubits of \(\varvec{C}_{\varvec{\theta }}\) tells us if the real part extraction has failed or not. While our theoretical insights provide a lower bound on the success probability, the exact success probability is unknown. The second question we address with our experiments is hence as follows:

(Q2):

What empirical success probability should we expect and how does \(\hat{\delta }\) vary as a function of \(\Vert \varvec{\theta }\Vert _{\infty }\)?

Third, as explained in Section 5.2, the parameter learning of QCGM can be done analogously to the classical graphical model, based on a data set \(\mathcal {D}\) and samples \(\mathcal {S}\) from the circuit. As explained above, samples from the actual quantum processor will be noisy. It is known since long that error-prone gradient estimates can still lead to reasonable models as long as inference is carried out via the same error-prone method (Wainwright 2006). Our last question is thus as follows:

(Q3):

Can we estimate the parameters of a discrete graphical model in the presence of quantum noise?

Table 1 Average Hellinger fidelities (F) and their standard errors over ten runs, computed from \(N=10,000\) samples

6.1 Experimental setup

For question \({\textbf {(Q1)}}\), we design the following experiment. First, we fix the conditional independence structures shown in the top row of Table 1. For each structure, we generate 10 graphical models with random parameter vectors drawn from a negative half-normal distribution with scales \(\sigma \in \{\nicefrac {1}{10},\nicefrac {1}{4},\nicefrac {1}{2}\}\). QCGM is implemented using Qiskit (Abraham et al. 2019) and realized by applying the circuit \(\varvec{C}_{\varvec{\theta }}\) to the state \(|0\rangle ^{\otimes (|\mathcal {C}|+1)}\otimes |+\rangle ^{\otimes n}\). The probabilities for sampling \(\varvec{x}\) are evaluated by taking \(N=10,000\) samples (shots) from the prepared quantum state and computing the relative frequencies of the respective \(\varvec{x}\). The quantum simulation is carried out by the Qiskit QASM simulator which is noise-free. Any error that occurs in the simulation runs is thus solely due to the fact that we draw a finite number of samples. The experiments on actual quantum computers are carried out on three superconducting quantum processors (IBM Quantum 2021): a 27-qubit IBM Quantum Falcon QPU (ibmq_ehningen), a 127-qubit IBM Quantum Eagle QPU (ibm_sherbrooke), and a 133-qubit IBM Quantum Heron QPU (ibm_torino). The results are post-processed with Matrix-free Measurement Mitigation (M3) (Nation et al. 2021).

For a comparison with classical sampling methods, we consider the benchmark methods described in Appendix A.1. First, Gibbs sampling is performed with a fixed burn-in of \(b=10\) samples. Moreover, we discard b samples between each two accepted Gibbs samples to facilitate independence. These choices are heuristics and prone to error. Second, we apply perturb-and-MAP sampling (Papandreou and Yuille 2011; Hazan et al. 2013) with sum-of-gamma (SoG) perturbations (Niepert et al. 2021). The SoG approach is an efficient heuristic that results in a slightly biased sampling distribution (see Appendix A.1 for details).

The quality of each sampling procedure is assessed by the Hellinger fidelity, defined for two probability mass functions \(\mathbb {P}\) and \(\mathbb {Q}\) via \(F(\mathbb {P},\mathbb {Q}) = (\sum _{\varvec{x}\in \{0,1\}^n} \sqrt{\mathbb {P}(\varvec{x})\mathbb {Q}(\varvec{x})})^2\). For question \({\textbf {(Q2)}}\), we consider the very same setup as above, but instead of F, we compute the empirical success rate of the QCGM as \(\hat{\delta }=\text {number of succeeded samplings}/N\). This is computed for each of the 10 runs on each quantum computer and the quantum simulator.

Finally, for question \({\textbf {(Q3)}}\), we draw N samples from a graphical model with edge set \(\{(0,1),(1,2)\}\) via Gibbs sampling. Based on these samples, learning \(\varvec{\theta }\) is carried out as described in Section 5.2. We choose ADAM (Kingma and Ba 2015) for numerical optimization to compensate for the noisy gradient information. For each of the 30 training iterations, we report the estimated negative, average log-likelihood, and the empirical success rate. Since the loss function is convex, the optimization procedure is independent of the initial solution. For simplicity, we initialized the training at \(\varvec{\theta }=\varvec{0}\).

Fig. 2
figure 2

Left: Learning curve for QCGM over 30 ADAM iterations on a QASM simulation and an actual superconducting quantum processor (ibmq_ehningen). Right: The empirical success rate \(\hat{\delta }\) of each training iteration

6.2 Experimental results

The results of our experimental evaluation are shown in Table 1 and Fig. 2. Numbers reported are average values and standard errors over 10 runs. Regarding question (Q1), we see that the fidelity of the simulation outperforms the classical benchmark methods. This can be explained by the fact that QCGM is guaranteed to return unbiased and independent samples from the underlying model. Gibbs sampling can only achieve this if the hyper-parameters are selected carefully, or, in case of PAM, when an exact perturbation and an exact MAP solver are considered. Moreover, the fidelity on actual quantum hardware degrades as the model becomes more complex. Yet, the average fidelity rarely drops below 0.98. Moreover, each of these cases occurs on the structures and (which are also the worst cases for Gibbs sampling). Since both models constitute the largest structures considered, they also correspond to the largest circuits which clearly suffer from the largest amount of hardware noise. On the larger QPUs (ibm_sherbrooke and ibm_torino) structure exhibits not only the worst fidelity but also the largest standard errors. We conjecture that QPUs with a large qubit count incur a larger uncertainty for multi-qubit operations that are required for implementing the two cliques of size three. When one considers the best result on each structure (and not the average), QCGM frequently attains fidelities of \(>0.98\) also on quantum hardware. In summary, we conclude that QCGM produces valid samples in the presence of actual quantum noise and has the potential to outperform classical sampling methods.

Fig. 3
figure 3

The empirical success rate \(\hat{\delta }\) and its relation to the uniform norm of the parameter vector and the scale of the underlying parameter distribution, respectively. Each success rate is estimated from 10k shots for a model with structure \(G=(\{0,1\},\{(0,1)\})\) and a random parameter vector \(\varvec{\theta }\), sampled from a half-normal distribution with scales \(\sigma \in \{\nicefrac {1}{10},\nicefrac {1}{4},\nicefrac {1}{2}\}\). Top: QASM simulation. Bottom: Superconducting quantum processor (ibm_torino)

For question (Q2), we see from Table 1 that the success probability degrades when the number of maximal cliques increases. For the simulation and the actual QPUs, the worst success probabilities are obtained on structures and which also incur the largest clique count (\(|\mathcal {C}|\)). The sheer number of variables (n) does not have the same impact, since structures and have the same number of variables but a consistently larger success probability. In Fig. 3, we consider the empirical success probability \(\hat{\delta }\) as a function of the parameter’s uniform norm \(\Vert \varvec{\theta }\Vert _{\infty }\) and the parameter prior’s scale \(\sigma \), respectively. As predicted by Corollary 1, \(\hat{\delta }\) degrades exponentially with increasing \(\Vert \varvec{\theta }\Vert _{\infty }\), or, alternatively, with increasing \(\sigma \). Interestingly, the second plot of Fig. 2 reveals that \(\hat{\delta }\) also degrades as a function of the model’s entropy. Since we initialize the training with all elements of \(\varvec{\theta }\) being 0, we start at maximum entropy. Parameters are refined during the learning and the entropy is reduced.

Finally, the first plot in Fig. 2 shows that the training progresses for both, simulated and hardware results. Although the hardware noise prevents the optimization to reach the same likelihood as the simulation, the likelihood improves with additional training iterations. We hence consider the answer to \({\textbf {(Q3)}}\) in the affirmative.

7 Conclusion

We introduce an exact representation of discrete graphical models with a quantum circuit construction that acts on \(n+1+|\mathcal {C}|\) qubits that shows potential for compatibility with near-term quantum hardware. This method enables unbiased, hyper-parameter free sampling while keeping the theoretical properties of the undirected model intact, e.g., our quantum circuit factorizes over the set of maximal cliques as predicted by the Hammersley-Clifford theorem. Although our results are stated for binary models, equivalent results for arbitrary discrete state spaces can be derived, where multiple qubits encode one logical non-binary random variable. The full compatibility between the classical graphical model and our unitary embedding is significant, since it allows us to benefit from existing theory as well as quantum sampling. A distinctive property of QCGM is that the algorithm itself indicates whether a sample should be discarded. Here, we show a lower bound for the success probability depending only on the number of maximal cliques and the model parameter norm. The experiments conducted with numerical simulations and actual quantum hardware show that QCGMs perform well for certain conditional independence structures but suffer from small success probabilities for structures with large \(|\mathcal {C}|\). When compared to state-of-the-art MCMC-based sampling methods, our proposed method has a subtle advantage in terms of parallelization. Given M quantum processors, we can produce \(\delta M\) samples by running the proposed circuit once on each processor in parallel. In contrast, classical Markov chains have to perform an exponential number of iterations until each chain has reached the stationary distribution. However, this cannot be accelerated by running more chains in parallel, since each chain does not indicate whether it has attained stationarity. It remains open for future research to potentially remove the dependence of \(\delta \) on the number of maximal cliques and to further explore the differences and potential relations in limiting factors between classical and quantum sampling methods. Finally, our results open up new avenues for probabilistic machine learning on quantum computers by showcasing that quantum models can be employed to generate unbiased and independent samples for a large and relevant class of generative models with sufficiently good fidelity.