1 Introduction

In recent years, the field of quantum computing has made significant steps towards practical usefulness, which has sparked increasing interest in many areas, including machine learning (Perdomo-Ortiz et al. 2018; Benedetti et al. 2019). The growing field of quantum machine learning has since led to proposals for quantum analogs of many types of classical models, such as convolutional neural networks (Cong et al. 2019) and graph neural networks (Verdon et al. 2019).

Many existing quantum machine learning approaches rely on the assumption that the exponentially large Hilbert space spanned by possible quantum states will lead to an advantage compared to classical methods. This, however, is far from being clear: encoding useful quantum states efficiently and measuring them accurately are challenges that make straightforward speed-ups difficult (Aaronson 2015). Furthermore, since existing quantum devices are very limited, empirical benchmarks are often impossible at the scales where quantum methods might lead to a real advantage. Due to these difficulties, theoretical analysis plays a fundamental role, and recent works focusing on characterizing the capabilities and limitations of potential quantum models have shown significant results (Schuld et al. 2021; Liu et al. 2021; KΓΌbler et al. 2021; Goto et al. 2021).

The goal of this paper is to establish a framework for learning functions over graphs using quantum methods and to study its theoretical properties. Graphs play a key role in modern machine learning, and are used to encode various forms of relational data, such as knowledge graphs (Bordes et al. 2011), social networks (Zhang and Chen 2018), and importantly also molecules (Wu et al. 2018), which are a particularly promising application domain of quantum computing due to their inherent quantum properties.

Graph neural networks (GNNs) (Kipf and Welling 2017; VeličkoviΔ‡ et al. 2018) are prominent models for classical relational learning, as they encode desirable properties such as permutation invariance (resp., equivariance) relative to graph nodes, enabling a strong relational inductive bias (Battaglia et al. 2018). While broadly applied, the expressive power of prominent GNN architectures, such as message-passing neural networks (MPNNs) (Gilmer et al. 2017), is shown to be upper bounded by the 1-dimensional Weisfeiler-Lehman graph isomorphism test (Xu et al. 2019; Morris et al. 2019). This limitation motivated a large body of work aiming at more expressive models, including higher- order models (Morris et al. 2019; Maron et al. 2019a), as well as extensions of MPNNs with unique node identifiers (Loukas 2020), or with random node features (Sato et al. 2021; Abboud et al. 2021).

In this paper, we investigate quantum analogs of GNNs and make the following contributions:

  • We define criteria for quantum circuits to respect the invariances of the graph domain, leading to equivariant quantum graph circuits (EQGCs) (SectionΒ 4).

  • We define equivariant hamiltonian quantum graph circuits (EH-QGCs) and equivariantly diagonalizable unitary quantum graph circuits (EDU-QGCs) as special subclasses, and relate these classes to existing proposals, providing a unifying perspective for quantum graph representation learning (SectionΒ 4.2).

  • We characterize the expressive power of EH-QGCs and EDU-QGCs, proving that they are universal approximators functions defined over arbitrarily large bounded graph domains. This result is achieved by showing a correspondence between EDU-QGCs and MPNNs enhanced with random node initialization which are universal approximators over bounded graphs (Abboud et al. 2021). Differently, our model does not require any extraneous randomization, and the result follows from the model properties (SectionΒ 5).

  • We experimentally show that even simple EDU-QGCs go beyond the capabilities of popular GNNs, by empirically verifying that they can discern graph pairs, which are indiscernible by standard MPNNs (SectionΒ 6).

This paper is based on work done for the MSc dissertation of the first author at the University of Oxford, first published in ICML 2022. This version includes extended details of all proofs and constructions.

The rest of this paper is organized as follows. We first discuss related work in the field of quantum machine learning in SectionΒ 2, then give an overview of important methods and results in graph representation learning that we build on in SectionΒ 3. After these preliminaries, we present our proposed framework and discuss important subclasses in SectionΒ 4, show our theoretical results on model expressivity in SectionΒ 5 and provide empirical evaluation in SectionΒ 6. We finish with a discussion of our results and possible further directions in SectionΒ 7.

2 Related work

The field of quantum machine learning includes a wide range of approaches. Early work had partial successes in speeding up important linear algebra subroutines (Harrow et al. 2009), but these methods usually came with caveats (e.g., requirements of the input being easy to prepare or being sparse, or approximate knowledge of the final state being sufficient) that made them hard to apply to large problem classes in practice (Aaronson 2015). Recent approaches tend to use quantum circuits to mimic or replace larger parts of classical techniques: quantum kernels use a quantum computer to implement a fixed kernel function in a classical learning algorithm (Schuld and Killoran 2019; Liu et al. 2021), while parameterized quantum circuits (PQCs) use tunable quantum circuits as machine learning models in a manner similar to neural networks (Perdomo-Ortiz et al. 2018; Benedetti et al. 2019). Lacking the possibility of standard backpropagation, there are alternative ways of calculating gradients (Schuld et al. 2019), and gradient-free optimization methods are also used (Ostaszewski et al. 2021). In this paper, we focus on PQCs.

There is also a growing body of work on the capabilities and limitations of such models. Ciliberto et al. (2018) and KΓΌbler et al. (2021) give rigorous results about when we can and cannot expect the inductive bias quantum of kernels to give them an advantage over classical methods; Servedio and Gortler (2004) and Liu et al. (2021) demonstrate carefully chosen function classes that quantum kernels can provably learn more efficiently than any classical learner. PQCs have been harder to reason about due to their non-convex nature, but there have been important steps in showing conditions under which certain PQCs are universal function approximators over vector spaces (Schuld et al. 2021; Goto et al. 2021), similarly to multi-layer perceptrons in the classical world (Hornik et al. 1989). There has been also rigorous work on the PAC-learnability of the output distributions of local quantum circuits (Hinsche et al. 2021).

For learning functions over graphs, the literature is sparse: there are some proposals supported by small-scale experiments, but there is generally a lack of formal justification for the particular model choices. In particular, we are not aware of any theoretical work on the capabilities of these models. We propose a framework unifying PQC models that build a circuit for each example graph in a structurally appropriate way when running inference, such as Verdon et al. (2019), Zheng et al. (2021), and Henry et al. (2021). Such PQCs are also used as a building block by Ai et al. (2022), who apply them to subgraphs, thereby requiring fewer qubits and enabling scaling to larger graphs. We discuss considerations for these, and investigate their expressive power.

There are also other approaches that we do not cover, such as using edges primarily in classical pre- or post-processing steps of a PQC (Chen et al. 2021), or running a PQC for each node independently and using the connectivity only to formulate the error terms calculated from the measurements (Beer et al. 2021).

3 Graph neural networks

GNNs can be dated back to earlier works of Scarselli et al. and Gori et al. and are designed to have a graph-based inductive bias: the functions they learn should be invariant to the ordering of the nodes or edges of the graph, since the ordering is just a matter of representation and not a property of the graph. This includes invariant functions that output a single value that should be unchanged on permuting nodes, and equivariant functions that output a representation for each node, and this output is reordered consistently as the input is shuffled (Hamilton 2020).

Formally, a function f is invariant over graphs if, for isomorphic graphs \(\mathcal {G},{\mathscr{H}}\) it holds that \(f(\mathcal {G}) {=} f({\mathscr{H}})\); a function f mapping a graph \(\mathcal {G}\) with vertices \(V(\mathcal {G})\) to vectors \({\boldsymbol x\in \mathbb {R}^{a V(\mathcal {G})a}}\) is equivariant if, for every permutation Ο€ of \(V(\mathcal {G})\), it holds that \({f(\mathcal {G}^{\pi })=f(\mathcal {G})^{\pi }}\).

Message-passing neural networks (MPNNs) (Gilmer et al. 2017) are a popular and highly effective class of GNNs that iteratively update the representations of each node based on their local neighborhoods. In an MPNN, each node v is assigned some initial state vector \({\boldsymbol {h}}_{v}^{(0)}\) based on its features. This is iteratively updated based on the current state of its neighbors \(\mathcal N(v)\) and its own state, as follows:

$$ {\boldsymbol{h}}_{v}^{(k+1)} = \textsc{upd}^{(k)}\Big({\boldsymbol{h}}_{v}^{(k)}, \textsc{agg}^{(k)}\big(\{\!\!\{{\boldsymbol{h}}_{u}^{(k)}~a~u \in \mathcal N(v) \}\!\!\}\big)\Big), $$

where { {β‹…} } denotes a multiset, and aggk(β‹…) and upd(k)(β‹…) are differentiable functions.

The choice for the aggregate and update functions varies across approaches (Kipf and Welling 2017; VeličkoviΔ‡ et al. 2018; Xu et al. 2019; Li et al. 2016). After several such layers have been applied, the final node embeddings are pooled to form a graph embedding vector to predict properties of entire graphs. The pooling often takes the form of simple averaging, summing or elementwise maximum.

The expressive power of MPNNs is upper bounded by the 1-dimensional Weisfeiler-Lehman algorithm (1-WL) for graph isomorphism testing (Xu et al. 2019; Morris et al. 2019). Considering a pair of 1-WL indistinguishable graphs, such as those shown in Fig.Β 1, any MPNN will learn the exact same representations for these graphs, yielding the same prediction for both, irrespectively of the target function to be learned. In particular, this means that MPNNs cannot learn functions such as counting cycles, or detecting triangles.

Fig. 1
figure 1

Two graphs indistinguishable by 1-WL: \({\mathcal {G}}_{1}\) consisting of two triangles (left), and \({\mathcal {G}}_{2}\) being a single 6-cycle (right)

The limitations in the expressive power of GNNs motivated a large body of work. Xu et al. (2019) proposed the graph isomorphism networks (GINs), as maximally expressive MPNNs, and showed this model is as powerful as 1-WL, owing to its potential of learning injective aggregate-update functions. To break the expressiveness barrier, some approaches considered unique node identifiers (Loukas 2020), or random pre-set color features (Dasoulas et al. 2020), and alike, so as to make graphs discernible by construction (since 1-WL can distinguish graphs with unique node identifiers), but these approaches suffer in generalization. Other approaches are based on higher-order message passing (Morris et al. 2019), or higher-order tensors (Maron et al. 2019b; Maron et al. 2019a), and typically have a prohibitive computational complexity, making them less viable in practice.

Rather recently, MPNNs enhanced with random node initialization (Sato et al. 2021; Abboud et al. 2021) are shown to increase the expressivity without incurring a large computational overhead, and while preserving invariance properties in expectation. Sato et al. showed that such randomized MPNNs can detect any fixed substructure (e.g., a triangle) with high probability, and Abboud et al. proved that randomized MPNNs are universal approximators for functions over bounded graphs, building on an earlier logical characterization of MPNNs (BarcelΓ³ et al. 2020). Intuitively, random node initialization assigns unique identifiers to different nodes with high probability and the model becomes robust via more sampling, leading to strong generalization. However, it is harder to train these models, since they need to see many different random labelings to eventually become robust to this variation. The extent of this effect can be mitigated by using fewer randomized dimensions (Abboud et al. 2021).

4 Equivariant quantum graph circuits

In this section, we give and describe the class of models we are considering and formalize the requirement of respecting the graph structure in our definition of equivariant quantum graph circuits. We then discuss two subclasses and their relation to each other.

4.1 Model setup

Let \({\mathbb {G}}^{n}\) be the set of graphs up to size n. Consider a graph \({\mathcal {G}} \in {\mathbb {G}}^{n}\), with adjacency matrix \({\boldsymbol {A}} \in \mathbb B^{n \times n}\) and a node feature vector xi for each node \(i \in \{1 {\dots } n\}\). We consider a broad class of models with the following simple structure, as shown in Fig.Β 2:

  1. 1.

    For each node with features xi, a quantum state \(|{v_{i}}\rangle = |{\rho ({\boldsymbol {x}}_{i})}\rangle \in \mathbb {C}^{s}\) is prepared via some fixed feature map ρ(β‹…). The dimensionality of this state is s = 2q when using q qubits per node.

  2. 2.

    The node states are composed with the tensor product to form the product state \(|{{\boldsymbol {v}}}\rangle = \bigotimes _{i=1}^{n} |{v_{i}}\rangle \in \mathbb {C}^{s^{n}}\).

  3. 3.

    We apply some circuit encoding a unitary matrix \({\boldsymbol {C}}_{{\boldsymbol {\theta }}}({\boldsymbol {A}}) \in \mathbb {C}^{s^{n} \times s^{n}}\), dependent on the adjacency matrix A and tunable parameters πœƒ, to the initial state of the system.

  4. 4.

    Each node state is measured in the computational basis, leading to a one-hot binary vector \(|{y_{i}}\rangle \in \mathbb B^{s}\) for each node. Over the entire system, we measure any \(|{{\boldsymbol {y}}}\rangle = \bigotimes _{i=1}^{n} |{y_{i}}\rangle \in \mathbb B^{s^{n}}\) with probability P(y) = |γ€ˆy|Cπœƒ(A)|v〉|2 as dictated by the Born rule. This means the probability of any specific measurement is given by the magnitude of a single element in the final state vector \({\boldsymbol {C}}_{{\boldsymbol {\theta }}}({\boldsymbol {A}}) |{{\boldsymbol {v}}}\rangle \in \mathbb {C}^{n^{s}}\).

  5. 5.

    These are aggregated by some permutation-invariant parameterized classical function \(g_{{\boldsymbol {\theta }}^{\prime }}\) to provide a prediction \(g_{{\boldsymbol {\theta }}^{\prime }}({\boldsymbol {y}})\).

Fig. 2
figure 2

Overview of our model setup. (a) A product state is prepared based on individual nodes, (b) a parameterized circuit C is applied based on the adjacency matrix A, (c) the nodes states are measured and (d) aggregated by some classical function g

While this setup rules out certain possibilities such as using mixed-state quantum computing with mid-circuit measurements, or somehow aggregating the node states inside the quantum circuit, it still leaves a broad and powerful framework that subsumes existing methods (as we will discuss in SectionΒ 4.2). We do not consider details of how to design the classical aggregator \(g_{{\boldsymbol {\theta }}^{\prime }}\) β€” for questions of expressivity, we will simply assume that it is a universal approximator over multisets, which is known to be achievable by combining multi-layer perceptrons with sum aggregation (Zaheer et al. 2017; Xu et al. 2019). The choice of the feature map ρ does have to be made upfront, but our proofs all use simple constructions encoding the data in the computational basis.

Our focus is instead on the circuit Cπœƒ(A), and how it should behave in order to interact well with the graph. As in the case of classical GNNs, we want to make sure the ordering of nodes and edges does not matter. In our case, this means that for any input, reordering the nodes and edges should reorder the probabilities of all measurements appropriately.

Example 1

With n = 3 nodes represented by a single qubit each (s = 2), the probability of observing some output γ€ˆy1y2y3| is p = γ€ˆy1y2y3|Cπœƒ(A)|v1v2v3〉. If we cycle the nodes around to form the input state |v2v3v1〉, and also use an appropriately reordered adjacency matrix \({\boldsymbol {A}}^{\prime }\), we should find the probability of the reordered observation \(\langle {y_{2}y_{3}y_{1}}|{\boldsymbol {C}}_{{\boldsymbol {\theta }}}({\boldsymbol {A}}^{\prime })|{v_{2}v_{3}v_{1}}\rangle \) to be p as well.

This brings us to the definition of equivariant quantum graph circuits (EQGCs):

Definition 1

Let \({\boldsymbol {A}} \in {\mathbb {B}}^{n \times n}\) be an adjacency matrix, \({{\boldsymbol {P}} \in {\mathbb {B}}^{n \times n}}\) a permutation matrix representing a permutation p over n elements, and \(\tilde {{\boldsymbol {P}}} \in {\mathbb {B}}^{s^{n} \times s^{n}}\) a larger matrix that reorders the tensor product, mapping any \(|{v_{1}}\rangle |{v_{2}}\rangle \dots |{v_{n}}\rangle \) with \(|{v_{i}}\rangle \in \mathbb {C}^{s}\) to \(|{v_{p(1)}}\rangle |{v_{p(2)}}\rangle \dots |{v_{p(n)}}\rangle \).

An EQGC is an arbitrary parameterized function Cπœƒ(β‹…) mapping an adjacency matrix \({\boldsymbol {A}} \in {\mathbb {B}}^{n \times n}\) to a unitary \({\boldsymbol {C}}_{{\boldsymbol {\theta }}}({\boldsymbol {A}}) \in \mathbb {C}^{s^{n} \times s^{n}}\) that behaves equivariantly for all πœƒ:

$$ {\boldsymbol{C}}_{{\boldsymbol{\theta}}}({\boldsymbol{A}}) = \tilde{{\boldsymbol{P}}}^{T} {\boldsymbol{C}}_{{\boldsymbol{\theta}}}({\boldsymbol{P}}^{T}{\boldsymbol{A}}{\boldsymbol{P}}) \tilde{{\boldsymbol{P}}} $$
(1)

In the following sections, we will generally leave the parameter πœƒ, and sometimes also A, as implicit when they are clear from context.

In accordance to our model setup, an EQGC Cπœƒ(β‹…) represents a probabilistic model over graphs only when combined with a fixed feature map ρ(β‹…) to prepare each node state, as well as measurement and classical aggregation \(g_{{\boldsymbol {\theta }}^{\prime }}\) at the end of the circuit. Putting these together, we can formally speak of the capacity of EQGCs in representing functions.

Definition 2

We say that a (Boolean or real) function f defined on \({\mathbb {G}}^{n}\) can be represented by an EQGC Cπœƒ with error probability πœ– if there is some feature map ρ and invariant classical aggregation function gπœƒ, such that for any input graph \({\mathcal {G}} \in {\mathbb {G}}^{n}\) the model’s output is \(f({\mathcal {G}})\) with probability 1 βˆ’ πœ–. In the special case, where πœ– = 0, we simply say that the function f can be represented by an EQGC Cπœƒ.

Remark 1 (A note on directedness)

Unlike many works on GNNs, our definition of EQGCs allows us to consider directed graphs naturally, and this will also be true for the subclasses we consider later. Of course, we can still easily operate on undirected data by either adding edges in both directions, or placing extra restrictions on our models. For the purposes of expressivity, we will still focus on classifying graphs in the undirected case, as this is better explored in previous works on classical methods.

4.2 Subclasses of EQGCs

Note that we cannot and should not aim to use all possible EQGCs as a model class. If we did, the prediction of our models on any graph would not restrict their behavior on other, non-isomorphic graphs in any way. This would not only make such a class impossible to characterize with a finite set of parameters πœƒ, but the models would also have no way to generalize to unseen inputs. Therefore, EQGCs should be seen as a broad framework, and we investigate more restricted subclasses that do not have such problems.

We are particularly interested in subclasses that scale well with the number of nodes in a graph, so in the following sections we discuss approaches based on uniform single-node operations and two-node interactions at edgesFootnote 1. All of the following models are parameterized by identical operations being applied for each node or for each edge, ensuring that a single model can efficiently learn about graphs of various sizes. It is also a useful starting point for ensuring equivariance, although as we will see, we also have to make sure that the ordering of these operations does not affect our results.

Note however that for the sake of making our analysis feasible, our model classes are not closely tied to realizations in quantum gates. We consider arbitrary Hamiltonian and unitary operators which can be approximated with a universal gate set to any required accuracy, but this might require very deep circuits. Due to this as well as the number of qubits required that we derive in Theorem 2, we do not expect our specific constructions to be practically realized in near-term hardware β€” rather, their primary value is in characterizing the capabilities of a broad class of models, and we leave more practical parameterizations for future work.

4.2.1 Parameterization by Hamiltonians

Operations on the quantum states of nodes or pairs of nodes can be easily represented as unitaries, but these are tricky to parameterize directly: e.g., a linear combination of unitaries is not unitary generally. One alternative is to use the fact that any unitary U can be expressed using its Hamiltonian H, a Hermitian matrix of the same size such that \({\boldsymbol {U}} = \exp (-i{\boldsymbol {H}})\). We can let the Hamiltonian depend linearly on the adjacency matrix, with Hermitian operators applied based on the structure of the graph:

Definition 3

An equivariant hamiltonian quantum graph circuit (EH-QGC) is an EQGC given by a composition of finitely many layers \({\boldsymbol {C}}_{{\boldsymbol {\theta }}}({\boldsymbol {A}}) = {\boldsymbol {L}}_{{\boldsymbol {\theta }}_{1}}({\boldsymbol {A}}) \circ {\dots } \circ {\boldsymbol {L}}_{{\boldsymbol {\theta }}_{k}}({\boldsymbol {A}})\), with each \({\boldsymbol {L}}_{{\boldsymbol {\theta }}_{j}}\) for 1 ≀ j ≀ k given as:

$$ {\boldsymbol{L}}_{{\boldsymbol{\theta}}}({\boldsymbol{A}}) = \exp\left( -i\left( \sum\limits_{{\boldsymbol{A}}_{jk}=1}{\boldsymbol{H}}^{\text{(edge)}}_{j,k} + \sum\limits_{i=1}^{n}{\boldsymbol{H}}^{\text{(node)}}_{i}\right)\right), $$
(2)

where the parameter set πœƒ = (H(edge), H(node)) is comprised of two Hermitian matricesFootnote 2 over one- and two-node state, and the indexing \({\boldsymbol {H}}^{\text {(edge)}}_{j,k}, {\boldsymbol {H}}^{\text {(node)}}_{v}\) refers to the same operators applied at the specified node(s) β€” i.e., one EH-QGC layer is fully specified by a single one-node Hamiltonian and a single two-node Hamiltonian.

This means that if the graph is permuted, the operators will be applied at changed positions appropriately. There is also no sequential ordering of operations in a summation, so the model is equivariant. For example, \({\boldsymbol {H}}^{\text {(node)}}_{3} = {\boldsymbol {I}} \otimes {\boldsymbol {I}} \otimes \hat {\boldsymbol {H}}^{\text {(node)}} \otimes {\boldsymbol {I}}\) in the case of n = 4 nodes.

EH-QGCs is closely related to the approach taken by Verdon et al. (2019) for their quantum graph convolutional neural network (QGCNN) model as well as the parameterized quantum evolution kernel of Henry et al. (2021). They both define operations in terms of Hamiltonians based on the graph structure. The difference is that for any given learning tasks, they consider a restricted class of models with hand-picked Hermitians, and leave only scalar weights multiplying these as learnable parameters. This helps for efficiently compiling small circuits, and allows better scaling to a larger number of qubits per node (which should be possible on future hardware). If we consider the full set of possible choices for these QGCNNs models, we get exactly our set of EH-QGCs as defined above. For our purposes, working with the broader class of arbitrary Hamiltonians lends itself better to theoretical analysis, and we leave it to future work to investigate circuit classes with better scaling in the number of qubits.

4.2.2 Parameterization by commuting unitaries

A similar, but more direct approach would be to consider two-node unitaries instead of Hamiltonians and apply a single learned unitary for each edge of the graph. As before, this ensures the number of operations scales linearly with the number of edges in a graph. This is also the approach taken by Zheng et al. (2021), but we need to add extra conditions that they do not consider to ensure equivariance.

Specifically, we need to enforce that the order we apply these unitaries in does not matter. This gives us the following commutativity condition for a two-node unitary U:

(3)

If the graphs are undirected, we should ensure the following to make sure the direction of the edge representation does not affect our predictions:

(4)

In the case of directed graphs, Eq.Β 4 need not apply, but Eq.Β 3 is also not sufficient in itself, since we need to consider cases where the unitary might be applied in different directions. Specifically, we need to ensure the following extra conditions:

(5)
(6)
(7)

EquationΒ 5 ensures commutativity of directed edges to the same target, Eq.Β 6 of edges from the same source, and Eq.Β 7 of 2 cycles between two nodes.

Of course, such a directed unitary can also be used for directed graphs by applying it in both directions: in fact, if Eq.Β 7 is satisfied, this composition itself satisfies the undirected Eq.Β 4:

(8)

It is not clear whether we can parameterize the space of all such commuting unitaries, but we can focus on a subclass.

Definition 4

An equivariantly diagonalizable unitary (EDU) is a unitary that can be expressed in the form U = (Vβ€‘βŠ—V‑)D(V βŠ—V) for a unitary \({\boldsymbol {V}}\in \mathbb {C}^{s \times s}\) and diagonal unitary \({\boldsymbol {D}} \in \mathbb {C}^{s^{2} \times s^{2}}\).

Note that all unitaries can be diagonalized in the form U = P‑DP for some other unitary P and diagonal unitary D. The above is simply the case when P decomposes as V βŠ—V for one single-node unitary V. All EDUs satisfy the given commutativity conditions. Using the facts that I βŠ—D is still a diagonal matrix and that diagonal matrices commute, we can see that equivariantly diagonalizable unitaries satisfy Eq.Β 3:

(9)

The directed versions (Eqs.Β 5,Β 6 andΒ 7) are similar, since V βŠ—V and Vβ€‘βŠ—V‑ commute with the swap, and then analogous derivations apply.

Furthermore, a square matrix is unitary if and only if all of its eigenvalues (the diagonal elements of D) have absolute value 1. We can therefore parameterize these unitaries by combining arbitrary single-node unitaries V with diagonal matrices D of unit modulus numbersFootnote 3.

This allows us to parameterize the following class of EQGCs:

Definition 5

An equivariantly diagonalizable unitary quantum graph circuit (EDU-QGC) is an EQGC expressed as a composition of node layersLnode and edge layersLedge given as follows on a graph with node and edge sets \((\mathcal V, \mathcal E)\):

$$ \begin{array}{@{}rcl@{}} {\boldsymbol{L}}_{\text{node}} &=& V^{\otimes a\mathcal Va} \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} {\boldsymbol{L}}_{\text{edge}} &=& \prod\limits_{(j,k)\in\mathcal E}U_{jk} \end{array} $$
(11)

In short, we either apply the same single-node unitary to all nodes, or we apply the same EDU appropriately for each edge. Since both types of layers are equivariant by construction, so is their composition, hence EDU-QGCs are a valid EQGC class.

It can be shown that EDU-QGCs are a subclass of the Hamiltonian-based EH-QGCs discussed in SectionΒ 4.2.1. This is particularly useful for investigating questions of expressivity: we also get a result about the expressivity of EH-QGCs by showing the existence of EDU-QGC constructions representing some function.

Theorem 1

Any EDU-QGC can be expressed as an EH-QGC.

To show this result, we consider node layers and edge layers separately and show that both can be represented by one or more EH-QGC layers. We first prove the case for node layers, then diagonal edge layers; finally, we build on these two to prove the case for all edge layers, completing the proof. The details are provided in Appendix A.

5 Expressivity results

In this section, we analyze the expressivity of the EQGCs discussed in SectionΒ 4.2: Hamiltonian-based EH-QGCs and EDU-QGCs defined using commuting unitaries.

Quantum circuits operate differently from MPNNs and other popular GNN architectures, so one might hope that they are more expressive. Since current classical methods with high expressivity are either computationally expensive (like higher-order GNNs) or require a large number of training samples to converge (like GNNs with random node initialization), this could in principle lead to a form of quantum advantage with sufficiently large-scale quantum computers.

We first show that EDU-QGCs subsume MPNNs: a class of MPNNs, including maximally expressive architectures, can be β€œsimulated” by a suitable EDU-QGC configuration. We then prove that they are in fact universal models for arbitrary functions on bounded-size graphs, building on prior results regarding randomized MPNNs. Since we have proven EDU-QGCs to be a subclass of EH-QGCs in Theorem 1, the results immediately follow for EH-QGCs as well.

5.1 Simulating MPNNs

Recall that MPNNs are defined via aggregate and combine functions in Eq.Β 3. In this section, we focus on MPNNs where the aggregation is of the form \(\textsc {agg}^{(k)}(\{\!\!\{{\boldsymbol {h}}_{i}\}\!\!\}) = {\sum }_{i} {\boldsymbol {h}}_{i}\), which includes many common architectures.

Remark 2

We consider MPNNs node states with real numbers represented in fixed-point arithmetic. Although GNNs tend to be defined with uncountable real vector state spaces, these can be approximated with a finite set if the data is from a bounded set.

We show that EDU-QGCs can simulate MPNNs with sum aggregation in the following sense:

Theorem 2

Any (Boolean or real) function over graphs that can be represented by an MPNN with sum aggregation, can also be represented by an EDU-QGC.

We prove this result by giving an explicit construction to simulate an arbitrary MPNN with sum aggregation, detailed in Appendix B.1. In particular, our construction for Theorem 2 implies that for an MPNN with k layers with an embedding dimensionality of w, with a fixed-point real representation of b bits per real number, this EDU-QGC needs (2k + 1)wb qubits per node.

Since MPNNs with sum aggregation (e.g., GINs) can represent any function learnable by any MPNN (Xu et al. 2019), we obtain the following corollary to Theorem 2:

Corollary 2.1

Any (Boolean or real) function that can be represented by any MPNN can also be represented by some EDU-QGC.

5.2 Universal approximation

We build on results about randomization in classical MPNNs, discussed in SectionΒ 3 (Sato et al. 2021; Abboud et al. 2021), to show that our quantum models are universal.

We simulate classical models that randomize some part of the node state by putting some qubits into the uniform superposition over all bitstrings, then operating in the computational basis. Unlike in the classical case, where this randomization had to be explicitly added to extend model capacity, we can do this without modifying our model definition β€” our results apply to EDU-QGCs and their superclasses. Analogously to the universality of MPNNs with random features, this allows us to prove the following theorem:

Theorem 3

For any real function f defined over \({\mathbb {G}}^{n}\), and any πœ– > 0, an EDU-QGC can represent f with an error probability πœ–.

We cannot directly rely on the results of either Abboud et al. (2021) or Sato et al. (2021): although our theorem is analogous to that of Abboud et al., they used MPNNs extended with readouts at each layer, which our quantum models cannot simulate. Sato et al. used MPNNs without readouts, but did not quite prove such a claim of universality. Therefore, we give a novel MPNN construction that is partially inspired by Sato et al., but relies solely on the results of Xu et al. (2019), and use it to show Theorem 3.

Briefly, we use the fact that for bounded-size graphs individualized by random node features, a GIN can in principle assign final node states that injectively depend on the isomorphism class of each node’s connected component. These node embeddings can be pooled to give a unique graph embedding for each isomorphism class of bounded graphs, which an MLP can map to any desired results. All of this can be simulated on an EDU-QGC, hence they are universal models. The details are given in Appendix B.2.

6 Empirical evaluation

While our primary focus is theoretical, and it is challenging to execute experiments large enough to give interesting results, we performed two small experiments as well. We first look at a very restricted EDU-QGC model and observe that it can the graphs \({\mathcal {G}}_{1}\) and \({\mathcal {G}}_{2}\) with nontrivial probability (which is beyond the capabilities of MPNNs), and also reason about this simple case analytically. After this, we construct a small classification dataset of cycle graphs in a way that MPNNs could achieve no more than 50% accuracy, and we successfully train deeper EDU-QGCs to high performance.

6.1 Testing expressivity beyond 1-WL

We performed a simple experiment to verify that EDU-QGC models can give different outputs for graphs that are indistinguishable by deterministic classical MPNNs. As our inputs, we used the two graphs \({\mathcal {G}}_{1}\) and \({\mathcal {G}}_{2}\) shown in Fig.Β 1 without node features (i.e., fixed initial node states in our quantum circuit), the simplest example where MPNNs fail. Our models should identify which graph is input. Using a single qubit per node, we expect our accuracy to be better than 50%, but far from perfect.

Experimental setup

To keep the experiment as simple as possible, we used a very simple subset of EDU-QGCs parameterized by a single variable Ξ±, similar to instantaneous quantum polynomial circuits (Bremner et al. 2016):

  • Each node state |vi〉 is initialized as the \(|{+}\rangle =H|{0}\rangle =\frac {1}{\sqrt 2}(|{0}\rangle +|{1}\rangle )\) state on one-node qubit (q = 1). By \(H = \frac {1}{\sqrt 2}\left (\begin {array}{cc} 1 & 1 \\ 1 & -1 \end {array}\right )\) we denote the Hadamard gate.

  • We apply an edge layer as given by Eq.Β 11, with a \(CZ(\alpha ) = \text {diag}(1,1,1,\exp (-i\alpha ))\) gate as the applied unitary acting on two neighboring node-qubits.

  • We apply a node layer with an H gate at each node.

  • After a single measurement, we measure k nodes as a |1〉 state and 6 βˆ’ k as |0〉. For each value of k, the aggregator gΞ±(β‹…) can map this to a different prediction.

Using ZX-diagram notation (Coecke and Kissinger 2018), Fig.Β 3 (top) shows the circuits we get for our choice of C in the case of \({\mathcal {G}}_{1}\) and \({\mathcal {G}}_{2}\). The probabilities of observing k |1〉s for each graph and all possible values of k as a function of our single parameter Ξ± are also shown in Fig.Β 3 (bottom).

Fig. 3
figure 3

The two circuits in the experiment in top-to-bottom ZX-diagram notation, with the Ξ±-box between white spiders representing a CZ(Ξ±) gate (a standard ZX-calculus shorthand (Coecke and Kissinger 2018)), followed by probabilities of observing given number of |1〉s as a function of Ξ± ∈ [βˆ’Ο€,Ο€] for each circuit. The two distributions differ most visibly when Ξ± is near Β± Ο€

We find that as Ξ± gets near Β± Ο€, the distributions of the number of |1〉s measured do differ, and an accuracy of 0.625 is achievable with a single measurement shot (and an arbitrarily low error rate can be achieved with a sufficiently high number of measurements). This would naturally get better as we increase the number of qubits used, but this already shows an expressivity exceeding that of deterministic MPNNs.

6.1.1 Theoretical analysis of the experiment

In an effort to better understand the power of such circuits, we focused on analyzing the most well-behaved special case of the above EDU-QGC, with CZ(Ο€) rotations and were able to analytically derive the observed measurement probabilities of this simple IQP circuit for any graph consisting of cycles.

Using the ZX-calculus, we show that applying it to any n-cycle graph results in a uniform distribution over certain measurement outcomes, give a simple algorithm to check for a given n-length bitstring whether it is one of these possible outcomes, and prove that the number of measured |1〉s always has the same parity as the size n of the graph.

With Ξ± = Ο€, the Ξ±-boxes representing the CZ-gates in Fig.Β 3 turn into simple Hadamard. So for any specific bitstring \(|{b_{1}{\dots } b_{n}}\rangle \), we can get the probability of measuring it by simplifying the following scalar:

figure j

where the numerical term comes from normalizing each CZ-gate with a factor of \(\sqrt {2}\).

We can substitute the appropriate white and gray spiders for the |+〉,|0〉 and |1〉 states to apply ZX-calculus techniques (Coecke and Kissinger 2018): a white spider with phase 0 for the |+〉 state, and gray spiders with 0 and Ο€ phases respectively for |0〉 and |1〉. All of these need to be normalized with a factor of \(\frac {1}{\sqrt 2}\). Due to the Hadamard gates, these all turn into white spiders that can be fused together, so this is equal to a simple trace calculation:

figure k

where Ξ±i = 0 ifbi = 0 andΟ€ ifbi = 1.

This can be simplified step by step. Firstly, as long as there are any spiders with Ξ±i = 0 and two distinct neighbors (i.e., there are at least 3 nodes in total), we can remove them and fuse their neighbors:

(12)

After repeating this, we get one of two outcomes. Firstly, we might end up with one of 3 possible circuits with that still have some Ξ±i = 0 but less than 3 nodes, which we can evaluate by direct calculation of their matrices:

(13)

Or all the remaining spiders have Ξ±i = Ο€, we can repeatedly eliminate them in groups of 4:

(14)

On repeating this, we end up with 0 to 3 nodes with Ξ±i = Ο€, which we can evaluate directly:

(15)

Observe that during the simplifications, we only introduced phases with an absolute value of 1, which do not affect measurement probabilities. Furthermore, we always decreased the number of nodes involved by 2 or 4, hence the parity is unchanged. This means for odd n, we will always end up with one of the odd-cycle base cases with a trace of 0 or \(\pm \sqrt 2\), while for even n, we get to the even-cycle base cases with traces of 0 or 2.

Combining with the initial coefficient of \(\big (\frac {1}{\sqrt 2}\big )^{n}\) and taking squared norms, we get that for odd n, each bitstring is observed with probability 0 or \(\frac {1}{2^{n-1}}\) (so half of all possible bitstrings are observed), while with even n, each bitstring is observed with probability 0 or \(\frac {1}{2^{n-2}}\) (so we see only a quarter of all bitstrings).

Furthermore, to check which bitstrings are observed, we can summarize the ZX-diagram simplification as a simple algorithm acting on cyclic bitstrings (where the first and last bits are considered adjacent):

  • As long as there is a 0 in the bitstring and the length of the bitstring is more than 2, remove the zero along with its two neighbors, and replace them with the XOR of the neighbors.

  • If you end up with just |00〉, the state has a positive probability to be observed. If you end up with |0〉 or |01〉, it has 0 probability.

  • When there are only |1〉s remaining, if the number of these is 2 mod 4, the input has 0 probability to be observed, otherwise positive.

This shows us why the observed number of |1〉s always has the same parity as n: at each step, both the parity of |1〉s and the parity of the bitstring’s length is unchanged. The only even-length base case with an odd number of ones is |01〉, which corresponds to states with 0 probability; and similarly the only odd-length base case with an even number of ones is |0〉, which has the same outcome.

We can also derive the specific probabilities observed in the experiment. It’s easy to see from this that in the case of a triangle, the observable states are |001〉,|010〉,|100〉,|111〉. This allows us to calculate the probabilities observed for the case of two triangles. For the 6-cycle, the observable states are |000000〉, six rotations of |000101〉, six rotations of |001111〉, and three rotations of |101101〉, giving the expected probabilities as well.

6.2 Synthetic dataset of cycle graphs

We created a synthetic dataset of 6- to 10-node graphs consisting of either a single cycle, or two cycles. The single-cycle graphs were oversampled to create two equally sized classes for a binary classification task. Eight-cycle graphs were reserved for evaluation, while all others were used for training.

We trained EDU-QGC models of various depths with a single qubit per node on this dataset. Each node state was initialized as \(|{+}\rangle =\frac {1}{\sqrt 2}(|{0}\rangle +|{1}\rangle )\), then an equal number \(k\in \{1, \dots , 14\}\) general node and edge layers were applied alternatingly. After measurement, the fraction of observed |1〉s was used to predict the input’s class through a learnable nonlinearity. Exact probabilities of possible outcomes were calculated, and the Adam optimizer was used to minimize the expected binary cross-entropy loss for 100 epochs, with an initial learning rate of 0.01 and an exponential learning rate decay with coefficient of 0.99 applied at each epoch.

Results are shown in Fig.Β 4. We report the one-sample accuracy (the average probability of a correct prediction across the dataset), and the highest achievable many-sample accuracy (the fraction of the dataset where a model was right with at least 50% probability). Importantly, we observe a consistent benefit from increasing depth, in contrast with the oversmoothing problems of GNNs (Li et al. 2018). We also did not experience any issues with the near-zero gradients or β€œbarren plateaus” that make it challenging to optimize many PQC models (McClean et al. 2018), although we have not investigated whether this would hold with the noisy gradients one would get in a real quantum experiment as opposed to our exact classical simulation.

Fig. 4
figure 4

Accuracies of EDU-QGC models on the synthetic cycles dataset. The many-sample accuracy bound is calculated as the fraction of examples in the dataset where the model was correct with more than 50% accuracy. Results are based on an average of 10 runs, with the shaded region representing standard deviation

Interestingly, the model performs better on the evaluation set than the training set. This is due to the fact that it is hard for the model to reliably correctly classify 9- and 10-node graphs containing two cycles when these contain subgraphs that are in the one-cycle class. For example, the model associates a high number of measured |1〉s with single-cycle graphs, then a 6-cycle will lead to many |1〉s. Since a disjoint union of a 6-cycle and a 3-cycle contains this subgraph, it will also have a relatively high fraction of |1〉s, leading to an incorrect prediction. Clearly, this would not be an issue if more qubits per node could be used (which may be feasible in future): the size of a cycle could be encoded exactly in the larger set of possible observations, and this could be easily aggregated invariantly to count the number of cycles. Note also that one of 10 runs with was dropped as an outlier in the case of 4 layers. Through some unlucky initialization, the model failed to learn anything and stayed at 50% accuracy in this single run.

6.2.1 Effective parameter count

The model was able to fit this dataset with a very small number of parameters: after accounting for redundancy, the model contains only 6 real-valued degrees of freedom for each pair of node and edge layers:

  • The node layer is given by an arbitrary single-qubit unitary, which can be given by 3 Euler-angle rotations of the Bloch sphere.

  • The edge layer can involve an arbitrary equivariantly diagonalizable unitary (V βŠ—V)D(Vβ€‘βŠ—V‑) as given in Definition 4. However, the V is redundant when surrounded by two-node layers applying single-node unitaries U1, U2 everywhere: modifying these to be V Γ—U1 and \({\boldsymbol {U}}_{2} \times {\boldsymbol {V}}^{\dagger }\) respectively would have the same effect. Hence it suffices to consider the diagonal unitary D, which applies some phase in each of the |00〉,|01〉,|10〉 and |11〉 cases. To satisfy the undirected graph constraint of Eq.Β 4, the phases for |01〉 and |10〉 need to be the same. This leaves us with 3 real parameters for each of the phases.

Note that in order to have an efficient implementation, we implemented edge layers as just diagonal unitaries over two nodes. This is justified by the above argument regarding their redundancy for all layers except the last, which is not surrounded by node layers β€” in this case it could slightly affect the performance of the model in principle.

7 Conclusions, discussions, and outlook

In this paper, we proposed equivariant quantum graph circuits, a general framework of quantum machine learning methods for graph-based machine learning, and explored possible architectures within that framework. Two subclasses, EH-QGCs and EDU-QGCs, were proven to have desirable theoretical properties: they are universal for functions defined up to a fixed graph size, just like randomized MPNNs. Our experiments were small-scale due to the computational difficulties of simulating quantum computers classically, but they did confirm that the distinguishing power of our quantum methods exceeds that of deterministic MPNNs.

By defining the framework of EQGCs and their subclasses, many questions can be raised that we did not explore in this paper. EDU-QGCs and EH-QGCs have important limitations: using arbitrary node-level Hamiltonians or unitaries allowed us to show expressivity results, but they are not feasible to scale to a large number of qubits per node, since the space of parameters grows exponentially. Perhaps a small number of qubits will already turn out to be useful, but EQGC classes with better scalability to large node states should also be investigated.

There are also design choices beyond the EQGC framework that might be interesting. For example, rather than measuring only at the end of the circuit, mid-circuit measurements and quantum-classical computation might offer possibilities that we have not analyzed.

Ultimately, the biggest questions in the field of quantum computing are about quantum advantage: what useful tasks can we expect quantum computers to speed up, and what kind of hardware do these applications require? Recent work on the theoretical capabilities of quantum machine learning architectures is already contributing to this: it has been shown that we can carefully engineer artificial problems that provably favor quantum methods (KΓΌbler et al. 2021; Arute et al. 2019; Liu et al. 2021), but this is yet to be seen for practically significant problem classes. At the same time, there are convincing arguments that quantum computers will be useful for computational chemistry tasks such as simulating molecular dynamics, where EQGCs could be useful, which is a direction worth exploring.