Unsupervised Event Classification with Graphs on Classical and Photonic Quantum Computers

Photonic Quantum Computers provides several benefits over the discrete qubit-based paradigm of quantum computing. By using the power of continuous-variable computing we build an anomaly detection model to use on searches for New Physics. Our model uses Gaussian Boson Sampling, a $\#$P-hard problem and thus not efficiently accessible to classical devices. This is used to create feature vectors from graph data, a natural format for representing data of high-energy collision events. A simple K-means clustering algorithm is used to provide a baseline method of classification. We then present a novel method of anomaly detection, combining the use of Gaussian Boson Sampling and a quantum extension to K-means known as Q-means. This is found to give equivalent results compared to the classical clustering version while also reducing the $\mathcal{O}$ complexity, with respect to the sample's feature-vector length, from $\mathcal{O}(N)$ to $\mathcal{O}(\mbox{log}(N))$. Due to the speed of the sampling algorithm and the feasibility of near-term photonic quantum devices, anomaly detection at the trigger level can become practical in future LHC runs.


Introduction
The search for New Physics depends on our ability to separate Standard Model events from the much rarer and complex signal events from within the data from the LHC. It is desirable to perform data-driven searches, i.e. train on this data directly, without making specific assumptions on the new physics scenario realised by nature. While this allows the search to be widely applicable, it requires the classification method to learn and identify the important features of the background data to discriminate them from rare signal events which do not conform to the same features. Within the realm of classical network methods, autoencoders have been designed for the purpose of anomaly detection and unsupervised event classification [1,2]. Other, cluster based methods have also been developed [3].
Careful consideration of how the data is represented can improve the performance of searches. Events in high energy physics naturally fit into graph structures. These structures allow us not only to define the features of individual constituents in the event but also how they are related to each other. Graphs have proven to be a powerful representation for LHC data [4][5][6] and combined with an anomaly detection method could be useful in data-driven searches.
To further boost the performance of an anomaly detection search we propose the use of a quantum device. Quantum computing has shown to be able to solve classically hard problems, such as database searching and factoring [7,8]. In HEP, quantum computers have become popular to solve a range of tasks. Annealers have been used to study field theories [9,10] and optimisation problems [11]. Quantum neural networks to solve classification problems have been built from quantum gates [12]. Calculation of multi-particle interactions [13][14][15][16][17][18][19][20][21][22] accomplished with the mapping of field theories onto quantum walks [23][24][25][26] or a combination of quantum and classical ideas [27][28][29][30] have also all been done using models built from quantum gates.
Previous work, mostly, focuses on quantum annealers or the discrete, qubit-based paradigm of quantum computing. While these models are very successful, another scheme can be employed. The continuous-variable (CV) model of quantum computer differs from its use of qumodes over qubits [31]. Data is embedded into these qumodes, an infinitedimensional object. This is typically an electromagnetic field, allowing CV quantum devices to be constructed using quantum photonics hardware. Programming photonic quantum devices proceeds in stark similarity to qubit-based devices, with both allowing the construction of circuits from quantum gates. As qumodes are represented by an infinitedimensional uncountable basis, the CV model is a more natural choice to simulate bosonic and continuous systems. In the particular interest of this paper is applications for graph data [32] and machine learning [33].
Photonic devices use single photon emissions, manipulated through squeezers and beamsplitters. A photonic system, even one that uses a non-interacting source of photons, can still exhibit quantum properties such as entanglement. An example of one such system is boson sampling [34]. Boson sampling techniques, such as Gaussian boson sampling (GBS), is an application of continuous-variable quantum computers where there is a clear advantage over classical devices [35]. A GBS device emits photons into an interferometer, generating a sample by counting the photons that exit the device. This setup can be constructed as a circuit in a quantum device. It is difficult to classically simulate the probability distribution of this process as it requires the calculation of the hafnian, a #P problem.
However, when access to GBS devices become more accessible samples can be generated at the rate of 10 5 every second [36]. By embedding graph data into this device we can use it to generate a lower-dimensional representation of the data which is easier to handle when trying to build classifiers. These fast response times make GBS-based anomaly detection methods suitable to implement at the trigger level in future runs of the LHC. While the L1 trigger operates at 1 µs, the higher-level trigger functions at < 100 ms. A GBS device could produce around 100, 000 samples in a similar time frame, significantly more than is required in the method proposed here.
The operation of a photonic device also presents some technical advantages over most qubit-based machines. Temperatures of less than 100 millikelvin are required for modern solid-state machines to operate efficiently [37]. Photonic devices are built from photonic hardware that can run at room temperature.
We aim to harness the power of continuous variable quantum computing, specifically Gaussian boson sampling, to survey particle physics events represented as graphs. These samples could then be used as input into various anomaly detection techniques. The anomaly detection method we use is built around the K-means clustering algorithm. We also present a quantum extension to this. Q-means clustering can be implemented on both a qubit-based, and qumode-based, quantum computer. In its most basic form, Qmeans provides a substantial advantage over K-means. The size of the feature vector being used grows the time complexity of K-means linearly, whereas for Q-means it grows logarithmically, an exponential improvement [38,39]. Specifically, the process of anomaly detection we present has three steps: (i) the creation of data and graphs, (ii) embedding of graphs in a lower dimension representation and (iii) the classification procedure. A classical method is shown as well as the use of quantum equivalents for parts (ii) and (iii). We apply our method to a search for hadronically decaying scalar resonances, produced through a Higgs-portal interaction in the pp → HZ channel [40,41]. To allow the process to trigger, we require the Higgs boson to recoil against a boosted leptonically-decaying Z boson. The major Standard Model background for such a signal is pp → Z + jets.
These events can be represented as a set of graph adjacency matrices, weighted by the event constituent's features (ie. m ij , ∆R). We then find lower-dimensional embeddings for these objects and perform anomaly detection with them. To use as a baseline we create embeddings from the matrix eigenvalues and perform the classification using a K-means clustering algorithm. We then propose a novel method using GBS to create the embeddings and perform anomaly detection using a quantum equivalent of K-means known as Q-means. We find the GBS embedding method has improved performance compared to the classical version.
The paper has the following structure: Section 2 discusses the event generation and the process of encoding this into a graph structure. In Section 3 we introduce our first method of transforming the graph structure into something that can be used to train a classification algorithm. Here, we also present results as to how well the "classical" method of vector embedding performs for anomaly detection. The photonic methodology is discussed in Section 4. We detail how a photonic circuit is created, how GBS sampling is performed and the results it gives when its samples are used for classification. Then, in Section 5 a quantum equivalent to K-means classification is adopted. Finally, a summary and conclusion is presented in Section 6 2 Analysis Setup

Data Generation
To use as our background and signal samples we generate pp → Z + jets and pp → HZ events, with subsequent decays H → A 1 A 2 , A 2 → gg and A 1 → gg. All events have been generated with a centre of mass energy of 14 TeV and a minimum p T of the hard process of at least 140 GeV. We force the Z boson to decay leptonically to either e or µ. We have set the Higgs mass to 125 GeV, the A 2 mass to 40 GeV and the mass of A 1 to be 60 GeV. We use Pythia 8.2 to generate events and perform parton showering [42].  Such a scenario could be realised through derivative interactions between the Higgs boson and the pseudoscalars A 1 and A 2 , which in turn form an effective, yet highly suppressed, interaction with gluons. Thus, their decay to gluons could still be prompt, whereas their direction production cross section in proton collisions was tiny. In such a scenario the observation of A 1 or A 2 would have to proceed through Higgs decays.
Our analysis follows the methodology used when using jet substructure to find the Higgs for our analysis [43][44][45]. We will cluster our events into a fat jet, before reclustering its contents into "microjets" [44,46]. First, though, we impose a rapidity cut of 2.5 and a p T cut of 10 GeV on all final state leptons. To reconstruct the Z boson we ensure to retain two charged leptons within |y l | ≤ 2.5 and require them to have an invariant mass of 80 ≤ m ll ≤ 100 GeV. To explore the boosted region, we only consider events where the p T of the Z boson is greater than 150 GeV. Based on the fat-jet properties only, signal events resemble background events very closely.
The remaining objects are subject to a rapidity cut of 5. Using FastJet [47] we cluster these objects into jets using the Cambridge-Aachen algorithm with R = 1.5 [48] and demand that there is at least one jet in the event with a transverse momentum of p T > 150 GeV. Figure 1 shows the invariant mass of the leptons and fat jet in the event and also the transverse momentum of the fat jet.
The hardest jet is then reclustered into a series of "microjets" using the anti-kt algorithm [49]. Here, we choose R = 0.2 and force a transverse momentum of p T > 5 GeV. Only events with 3-6 microjets are selected for analysis. The choice of this limit is to informed by the runtime of the quantum sampling process used. Larger graphs begin to take inhibitive amounts of time to be sampled on a simulator using our quantum sampling method. The number of jets chosen strikes a balance between the length of time it takes and being able to capture the maximum amount of information on the event. The number of microjets found, and the total mass of a fat jet's microjets is shown in Figure 2. These microjets are   the objects used to construct our graphs. The result of the cuts we have applied is shown in Table 1. Here, we see the fraction of remaining events after each constraint is applied.

Constructing the Graphs
A method of constructing graphs from our events is to use only the final states [4,5]. Specifically, we use the microjets found from each event as the nodes to construct a set of graphs. We choose to connect nodes in the final state to every other node. Figure 3 shows an example of a graph we may construct. Signal graphs in our sample contain on average 4.9 nodes, while background graphs contain 3.9. From here we can construct a set of adjacency matrices that can be weighted to give detail on certain features of our event. Adjacency matrices are matrix representations of a graph where the rows and columns are defined by the graph nodes. Entries in the matrix are either 1 or 0, based on whether 2 nodes are connected. However, more information than this can be stored -the matrix entry can be weighted to show how strongly the nodes are connected. Some graph classification methods separate the graph information into feature and adjacency matrices [50]. In these cases, the adjacency matrix will include information on how the nodes in the graph are connected while the node's features are detailed in the feature matrix. However, for many graph sampling techniques, which require as input only symmetric matrices, a feature matrix cannot be readily included. We are limited to how we can manipulate adjacency matrices. Therefore, our aim is to construct a set of symmetric matrices that will include as much information about the node connections, and therefore our event, as possible. Our goal here does not require a model as complex as a neural network. We aim just to find a good representation of the graph, allowing us to use other methods to make classifications.
To begin with, an adjacency matrix can be constructed with connections between nodes weighted by their distance to each other in the (y, φ) plane. This distance is given by where ∆y is the difference in rapidity between particle i and j, while ∆φ is the difference between the azimuthal angle of particles i and j. The dimensions of the resulting matrix will be determined by the number of nodes in the graph, whereas the matrix entries are found from Eq. (2.1). We will eventually want to compare graphs, and hence matrices, of different sizes. Therefore, we pad each matrix with zeros on the right and lower sides such that they all have dimensions of 6 × 6. To be explicit, we can also construct weighted matrices from ∆φ and ∆y. Other adjacency matrices can be constructed using a measure of 2 objects energy E i E j or invariant mass m ij .
To be able to compare features of individual events with the global properties of the background event sample, each set of adjacency matrices are individually scaled with a constant s = 1/x, where x is the maximum value in the set of matrices made from the background samples. For example, the m ij matrices are scaled by a term s m ij , while the E i E j matrices are scaled by s E i E j . Thus, the five matrices (∆R, ∆y, ∆φ, E i E j and m i,j ) are all scaled accordingly. Figure 3 shows the process of taking one final event graph, finding a selection of weighted adjacency matrices (∆R, ∆y, ∆φ, E i E j and m i,j ) and then the final set of scaled and padded matrices. These adjacency matrices, like those in Figure 3 will be embedded in a lower-dimensional space and then used as input into a classifier.

Anomaly Detection on a Classical Computer
To make predictions the graph data will be embedded into vectors. By being a 1-dimensional vector, rather than a 2-dimensional matrix, it is simpler to use them as input into a classifier. These vectors will then be used in an unsupervised K-means method of anomaly detection. What is being done here is somewhat different from kernel-based methods of graph classification. If one were to use the kernel framework the entire graph dataset would be embedded into a kernel. These kernels provide a measure of how similar each graph is to the other. This object could then be passed to a kernel-based classification algorithm (ie, a SVM) and from there predictions could be made. While this method is popular and  Figure 3: For a graph we find five weighted adjacency matrices from it, each one capturing features of the graph nodes. These five are then scaled and padded, giving the final matrices on the right. These can be embedded and used as input to classification algorithms.
has been shown to have predictive powers it is not without its limitations. By creating a graph-kernel one is required to use a kernel-based algorithm. However, by creating feature vectors from the graphs we can be more versatile when creating a model as the range of algorithms available to use is broader.
The adjacency matrices created in Section 2 will be the objects used to create feature vectors. For the classical example we will simply take the eigenvalues of the five matrices. Each of the matrices gives us 6 eigenvalues. When these 5 objects are combined it results in a vector of length 30. Finally, we apply another round of scaling. This is done using StandardScaler from scikit-learn [51].
This method of creating a vector is based on the use of Laplacian Eigenvalues [52]. This method first transforms the adjacency matrix into its Laplacian -another form of graph matrix representation. The eigenvalues are taken of this and ordered. We found that the use of a vector created this way and one created with the more straightforward method described above gives similar results.
We prepare 1000 background samples and 200 signal samples for use. The background set is then split into 800 training examples and 200 test examples. The number of events we use is limited due to the nature of sampling from a simulated GBS device, which is discussed in Section 4.
K-means clustering is a popular method of unsupervised classification [53,54]. Its aim is to separate samples into several clusters. Each cluster has a centroid that is defined by the mean µ of the samples inside it. To begin finding these clusters the centroids can be initialised randomly. Then, they are updated by the algorithm by repeating 2 steps: (i) assign every point a label. This label describes what cluster it belongs to and is decided by the centroid closest to the point. (ii) The positions of the centroids are then updated by taking the average of every point in the cluster. These steps repeat until the change in centroid location is less than a selected value. As the centroids are updated the process aims to minimise the within-cluster sum-of-squares where C is the set of clusters (C 1 , ..., C n ) each with a mean (µ 1 , ..., µ n ). In our method, we only cluster our samples into one cluster. In this scenario, clustering is a somewhat-trivial problem. However, our choice of algorithm here is simply to develop a baseline to allow us to compare our classical sampling technique to a quantum equivalent. Regardless, the cluster is created using the 800 background training samples and results in the centroid c.
When testing on a sample, x, we can then define a loss function which measures the distance between a point and the cluster centroid. Intuitively, it can be assumed points closer to the centroid will more likely come from the background sample while points further away will more likely be signal. Therefore, a point with a larger loss value can be classified as an outlier and, thus, as signal. Figure 4 (a) shows the ROC curve from testing the fitted K-means algorithm, giving a result of 0.74 AUC. Figure 4 (b) shows the distribution of distances for the background and signal test sets.
To give some context for this result we can compare it to another classical method of anomaly detection. This will be done using an autoencoder. Autoencoders are neural networks where the input and out dimensions match while the middle of the network forms a bottleneck. The network's optimisation task is to recreate the input exactly in the output, but this, of course, is impossible due to the smaller number of nodes in the centre of the network. Like the K-means method, the autoencoder is only trained on background data. By comparing the loss associated with background test data and signal data one can create an anomaly detector [2]. We find this method gives the same AUC of 0.74.

Anomaly Detection on a Photonic Quantum Devices
In Section 3 we described a classical method of embedding the matrices we created into a vector and classifying them. Here, we replace the embedding procedure with an equivalent that can be performed on a quantum device.
The continuous-variable quantum computing regime differs from traditional, discrete, qubit-based quantum computing in many key areas. In the qubit system we can describe states as being in the form However, when we move to the CV form of quantum computing our states are no longer represented as qubits, but rather as qumodes. Qumodes are vectors expressed in an infinite dimensional uncountable basis |ψ = ψ(x)|x dx. A specific qumode k, at its simplest, can be described as a harmonic oscillator [55] with the position and momentum operatorŝ These operators both depend on the creation and annihilation ladder operatorsâ andâ † . By using multiple modes, and evolving their state, we can build a continuous variable quantum computer. While having an infinite dimensional basis is a fundamental change of the configuration space of our quantum system compared to Eq. (4.1), the dynamics of the state |ψ are still governed by unitary operators. Thus, by embedding data into a qumode one can apply gates to evolve the state, which is then measured at the end to obtain the result. However, the gate-set available to construct these circuits differ between discrete quantumgate computers and CV devices [55].
A gate can take the form of the unitary U = exp(−itH), where H is the generating Hamiltonian with time t. On a CV device we can classify the gates into two groups based on the form of this generator. Gaussian gates are at most quadratic, and take one or two modes as input. Squeezed states can be created using the squeeze gate where a andâ are the ladder operators and z = re iφ . The value r controls the squeeze amount and φ is the squeeze phase angle. The gate, applied to a mode, performs the transformationŜ † (z)âŜ(z) =âcosh(r) +â † e iφ sinh(r), (4.7) For a position-momentum pair, the squeezing gate will amplify one while reducing the other. States can be rotated through the rotation gate Here, θ controls the rotation angle. The gates application on a state will rotate a states position and momentum The final Gaussian gate we will introduce is the beamsplitter gate. This a 2 mode gate (ie. it takes two qumodes as input) that transforms states |â 1 ,â 2 to |â 1 ,â 2 , wherê a 1 =â 1 cos(θ) −â 2 e −iφ sin(θ), (4.12) a 2 =â 2 cos(θ) +â 1 e −iφ sin(θ).  As well as Gaussian gates there are also a set of non-Gaussian gates. These gates differ from Gaussian gates as they will only have 1 mode as input, and also through the degree of H. If the hamiltonian in the unitary has as degree of 3 or more (the position, momentum or a ladder operator in the gate is cubed, or more) it is classed as non-Gaussian. An example of this is the cubic phase gate With a set of Gaussian gates it is possible to construct all quadratic unitaries. However, a gate-set combining both Gaussian and non-Gaussian gates allows for universal quantum computation. This is defined by the computers ability to implement in a finite number of steps, with arbitrary precision, a unitary which is polynomial [31]. Continuous-variable photonic quantum devices have been shown to solve some classically difficult problems. One such problem is Gaussian boson sampling. This is the quantum method we will use to embed our matrices. For now, the classification step (K-means) will remain unchanged.

Gaussian Boson Sampling
An area where photonic devices are proposed to demonstrate an advantage is through boson sampling [34,35]. Boson samplers are made up of an array of emitters designed to emit single photons into an interferometer. Samples are found from the regularity of photons seen by the detectors after being outputted from the device. By embedding matrices into the device one can use the device to sample the object. To create samples of our graphs from a quantum device we will embed the graph adjacency matrices into the GBS device. From the samples that will be output one can create a feature vector to use as classifier input [32,36,56].
A simple analogue to boson sampling is the Galton board. A Galton board is built from a vertical board with a series of pegs attached. Balls are dropped into the device and make there way down the board, their path being altered by the pegs they hit. At the bottom of the board the balls are collected. In the boson sampling device however bosons can be emitted from multiple locations, instead of just one. Similar to the balls in the Galton board, the bosons do not interact with each other, To encode information into the device one can change the layout of the "pegs". The probability of finding a ball in a specific bin (or a photon at a specific detector) requires the calculation of the permanent The permanent is found from the matrix A, whose entries A ij will be the probability that a photon i will be found in detector j, and S N , the entire set of permutations of N photons. The permanent can also be defined by its relation to graphs. Calculating the permanent of the adjacency matrix of a bipartite graph is the equivalent of summing all the graphs perfect matchings. A graph, or subgraph, can be described as a "matching" if every vertex in the graph is connected with, at most, one other vertex. A "perfect matching" is the situation when every vertex is connected to one other vertex. The calculation of the permanent is a #P-problem and hard to solve on classical computers [57].
Boson Sampling requires the generation of deterministic sources of single photonssomething not currently available. Extensions to the above boson sampling model have been proposed to get around this. One of these is Gaussian boson sampling (GBS) [35]. GBS, instead of single photon states, uses single mode squeezed states. This setup has the same problem complexity as typical boson sampling. However, since squeezed states can be generated deterministically, it does not have the same scaling issues.
Unlike "standard" boson sampling, which requires the calculation of the permanent, to simulate Gaussian boson sampling requires the calculation of the hafnian. The hafnian and permanent are related through where M is a matrix. The hafnian will calculate how many perfect matchings are in an arbitrary graph. If the edges are weighted (ie. there is a weighted adjacency matrix) the hafnian calculation will sum the weights associated with the vertices of the perfect matchings. The hafnian is calculated as . For a graph containing n nodes, P n would find every set of perfect matchings. Note, in the n = 4 example that the three sets found are the indices of nodes in the three possible perfect matching sets. If n were to equal an odd number, there could not be a perfect matching and the hafnian would equal zero. The hafnian is a more general function than the permanent but like it, there is no known method of classically calculating it efficiently. The state prepared by a GBS device with N modes is fully described by a 2N × 2N covariant matrix σ and a displacement term, d. Here, we focus solely on σ. For such a state the GBS device can be sampled. What will be outputted is an array of photon counts. The arrayn will be filled such thatn = [n 1 , ..., n N ], where each n i represents how many photons have been counted at each detector. The probability of seeing a resultn is Here,n! = n 1 !n 2 !...n N !. The matrixÃn and Q are both related to the covariance matrix σ. This can be seen through the relations Q = σ + I/2,Ã = X(I − Q −1 ) and X = 0 I I 0 [35]. In Eq. (4.19) the hafnian ofÃ is not being found, but rather a modified versionÃn . The new matrix will be the same, except for the removal (or duplication) or certain rows and columns fromÃ, depending on the values inn. Ifn i = 0, then the i-th row and column of A will not be included, ifn i = 1 the row/ column will not change, and ifn i > 1 the row/ column will be duplicated. Finally, if all values inn are 1 thenÃ =Ãn.
For our purposes, we need to be able to embed our graphs into the device to retrieve samples. Therefore, there is a need to relate the adjacency matrices A to the covariance matrix σ. This is done through the double encoding strategỹ (4.20) Eq (4.19) can be written such to show the probability of sampling a photon eventn with respect to our adjacency matrices Again, the hafnian of the entire matrix A is not found, but rather a submatrix An. Practically, to sample from this distribution we embed the matrix into the device via parameters in the squeezing gates and interferometer of our photonic circuit. Overall, the probability of a set of photonsn being detected is therefore proportional to the weighted number of perfect matchings of the subgraph An. By sampling from a GBS device we can create feature vectors to use as input to classifiers.

Constructing a GBS Circuit
Due to the hafnian calculation, simulating Gaussian boson sampling is difficult on a classical machine. However, using quantum gates one can build a Gaussian boson sampling device that will run on a photonic quantum computer. The sampler is created from squeezing, rotation and beamsplitter gates and is shown in Figure 5. The gate parameters (detailed in Eqs. (4.6), (4.9) and (4.14)) are all determined by the adjacency matrix A. The matrix A can be decomposed, using the Takagi-Autonne decomposition, to obtain Here, U is a unitary matrix and the set of λ's are the matrix eigenvalues [32]. It is possible to further decompose the matrix U to give parameters for the set of interferometer gates [58]. To define a term to use for our squeeze gates parameters we need to introduce a scaling term c. This term is defined such that wheren is again the mean number of photons, a parameter we can choose [32]. From this and the eigenvalue we can find a squeezing parameter z i = tanh −1 (cλ i ). Altogether, this will allow us to embed our adjacency matrices within a GBS device and produce samples from it.
The construction of the circuit on a real device allows large amounts of sampling to be performed very quickly. Current devices find around 10 5 samples every second [36]. However, access to these devices is currently limited, resulting in the process needing to be simulated. This is accomplished by using the Python library Strawberry Fields [59].

Creating Feature Vectors from GBS Samples
By first transforming graphs into their adjacency matrix representation (as carried out in Section 2) they can then be embedded into the GBS device. When the device is sampled the process will therefore be driven by the information contained in the matrix. The generated samples can be used to create feature vectors to use in a classification procedure. The output from the GBS device will be arrays N long, where N equals the number of nodes in our graph. This vector will be made from information on frequencies of measurements from the device. As we run the device, we can choose from either thresholding the output, or not. In "threshold" mode, the output sample of the device only tracks if a photon has been detected in a specific mode or not, rather than count how many have been seen. On a simulator, not thresholding the results proved very time-intensive. Therefore, the sample results where thresholded. The output of the device will therefore be an array N long, filled with zeroes or ones depending on whether a photon was detected.
Regardless, sampling using the GBS simulator is time-consuming with or without photon thresholds. This constrains us with the amount of data we can use. As mentioned, feature vectors are built from frequencies of certain measurements from within an events set of samples. It follows that we need to sample each graph multiple times to accurately determine these measurements. The amount of samples taken from each event is another factor that influences runtime. We choose to sample each of the 5 matrices in a single event 6500 times, giving a total of 32,500 samples per event. This value gives us a good result.
The generated set of samples can then be used to calculate feature vectors. Here, a few methods exist to construct these [36]. An option is to craft a vector that contains probabilities of a photon being seen at each detector. If we have an event with a device containing k photon detectors we can construct a vector v = (p k 1 , p k 2 , ..., p k N ). (4.24) Here, p k is the chance a photon will be seen in this position by the detector. This setup, though it can be computationally expensive at large numbers of photons, grants us a lot of freedom to create the vector.
For each graph, we have a set of five adjacency matrices. For each matrix a feature vector is found (just as in Eq. (4.24)). The five vectors representing the graph are then combined, making a vector of length 30. This procedure mirrors how we handled the classical eigenvalue scenario. However, here we create the samples from the GBS device.

K-means Anomaly Detection using Gaussian Boson Sampling
To set a benchmark for anomaly detection with samples from the GBS device we will once again use K-means clustering. The pattern for this will match what was done in Section 3. The vectors will, firstly, be scaled using scikit-learn's StandardScaler. The K-means algorithm will then be run on 800 background samples to find a centroid. Distances can be calculated from the centroid to points in our test set of 400 background and signal graphs. This distance will be used as an error value in the fit, as shown in Eq. (3.2). Results for this method are shown in Figure 6. We achieve an AUC score of 0.79. This is an improvement over the classical scenario where an AUC of 0.74 was achieved. The GBS sampling method appears to provide an advantage in creating samples to use for anomaly detection tasks.

Q-means Clustering
As mentioned in Section 1 we can consider our anomaly detection problem in three parts: (i) graph creation, (ii) embedding of matrices and (iii) the classification method. In Section 4.1 we discussed a quantum method to embed the matrices. In this section, we introduce Q-means clustering, a quantum equivalent of K-means clustering, which we will use for anomaly detection.
K-means can be described as having two steps: (i) assigning points a label and (ii) updating the centroids. To successfully complete part (i) there is a need to be able to deduce what centroid each point is closest to. In this variant of Q-means we focus specifically on this.
As with the GBS sampling device, the Q-means method is applied using a quantum circuit. The circuit constructed aims to provide a measure of how close a point is to a centroid. The circuit itself can be split into three parts: data embedding, distance calculation and readout. After calculating which centroid the point is closest to we can then move onto the second step of the clustering procedure, updating the centroids. The centroids are chosen such that they are the mean of every point in the cluster. These steps will be repeated until the distance between the updated cluster location and the previous location is less than a threshold [38].
We will first view the task of Q-means classification in the discrete qubit scenario. To embed our data to use in the circuit we will use U 3 gates, .
The vectors are embedded in the θ, φ and λ parameters. This takes the values and embeds them as angles in a qubit. For vectors with more than three features we must use more than one qubit. After being embedded the states must enter into a circuit designed to measure some metric of distance. This will be done using the SwapTest circuit. See Appendix A for more information. For two states |α and |β the SwapTest routine will measure the overlap between them. If we find a probability of P (|0 ) = 0.5 then the two states are orthogonal, if P (|0 ) = 1 then the states are the same. If the centroids are embedded into a state alongside a sample vector we can measure which centroid state overlaps the most with the vector. The centroid with the most overlap will be chosen as the closest and the vector will be assigned to that cluster.
SwapTest circuits are constructed from Hadamard H and SWAP gates, respectively and A CSWAP gate is similar to a SWAP gate, Eq. (5.3), but depends on the state of a control bit. The Hadamard gate is used regularly when building quantum circuits to introduce entanglement into the system. These gates, alongside the U 3 embedding gates are combined to form the circuit shown in Figure 7. The circuit shown is designed for a model where vectors contain between 6 and 9 features, and hence need to use 3 qubits to store the information. The vectors created from the GBS method contain 30 elements and hence will require 10 qubits.
We have focused on a single variant of Q-means here. However, the model can be expanded to improve its performance further. Including other quantum algorithms (such as Grovers) can improve the label assigning stage [38], while a deeper reliance on qRAM can allow for an improved scaling complexity with respect to the data-set size itself [60].
What is appealing about the construction of this circuit is that (assuming states are prepared) the SwapTest routine does not depend on the number of points in a vector. To encode a state into quantum memory has the time complexity of O(log(N )), where N is the number of features in our vector [38,39]. This matches with what we may intuitively expect -for N features we require log 2 (N ) qubits. For regular K-means we expect a time complexity of O(N ). By using a quantum variant of K-means we see an exponential performance increase, with respect to the number of features in our sample.  Figure 7: Diagram of the Q-means quantum circuit. The feature vector and centroid is encoded through U 3 gates, while a SwapTest is used to provide a measure of their similarity.

Implementation, Performance and Scalability
The discussion in the previous section is based on the discrete qubit quantum computing regime. However, this methodology is also possible to carry out in the continuous-variable, qumode scenario. The Q-means method presented here is based on using the SwapTest to measure the overlap of two states. This has been shown to be the equivalent of Hong-Ou-Mandel effect [61]. The Hong-Ou-Mandel effect is a way to compare two arbitrary states in quantum photonics. Making use of this would allow one to implement Q-means on a photonic device straightforwardly.
To implement the Q-means quantum circuit shown in Figure 7 we use the Python package PennyLane [62]. The Q-means algorithm is trained with the same parameters as the previous two scenarios -using 800 background samples, each with 30 features. We use both methods of generating samples to train the model. In both cases, the classical eigenvalue scenario and the GBS sample variant, we find Q-means and K-means results to be equivalent. If one judges the Q-means performance based on the potential efficiency improvements there is a clear benefit to its use. The power of Q-means lies in its ability to scale to larger feature-vectors. The increase in the availability of quantum devices, and the ability to run these algorithms on their native devices, allows clustering methods to be more viable.

Conclusions
Separating rare unspecified signal events from common Standard Model backgrounds is one of the most important tasks multi-purpose experiments at the LHC perform. Datadriven searches, ones that train directly on the data without making assumptions about any features in new physics models, are an appealing option. By learning solely the distributions of the background Standard Model events it is possible to flag events that do not share similar features as signal. When building models and constructing datasets to train on it is worth considering how data is represented. Particle physics events are well-suited to be stored as graph structures, allowing relationships between event constituents to be naturally captured. To use these graph structures for classification one needs to use a model the structure naturally fits (such as a graph neural network) or be able to embed the structure (for example, into a kernel or vector).
Continuous-variable (CV) quantum computing, through the use of Gaussian Boson Sampling, provides a method of embedding a graph into a lower-dimensional representation. CV quantum computers differ from discrete, qubit, models of quantum computing by their reliance on infinite-dimensional qumodes. These devices offer unique quantum advantages, one of which is boson sampling. Boson sampling creates samples from detecting photons after they travel through an interferometer. Simulating this, due to the matrix operations involved, is a #P problem and therefore a classically hard task.
We propose using Gaussian Boson Sampling to embed generated data into a feature vector and perform anomaly detection with it. This method appears to be a viable choice for trigger level anomaly detection during future LHC runs because of the speed sampling can occur on a near term photonic device. Results using GBS to generate features compare favourably to results found when the feature vectors have been classically generated. Another quantum advantage can be gained from the use of Q-means clustering. We find performing anomaly detection using Q-means clustering gives equivalent results to K-means but has the potential to scale to large feature vectors more efficiently. The variant of Q-means presented here has a O(log(N)) complexity, with respect to the size of the feature vector. This is compared to the K-means complexity of O(N ). While Q-means is implemented here in the discrete qubit-based model it should be possible to expand this to the CV paradigm.

A SwapTest
A SwapTest circuit provides a method of checking the overlap between two states. Following [39], we can construct the circuit from Hadamard and CSWAP gates. We will use the information on the overlap of the states in our implementation of a quantum K-means routine. To understand how SwapTest operates we begin by defining a state |ψ = |0, α, β . (A.1) The state ψ contains the two states we wish to measure (α and β) which can contain multiple qubits and a control qubit. This initial state is evolved by applying a Hadamard gate to the control bit, resulting in the new state The next gate applied is CSWAP. The controlled SWAP gate will swap qubits based on the value of the control qubit. The application of a CSWAP will therefore evolve our state to Finally, another Hadamard gate is applied to the control qubit: The probability of measuring the state |0 will give us a measure of the overlap. If states |α and |β are identical then P (|0 ) will equal 0, if they are orthogonal than P (|0 ) = 0.5