Learning the latent structure of collider events

We describe a technique to learn the underlying structure of collider events directly from the data, without having a particular theoretical model in mind. It allows to infer aspects of the theoretical model that may have given rise to this structure, and can be used to cluster or classify the events for analysis purposes. The unsupervised machine-learning technique is based on the probabilistic (Bayesian) generative model of Latent Dirichlet Allocation. We pair the model with an approximate inference algorithm called Variational Inference, which we then use to extract the latent probability distributions describing the learned underlying structure of collider events. We provide a detailed systematic study of the technique using two example scenarios to learn the latent structure of di-jet event samples made up of QCD background events and either $t\bar{t}$ or hypothetical $W' \to (\phi \to WW) W$ signal events.


Introduction
With the discovery of the the Higgs boson [1,2], all the degrees of freedom that form our current consistent theoretical understanding of fundamental quantum interactions -the standard model (SM)have been experimentally established. The theoretical and phenomenological program that enabled this fundamental scientific achievement has lasted for more than three decades starting when the dominant SM Higgs boson production and decay modes have been first identified and computed [3][4][5]. It involved detailed theoretical calculations of both the eventual signal, but also the most relevant (and typically much more abundant) background processes from which the (relatively small) signal had to be painstakingly extracted with the use of advanced statistical methods [1,2,6].
Contrary to the hunt for the SM Higgs boson (and also other SM heavy resonances, like the top quark or the weak gauge bosons), whose properties, processes and signatures at high energy colliders were well predicted and understood before their discovery, our current quest for uncovering possible new physics (NP) degrees of freedom beyond those of the SM faces a much bigger challenge. Namely, there is no unique well established model of NP which would convincingly address all the known SM shortcomings and whose phenomenology could be precisely studied and targeted experimentally. Instead there exist a plethora of NP proposals and possibilities. The few simplest, most elegant and thus most compelling possibilities have already been mostly excluded or pushed into fringe corners of their respective parameter spaces, while systematically exploring the phenomenology of the whole vast imaginable model space is clearly beyond current human capabilities.
In the last few years, machine learning (ML) tools have opened new avenues in NP searches, see e.g. [7][8][9][10] and references therein. The currently most widely used framework is that of Neural Networks (NNs) as efficient likelihood approximators trained on vast amounts of data. Since these supervised ML approaches commonly rely on theoretical predictions for both NP (signal) and SM (background) training data sets (typically through Monte Carlo (MC) generators), their use in searches for a priori unknown new phenomena in LHC events is severely limited.
There have been recent advances in unsupervised or semi-supervised ML techniques designed to be able to separate signal and background events in mixed samples, and could therefore be run directly on experimental data without the need for pure MC training samples, see e.g. refs. . They rely on categorizing and comparing datasets with different expected signal and background admixtures or identifying anomalous events inside large datasets. While these approaches ameliorate the model dependence of fully supervised ML, they are still potentially susceptible to correlated systematics (i.e. detector) effects and/or subject to large look-elsewhere effects. In addition, they generally work best when applied on very large datasets. Consequently their performance may suffer when looking for effects in tails of distributions.
Recently [37], we have proposed a new technique to classify jets and events in situ within a single mixed event sample, using tools developed in a branch of ML called generative statistical modeling, see e.g. [38]. Developed primarily to identify emergent themes in collections of documents, these models infer the hidden (or latent) structure of a document corpus using posterior Bayesian inference based on word and theme co-occurence [39][40][41][42][43][44][45]. Using the example of jet substructure observables based on the clustering history, we have shown how to construct statistical mixed membership models of jet substructure. In particular, using the model of Latent Dirichlet Allocation (LDA) [40], which can be solved efficiently using e.g. Variational Inference (VI) [46] techniques, we were able to define robust parametric jet and event classifiers. 1 In the present work we provide further details of this approach, building upon the basic assumptions and premises about the relevant measurements/observables and their statistical modeling, in order to construct generative Bayesian models most relevant and practical for particle physics event classification, in particular LDA. We also provide further justification for why jet clustering history observables are a particularly interesting and applicable example for these methods. Finally, we perform a systematic study of the parametric and Bayesian prior dependence and performance of VI and LDA, respectively, based on two representative examples of boosted tt events and events containing hypothetical boosted color neutral, but hadronically decaying scalars [13,20,26]. In particular, we identify a robust measure of LDA and VI performance, which does not rely on access to labelled data but at the same time correlates strongly with traditional classification performance measures (like tagging efficiency and mistag rate), and use it to identify parameter and prior ranges most suitable for the example datasets.
The paper is structured as follows: In Sec. 2 we outline the statistical premises and introduce Bayesian generative models upon which LDA is based. We also provide details of LDA training and inference methods and how they can be applied to event classification. In Sec. 3 we apply the general framework to the multi-jet event data in the form of jet clustering history observables and discuss the most appropriate data representations. The benchmark event samples used for our study are introduced in Sec. 4, where we also discuss the data preparation steps that need to be considered when using LDA. Sec. 5 contains the main results of our systematic study of LDA based classification methods applied to the example datasets. Finally we summarize our conclusions and provide an outlook in Sec. 6.

Probabilistic generative modelling for collider experiments
The goal in high energy collider experiments is to gain understanding of the underlying physical processes taking place at very high energies during the events, i.e. the hard collisions. Each event can result in anywhere from O (10) to O(1000) particles being detected away from the beamline and the detector records the energy, momentum, and tracking information of these particles. One must then analyse this high-dimensional dataset and compare it to what is expected from theoretical predictions in order to gain an understanding of the underlying physics. To perform this analysis in practice, typically the dimensionality of the dataset must be drastically reduced. How this is done depends on the type of underlying physics one wants to study and typically involves some combination of jet clustering and grooming, pile-up subtraction, the use of certain high-level observables, and performing cuts to remove unnecessary elements of the dataset. Once the dataset has been processed and the relevant high-level observables have been collated, a statistical analysis comparing the measurements with the theoretical predictions can quantify how much a particular physics model agrees with the data.
Bayesian probabilistic generative modelling is an unsupervised machine learning approach in which one constructs a probabilistic model for a dataset, and then uses approximate inference techniques to estimate the parameters of this probabilistic model directly from the data. If the probabilistic model is a good approximation to how the data was actually generated, this in turn allows to identify patterns in the dataset. In our case these patterns could contain important information on the underlying physical processes registered in collider events.
In the most general sense, a single event e j (j = 1, 2, . . . , N e ) can be represented by a finite list of measurements, e j = {o j,1 , o j,2 , . . .}, where o j,i (i = 1, 2, . . . , N j ) are in general functions (or mappings) of the relevant multi-particle phase-space. We can construct a model for the events by supposing that the measurements have been sampled from a (presumably complicated) joint probability distribution p(e j ) = p(o j,1 , o j,2 , . . .). This is the starting point for the unsupervised analysis techniques used in this paper. Writing a general statistical model describing the generative process of events is not possible in practice. To proceed, it is necessary to impose a set of simplifying assumptions on the joint probability distribution. The functional dependence of this distribution on o j,i must of course be flexible enough in order to account for the multiple physical processes manifest in each event, but it also must be simple enough such that efficient inference techniques can be implemented. In order to model the events using the techniques described in this paper, the phase-space observables furthermore need to be labeled and binned so that the possible measurements o j,i are discrete and finite in number. This requirement allows to construct probabilistic models based on multinomial distributions that describe the occurrence of the measurement bins o j,i in events. If we were to consider unbinned observables then p(e j ) would be constructed from continuous probability distributions. However, given that in practice the measurements we work with are also coarse-grained due to detector granularity and reconstruction uncertainties it is intuitive to select bins for the measurements that reflect this.

Probabilistic generative models
In order to introduce the reader to the concepts and models used in this work we will discuss two different models for p(e j ) of increasing complexity: mixture models, and mixed-membership models. The starting point for the construction of these models is de Finetti's representation theorem [54], which states that if the measurements in the joint probability distribution are exchangeable then the measurements are conditionally independent given some latent variables. The exchangeability requirement simply means that the joint probability distribution should be invariant under a re-ordering or exchanging of the measurements. Note that measurement exchangeability is not to be confused with measurement independence (i.i.d). The former is a weaker condition that leads to more flexible probabilistic models capable of capturing complicated hierarchical patterns in the data. Formally, the theorem states that the joint probability distribution can be written in the form where θ (Θ) represents a hidden latent parameter (space) which is marginalised over in the joint probability. From the p(o j,i |θ) factor on the right-hand side of the above equation we can see that the independence of the different measurements is manifest, and is replaced by a conditional dependence on the latent space variables. When viewing this result from a Bayesian perspective, the probability functions p(o |θ) represent likelihoods while the function p(θ) outside of the product acts as a prior distribution over the latent space. This theorem underpins probabilistic models such as mixture models and mixed-membership models, which we will discuss in the following sections.

Mixture models
We now present one of the simplest probabilistic models for a sample of collider events. The mixture model consists of T probability distributions over the measurement bins, represented by p(o|t, β) for t = 1, . . . , T . These probability distributions are M -dimensional multinomials (multidimensional generalisations of the binomial), where M is the number of bins in observable-space. The parameters of these multinomials are represented by the elements of a T×M dimensional matrix, β t,m having the property that M m=1 β t,m = 1 for all t. Each row in β t,m contains the parameters of the multinomial associated to one of the T probability distributions. A key feature of mixture models is that they assume measurements in a single event have been sampled from just one of these T multinomial distributions. So for each event one of the T distributions is selected from a multinomial probability distribution p(t|ω), where ω = (ω 1 , . . . , ω T ) are the probabilities to select each T . They satisfy 0 ≤ ω t ≤ 1 and T t ω t = 1. In this work we refer to each T latent multinomial distribution as a 'theme', in reference to the field of 'topic modelling' in text analysis where these methods were popularised, thus the ω parameters are referred to as theme weights.
It is useful to describe the probabilistic model in terms of a generative process, outlining the underlying assumption on how the events were generated. The generative process for a collection of events in a mixture model goes as follows: i. Randomly sample a theme t ∼ p(t|ω).
ii. Randomly sample a measurement o j,i ∼ p(o|t, β t ).
iii. Repeat step (ii) for each measurement in the event.
iv. Repeat steps (i)-(iii) for each event in the sample.
The mathematical structure of the model can be realised by taking the representation of the joint probability distribution due to de Finetti's theorem (2.1) and defining θ ∈ R with a prior distribution over the latent space as p(θ) = T t p(θ|ω)δ(θ − t) where p(θ|ω) is a density such that p(θ = t|ω) = ω t . This leads us to the form (2. 2) The generative process described here can be visually represented using a so-called graphical model, see e.g. [38]: the unobserved variables (Latent random variables and model parameters) are represented by white circles, observed data (measurements) are represented by shaded circles, while the conditional dependencies and i.i.d samplings are represented by arrows. To indicate that certain steps in the generative process are replicated, a labelled box or plate is drawn around the relevant parts of the diagram, with the integer label representing the number of times these steps are to be repeated (thus such graphical models are often also referred to as plate diagrams). The graphical mixture model described here is shown in the upper diagram in Figure 1. We can see there that the free parameters of the mixture model given by the theme proportions ω t and the multinomial probabilities β t,m , located outside all plates, have to be defined for the whole event sample in order to initiate the generative process that leads to the measurements o j,i in the inner-most plate.
In collider physics, it is implicit that event samples arise from a statistical mixture of multiple underlying physical scattering process, where each event is a result of one such particular scattering process. Once the corresponding differential cross-sections are binned, the scattering processes can be identified with themes in a multinomial mixture model as described above. Traditionally, the weights ω are computed from first principles using a combination of Quantum Field Theory, Montecarlo event generators tuned to data and experimental knowledge of the detector response.
More recently, mixture models have been used for semi-supervised classification of event samples where the mixture proportions of the themes are a priori unknown. For instance, in the CWoLa framework [11] a set of event samples M 1 and M 2 are taken as mixtures of two underlying themes (with different mixing proportions) S and B, corresponding to signal and background themes. Along this same line, a mixture model was used in Ref. [21] with the aim of disentangling the jet substructure distributions (e.g. constituent multiplicity) of quark and gluon jets using mixed event samples. The same model and technique was also used in ref. [55] in an attempt to separate pp → tttt from backgrounds in inclusive same-sign dilepton events using jet multiplicity distributions.
There are several drawbacks when using mixture models for (unsupervised) event classification tasks. These come from the assumption that all measurements in an individual event are drawn from one theme. The main (related) issues are the following: • Measurements on a single collider event typically receive contributions from many sources, for example in measuring tt production it is inevitable that much of the measurements will be of soft QCD radiation rather than the hard decays products of the top quarks. Mixture models fail to differentiate between different underlying processes in individual events.
• Mixture models are not well suited for modelling datasets where events generated from different themes share common features. Admittedly this is a problem in extracting the themes with the approximate inference techniques, but is a drawback nonetheless, see e.g. ref. [40].
In general, mixtures are useful representations of the data if the mixing proportions ω can be computed from first principles or estimated with other means (such as in (semi)supervised ML), but tend to be less suitable if w are a-priori unknown, as in unsupervised ML. Next we discuss mixed-membership models, which address these issues in an efficient way.

Mixed-membership models
Mixed-membership models also consist of T themes, however a single event is now generated from a mixture of themes rather than being generated from a single theme, as in the mixture model. Each event e j now has its own theme weights ω j = (ω j,1 , . . . , ω j,T ). These are now latent variables of the model (not parameters) that are sampled from a prior distribution p(ω|α), with α being the parameters of the distribution. This prior is in general defined over the (T − 1)-dimensional simplex describing the space of all theme weights (i.e. the space of T -vectors with positive entries that sum up to one). The generative process for a mixed-membership model goes as follows: i. Randomly sample a set of T theme proportions from the prior, ω j ∼ p(ω|α).
ii. Randomly sample a theme t ∼ p(t|ω j ).
iii. Randomly sample a measurement o j,i ∼ p(o|t, β).
iv. Repeat steps (ii)-(iii) for each measurement in the event.
v. Repeat steps (i)-(iv) for each event in the sample.
We can derive the mixed-membership representation of the joint probability by taking Eq. (2.1) and assigning where p(t|ω j ) = ω j,t . Therefore the mixed-membership model for an event is defined by where Ω is the simplex. Note the slight change in notation: the latent space variable θ from Eq. (2.1) has been replaced with ω ← θ (and Ω ← Θ) to keep the notation for mixed-membership models in line with the notation for mixture models. The generative process for the mixed-membership model is shown in the lower diagram of Figure 1. In comparison to the mixture model plate diagram, notice that the theme selection step is now inside the event plate indicating the mixed-membership nature of the model. The free parameters of the mixed-membership model are α from the prior and the multinomial probabilities β t,m of the themes. Using mixed-memberships resolves the problem mixture models have when modelling events sharing similar features. It is therefore possible to model events that are much more heterogenous. It is also clear that the model can now describe events where measurements receive contributions from multiple sources, accommodated now by each event having measurements in a single event sampled from different themes.

Latent Dirichlet Allocation
In Bayesian probabilistic modelling the inference of model parameters is one of the primary tasks, and will be discussed in detail in Sec. 2.3. Choosing the prior p(ω|α) to be the conjugate distribution to the likelihood function makes the parameter inference easier. For mixed-membership models with a multinomial likelihood, as we have here, the conjugate prior is the Dirichlet distribution D(·|α). Choosing p(ω|α) in Eq. (2.4) to be a Dirichlet distribution leads us to Latent Dirichlet Allocation (LDA) [40,42]. The Dirichlet distribution defined over the simplex Ω is a multivariate generalisation of the beta distribution over the unit interval [0, 1], reducing to the beta distribution for T = 2. It is in fact a parametric family of distributions, defined by T positive non-zero parameters, α = α 1 , . . . , α T , and has the explicit form where Γ(x) is the gamma function. In LDA, the Dirichlet prior encodes prior information on how we expect the themes to contribute both to individual events and to the whole sample of events. It does this by influencing the possible proportions ω j selected in the generative process. For example a particular choice of parameters (α) could define a model in which one particular theme contributes much less to individual events than another, or it could define a model in which some events are composed almost exclusively of one theme while others are more equal mixtures of several themes. As mentioned in the introduction, we will be concerned solely with scenarios in which only two themes are relevant. We will therefore focus on the T = 2 case where the Dirichlet prior reduces to a beta distribution. For each event we sample a variable ω 1 from the Dirichlet (beta) distribution representing the proportion of the first theme p(o|1, β), while the proportion of the second theme p(o|2, β) is given by ω 2 = 1 − ω 1 . In this two-theme scenario, the analytical form of the Dirichlet is given by where for now we drop the j subscript labelling the event. When inspecting the above distribution family for different values of the α parameters, one identifies several cases that give rise to different types of distribution shapes. These different shapes encode different assumptions about underlying event data. For instance D(ω 1 |1, 1) corresponds to the flat distribution over the unit interval and would describe events for which the occurrence of either theme in an event is completely random (shown in gray dashed line in Fig. 2). The other more interesting shapes are the following: 1. α 1 < 1, α 2 < 1: bi-modal distributions (shaded in red in Fig. 2) with two maxima at the boundaries of the unit interval (ω 1 = 0 and ω 1 = 1). Physically, this scenario describes samples for which one group of events have measurements predominantly sampled from the first theme, and another group for which measurements are mostly sampled from the second theme. The relative size between each group of events is controlled by the ratio α 2 /α 1 .
2. α 1 > 1, α 2 < 1: uni-modal distributions with a maximum located at one boundary of the interval and the distribution tail stretching towards the opposite boundary (shaded in yellow in Fig. 2). In this case we expect most events to be generated mostly by one predominant theme.
3. α 1 > 1, α 2 > 1: uni-modal distributions with one maximum located at ω 1 = α1−1 α1+α2−2 and two tails stretching towards both boundaries of the interval (shaded in blue in Fig. 2). In this case we expect the bulk of events to be generated by non-negligible mixtures of both themes, with very few events where just one theme completely dominates. However the exact distribution depends strongly on the hierarchy between α 1 and α 2 .
In the following sections we will rely on a useful re-parameterisation of the Dirichlet where we trade the (α 1 , α 2 ) parameters for (Σ, ρ) defined as By convention we have fixed here α 2 ≤ α 1 , hence 0 < ρ ≤ 1. The ρ parameter controls the asymmetry in the shape of the Dirichlet distribution. In Fig. 2 (right) we present a visualisation of the the different shapes taken by the Dirichlet distribution, in terms of these ρ and Σ parameters. The smaller ρ is, the more probable events will be composed of measurements drawn from the first theme (t = 1). A way to see this, is by considering the expectation for sampling the themes from the Dirichlet during one measurement sampling. One finds where p t are shorthand for the theme multinomials p(o|t, β t,m ) and E D [·] denotes the expectation with respect to the Dirichlet distribution. To derive this we have used the relation for the mean value µ = 1+ρ . This indicates that in the limit ρ → 0 of asymmetric Dirichlet priors, there will be a prevalence of the first theme over the second theme when sampling measurements for an event, while in the limit ρ → 1 the priors become symmetric and events will tend on average to have measurements coming from both themes in similar proportions. The parameter Σ, on the other hand, controls to what degree individual events in the model are described by mixtures of themes for a fixed value of the asymmetry parameter ρ, i.e. to what degree the model is a mixed-membership rather than just a mixture model. For large Σ we expect that events are generated from mixtures of both themes, whereas for Σ 1 we expect that events are generated from pre-dominantly one theme. In fact, it is known that the Beta distribution will approach the Bernoulli distribution in the limit of Σ → 0 with fixed ρ. In general, a Dirichlet for T themes will approach a T -dimensional multinomial distribution in the limit T t=1 α t → 0 [56]. In this limit the Bernoulli probability p is equal to the expectation value of the Dirichlet, 1 1+ρ . Therefore in the Σ 1 limit each event is approximately generated by just one theme, and the LDA mixed-membership model tends to the mixture model described previously. What happens is that when for every event you sample (ω j,1 , ω j,2 ) from the Dirichlet, the only weights that have a non-zero probabilities in the Dirichlet distribution are (ω j,1 = 1, ω j,2 = 0) and (ω j,1 = 0, ω j,2 = 1), where the probabilities for selecting each of these from the Dirichlet is 1/(1 + ρ) and ρ/(1 + ρ), respectively. In this mixture model limit ρ takes on the role of the ratio of theme weights ω 1 /ω 2 . In Fig. 2 we can then identify the boundary at the y-axis as a mixture model with ω 1 /ω 2 = ρ.
The event samples we analyse can contain anywhere from O(10 3 ) to O(10 6 ) events and the number of unique measurements can also be very large. This means that for parts of the event sample the data will be very sparse, i.e. there will be many o j,i that do not appear often in the sample. This can lead to issues in the inference procedure, with these rare measurements being assigned zero probability in the themes, which then leads to problems during the classification of events. This issue can be solved by so-called 'smoothing' [40]. Smoothing involves placing a M -dimensional Dirichlet prior on the variables of the theme probability distributions, such that no measurement can have a zero probability. The generative process is then augmented as shown in the plate diagram in Fig. 3. We fix each of the M − 1 parameters of the Dirichlet prior to 1/M as default, changing this does not lead to significant changes in the output of the algorithm. Henceforth we will focus on smoothed LDA and refer to it simply as LDA.

Approximate inference
Ultimately, the goal is to estimate the posterior distributions for the variables in the LDA model given the observation of experimental data. The joint probability over all events e = (e 1 , . . . , e Ne ) for the LDA model can be written as where we have not marginalised over the model variables. On the left hand side of this equation ω represents the list of theme weights for all events in the sample, and t represents the list of topic assignments for each o j,i in all events in the sample. The joint probability is the probability of having generated these events given the LDA model with the themes each being sampled from a Dirichlet parameterised by η, and the theme weights being sampled per event from a Dirichlet parameterised by α. From this we want to approximate the posterior distribution p(β, ω, t|e). Bayes theorem states that this posterior should have the form p(β, ω, t|e) ∝ p(e, β, ω, t)/p(e). The term in the numerator is calculable, however the difficulty lies in the normalisation term, the evidence. This term is an intractable integral and prevents us from straightforwardly obtaining a closed form expression for the posterior distribution. We can however obtain an approximation to the posterior distribution using approximate inference techniques. Following [46,57] we choose the Variational Inference (VI) technique. With VI the log of the evidence is written as where KL stands for the Kullback-Leibler divergence [58] and we use ξ as a shorthand for the model variables, (β, ω, t). The function q(ξ) has been introduced here as an approximation to the posterior distribution, where q(β, ω, t) is assumed to factorise in each variable, reflecting how these are grouped in LDA. The goal of VI is to approximate this q(ξ) function. On the right-hand-side of Eq. (2.10) we have two terms: the Evidence-Lower-BOund (ELBO) L, and the KL divergence between the posterior and its approximation. The KL divergence is always greater than zero, and is equal to zero only when q(β, θ, t) = p(β, θ, t|e). The term L is then a lower bound on the evidence, hence calling it the ELBO. We cannot compute the KL divergence because we cannot compute the posterior, however the joint likelihood and therefore the ELBO term can be computed. The goal is then to maximise the ELBO with respect to q(β, ω, t).
Because the evidence term on the right-hand-side is completely independent of q(β, ω, t), maximising the ELBO is equivalent to minimising the KL divergence between q(β, ω, t) and the posterior, thus finding a q(β, ω, t) which is a good approximation to the posterior. VI gives us a prescription for doing this in mixed-membership models like LDA. The LDA model belongs to the conjugate exponential family of models. For these, one can show that the terms in the posterior approximation must have the following form: (2.12) So to optimise q(β, ω, t) we need to maximise the ELBO with respect to the parameters φ j,i , α j,t , γ t,m . Note that there are T (n e + M ) + N J parameters here, where N J is the total number of measurements in all events in the sample. Due to the specific structure of LDA, i.e. the conditional dependencies and the use of conjugate priors, closed form expressions of the parameters that maximise the ELBO can be written in terms of each other (see below). The VI algorithm then dictates how to update the parameters iteratively such that it converges to a maximum of the ELBO function. Due to the large number of events from which we infer the parameters of the approximate posterior, the basic VI algorithm is inefficient. To implement this in a way which scales well to large datasets we employ an extension of this algorithm called Stochastic Variational Inference (SVI). This technique uses results from stochastic optimisation methods to speed up the inference by inferring from smaller randomly sampled subsets of the data on each update. These are called chunks of data, and their size is determined by the chunk size n c . The algorithm will run for a total number of passes through the dataset, defined by n p . We denote the total number of chunks of data processed by N . The algorithm is thoroughly defined as follows:
• Procedure i. Update q(φ j,i ) by iterating through j and i and setting (2.13) ii. Update q(ω j ) by iterating through t and setting: (2.14) iii. Check for convergence: if the change in α is less than the threshold parameter α thresh , end loop.
i. Update q(β t ) by iterating through t and m and setting: (2.15) (d) Evaluate the normalised (per-o j,i ) ELBO for this chunk of data from the dataset, L n . This can be used to check for convergence.
The algorithm makes use of the hierarchical structure of the model, with local variables (ω, t) being optimised until convergence before an update on the global variables (themes, β) is performed. While optimising the local variables the algorithm uses the (l − 1) th approximation of α j,t and the (n − 1) th approximation of γ t,m to calculate the l th approximation of φ j,i , before using the l th approximation of φ j,i to calculate the l th approximation to α j,t . Once L updates of this sort have been done, or until convergence has been met according to α thresh , the themes are updated using the local variables obtained at the end of the inner loop. A few points of note: (i) the ψ(·) in Eq. (2.13) arises from the expectation of the natural logarithm of the Dirichlet distribution, (ii) the η in Eq. (2.15) is from the prior on the theme distributions, (iii) the I(·) in Eq. (2.15) is an indicator function, which is equal to 1 when the equality in the brackets is true, and equals zero when it is not.
It is also important to note the key role played by the latent variable φ j,i , which encodes information on which theme each measurement in each event was sampled from. This variable captures the cooccurrences between different measurements in the event sample. For example, if some measurement m co-occurs with another measurement m in many events, this information is stored by the φ j,i variable and through iterative updates these two measurements are more likely to end up with large weights in the same theme distribution. It is through the presence of co-occurring measurements in the data that this algorithm is able to disentangle different underlying physical processes occurring in the events. Without these co-occurrences, or a method to utilise them, the best an unsupervised algorithm can do in identifying rare events is to search for outliers in the data. Thus searching for these co-occurrences is essential in extracting a generative description of the events. We pay particular attention to this in deciding upon a data representation (Sec. 4.4) for our benchmark studies.
In this work we have used the implementation of the SVI procedure as described above within gensim [59], a software package for performing unsupervised semantic modelling of plain text. The parameters of the SVI algorithm are the chunk size n c , the number of iterations L, the alpha threshold α thresh , and the number of passes n P . On the other hand, the learning rate δ n is not constant in gensim but follows where τ 0 is the offset and κ is the decay parameter. This stochastic inference procedure is proven to converge to a local minimum if . The convergence of the whole algorithm can be assessed using the ELBO, or equivalently, the perplexity defined as P n = 2 −Ln . (2.17) A lower perplexity means that the ELBO is larger and thus the KL-divergence between the posterior and approximate posterior is smaller. In Sec. 5.3 we study how the choices of the offset and the chunk size parameters of the algorithm affect the convergence and performance of the models as well as their final perplexity.
With the posterior distributions at hand, we would typically like to infer the theme distributions and the theme weights of individual events. Ideally, we would maximise the posterior distributions with respect to the variational parameters to obtain best estimates of the theme parameters and mixing weights for the LDA model, however this is computationally difficult [60]. A good approximation for the theme parameters and mixing weights can instead be obtained by simply taking the expectation values,

Event classification
Once we have the posterior approximation and the estimates of the theme distributions (β t,m ) and the theme weights for each event (ω j,t ), we want to be able to use this information to cluster events into one of two clusters, C 1 or C 2 . The mixed-membership model assumes that each event is already a mixture of two types of underlying themes, so we could simply cluster the events by placing cuts onω j,1 for each event:ω (2.20) Classifying events in this way does yield good classification performance, as demonstrated in our earlier work [37]. However, we have also found that using instead the likelihood-ratio classifier yields a more robust performance over a larger region of model prior space. The likelihood ratio classifier is constructed using just the themes as extracted from the data, and not the theme weights. The likelihood ratio can be written as where I(·) is again the indicator function, equal to 1 when the expression in brackets is true and equal to 0 when it is not. With the likelihood ratio we also need to perform a cut in order to cluster the events,

Evaluating the performance of a model
We can evaluate how well a particular classification technique performs using Monte-Carlo generated data, for which we know the truth labels. Suppose we generate two samples of events, sample 1 and sample 2, and we produce a mixed sample of events from both pure samples. We can train an LDA model with T = 2 on this mixed sample to extract 2 theme distributions that describe the data. We can then either use the extracted theme weights, or the likelihood ratio to cluster the events in either C 1 or C 2 . Suppose the goal is to cluster events from sample 1 into C 1 , and events from sample 2 into C 2 . We can test how well the algorithm performs using the truth labelled data. We can compute the fraction of events from sample 2 correctly assigned to C 2 as a function of the cut, ε 2 (c). And analogously we can compute the fraction of events from sample 1 incorrectly assigned to C 2 as a function of the cut,ε 1 (c). The Receiver-Operating-Characteristic (ROC) curve is then defined as the curve tracing the true positive rate as a function of the false positive rate, i.e. ε 2 (ε 1 ). Two measures we use to evaluate the performance of the LDA models we have trained are 1. Area Under Curve (AUC): the integrated area under the ROC curve.
2. Inverse mistag at fixed efficiency: The AUC is a useful statistic when we are interested in the overall general performance of the classifier. However when the experimental analysis is focused on identifying rare signals in a sample of events, the AUC statistic is not always the most relevant indicator of performance. What is required is a statistic which demonstrates a strong rejection of background events coinciding with the acceptance of a moderately large number of signal events. This is captured by the inverse mistag at fixed efficiency.
The goal is to determine the LDA priors and VI parameters that lead to the best performing models -the ones best characterising and differentiating the pure samples and at the end offering the best classification performance. As we demonstrate in Sec. 5 the classification performance is indeed correlated with the perplexity of the model, Eq. (2.17), when calculated using all the data in the event sample. A decrease in perplexity corresponds to an increase in the ELBO and therefore a decrease in the KLdivergence between the approximate and true posteriors. Hence the better we approximate the posterior, the better we expect the classifier to perform.

Learning latent jet substructure
So far we have introduced probabilistic generative generative models as a tool for analysing experimental data, in particular for extracting rare signals in a dataset. As an example of how this works in practice, we apply LDA to the analysis of di-jet events. In this section we explain how to represent di-jet events in terms of a sequence of exchangeable measurements o j,i , and discuss how the mixed-membership model is well suited for describing di-jet events, and finally we discuss our choice of o j,i representation and basis.

Jet de-clustering and substructure observables
When coloured particles are produced at high energy colliders the subsequent QCD showering, fragmentation, and hadronization results in many hadrons in the final state. If the transverse momentum of the initial particle is large enough, all of these final-state hadrons will be registered by the detector within a single localised region in (η, φ). These clusters of hadrons are referred to as jets, and there have been many different clustering techniques developed to define jets based on the four-momenta of the constituent hadrons. Of these different techniques, the sequential recombination schemes [61][62][63][64][65] have become the standard algorithms for jet clustering. When applied to data collected for a single collider event, the algorithm can reduce the complexity of the data to a handful of jets, each representing a final state of some high-energy parton produced in the hard collision. In order to arrive at a single clustered jet from hundreds of hadrons, the sequential recombination scheme goes through a set of pairwise intermediate clusterings in which the four-momenta of two subjets are combined to form a larger subjet. De-clustering the jet and analysing these individual splittings can provide crucial information into the physical processes taking place during the event. For example, if the initial particle is a top quark with a large transverse momentum, the resulting jet will contain splittings that describe the decay of the top Figure 4: In the left plot we show schematically how the sequential unclustering algorithm proceeds, with the whole jet J being repeatedly separated into two subjets (j1 and j2 with mj 1 > mj 2 ). In the right plot we then show how the feature representation of the data maps onto this unclustering, with each oj,i being mapped to a node in the unclustering tree. Note that the ordering of these oj,i terms with a single jet does not matter.
quark into a bottom quark and a W boson, and splittings describing the decay of the W boson. These features are readily exploited by existing traditional top taggers, see e.g. [66].
To analyse di-jet events with the probabilistic generative models outlined in this paper, we extract measurements from each of these splittings in the (de)clustering procedure. At each splitting we construct a number of observables using the four-momenta of the subjet being de-clustered (j 0 ), and the two subjets resulting from the de-clustering (j 1 and j 2 ). The process for doing this is straight-forward but we must decide on a fixed set of observables to use throughout, this will be discussed in detail in Sec. 4. Once we have collected a set of measurements at a splitting, we must then bin their values, e.g. according to the detector resolution, but more importantly according to what the algorithm can realistically handle. The relationship between the size of the observable bins and the algorithm performance is discussed in detail in Sec. 4.4. One of these binned lists of observables is what we refer to as a measurement o j,i in the probabilistic model. Because of the binning there will be a finite (although possibly very large) number of values that each o j,i can take. In addition to the kinematical observables at each splitting we add one more categorical observable, that is a label identifying which jet the splitting belongs to. With these methods we are describing the whole event rather than a single jet, so the information to which jet a splitting belongs is important to properly characterise the whole event. Of course, including measurements from all splittings in the jet clustering history is not necessary, and would hinder the VI algorithm in extracting themes relevant for describing a potential signal. Thus we need to impose cuts such that most of the splittings that are irrelevant to uncovering rare signal events are removed, for example a simple cut on subjet masses removing splittings of subjets with m 0 < m cut could remove many of the soft emissions occurring near the end of the QCD showering process. The whole process, starting from the raw event data, can be described as follows: 1. Cluster the event with a large jet radius, and keep only the two hardest jets.
2. De-cluster each jet, extracting a list of measurements at each splitting.
3. Bin the measurements from each splitting appropriately, and assign a label identifying which jet the splitting belongs to.

Apply kinematical cuts on the splittings.
An event is then described by an ordered sequence of o j,i each representing a splitting, where each o j,i consists of a list of binned measurements and a label identifying which jet the splitting occurred in. We would like to point out that this method, and the model in general, does not rely on any specific clustering scheme. Any set of measurements which describe substructure kinematics of the jets could be used.

Probabilistic models of jet substructure
At the core of the probabilistic models discussed in Sec. 2 is De Finetti's theorem. Under the assumption that the measurements o j,i used to describe the events are exchangeable, this theorem allows us to derive, based on additional modelling assumptions, the different latent structures in mixture models and mixedmembership models. Constructing the o j,i variables for jet substructure as described in the previous section is in line with the exchangeability assumption. Sequential jet clustering algorithms do impose an ordering on the splittings due to the pairwise nature of the algorithms and the procedure through which the next subjets to be clustered are selected. However it is the kinematical properties of the splittings that cary most of the interesting physical information, not the order in which they occur, as shown e.g. in [67]. We see then that the latent themes in both the mixture and mixed-membership models for di-jet events are probability distributions over the space of possible splittings (de-clusterings) that can occur within the two leading jets. The generative processes for the mixture model and mixed-membership (LDA) model are of course different. In a mixture model a theme would ideally be associated to the specific (hard) partons produced in the collision. Each splitting in an event described by a mixture model is sampled from just one theme, therefore this theme must represent all of the physical processes occurring within the jets produced within that event. In a mixed-membership model however, different themes can be associated to different physical processes occurring within the jets of a single event. Each event in a mixed-membership model is composed of a mixture of themes, just as there are mixtures of different physical processes occurring within each event. The theme proportions for each event are selected individually from a prior distribution, whose parameters are important in the modelling. Measurements in each event are 'generated' by first sampling theme proportions from the prior, then for each splitting o j,i a theme is drawn from the theme proportions, and a splitting is sampled from that theme.
As an example consider modeling a mixed sample of events consisting of QCD di-jet events (pp → jj), and top quark pair-production events (pp → tt → (W + b)(W −b ) → jj) where the top quarks are boosted enough such that the decay products of a single top are clustered into a single jet. The splittings within a QCD jet will be predominantly soft with the number of splittings at higher k T being monotonically suppressed. For the top jets the decay chain also involves many coloured particles (the top, the bottom, the decay products of the W boson), therefore there will be many, predominantly soft, gluon emissions occurring within the top jets as well. However there will always also be a few hard splittings corresponding to the decay of the top quark to the bottom quark and W boson, and the decay of the W boson to light quarks. Using a two-theme mixture to model this event sample would ideally lead to one theme describing all the splittings within QCD jets, and one theme describing both the hard (decay) splittings and soft (QCD) splittings within the top jets. With a mixed-membership model on the other hand, the soft splittings occurring within both the QCD and top jets can be modeled by one theme, with the other theme describing just the hard splittings related to the decay dynamics inside top jets. This seems like a natural setting in which to search for rare new physics signals in di-jets at high-energy colliders.

Choosing a data representation for the jet substructure
The discussion so far has not been specific to which observables are to be measured at each j 0 → j 1 j 2 splitting in the jets. In this subsection we will discuss and justify two bases of observables, with each basis using a different cut to determine which splittings are included in the analysis. Note that in the end we will only use a subset the observables from each basis, as explained in more detail in Sec. 4. The first choice is what we refer to as the mass basis, see e.g. [50,66]: These are the mass of the (mother) (sub)jet being de-clustered, the mother/daughter subjet mass drop, the daughter subjets' mass ratio, the k T distance between the daughter subjets defined in the usual way as k T = p T,2 ∆, where ∆ 2 = (y 2 − y 1 ) 2 + (φ 2 − φ 1 ) 2 , and the helicity angle between the mother (sub)jet and the daughter subjets as defined e.g. in [53,68]. In this basis we only include splittings from the jets in which the subjet being de-clustered has a mass m 0 > 30 GeV. The second choice is what we refer to as the Lund basis [67]: Lund-basis: where R is the jet radius, z = p T,2 /(p T,1 + p T,2 ), κ = z∆, ψ = tan −1 (y 2 − y 1 )/(φ 2 − φ 1 ), and p T,1 > p T,2 .
In this basis we only include splittings from the jets which lie on the primary Lund plane. The primary Lund plane is defined as the path through the clustering history, starting from the clustered jet, and continually moving through the pairwise splittings to the subjet with the largest p T until the end of the clustering history. One advantage of the primary Lund plane compared to the mass-basis is that it offers a clearer interpretation in terms of hard vs. soft (i.e. perturbative vs. non-perturbative) splittings, see Ref. [67] for details. We emphasise that these two bases do not just differ in the observables (in fact both bases include the subjet mass m 0 and (log of) k T ), but the different cuts make a considerable difference in the splittings which are used for the description of the jets. In Sec. 4.4 we explore how some features of the dataset change as we vary the binning used for these observables. Here we only specify the default bin sizes: for the mass-basis observables we bin the measurements in intervals of {10 GeV, 0.05, 0.05, 0.05, 0.1} , while for the Lund basis we use {10 GeV, 0.2, 0.2, 0.05, 0.2, 0.2} .
The last thing to discuss in terms of the data representation are the jet labels. In Sec. 3.1 we discussed the importance of including jet labels to differentiate between splittings occurring in the two jets, however we did not specify how these jets should be labelled. Naively, because we select the jets according to p T , one might choose to also label the jets in the same way with J 1 being the jet leading in p T and J 2 being the jet subleading in p T , i.e. p T,J1 > p T,J2 . However this is not suitable in practice. In the top quark pair production example discussed in previous subsection the ordering of the jet labels is not so important, since both jets in the event are top jets and have the same decay structure. However not all of the signals we may imagine will be so simple. In many cases, including the new physics example studied in this paper, the two jets in the final state will have been seeded by two different particles of different mass and thus they will both have distinctly different decay dynamics. Being able to differentiate between these different structures is not just important for classification, but is also important for a physical interpretation of the themes learned through the VI algorithm. Therefore in the case where the signal events contain two different jets, we would like to be able to associate the (J 1 , J 2 ) labels with splittings from one jet or the other, consistently across the whole sample. This will not happen if we label the jets by their p T , instead the best way to do this is by labelling the jets according to their jet mass m J , such that m 1 > m 2 .

Algorithm set-up
There are a number of parameters that determine how the VI algorithm is implemented, these have been discussed in Sec. 2.3. In our benchmark examples we use the following choices, which produce robust results across a wide range of scenarios: passes n p = 200, chunk size n c = 10 4 , iterations L = 100, offset τ 0 = 1000, α thresh = 10 −8 , decay κ = 0.5. These choices are justified in Sec. 5.3 where we discuss in particular how changing the chunk size and the offset affects the convergence of the algorithm, and the performance of the classifier.

Benchmark di-jet events
We perform our analysis using two benchmark scenarios, (i) boosted top quark pair-production pp → tt → bbW + W − , and (ii) a hypothetical 3 TeV vector W plus a 400 GeV scalar φ model, with the dominant production and decay chain pp → W → W (φ → W W ). Since the choices of observables here focus only on the jet substructure, we consider only the hadronic final states of the W bosons in both cases. Consequently, the main background process in both scenarios is the QCD di-jet production. All event samples were generated using aMC@NLO [69] interfaced with Pythia 8.2 [70] for showering and hadronization, and FastJet 3.4.1 [71] for jet clustering. The events were generated at a collision energy of 13 TeV and the jets were clustered using the CA algorithm [64,65] with R = 1.5. No jet grooming was performed. Finally, for tt (W ), jets with p T < 300 (400) GeV were discarded. The detector effects were not simulated, however we checked that the effects of subcluster energy smearing consistent with the Delphes 3.4.1 [49] simulation of the ATLAS detector had no significant effect on the results.

Boosted top quark pair-production
In the recent years the pp → tt → bbW + W − process has become a standard benchmark for supervised machine learning applications to particle physics [72]. Despite there being no need for an unsupervised top tagging algorithm, we find that this is a nice example to demonstrate the power of these techniques as applied to a physical process that is already well measured and understood.
In Figs. 5 and 6 we plot the pure signal (tt jets) and background (QCD di-jets) samples in the (m 0 , m 1 /m 0 ) and (log R/∆, log k T ) planes, respectively. We see in Fig. 5 that the hard splittings corresponding to the decay of the top quark to the W boson and the decay of the W boson to light jets are clearly discernable. The top quark decay is indicated by the two clusters (overdensities) of measurements at m 0 175 GeV, with the cluster at m 1 /m 0 1 being due to the clustering of light radiation around the subjet containing all of the top quark decay products, while the cluster at m 1 /m 0 m W /m t corresponds to the splitting that separates the bottom and W subjets from within the top jet. The decay of the W boson is indicated by the two clusters at m 0 80 GeV. Again the cluster at m 1 /m 0 1 is due to the clustering of soft radiation around the subjet containing the W boson decay products, while the cluster at lower mass drop shows splittings that separate the decay products of the W boson. The fact that this cluster is at m 1 /m 0 0.2 does not indicate that the W boson is decaying to a state of mass 0.2m W , instead this is an artifact of the definition of mass drop m 1 /m 0 with m 1 > m 2 ordering. In mass drop we take m 1 to be the heaviest of the subjets in the splitting, therefore the distribution of the mass drop is skewed away from zero. If we instead had plotted m 2 /m 0 we would see that this cluster is pushed towards m 2 /m 0 0. The (m 0 , m 1 /m 0 ) distribution for the background QCD jets is smooth and monotonically decaying at large m 0 and small m 1 /m 0 . In Fig. 6 we see that the splittings corresponding to the hard decays of the top quark and the W boson are indicated by the two overlapping clusters at log k T 5 and log R/∆ 1. Apart from the obvious difference in choice of observables here, we should also keep in mind that the actual splittings which pass the cuts here are different than those that pass the cuts in Fig. 5 (see Sec. 3.3). This choice leads to a larger overlap between the background and signal distributions, as seen by the stream of splittings at low log k T , however there is still a good separation between the features that distinguish the tt jets from the QCD background jets.

A 3 TeV W model with a 400 GeV scalar
The second benchmark is an example of a new physics signature which could be searched for at highenergy colliders using these techniques. The new physics process is the production of a 3 TeV W boson at a collision energy of 13 TeV, which decays to a SM W boson and a 400 GeV new physics scalar boson φ. The scalar boson φ then subsequently decays to two SM W bosons. The model has been introduced and previously studied in [13,20,26]. For the study in this paper we consider only the hadronic final states of the W bosons. The mass difference between the W and its decay products mean that the constituents from the scalar boson and the W will be clustered into a pair of boosted jets, making the jet substructure an important tool for any analysis of these events. This was first studied in [20] and used as a benchmark for the unsupervised CWoLa search technique in [13]. In the precursor to this paper [37] this example was also used. For this benchmark, in addition to the p T cut at 400 GeV, events were selected in the di-jet invariant mass window [2700, 3300] GeV to encapsulate the peak in the production cross-section of the W boson.
In Figs. 7 and 8 we plot the pure signal (hadronic W final states) and background (QCD di-jets) samples in the (m 0 , m 1 /m 0 ) and (log R/∆, log k T ) planes, respectively. The origin of the features in these plots is completely analogous to those for tt in Figs. 5 and 6. One important difference to note is that the plots for the leading and subleading signal jets are different, while for tt they were equivalent. This is obviously because here our signal consists of two jets of different origin. This highlights the importance of including labels for the jets in our representation of the measurements in LDA, in order to properly characterise the signal from the posterior theme distributions. In Fig. 7  in the tt case. The distributions for the background jets in Fig. 8 are similar to the distributions for the background jets in Fig. 6, as expected. Interestingly, the distributions for the W and tt signal jets in Fig.'s 8 and 6 are more similar than they are in Fig.'s 7 and 5. This is because the observables in the former case measure the k T and angular separation, rather than the masses of the (sub)jets in the splittings. Also, the observables are now both binned and displayed on a logarithmic scale, making any differences at large k T less pronounced. One obvious difference between the W and tt distributions in Fig.'s 8 and 6 is that the clusters associated with the different hard decays are more distinguishable from each other in the W case than in the tt case. This is primarily because the mass difference between the scalar boson φ and the SM W bosons is much larger than the mass difference between the top quark and the SM W bosons. Another difference is in the amount of soft radiation in the tt jets and the W jets, this is due to the top quark carrying color charge and the φ boson being color-neutral. The similarities in the two distributions do however suggest that any classifier selecting events with splittings in the large k T region may work reasonably well as a generic anti-QCD tagger.

Comparing classification power of different observables
There are many possible choices of observables that we could include in our analysis of di-jet events using LDA. All of the observables discussed in Sec. 2 carry some ability to distinguish between signal events and QCD background events, and some observables will be more useful than others depending on what the signal process is. In this section we study the classification power of each of these observables, and some combinations of them, using a simple binned likelihood classifier. To construct the binned likelihood classifier we split our signal and background datasets each into 'training' and 'testing' sets. We then compile counts of how often each measurement bin occurrs in each of the signal and background training sets, and normalise these to give us a discrete probability distribution for the signal and background samples. For each event in the testing sets we then compute the likelihood ratio as defined in Eq. (2.21), except with the β's replaced with the binned likelihood multinomials. The results are summarised in Fig. 9. First thing to notice is that the observables are in general better at classifying W events than tt events, the obvious reason being that the W signal contains splittings that are very rare in QCD background events, in particular rarer than the splittings in tt events. In the first row we show the classification performance of the observables for the tt sample. The best performing individual observables are log R/∆ and m 0 from the Lund basis. Note that m 0 appears twice, once in the mass basis and once in the Lund basis. The difference in classification power here comes only from the cuts performed on the dataset, because in the mass and Lund bases these cuts differ, as explained in Sec. 3.3. In combining observables we see that the best performing pair of observables in the Lund basis are log k T and z. In the mass basis the best performing pair are m 0 and k T , however the differences between this pair and others are miniscule. In the second row we in turn show the analogous plots demonstrating the classification power of the observables for the W sample. The results here are different than for tt, which is not surprising since not only are the masses of the particles produced in the collision different, but also the top quarks are coloured, have spin 1/2, and therefore produce a very different radiation pattern than the W decay products (colourless W with spin 1 and φ with spin 0 ). The best performing individual observable here is the subjet mass m 0 in both bases, mass and Lund. In combining observables we find that the best performing pair of observables in the Lund basis are m 0 and log k T , while in the mass basis the best performing pair are m 0 and m 1 /m 0 . Again the differences between these pairs and some of the others are very small. We do not study combinations of more than two observables, because we find that adding more observables to the best performing pairs does not provide any appreciable difference in classification power of the binned likelihood. Interestingly however, we have also found that the observables which provide the best performance with the supervised binned likelihood classifier are not necessarily the best to use in an unsupervised analysis based on LDA and VI, which crucially depend on patterns of concurrence of two or more measurements within the same event. Therefore in the unsupervised analyses we focus solely on two pairs of observables (i) (m 0 , m 1 /m 0 ), (ii) (log k T , log R/∆), based on their robust performance, good interpretability and since they are already commonly used in jet classification tasks. Note that while the classification power of any combination of observables in the supervised binned likelihood does not necessarily indicate the best choice to use in any given analysis, according to the Neyman-Pierson lemma [73] it does represent an upper bound on the classification power of any unsupervised classifier based on the corresponding themes of these same binned observables, as extracted from LDA.

Measurement co-occurrences
When introducing LDA and VI in Sec. 2 we first encountered the importance of measurement cooccurrence in individual events. Certain measurements within individual events must exhibit a pattern of co-occurrences in order for the inference algorithm to recognise and extract the corresponding theme distributions. In other words, VI is unable to extract any information from unique measurements o j,i appearing only once in the dataset. In this section we explore how the measurement co-occurrences in a dataset vary with the choice of observables and their binning, highlighting the importance of the data representation in the construction of the unsupervised classification strategy. We demonstrate this on the example of the W model, while we have checked that the tt example exhibits analogous behaviour. We quantify the measurement co-occurrences by calculating the number of unique o j,i per bin of one of the observables, marginalising over the rest, and dividing this by the total number of measurements in that bin. The lower this 'fraction of unique o j,i ' is the stronger the co-occurrences are, and the easier it will be for the VI algorithm to extract themes accurately describing the underlying structure of the events. In the upper row of Fig. 10 we show how the co-occurrences in the mixed W -QCD sample with observables (m 0 , m 1 /m 0 , m 2 /m 1 , k T , cos θ) vary per m 0 bin. On the left hand side we do this for a mixed sample of 9×10 4 signal and background events with varying S/B, while on the right hand side we do it for varying amounts of pure signal events (100, 500, and 1000). We focus on such small numbers of signal events because we are interested in finding rare signals, in which the co-occurrences will inevitably be less apparent. Conversely, in a sample containing a large fraction of signal events the structure of the signal events would be more easily uncovered due to the strong co-occurrences between the measurements. As expected, the co-occurrences are strongest at m 0 m W and m 0 m φ , since the signal events are more likely to contain splittings with these masses (see Fig. 7). However, as discussed in the previous subsection, including more than two of these observables in the analysis does not significantly increase the classification power of the binned likelihood classifier. At the same time we demonstrate in the second row how restricting the analysis to including just one such pair (m 0 , m 1 /m 0 ) can drastically increase the strength of the co-occurrences in the event sample. This provides further justification for including no more than two observables in the LDA analysis. In Fig. 11 we display the same information for the Lund observables where we measure the co-occurrences as a function of log k T . We see again that the co-occurrences are strongest at the points where the signal features are most pronounced (see Fig.  8), and that by restricting the observables used at each splitting we can increase the frequency of these co-occurrences significantly.
Before moving on we examine another handle we have on increasing co-occurrences in the event sample, that is by varying the binning used for each of the observables. To do this we keep with the W sample and focus on just the pair of (m 0 , m 1 /m 0 ) observables. The results are summarised in Fig. 12. We note that some of the bin sizes used in this plot would be impossible to use in practice due to the finite experimental resolution, however they still serve as useful examples to demonstrate the potential effects of varying bin sizes in the analysis. In the upper left plot we show the co-occurrences for the whole mixed sample with S/B=5% and four different choices of bin sizes. In each of the other three plots we then show the co-occurrences for different numbers of signal events (again 100, 500, and 1000) and varying bin sizes. As expected, larger bin sizes result in stronger co-occurrences, however the size of this effect is not as large as the effect of removing observables from the analysis completely. For example, in all cases the strength of the co-occurrences at m 0 m W is almost the same for all choices of the binning. The effect due to different bin sizes is more clearly seen away from these areas of strongest co-occurrence, where larger bin sizes result in stronger co-occurrences across the whole m 0 range. In particular, this may aid in better modelling of the signal and background distributions away from m 0 m W and m 0 m φ . On the other hand, increasing the bin size will also make the signal features less pronounced, potentially reducing the classification power in the same way as a binned likelihood classifier becomes worse and worse approximation to the Neyman-Pearson un-binned likelihood. Therefore there is trade-off here between potential classification power and the ability of VI to extract optimal theme distributions from the data. In particular we find that the constant δm 0 = 10 GeV bin size provides the best trade-off for the examples satudied here and is also in practice close to the variable binning δm 0 = 0.05m 0 mimicking the typical (energy) resolution of modern particle detector calorimeters. case because ultimately we are interested in uncovering rare new physics signals in the data. For each benchmark we construct two separate sets of mixed samples, one using the mass basis observables (m 0 , m 1 /m 0 ) and one using the (primary) Lund plane (log k T , log R/∆), as outlined in Sec. 4.3. For each mixed sample, 12 in total, we train LDA models with different Dirichlet parameters, extract the themes, and use them to cluster/classify the events in the sample. We perform an extensive scan over the Dirichlet parameters, training 961 models for each mixed sample, scanning over the (ρ,Σ) parameter space in the ranges −3 ≤ log 10 ρ ≤ 0 and 0 ≤ Σ ≤ 3 with resolution δ log 10 ρ = 0.1 and δΣ = 0.1, respectively. For each of these mixed samples we plot the inverse perplexity (P −1 ) calculated over the whole sample, as well as the AUC, and the inverse mistag at fixed efficiency, both calculated on separate pure signal and background samples. In particular, we are interested in how the inverse perplexity, which can be computed from unlabelled data alone, is correlated with the performance of the classifier (AUC and −1 b ( s = 0.5)), which is inaccessible in absence of labelled data. We also consider how both the inverse perplexity and performance behave in different regions of the Dirichlet parameter space, as discussed in Sec. 2.2.

Unsupervised classification of boosted tt production
Starting with the boosted tt scans using the (m 0 , m 1 /m 0 ) observables in Fig. 13, the first obvious trend we see is that the performance of the LDA classifiers is reduced as the number of signal events in the sample is decreased. This occurs because with less signal events it becomes more difficult for the algorithm to extract an accurate description of the signal in terms of the themes. Thus when we construct the likelihood ratio classifier as in Eq. (2.21), the performance is reduced for smaller S/B. In a full search strategy this would make it harder to isolate low S/B signals. This is a universal trend that we also see in Fig.'s 14, 16, and 17. For S/B = 1 the best performing models tend to be in the regions with larger ρ and smaller Σ. This intuitively makes sense, as the larger ρ implies that the algorithm attempts to extract themes under the assumption that the different types of events making up the mixed sample occur in comparable proportions. Also, the smaller Σ (≤ 1) means that the prior over the theme proportions is bi-modal (see red contours Fig. 2), and so events are assumed to be mostly composed of one dominant theme. We see some correlation between the perplexity and performance here, although because the performance is good in most of the parameter space for S/B = 1 there is not much to note. As we lower the S/B we see immediately, even at S/B = 0.5, that a ridge-like feature forms in the same region in both the perplexity and performance plots. This persists all the way to S/B = 0 and seems to be related to the transition with the prior moving from a uni-modal to a bi-modal shape. We see that the performance of the classifiers is better on one side of the ridge than the other, with the performance being better at lower, rather than larger, ρ as the S/B is lowered. Regions with large inverse perplexity do seem to indicate regions where the classifier is better at identifying the signal, with one exception. In the ρ, Σ → 0 limit the LDA model tends to a mixture model where only a single theme is relevant. In this region the inverse perplexity invariably reduces, while the performance of the classifier tends to flatten out. This result also shows that mixed membership models preform better at describing the data compated to mixture models. Finally we note that the presence or absence of a signal cannot be inferred by looking at the inverse perplexity plot alone. For very low S/B (including S/B = 0), LDA is unable to learn signal features effectively and (depending on the priors) learns to predominantly isolate (rare) co-occurance patterns in the background data. The resulting classifiers in this case (at small ρ 0.1 and moderate Σ ∼ O(1)) are effectively anti-QCD taggers.
In Fig. 14 we show in turn the results of the S/B and (ρ,Σ) scan for tt using the pair of (log k T , log R/∆) observables. We see a very similar general behaviour as in the mass basis example: similar correlation between perplexity and performance, with similar behaviour at ρ, Σ → 0 and a similar ridgelike feature forming at Σ < 1. Albeit the performance of the classifiers here is generally worse than in the (m 0 , m 1 /m 0 ) case in Fig. 13. We also see that the AUC even slightly improves for lower S/B when ρ is small and Σ is large. This is again the effect of anti-QCD tagging, where the algorithm learns the distribution describing the QCD background quite well but does not learn the signal distribution. Sometimes this is enough to see a classification performance which is slightly better than random.
As an example of the themes that are extracted using this technique we select a model from the (m 0 , m 1 /m 0 ) scan with S/B= 0.1. It is important to emphasise that when using these techniques in practice one does not in general have access to pure distributions with which to gauge the classification performance, so the perplexity will be an important statistic to judge which prior parameters best describe the data (and in turn allow for the best classification performance). Therefore we have selected the model with ρ = 0.1 and Σ = 1.5, since this model exhibits a large inverse perplexity. The corresponding learned model themes are shown in Fig. 15. These extracted latent distributions are remarkably similar to the pure distributions of the tt event samples shown in Fig. 5. Remember that there is a direct expected correlation between LDA themes that are similar to the pure signal and background distributions and the performance of the corresponding theme-based likelihood classifier between signal and background events. Despite the similarities, there is one important difference between the theme distributions in Fig.  15 and pure sample distributions in Fig. 5. That is the soft QCD splittings (at m 0 → 0, m 1 /m 0 → 1) present in both background and signal events in Fig. 5 are not present in one of the extracted themes. This is precisely due to the mixed-membership nature of LDA and is related to the anti-QCD part of the classifier. The model was able to identify the smooth QCD distribution peaking towards m 0 → 0, m 1 /m 0 → 1 as a theme. While it was able to recognise these same features both in background as well as in signal events, in the later case it also picked up on co-occouring hard splittings corresponding to the t andt decays and isolated them to a separate theme.

Unsupervised characterisation of new physics
Moving on to the new physics benchmark pp → W → φW, φ → W W , with m W = 3 TeV and m φ = 400 GeV, the scan plots in Figs. 16 and 17 are analogous to those for tt, with different S/B's. The same features we have seen for tt persist here: the ridge-like structure at ρ → 0 and Σ < 0.5, the correlation between perplexity and classifier performance, and the dip in inverse perplexity at ρ, Σ → 0. Clearly the performance of the classifier degrades as the S/B in the sample is reduced, but we see that the performance for the mass basis observables in Fig. 16 remains acceptable down to S/B= 0.01, while at S/B=0.005 it seems that the algorithm finds it difficult to robustly extract a reliable description of the signal. Note that even in the case with no signal there are still some features in the perplexity plot, and the performance is quite a bit better than a random tagger in some cases. This is again due to the anti-QCD tagging effect, which is stronger here than in the tt benchmark because the defining features of the W events in this representation have a smaller overlap with the QCD background. The performance of the Lund basis observables in Fig. 17 is a bit worse than for the mass basis observables, but is still quite good for larger S/B. It is notable that the results from the Lund basis observables here are much more uniform across the (ρ, Σ) landscape than they are for the mass basis observables. This could be due to the fact that the relevant features in the mass basis are much finer than they are in the Lund basis, as can be seen by comparing Fig. 7 and Fig. 8. This is partially due to the logarithmic binning used in the Lund basis observables, but is mostly due to the actual choice of observables. In Fig. 18 we show an example of the themes extracted using the VI algorithm for these scans, with S/B= 5%, using Lund basis observables, and with (ρ, Σ) = (0.1, 1.0) yielding the largest inverse perplexity in the scan. We see that the typical signal features are well distinguishable in one of the themes (theme 2). The two clusters in the leading jet distribution of this theme correspond to the decays of the φ boson and the W boson, with the single cluster in the subleading jet corresponding to the decay of the W boson. As in the mass basis example for boosted tt, we observe some notable differences when comparing the two themes to pure signal and background distributions, especially in the soft (k T → 0) and collinear (∆ → 0) regions of the Lund plane. Theme 2 predominantly captures that hard splittings associated with the massive resonance decays, while the softer splittings are predominantly captured by theme 1. Interestingly, it seems that the algorithm in this case picked up some distinguishable features of the signal (a deficit below the W peak) even in the non-perturbative (low k T ) regime. We warn however that these effects are very subtle and the least robust, since they vary considerably dependent on the model priors.

Systematics
In order to use the techniques presented above in practice it is important that the VI algorithm produces results which are stable under changes in the random initialisations of the model variables, i.e. the random seed. Also important is to verify that the algorithm parameters chosen for the inference procedure (see Sec. 2.3) are sensible given the datasets being used. The most important algorithm parameters here are the offset (τ 0 ), the chunk size (n c ), and the number of passes (n p ). The offset affects the learning rate, both the overall magnitude and as a function of the global updates, see Eq. (2.16). The chunk size changes how many events are used to optimise the local parameters before an update on the global parameters is performed. Finally, the number of passes must simply be large enough such that the algorithm converges.

Offset
We start with the offset, and to be clear what the actual consequences of particular offset values are in the inference algorithm, we show in Fig. 19 how the offset affects the learning rate (δ n ) as a function of the number of global updates. We see that the larger offsets inevitably mean a smaller learning rate, but also a learning rate which is more constant across the global updates. Smaller offsets lead to very large learning rates at earlier global updates, and larger learning rates overall.
To demonstrate the effect that the learning rate and offset have on the results, we have chosen a single parameter point from the scans performed on the W /QCD mixed event sample, with S/B = 2.5% and [ρ, Σ] = [0.05, 1.3] in the mass basis. We keep all of the parameters as they were in the scan, except now the offset is varied from 1 to 2×10 5 . For each offset we calculate the perplexity, the AUC, and the inverse  mistag rate so that we can analyse changes in performance. To assess the stability of the algorithm as a function of offset we repeat this for 100 different random seeds, calculating the mean and the (upper and lower) standard deviation of the resulting distribution. These results are shown in Fig. 20.
The first clear effect we see is that both the perplexity and the performance of these models increase with the size of the offset, degrading heavily at low offsets. The reason for this is simple, the learning rate is too large to sufficiently resolve the maxima in the ELBO. We also see that the random seed induced variance of the results increases considerably at low offsets. This is partially due to the overall size of the learning rate, but is also affected by the significantly increased learning rate in the initial global updates, as can be seen in Fig. 19. Because the chunks of data are sampled randomly at the beginning of the analysis, a different random seed means that a different subset of the data will have more influence on the inference, hence the larger variance. The second effect we see is the change in behaviour at very large offset. The AUC and inverse mistag are both good measures of performance for the model so we might expect that an increase in one leads to an increase in the other, however we see here that this is not the case. At offset ∼ 10 4 the AUC begins to degrade while the inverse mistag at fixed efficiency of ε s = 0.5 continues to improve somewhat.
The learning rate also affects the speed of convergence of VI. In the algorithm described in Sec. 2.3 we allow the algorithm to run for a fixed number of passes over the data without checking for convergence. However one could easily change this to check explicitly for convergence and end the algorithm early. In Fig. 21 we look at how many passes over the data the algorithm takes to converge, seeing that runs with larger offsets take much longer to converge. This is easily understood due to the smaller learning rate implied by larger offsets.
From these observations we deduce that an offset in the range 10 3 -10 4 is the best choice for both the  performance and stability of the inference algorithm as applied to our example datasets. Correspondingly, the suitability of a paticular offset choice on other datasets can be readily verified by checking for convergence as well as model perplexity dependence on this algorithm parameter.

Chunk size
In the prior scans in Secs. 5.1 and 5.2 the chunk size was 10 4 while the samples contained almost 10 5 events, i.e. 10 chunks per pass over the sample. Since we are looking for rare signals it is possible that the signal events could be very unevenly distributed throughout these different chunks, resulting in each chunk having significantly different perplexity, i.e. ELBO. Therefore the algorithm would essentially be attempting to optimise one model for these 10 different chunks, and the resulting posterior approximation would fail to accurately describe the true posterior. To test that this is not an issue in the scans, we have performed the same offset scan as in Fig. 20 but now for a smaller event sample (10 4 events), where the global updates are performed only after seeing the whole dataset, i.e. the chunk size is equal to the size of the event sample. We see these results in Fig. 22 and it is clear that while they differ slightly, qualitatively the same behavior is observed in the perplexity and performance at different offsets.
To properly study the effect of changing the chunk size we need to find a better way to compare models trained with different chunk sizes. Changing the chunk size significantly affects how much of the data the algorithm analyses before it converges, and we would like to disentangle this effect from the effect due to less data being analysed per global update in the algorithm. To do this we vary the offset simultaneously with the chunk size such that the learning rate at one pass over the data is held constant.  Fig. 23, where we clearly see the disadvantages in using very small chunk sizes. When the chunk size reaches O(5%) of the size of the event sample the perplexity and performance statistics reach a plateau. As we vary the chunk size we see that the results are not very sensitive to the random seed. This is because the learning rate is kept at a constant (small enough) value by also varying the offset accordingly. So while choosing the chunk size to be equal to the size of the event sample is certainly a good idea, especially with smaller datasets and rare signals, we conclude that for the event samples that we have analysed in this paper setting the chunk size to only a fraction of total dataset does not significantly impede the quality or robustness of VI while significantly improving its convergence. In particular, our reasoning in choosing the chunk size to be 10 4 rather than 10 5 for our prior scans in Secs. 5.1 and 5.2 is thus simply that the algorithm converges 10 times faster.

Conclusions
In this work we have described a general unsupervised framework capable of learning rare patterns in event data collected at high-energy colliders. We use a Bayesian probabilistic modelling technique called Latent Dirichlet Allocation (LDA), an unsupervised ML approach that was first introduced in the context of BSM collider physics in a previous paper [37]. We started by representing individual collider events as sequences of binned exchangeable measurements, and assumed a simplified picture in which the events are generated by sampling these measurements from some underlying joint probability distribution. The assumption of exchangeability of measurements in an event guarantees, through de Finnetti's theorem, that the sequence of measurements in an event are conditionally dependent on a latent variable sampled from (marginalised over) a prior distribution in latent space. Through some basic assumptions on this latent space we arrived at the LDA model, which we focus on throughout the paper. LDA is a mixedmembership model, meaning that under this model it is assumed that the measurements in each event are assumed to have been sampled from multiple (two, in our case) different multinomial distributions themes. These themes encode information on the underlying structure, i.e. hidden patterns, in the event data represented in terms of binned measurements. The mixing proportions of themes are sampled from a prior taking the form of a Dirichlet distribution, a parametric family of distributions over the simplex. Mixed membership models have the advantage of describing different events which share features arising from the same underlying physical source. Depending on the Dirichlet prior, the generative model can naturally describe event samples where certain combinations of measurements appear rarely, which is crucial for uncovering rare signals. Given the LDA model and the event data, we described in detail a stochastic variational inference technique for approximating or learning the underlying themes from which the data is assumed to have been generated. We then described how the extracted themes can be used to construct a classifier to cluster events into two categories, potentially aligned with the background and signal classes. We finally identified a measure of classification performance based solely the learned themes, the perplexity, which does not require truth labels to compute and can thus be extracted directly from mixed data. In particular, we found that perplexity correlates strongly with the widely used traditional measures of classification performance based on the ROC curves -the AUC and inverse-mistag rate at fixed efficiency, which do require truth labels.
To demonstrate the power of this technique we considered the analysis of di-jet events at the LHC focusing on two benchmark examples; boosted SM tt production and a hypothetical BSM production of W → (φ → W W )W . We described in detail how to pre-process the event data to express each event as a sequence of exchangeable measurements, and how the generative model for di-jet events is to be interpreted using LDA. Our choice of jet substructure observables that we used in the analysis is based upon high level observable combinations that have previously been shown to be good for identifying massive resonance decay chains within large radius jets with supervised methods: the traditional mass drop basis (see e.g. Ref. [53]) and the primary Lund plane basis [67]. 2 Through a study of the classification power of these different observables, and of how strong their co-occurrences are in the data, we have identified most promising pairs of observables in each basis for our unsupervised classification approach.
The results for each of the benchmark di-jet examples from this study are presented in Sec. 5. Using the perplexity, AUC, and inverse-mistag rate at a fixed signal efficiency as performance indicators, we analysed how well the two-theme LDA models classified events over a large range of values of Dirichlet prior parameters ((ρ, Σ)). For each benchmark we considered six different samples with varying S/B, ranging from 0.01 to 1.0 for the boosted top-quark example, and from 0.005 to 0.1 for the W example, including background only samples for reference. For both benchmarks the mass drop observables generally outperform the Lund observables in classification, however both choices lead to complementary results with the extracted themes in each case holding valuable information about the signal and background processes. From the results it is clear that the inference algorithm was able to separate measurement patterns corresponding to the massive resonance decays within the signal jets from patterns corresponding to light QCD emissions present within all jets. This is achieved due to the mixed-membership nature of the generative model, where QCD-like patterns found both in the signal and background jets were identified as having been sampled from the same theme describing QCD-like splittings in the jet substructure.
Finally, in Sec. 5.3 we studied how the results and performance of the chosen inference technique depend on the tunable parameters of the algorithm, in particular the chunk size and the offset. We demonstrated that the results of the algorithm are in fact stable over a large range of these parameters, and that the algorithm tends to converge within 100 passes for the example datasets.
Perhaps the most important result of this work is that over the (ρ, Σ) Dirichlet parameter plane the AUC and inverse-mistag rate, calculated using truth label information, are strongly correlated with the perplexity, which is calculated without truth label information. This implies that, not only can perplexity be used as a practical measure to asses LDA model convergence, but it can also provide guidance when selecting the most viable and robust Dirichlet priors for unsupervised collider analyses and searches. By allowing the algorithm to select optimal ρ and Σ parameters we would not need to perform a search for each choice of parameters considered, meaning that there would be no contribution to the trials factor due to these parameters. This result is a crucial step towards the next part of this work programme, constructing a full unsupervised di-jet search strategy for new physics at the LHC using LDA. In a recent letter [36] the ATLAS collaboration published an analysis of a weakly supervised di-jet resonance search in which contributions to the trials factor associated to the masses of the final state jets are eliminated by allowing the ML algorithm to define the classifier using the event data alone. We would like to stress that the method presented in this paper also benefits from such a reduction in the trials factor. In fact, because we represent each jet as a sequence of splittings corresponding to possibly many massive resonance intermediate decays, this method has the potential to describe arbitrarily complicated jet substructure signatures without paying any penalty in the trials factor. We reserve a full discussion of the search strategy to an upcoming publication.