Topic Model for four-top at the LHC

We study the implementation of a Topic Model algorithm in four-top searches at the LHC as a test-probe of a not ideal system for applying this technique. We study this Topic Model behavior as its different hypotheses such as mutual reducibility and equal distribution in all samples shift from true. The four-top final state at the LHC is not only relevant because it does not fulfill these conditions, but also because it is a difficult and inefficient system to reconstruct and current Monte Carlo modeling of signal and backgrounds suffers from non-negligible uncertainties. We implement this Topic Model algorithm in the Same-Sign lepton channel where S/B is of order one and all backgrounds cannot have more than two b-jets at parton level. We define different mixtures according to the number of b-jets and we use the total number of jets to demix. Since only the background has an anchor bin, we find that we can reconstruct the background in the signal region independently of Monte Carlo. We propose to use this information to tune the Monte Carlo in the signal region and then compare signal prediction with data. We also explore Machine Learning techniques applied to this Topic Model algorithm and find slight improvements as well as potential roads to investigate. Although our findings indicate that still with the full LHC run 3 data the implementation would be challenging, we pursue through this work to find ways to reduce the impact of Monte Carlo simulations in four-top searches at the LHC.


Introduction
The LHC is a very successful machine that is achieving amazing results such as the discovery of the Higgs boson [1,2], the detailed study of the top quark [3,4] and Higgs boson [5,6] properties, and the current exploration of the physics in the TeV scale as no other available experiment. The latter is pushing the frontiers of our knowledge by testing many of the available new theories. The LHC is currently designed to collect in the next twenty years approximately an order of magnitude more data than what has been collected so far. To enhance the LHC discovery scope beyond this increase in luminosity, it is crucial that the community commits to explore and find new observables and different analyses which can further probe the Standard Model (SM) beyond current tests. For these reasons, we investigate in this article the perspectives for a new analysis on the four-top final state based on Topic Model techniques.
Four-top is arguably among the last benchmarks of Standard Model Physics at the LHC together with tth [7,8] and hh [9][10][11] final states. In particular, four-top is especially sensitive to light New Physics (NP) which couples predominantly to the top quark [12,13]. There exists also in the literature many works that point out four-top as a sensitive channel to test heavy NP and/or Effective Field Theory effects [14][15][16][17][18][19]. However, since it is a very populated final state and with many different possible channels, its reconstruction is inefficient and therefore its measurement [20,21] is carried on mainly through signal regions which are compared to Monte Carlo (MC) predictions. Although MC-predictions have reached a complex level of development, including Next to Leading Order (NLO) matrix level calculations for four-top production [22], there are still many measurements of four-top signal and backgrounds which require renormalizations to data in control regions which still need further understanding. Because of this, we investigate in this work a direction to reduce the impact of MC simulations and tuning in the extraction of physical quantities in the four-top final state. It is worth stressing at this point that, even with all the new Machine Learning and Topic Model techniques, it is not possible to avoid a dependence on MC simulations in extracting absolute physical quantities from the LHC data. We pursue the goal of reducing the impact of these simulations by replacing some of the required simulation and/or calibrations with different techniques.
Topic modeling is a subject of natural language processing and Machine Learning that concerns the study of statistical models to recover abstract topics that occur in a corpus of documents. One of the seminal works in this subject defines the Latent Dirichlet Allocation (LDA) [23] technique to determine and extract topics from a corpus of documents. It considers topics as probability distributions over words, and resorts to different strategies for recovering these distributions from the observed word distributions in the documents. Some works build over LDA, as for instance the Dynamic Topic Model [24], that considers ordered documents within the corpus, the Correlated Topic Model [25], that considers correlation among the topics, the Online LDA [26] that allows for a streaming of the documents in the training process, or the Decontamination of Mutual Contamination Models [27], that tackles the demixing of mixed membership models, among others. Some of these sets of tools have recently found their way into jet physics, where now documents are the distributions over some jet observable, and words are each of the bins in these distributions [28]. The underlying topics then cease to be abstract notions, but are meaningful distributions, for example signal and background for a particular process. For the simple case in which there are two observed distributions, and two underlying distributions to be recovered, a data-driven implementation of a mixture membership topic modeling has been applied to differentiate quark from gluon jets [29,30]. Along this work we investigate this particular framework applied to four-top physics. We refer to this technique as the Topic Model Demixer algorithm, or Demixer algorithm for short [29].
We investigate in this article the four-top final state and its backgrounds as two specific topics which are mixed in different proportions in two defined samples according to the number of jets that are b-tagged in the final state. We explore within this framework the perspectives for measuring four-top physical properties, combining the information that can be extracted from the Demixer algorithm with MC simulations and techniques.
This work is divided as follows. In Section 2 we present the Demixer algorithm and investigate its behavior in cases where its hypotheses are not fully satisfied. In Section 3 we apply the Demixer algorithm to four-top production and its main backgrounds ttW and ttH, where we show how to recover important physical properties and distributions without relying on MC simulations. In Section 4 we discuss improvements and alternative strategies that could be performed on the algorithm to enhance the result in this work. We discuss in Section 4 a possible strategy to tune a MC generator using the information extracted from the Demixer algorithm. We summarize our conclusions in Section 5. In Appendix A we essay Machine Learning techniques on the Demixer algorithm to improve the results in describing the signal and background distribution and purity fractions.

Demixer algorithm
In this section we give a brief review on the Demixer algorithm. We first describe the algorithm along with its basic features, which allows to recover two underlying distributions and their fractions from a pair of mixed samples. We then discuss under which conditions the algorithm works properly, and we analyze more realistic cases which relax some of these conditions. We study how the algorithm can still be used to recover sensible topics, as long as the withdrawal of the hypotheses is tractable.

Demixer algorithm in the ideal case
Along this work we consider the case of two samples M 1 and M 2 that are a mixture, in different proportions, of two underlying sources which we call signal and background. In the following paragraphs we summarize the basic layout of the problem for the ideal case, further details can be found elsewhere [27,29].
Assuming that some features of the elements in samples M 1,2 can be described by a given observable x, we can define a probability distribution of these elements in each sample as Here p S (x) and p B (x) are the two underlying distributions in x of signal and background, respectively, and f i is the fraction of signal events in sample M i . These are the unknowns that one would like to recover, or demix, from the original samples M 1 and M 2 . Observe that one of the important assumptions is that p S (x) and p B (x) are the same for both samples.
In order to demix these samples we perform a maximal subtraction through the definition of reducibility factors: We maximally subtract sample M 2 from sample M 1 and normalize it in order to define the distribution of the reconstructed signal topic T S over x as: In a similar way, we can obtain the reconstructed background topic T B distribution in x.
Without loss of generality, we suppose that sample M 1 has a larger fraction of signal events than sample M 2 (f 1 > f 2 ), and therefore T S is the one matching the signal distribution, whereas topic T B would match the background distribution.
For the demixer to work properly, besides having different fractions f i in Eq. 1, we need to have anchor bins. That is, having at least two points (or bins, in the discrete case) x S , x B such that Defining the irreducibility factors for the underlying distributions κ(B|S) and κ(S|B) by replacing M i,j by S and B in Eq. 2, this implies that the reducibility factors κ(B|S) = κ(S|B) = 0. This hypothesis is usually called mutual irreducibility. When this condition is guaranteed, topics reconstructed from maximally subtraction samples between themselves will match underlying signal and background distributions, and the fractions f 1,2 can be recovered from inverting the following equations: If there is no mutual irreducibility, sample demixing still leads to relevant quantities. For instance, the reconstructed topic for the more signal-like sample leads to the backgroundsubtracted signal distribution: And the analogous is valid for the other sample, by swapping B and S. If there is extra input on the size of these reducibility factors, by either theoretical principles, or by some given estimation, then the two equations can be solved for the pure distributions. In this sense, κ SB and κ BS can be thought as hyperparameters, since prior information on them would provide a better determination of the underlying topic distributions.

Demixer algorithm beyond the ideal case
In the above paragraphs we have shown the basic features in extracting two topics from two samples that contain different mixtures of these topics in an ideal case. In the real case scenario the procedure is highly different and many factors may affect the conclusions. For instance, since the data has uncertainties, the minimum in Eq. 2 may not correspond to the true minimum, or one of the underlying distributions may not have an anchor bin, or the signal or backgrounds distributions may be different in M 1 and M 2 , among others. In the following paragraphs we perform a brief study on how some of these real case scenario factors may affect the extraction of topics and fractions. In this overview we neglect experimental factors such as statistic and systematic uncertainties and we focus in studying cases where some of the hypotheses needed for the Demixer algorithm procedure is not satisfied. A more exhaustive study scrutinizing the effect of all these factors lies beyond the scope of this work. Nevertheless, it would be useful for further understanding the reliability of the Demixer algorithm in particle physics where not only the ideal case is always far from reality, but also because in particle physics the topics (and their fractions) are usually not abstract distributions, but instead have physical meaning.
Let us consider cases in which the mixture samples stray from the optimal conditions for demixing. In order to keep track of these conditions, we define two basic hypotheses as: • H1: Mutual irreducibility. Both underlying signal and background have anchor bins.
• H2: Same underlying distributions. The two samples are sums of the same signal and background distributions, differing only on the fractions.
In many real case scenarios H1 is relaxed either because mutual irreducibility is a priori unknown, or because it is well known that one of the underlying distributions does not have an anchor bin. Relaxing H1 means that the reconstructed topics now match backgroundsubtracted-signal distribution and vice-versa, as in Eq (5). However, if one could know the reducibility factors κ SB and κ BS , inverting this system of equations would give the true underlying distributions. In such cases one option is to resort to either theoretical or experimental arguments, or simulations in order to estimate the size of the reducibility factors. In the same direction, one could justify that any of the reducibility factors is zero -or negligible-and extract useful information, even though the other is unknown. In fact, as shown in Ref. [30] or in Eq. 5, this means that one can reconstruct the distribution that has an anchor bin exactly without recurring to additional information.
In order to quantitatively discuss deviations from H2 we need to quantify the difference between probability distributions. There are many options to quantify the difference between two curves, however given that these two curves are probability distributions we adopt the Hellinger distance [31] which is defined as This distance has the property that through the use of a square root it provides a relative enhancement in importance to regions with smaller values. The purpose of this is to be more sensitive to differences in the small probability regions, where the anchor bins are located.
In any case, it is worth mentioning that we have verified that using other notions of distance such as L 1 or L 2 does not yield qualitative changes to our conclusions.
With this notion of distance, we can measure the Demixer performance when the hypothesis of same underlying distributions is not fulfilled. We define δ S,B to quantify that either signal or background distributions can be different in each sample as Here the sub-index 1 or 2 refers to the underlying distributions in samples M 1 and M 2 , respectively. To determine the performance of the algorithm, we can measure how good signal and background reconstruction is by defining the distances Since sample M 1 is the one with larger fraction of signal, and sample M 2 the one with larger fraction of background, we define for practicality ∆ S = ∆ S 1 and ∆ B = ∆ B 2 .
In order to keep the same notion of mutual irreducibility when in presence of four underlying samples P S 1,2 (x) and P B 1,2 (x), we consider the deviation between underlying samples to be generated by multiplicative noise, that is with a noise function ξ S satisfying to keep P S 2 (x) normalized. For the background we have analogous relationships exchanging S → B and 1 ↔ 2. In this way P S 1 (x) and P B 2 (x) would be the reference distributions for signal and background, respectively.
To better explore the Demixer algorithm conditions and their subsequent relaxation we split the space of possible combinations into a few cases. For H1 we have either If H2 holds, in case b) the algorithm still recovers correctly the distribution that has the anchor bin. If in addition the reducibility factors are known, in all three cases the demixer works in recovering underlying distributions. For H2 we distinguish three cases: Case 0) is where H2 holds. For case 1) there is a mirrored case that occurs by switching between S and B labels. Case 2) is the most general one, and can be particularized to the other cases for either δ B → 0, δ S → 0 or both.
We can see how the demixer works when H2 does not hold. In this case the factors κ ij get modified due to the difference between underlying signal and background functions As we parameterize the difference between functions as a multiplicative noise, we get Here ξ 0 B and ξ 0 S are the values of the noise functions at the anchor bins, which should not be confused with the functions. Performing the maximal subtraction on sample M 1 yields -assuming f 1 > f 2 -the signal-topic The background-topic is found by replacing f 1 → 1 − f 2 , f 2 → 1 − f 1 , and flipping B and S labels. We can first consider the simpler case 1), which corresponds to p B 1 (x) ≡ p B 2 (x). In this case the expressions simplify to We can see that in general the signal reconstruction is better than the background reconstruction, as the expression involves only the underlying signal distribution, whereas the background topic involves both signal and background distributions in addition to denominators which may enhance the disagreement between P T B (x) and P B (x). More quantitative statements can be made, by calculating the distances ∆ B , ∆ S 1 and δ S . For instance, by computing ∆ S using Eq. 11 and expanding in ξ S we obtain That is, ∆ S , a measure on the performance in the signal reconstruction, follows a linear dependence on the distance between the two signal distributions, which is a measure of the breaking of condition H2. We also see that the slope is simply a function of the signal fractions in the samples f 1 , f 2 , and that by increasing f 1 , that is the amount of signal in sample M 1 , this slope decreases. This is expected, since the more pure are the samples on a given distribution, the better is the reconstruction of the corresponding topic.
To better visualize the above results, we have performed numerical simulations by scanning on different function and noise shapes. We have used Gaussian for S and B distributions, randomly sampling their means and deviations, and keeping only cases with mutual irreducibility, under a certain tolerance. For the noise functions ξ S,B (x) we sampled randomly for each bin a value between [−0.1, 0.1], then renormalizing the functions in order to satisfy relation (7) (and the analogous for the p B (x) distribution). Then the two mixture samples having fractions f 1 = 0.45, f 2 = 0.22 were generated by summing the underlying distributions.
Using these simulations we have computed the above distances for each topic. Each simulated point corresponds to a pair of functions P S 1 (x) and P B 2 (x) mutually irreducible, and two noise functions ξ S , ξ B of the same amplitude. A plot of these distances can be seen in Fig. 1, for case 1), which corresponds to setting P B 2 (x) = P B 1 (x), that is using a single noise function ξ S (x). We see the linear dependence in signal reconstruction with the signal difference δ S . We have verified a similar result for other distances such as L 1 . This behavior is well explained by Eq. 13. These points can be linearly interpolated for different values of f 1 > f 2 , to see that the slopes follow the predicted values. From Eqs. 11 and 12 one can see that the background topic is expected to be noisier. In fact, one can see from the RHS in Eq. 13 that for fixed δ S , the distance ∆ S 1 is approximately constant. Whereas the same procedure for ∆ B is more involved due to the other factors present in Eq. 12 which yield stochastic noise. We also see in Fig. 1 that the Hellinger distance ∆ B is an order of magnitude larger than δ S , showing that the background reconstruction is sensitive to the distance between underlying signal distributions δ S . We find less enhancement if the L 1 distance is used. Figure 1: Testing H2 under case 1). Scatter plot of distance between the reconstructed signal and background topics and the corresponding underlying distribution as a function of distance between the two underlying signal distributions. We show in dashed black the line given by slope 13. The vertical axis corresponds to ∆ S 1 for orange (square) points and to ∆ B for blue (circle) points.
For Case 2), with different signal and background distributions in each sample M i , reaching a closed form for the distances ∆ B and ∆ S is considerably more involved. We have computed numerically these distances using the same simulation scheme as in the previous case and obtained the plots in Fig. 2. From using specific forms of ξ(x) noise functions one can infer that the topic with more fraction in the sample is the one that would have a better reconstruction. We have verified this statement by simulating many cases as in Fig. 2 with different fractions.

Summary
From the above results we see that there is an interplay between the distances of the underlying distributions and the fractions of purity of each topic in the samples to determine which topic is better reconstructed and to what extent. For instance, in Fig. 1 we see that having a non-zero distance between the signal distributions, δ s = 0, dominates and the background is less well reconstructed. This statement is still valid for specific values of the fractions f i that increase the slope of the orange points, reaching the limit in which both topics are equally reconstructed. On the other hand, in Fig. 2, we see that for δ S ∼ δ B = 0, the value of the fractions f i is the one that determines which topic is better reconstructed.
As a summary of the above paragraphs, we can extract some useful statements regarding the validity of the demixer when the working conditions do not fully satisfy H1 and H2.
• When mutual irreducibility is not guaranteed, if one of the topics has an anchor bin, then this topic is better reconstructed by the algorithm. (Case b0).
• If the underlying distributions for one topic are different in the two samples M i , and the other topic has equal underlying distributions, then the former is the one with better reconstruction. (Case a1.) • If both topics have different distributions over both samples, then topic reconstruction is mainly ruled by sample purities. If samples are mostly background, then background reconstruction will be better than signal reconstruction, and vice-versa. (Case a2.)

Demixer algorithm in four-top at the LHC
The four-top final state at the LHC is a very busy channel with little chances to be correctly reconstructed. In addition, this signal and its background suffer important MC uncertainties. It is therefore an attractive channel in which to apply the Demixer algorithm, and where we can expect to find many of the difficulties discussed in Section 2.
To choose the decaying channel in which to apply the Demixer algorithm, we note that the algorithm requires a non-negligible fraction of signal in the samples. Considering the branching ratios and background processes, we find it suitable to focus on the same-sign dilepton and tri-lepton channels where the main backgrounds are ttW and ttH with the W decaying leptonically and the H decaying semi-leptonically or leptonically through W W * . This multilepton channel provides the highest signal over background ratio, provided we take the appropriate cuts beforehand by following Ref. [20]. We have performed event MC simulations of pp → tttt up to detector level using MadGraph [32] for matrix-level process, Pythia [33] for showering and hadronization and Delphes [34] for detector simulation, following the same basic cuts as in Ref. [20], but at leading order and with only up to one extra parton. We have set the mixture fractions to agree with the event yields reported in Ref. [20].
In order to have a good demixer we need to find two sets of observables as uncorrelated as possible, one to define M 1 and M 2 and the other one to play the demixing variable x, as defined in Section 2. The uncorrelation is what provides the approximately same underlying distributions for signal and background in both samples, namely H2. A naive set of observables from which to choose is N b , N j and the p T , energy and ∆R from all the reconstructed objects in the event. In order to do a simple demixing model, we choose only one variable to define M 1 and M 2 and another one to demix. Using a multidimensional demixing with a Neural Network as in CWoLA [35] does not show at this extent considerable improvements. This is discussed in Appendix A.
On physical grounds, the most direct distinction between signal and background would be the number of b-tags (N b ). Observe that the SS multilepton channel has the special feature that none of the backgrounds has more than 2 b-jets. Meanwhile, and as showed in Ref. [13], the number of reconstructed jets (N j ) is also a good discriminator between signal and background. We consider these two variables to perform the M i definition and the demixing, leaving p T , energy and ∆R for setting cuts to accept or reject reconstructed objects such as leptons, jets or b-jets. Considering H2, we require the two background processes ttW and ttH contained in both M i to yield approximately the same underlying distribution on each sample. Since these backgrounds have different number of jets at parton level, it is suitable at this level to avoid dividing the samples using N j because the relative proportions of these backgrounds suffer a non-negligible change in each sample, yielding different underlying distributions for the background. In addition, N b is a variable with very few discrete values to use as a demixing observable. Therefore, we consider dividing the samples as Events with 3 b-tags, M 2 : Events with 2 b-tags.
Whereas N j is to be used as the variable on which the demixing algorithm is performed.
Using the above M i definition we can construct the two mixture samples from the simulated events. We show in Fig. 3a the distribution of M i in N j . As expected, the M 1 mixture -which has the largest fraction of signal-is shifted towards large N j with respect to the M 2 mixture. We have computed that for full LHC Run 3 the two distributions would be distinguishable at a ∼ 3σ level at the N j = 7 bin. We can perform the demixing algorithm as described in previous section assuming, in a first step, that it is an ideal case. Since we can expect from general grounds that the background distribution goes to zero for large N j , we can predict a good reconstruction for the background. We plot the reconstructed underlying probability densities as well as the truth-level topic distributions in Fig. 3b. In fact, the background underlying distributions are approximated properly by the background reconstructed topic through the algorithm. As studied in Section 2, this is because the background has a proper anchor bin at large N j , it has similar distributions in M 1 and M 2 , and also because it has larger fraction than signal in both samples. On the other hand, the reconstructed signal topic does not match the underlying signal distributions because of the lack of an anchor bin. As discussed below, obtaining a trustable background distribution in the signal region from data provides a new way of tuning the Monte Carlo event generator in the signal region.
Further analysis of the demixing algorithm indicates that the reconstructed fractions misestimate their true values by about a ∼ 30%. This result is obtained through a MCindependent algorithm and the shift is within the order of magnitude of the usual MC normalizations performed in four-top signal and background predictions. Moreover, this result is the product of applying the demixing assuming H1 holds, which we know is not true from theoretical grounds: signal does have events in all the N j bins. As a matter of fact, we can see in Fig. 3b, that it yields a background-subtracted signal distribution which assigns a zero probability to N j = 5. Therefore, we can still improve this result, but at the price of including Monte Carlo input. As a second step, we address the demixing algorithm using the workaround for the absence of anchor bins, the κ factors defined in Section 2. These κ factors could be understood as hyperparameters in the algorithm, since a prior knowledge or constraint on their possible values provides a better adjustment in the estimated fractions. We can study the performance in the reconstructed fractions using the hyperparameter plane (κ SB , κ BS ). In Fig. 4 we scan on these κ for the real case of different underlying distributions in both samples (Fig. 4a) and for the adjusted case of equal underlying distributions in both samples (Fig. 4b). In both cases we see that using the prior theoretical knowledge that κ SB > 0, and tuning manually κ SB to larger values while leaving κ BS = 0, pushes f R 1 /f 1 to one. Moreover, we can see that when the underlying distributions are not equal in both samples (Fig. 4a), the κ's that correctly reconstruct f R 1 /f 1 do not coincide with the solution corresponding to the correct underlying distributions. It is interesting to see in Fig. 5 how by manually tuning κ SB to larger values we reach a solution in which the fractions are correctly reconstructed, but not the distributions, and then vice-versa. This behavior is still obtained if one also varies κ BS , since the reason behind this disagreement is that H2 is not fulfilled.
If, on the other hand, we satisfy H2 by forcing to have same underlying distributions (Fig. 4b), we can tune the hyperparameters κ SB,BS to correctly reconstruct the fractions and the distributions. For the sake of completeness, we show in Fig. 6 the output for the demixing algorithm in this case. The plots in Fig. 4 show the sensitivity of the algorithm to H2, as noticed in previous section. We also see in this figure that the correct solution is more sensitive to κ SB than to κ BS . This is expected, since the background does have an (approximate) anchor bin, and thus κ BS is expected to be close to zero.
In the following section we complete this discussion with possible methods to combine the demixing algorithm with MC simulations to reduce the impact of MC tuning in the extraction The truth-level distributions are S 1 and S 2 whereas S R is the distribution obtained using Eq. 5. This is an explicit and graphical demonstration on the sensitivity of the algorithm to H2.

Discussion
Along previous sections we studied how the Demixer algorithm works in recovering signal and background distributions from mixture samples, and how to apply these tools to the four-top process. In this section we discuss the strengths and some of the shortcomings of this implementation, along with possible improvements. We first discuss possible goals relative to the analysis in Section 2, when the Topic Model hypotheses are not fully satisfied. We then discuss possible variations to the Demixer algorithm implementation presented in Section 3 together with its pros, cons, and features to be further explored. We end with a discussion on how the presented Demixer algorithm could be implemented to extract physical quantities in four-top while reducing the impact of MC simulations.
In Section 2 we detailed the hypotheses behind the topics and fractions extraction, exploring what happens when these hypotheses are not valid. A more exhaustive study can be made, both analytically and numerically. For instance, the error propagation of the demixer algorithm can be studied for the more realistic case in which the uncertainty in each sample is taken into account. Error bars in samples P M 1 (x) and P M 2 (x) would be translated into error bars in P S (x), P B (x), along with reconstructed fractions f R 1,2 . Analyzing this error propagation and its dependence on the validity of conditions H1 and H2 defined in Section 2, would be an important step towards better understanding the algorithm and its potential.
In Section 3 we used the SS multilepton channel for applying the Demixer algorithm to the four-top process. The main reasons for this choice are the relatively high S/B ratio, and also that this channel has the special condition that all background processes consist of at most two b-jets at parton level. It is worth noticing at this point that other minor backgrounds to the SS channel, such as non-prompt leptons and ttZ, have a behavior similar to the main backgrounds in what concerns to the number of b-jets and also to the anchor bin for large N j . We have also studied other channels such as mono-lepton, for which we only mention the following results. We find that being S/B below O(10 −2 ) makes it challenging to correctly apply the algorithm. Moreover, even in the hypothetical case that new cuts could increase S/B we find that using N b to define the samples and N j to demix yields different underlying background distributions, which breaks H2 and therefore spoils the results. In fact, since in this case ttbb is among the main backgrounds, the relative contribution of this background to the background topic would change considerably between the N b = 2 and N b = 3 samples. Despite of these difficulties in approaching this channel with the Demixer algorithm, we still find it an interesting goal, since having a less MC-dependent prediction of backgrounds such as ttbb is an important avenue in four-top and in heavy flavor physics.
Along the article we have used b-tagging information through N b to define the samples M i , and N j to demix. An inversion of the roles of these variables would be an interesting study. Of course, in this case one could not demix using N b since it can take very few discrete values. In fact, to explore an inversion of roles one should define the samples as -for instance-M 1 : N j > 7 (signal enhanced) and M 2 : 5 ≤ N j ≤ 7 (signal suppressed), and define a continuous variable considering the probability of having b-tags within the jets. This variable could be, for instance, the sum of the MV2c20 [36] variable 1 of the four leading jets. Such a variable would permit to perform a demixing on this variable, and bring up a different prediction for four-top using the Demixer algorithm.
We have also implemented Machine Learning techniques to explore the ability of a Neural Network to best discriminate M 1 from M 2 , and therefore signal from background. The bottleneck in this direction is to choose the parameters for the Neural Network whose distribution remains the more similar possible between M 1 and M 2 , otherwise the breaking of H2 spoils any improvement that could be done with the Neural Network. We have discussed different architectures and choice of parameters in Appendix A, where we show the demixing result for each case. In Fig. 7 we show the result for teaching a Neural Network to distinguish M 1 from M 2 and then using its output as the demixing variable. Losing physical interpretation on the demixing variable to assure same underlying distributions, yields potential issues in guaranteeing H2, which is translated into a more challenging topic reconstruction. More details on the procedure and the architectures can be found in Appendix A. We find moderate improvements compared to demixing in N j alone, meaning that N j is the main discriminator between signal and background at this level. It is also worth noting that the Demixer algorithm is a data-driven technique that does not require training and, using eight i7 CPU cores, it performs with times of order O(10 −3 ) seconds while the CWoLA algorithm requires times of order O(10 2 ) seconds for a simple architecture and a relatively small dataset as implemented in Appendix A. One of the reasons for this difference is the small number of bins over which we do the demixing. Nevertheless, we consider that more work in this direction, including a continuous variable for the b-tagging information, could provide still further improvements to the main result in this work.
We have seen along the work the sensitivity of the algorithm to hypothesis H2. Since we are including different backgrounds altogether in the background topic, this makes it easier to break H2 and having different underlying distribution in each sample. This is in particular a crucial failure for the mono-lepton channel briefly discussed above. To tackle this issue a generalization of the demixer into -for instance-three samples, now to distinguish between the backgrounds themselves, would be an interesting work to address.
We end this section with a brief discussion on how the obtained results using the Demixer algorithm on four-top could be used to reduce the impact of MC simulations and its tuning in the extraction of an absolute physical quantity such as the four-top production cross-section, σ(tttt). We consider one specific possibility, however there are different options which could be explored. As discussed in the introduction, we consider that it is not possible to avoid a MC dependence to measure a quantity such as σ(tttt), which can be compared to theoretical matrix level calculations. However, reducing the impact of these predictions and tunings is a crucial task in the four-top final state which has many complex ingredients such as ISR/FSR, hadronization, jet reconstruction and isolation, among others, which still need to be further understood.
The Demixer algorithm applied to four-top as implemented in previous section yields a reasonably reliable distribution for the background topic, since the background fairly satisfies having an anchor bin and has more purity in both M i samples. This is an important result because it provides a MC-independent prediction for the background in a region where signal is expected. Therefore, one would be implementing a data-driven MC tuning for the background in the signal region. That is, to tune the MC to reproduce the background shape in N j extracted from the Demixer algorithm. In contrast to usual MC tuning in control regions, where there is no signal, this kind of tuning has the advantage that it does not require an extrapolation to different regions in parameter space. Once this is performed, one could use this MC to predict the κ SB and from this number extract the shape of the signal distribution in N j from the Demixer algorithm. A comparison of the signal shape predicted by the tuned MC and the shape predicted by the Demixer once the κ SB has been determined, would be a measure of the success of the method. In such a case, one could extract the fractions of signal and background in M i and rely on this tuned MC to extract the tttt production cross-section.

Conclusions
We have studied the Demixer algorithm as well as some of its limitations, and applied it to the four-top signal and its background at the LHC, where many of the required hypotheses are not fully satisfied.
In real case scenarios the requirements for applying the Demixer algorithm are usually not exactly satisfied. We have analyzed how the outcome of this algorithm is affected when some of the required assumptions break down. We find that, to some extent, the method is still useful. We have shown explicitly how the topics reconstruction and fraction estimations may shift because of not having equal underlying distributions in the initial mixture samples and/or if there is not mutual irreducibility. The study is not exhaustive, and we conclude that more work in this direction is needed to better understand the Demixer algorithm and its scope beyond ideal cases.
We have implemented the Demixer algorithm for the tttt final state and its main backgrounds. We have used the Same-Sign multilepton channel which assures a reasonable S/B and has the special feature that all backgrounds have no more than two b-jets in their final state. The implementation has been made using simulated events up to detector level. We have defined the two mixture samples using the number of b-tags, N b = 2 and N b = 3, while we have used N j to demix the samples. We have also essayed to demix using Machine Learning through a Neural Network which best discriminates the mixed samples, as in CWoLA. We find that Neural Network results can be slightly better, but at the price of considerably hiding the clarity of just using N j . Neglecting statistic and systematic uncertainties we have shown that the reconstruction of signal and background topics is as predicted by the Demixer algorithm beyond the ideal case: i) A lack of anchor bin in signal yields that only background is correctly reconstructed; ii) Signal is correctly reconstructed provided the corresponding reducibility κ-factor, which requires input beyond data-driven; and iii) The estimation of the fractions and reconstruction of the topics yield values somewhat shifted from its true values due to slight differences in the underlying distributions. The Demixer algorithm in the described framework without using MC inputs predicts the fraction of signal in the mixture samples with a misestimation of approximately ∼ 30%.
We have discussed some possible future directions and improvements regarding the results in the article, as for instance to invert the roles of N b and N j for defining the samples and demixing, replacing N b by a more continuous b-tagging variable. We have proposed to consider the Demixer algorithm to obtain a MC-independent prediction for the four-top background, which would allow to tune the MC parameters for the background prediction in the signal region. This tuned MC could be used as a new tool to extract physical quantities from the signal region, as for instance the tttt production cross-section. Achieving in this way the general goal of this work, which is to reduce the impact of MC in the extraction of physical quantities.
The four-top is a very challenging final state within any framework and the Demixer algorithm is not the exception. The actual implementation of the algorithm for four-top at the LHC along the lines presented in this manuscript would require the full LHC Run 3 dataset, and still many experimental aspects and uncertainties would have to be further analyzed. Among the main issues to be addressed in a real implementation, we can mention that a better discriminating variable than N j that could provide an improved differentiation between the mixture samples would be a crucial milestone. We present the results in this article as an alternative step towards reducing the impact of MC generators in the extraction of physical quantities in four-top physics.

Acknowledgments
We thank Jesse Thaler for very useful conversations during the development of this work. E.A. thanks to participants of Voyages Beyond the SM III Workshop for useful discussions.

A Using NNs for CWoLA implementation in 4-tops
In Section 3 we detail an implementation of Demixer algorithm to demix four-top from two backgrounds in the Same-Sign lepton channel. Demixer algorithm, as detailed in Section 2, is implemented in a simple and clear way, with only one observable (N b ) to define the orthogonal regions M 1 and M 2 and one observable (N j ) to demix and obtain the topics T S and T B . In principle this could be improved as detailed by CWoLA [35] with the use of a larger set of observables that have to be combined in some way to get an optimal classifier for M 1 and M 2 which corresponds to the optimal classifier for signal and background. However, to identify the output of CWoLA to the real signal and background distributions, one still has to consider the validity of H1 and H2 in Section 2 In this appendix we study the use of Neural Networks (NN) to search for a better discriminant than using only N j , while we maintain the definition for M 1 and M 2 as a function of the number of N b . We implement our algorithms through the Keras package [37] for Python 3. This discriminant should provide a better resolution between M 1 and M 2 while also providing anchor bins for both the reconstructed distributions. These distributions should be identified with signal and background if the hypotheses in Section 2 are fulfilled.
We feed the NN with the same simulated events, and label the events according to their classification into M 1 or M 2 . We test different NN architectures and different observables of the reconstructed events. We split the samples into training, validation and testing samples. The training is performed using 200 epochs provided there is no overfitting. We make use of the classes weights option provided by Keras to account for the different number of events for M 1 and M 2 reported in Ref. [20] without discarding any simulated event. From each event we extract N j , the p T and energy of all reconstructed objects, angular distance between any pair of reconstructed objects, and total transverse energy H T . In the following we use layers of neurons with ReLu activation functions and a final neuron with a sigmoid activation function forming a feed-forward NN trained with a binary cross-entropy loss function. The notation for the chosen observables is N 1 object 1 -N 2 object 2 which means that we use the p T and energy of the p T -leading N 1 and N 2 objects object 1 and object 2 and the 1 2 (N 1 + N 2 )(N 1 + N 2 − 1) angular distances between all of them. For instance, in "Nj-2leps-2jets" we refer that the NN is being fed with N j and all the information of the 2 leading leptons and 2 leading jets in form of p T , energy and angular distances between them.
Along Figs. 8, 9 and 10 we show three simple architectures with three different choices of observables in each figure. We plot the NN output for M 1 and M 2 and the result of using the demixing algorithm on this output to recover the underlying distributions and the signal fractions in each mixture. As a general result, we find that using b-jets in feeding the NN brings correlation between the demixing and the samples and then, even if the discrimination may be efficient, the algorithm fails to reconstruct the underlying distributions since they do not accomplish H2. If we instead feed the NN with the leading jets -regardless of whether they are b-tagged or not-then such a correlation is suppressed and the reconstruction of the underlying distributions is better. We also find that using less jets proves to be slightly better. In Fig. 8 we investigate NN with 3 layers of 4 nodes each and a final sigmoid node, which we refer to it as "4+4+4+1". We find that the NN is too simple and the reconstructed and underlying topics consist of tacking lines. If Fig. 9 we add nodes and use a 32+32+32+1 NN to obtain smoother curves for reconstructed and underlying topics. Adding still one more layer with 32 nodes (Fig. 10) brings still more smoothness and a good agreement in the reconstructed fractions.
We find the best set up by using a 32+32+32+32+1 NN fed with N j , the two leading leptons and the two leading jets, Fig. 10c and Fig. 10d, which we also display in Fig. 7. Further exploration along these lines could bring better and more solid results for the main purpose of this work.  Figure 10: Idem as Fig. 8, but for a more complex and deeper 32+32+32+32+1 NN.