ThickBrick: optimal event selection and categorization in high energy physics. Part I. Signal discovery

We provide a prescription called ThickBrick to train optimal machine-learning-based event selectors and categorizers that maximize the statistical significance of a potential signal excess in high energy physics (HEP) experiments, as quantified by any of six different performance measures. For analyses where the signal search is performed in the distribution of some event variables, our prescription ensures that only the information complementary to those event variables is used in event selection and categorization. This eliminates a major misalignment with the physics goals of the analysis (maximizing the significance of an excess) that exists in the training of typical ML-based event selectors and categorizers. In addition, this decorrelation of event selectors from the relevant event variables prevents the background distribution from becoming peaked in the signal region as a result of event selection, thereby ameliorating the challenges imposed on signal searches by systematic uncertainties. Our event selectors (categorizers) use the output of machine-learning-based classifiers as input and apply optimal selection cutoffs (categorization thresholds) that are functions of the event variables being analyzed, as opposed to flat cutoffs (thresholds). These optimal cutoffs and thresholds are learned iteratively, using a novel approach with connections to Lloyd’s k-means clustering algorithm. We provide a public, Python implementation of our prescription, also called ThickBrick, along with usage examples.


JHEP03(2021)291 1 Introduction
The increase in complexity and scale of high energy physics (HEP) experiments over the years has been accompanied by an increased sophistication of the techniques employed in the data analysis. Among the more recent additions to the medley of analysis techniques used in experimental HEP are machine learning algorithms. Machine learning techniques have found applications in various aspects of the analysis [1][2][3], and are an active area of research, both from the standpoint of finding new applications as well as refining their current usage.
An important step in any collider 1 data analysis is event selection and/or categorization. Event selection is the process of selecting a subset of collider events for further analysis. Event categorization is the process of grouping the selected events into disjoint categories, each of which will be individually analyzed.
Event selection serves two purposes. First, it helps bring down the volume of recorded data within the computational limits of the experiment (storage and processing power). This is done at the trigger level [4,5] and works by making rapid decisions on whether or not an event is "interesting enough" to warrant further processing, given the goals of the experiment and its computational limitations. Secondly, event selection is used in individual analyses to reduce the amount of "background" in the event sample and make it more "signal"-rich, with signal and background events defined appropriately. This is the type of event selection we will focus on in this paper.
A sample of collider events can often be thought of as being composed of several subsamples, although real collider events do not necessarily carry a label indicating which subsample they are from. These mixture models are used in the Monte Carlo simulation of events, with simulated events carrying a label to identify the subsample they are from. These subsamples can be labeled as signal and background (or bg1, bg2, etc), as appropriate for the particular analysis at hand. The event selection process can then be thought of as selectively including signal-like events (and rejecting background-like events) to "improve" the analysis. Let us call this the "signal is better than background" heuristic.
The goal of event selection in an analysis searching for a signal (due to new physics or an interesting Standard Model process), is to increase the expected significance of a potential signal excess, or improve the expected upper-limit on the signal cross-section (in the case of a null result). It is easy to see how the heuristic mentioned above can help improve such an analysis. On the other hand, the goal in a parameter measurement analysis (say a top quark or a W -boson mass measurement) is to increase the precision of the measurement. In such analyses, subsamples whose distributions are independent of the parameter(s) being measured can be labeled as background, and subsamples whose distributions do depend on the parameter(s) being measured can be labeled as signal. Again, we can see how the "signal is better than background" heuristic can improve the precision of measurement by making the distribution of the selected event sample more sensitive to the parameter(s) being measured.

JHEP03(2021)291
The "signal is better than background" heuristic allows us to rephrase the event selection problem as an event classification problem. Over the last two decades, machine learning methods like artificial neural networks [6] and boosted decision trees (BDT) [7,8] originally developed to solve classification problems have been successfully employed in collider physics analyses [1][2][3]. These algorithms are trained on simulated data samples to learn features that distinguish signal and background events. They can then provide a measure of how signal-like an event in the real data is, and the event selection can be done based on this measure.
But despite the rationale behind the "signal is better than background" heuristic, it is not perfectly aligned with the physics goals of improving the search sensitivity and measurement precision. In this work, we will prioritize the physics goals, and focus on maximizing the expected sensitivity of the analysis that uses the data after event selection/categorization. We will introduce appropriate performance measures that capture this sensitivity and provide prescriptions to maximize them. This is the optimality alluded to in the title of the paper, and it is in this sense that the word "optimal" is used throughout.
In this work, we introduce a paradigm named ThickBrick to construct optimal event selection and categorization prescriptions for physics analyses within the supervised machine learning paradigm, with cost functions defined on a per-event basis. Along the way, we will point out exactly how our methods realign the goals of event selection and categorization with the physics goals mentioned above. To facilitate clarity of presentation and improve accessibility, we will present the ThickBrick approach in three parts (tentative).
• In part I (this paper), we will focus on event selection and categorization for signal discovery (and upper-limit setting). We will provide a prescription to train event selectors and categorizers, which are optimal for an exactly specified signal distribution.
-We will not account for unspecified signal parameter(s), like the mass of the dark photon in a dark photon search. In other words, in this paper we will train our categorizers to be optimal for a given set of values of signal parameters. We will also not concern ourselves with systematic uncertainties in the model specification. Both of these issues will be addressed in part III.
-The technique introduced in part I is directly applicable in searches where the signal parameters are known beforehand, e.g., rare Higgs signatures, where the mass of the Higgs is known a priori.
• In part II [9], we will provide event selection and categorization prescriptions for reducing the expected statistical uncertainty of the measurement in parameter measurement experiments.
-As in part I, we will not be accounting for model (systematic) uncertainties, which will be addressed in part III.
• In part III [10], we will provide prescriptions to train optimal event selectors and categorizers accounting for unspecified signal parameters and systematic uncertainties.
In other words,

JHEP03(2021)291
-We will provide prescriptions to train event selectors and categorizers that improve sensitivity over a range of unspecified signal parameter values.
-We will also provide prescriptions to maximize the significance of a signal excess in the presence of model uncertainties, and minimize the combined statistical and systematic uncertainty in parameter measurement experiments.

Dimensionality reduction and complementary information
Data recorded in collider experiments is typically high dimensional. For example, in LHC experiments, a typical recorded event leaves thousands of hits in the detector. Even at the parton level, depending on the number of reconstructed final state particles, events can have ∼10 or more attributes. Ideally one would want to draw statistical inferences on the underlying physics by analyzing the distribution of events in these high dimensional spaces. But analyzing the distribution of collider data in a high dimensional space (using multi-dimensional histograms, likelihood-ratio tests, the Matrix Element Method [11][12][13] , or machine learning-based inference [14][15][16][17] 2 ) comes with several challenges -the computational power needed to systematically scan the phase-space grows exponentially with the number of dimensions; the volume of data (real and simulated) needed to populate the space grows exponentially with the number of dimensions; directly evaluating the likelihood-ratio using the Matrix Element Method is computationally expensive [13], Monte Carlo validation in high dimensional phase-space (e.g., in the context of machine learning) can be challenging, etc.
One way to deal with the curse of dimensionality is to reduce the dimensionalityby constructing event variables like pseudorapidities, transverse momenta, invariant or transverse masses of sets of final state particles, etc. After reducing the dimensionality of data to a more tractable number of event variables, we can limit the analyses to studying the distribution of these event variables.
But this dimensionality reduction is accompanied by information loss as expected from the data processing inequality [19] -typically the dimensionality-reduced data contains less information than the original data, be it for the purpose of signal search or parameter measurement. In the case of signal search analyses, the information loss can be understood as follows: all the information contained in a collider event for the purpose of testing for the presence of a signal over a background can be captured by the likelihood-ratio of the event under the signal and background distributions (a consequence of the Neyman-Pearson lemma [20]). Reducing the dimensionality of the data mixes up regions of phase-space with different values of this likelihood-ratio, but the same value of the event variable(s). To see how such a mixing causes information loss, consider the simple scenario with two regions of phase-space where the expected number of signal and background events are, respectively, (S 1 , B 1 ) and (S 2 , B 2 ), all non-zero. Let the events in the two regions be mixed and analyzed together, and let the combined (expected) signal and background counts be (S tot , B tot ) ≡ (S 1 + S 2 , B 1 + B 2 ). Note that

JHEP03(2021)291
This implies, by Jensen's inequality [19,21], that for any convex function f convex , with equality occurring iff S 1 /B 1 = S 2 /B 2 . By choosing appropriate forms of f convex , it can be shown that where Eq. (1.3) illustrates the loss of sensitivity of the analysis to the presence of a signal caused by phase-space mixing or dimensionality reduction, with sensitivity quantified by several measures, including the familiar "S/ √ B or S/ √ N added in quadrature". Many commonly used analysis techniques, including event selection and categorization, can be thought of as attempts to mitigate the information loss caused by dimensionality reduction. Event categorization helps by separating regions of phase-space, that would otherwise be mixed together, into disjoint categories. Similarly, event selection would help when not including certain parts of phase-space leads to better sensitivity than mixing them with regions of higher purity -in our example, if S 2 1 /B 1 > S 2 tot /B tot , then rejecting region 2 is better than mixing the two regions together.
Let e represent all the observable information pertaining to a single collider event. Let x be an event variable constructed out of e whose distribution will be analyzed to search for the presence of a signal. In order to mitigate the loss of information caused by dimensionality reduction e → x, event selection or categorization needs to be done based on the information lost during dimensionality reduction, i.e., information complementary to x. Event selection or categorization based on the information contained in x itself cannot improve the sensitivity of the experiment, and can potentially be detrimental to it. 3 This is illustrated in figure 1, which shows event selection performed based only on x. This only removes some bins from the analysis, hence removing some non-negative, potentially positive, terms from the sum i∈x bins s 2 i /b i (or other measures of sensitivity). Typically event selection based on machine learning (ML) classifiers (trained on the full event information e) is done by choosing a working point or selection threshold on the ML output. The problem here is that the ML classifier learns to distinguish between signal and background based on the full information in e, and not just the complementary information. As a result, this selection procedure has to find a compromise between the JHEP03(2021)291 An illustration of event selection based solely on the event variable to be analyzed after the selection process. The dark shaded region represents the rejected region of phase-space. This kind of event selection does not improve the sensitivity of the analysis. To improve sensitivity, background needs to be selectively removed from each bin as opposed to removing entire bins. gain in sensitivity from tapping into complementary information and the loss in sensitivity resulting from using non-complementary information. This problem is particularly severe when x is a "good event variable" in terms of separating signal and background, since such a variable will be strongly correlated with the classifier output. This is the source of "misalignment" addressed in this paper. As we will demonstrate, such a compromise is not necessary, and it is possible to force an event selector to learn from only information complementary to x. This can be done by using different selection thresholds at different values of x. Intuitively this can be understood as follows: to maximize i∈x bins s 2 i /b i using an event selector, one needs to maximize s 2 i /b i in each x-bin, by choosing an appropriate, bin dependent, threshold on the ML classifier output. On the other hand, even the best bin independent threshold will only be a compromise value.
In the case of event categorizers, although using information contained in the event variable x is not detrimental to the sensitivity, it still does not improve it. So a similar compromise between using complementary and non-complementary information exists in typical ML classifier based categorization as well (flat thresholds between categories), which can also be avoided using x dependent thresholds between categories.
In addition to leading to sub-optimal event selectors and categorizers in terms of statistical sensitivity, non-complementary information used in event selection often causes the background distribution (post-selection) to be peaked in the same region as the signal. This amplifies the challenge posed to signal searches by systematic uncertainties in the background specification. Although this paper is not geared towards reducing systematic uncertainties (which we will do in part III [10]), event selectors decorrelated from the event variables x can make it easier to estimate the background distribution from data itself, thereby reducing the impact of systematic uncertainties. This is illustrated in figure 2 using plots from a toy example discussed in section 4. The left panel shows the distributions of signal and background before selection. The remaining two panels show their distribution post-selection using event selectors based on the full information e (middle panel) JHEP03(2021)291 and complementary information only (right panel). We observe that in the middle panel, the background distribution becomes sculpted and peaks in the same region as the signal, while in the right panel, with the procedure which we advocate below, the background distribution remains flat.

The landscape of related methods
It was pointed out in [22,23] that the performance of event selectors in terms of classification accuracy is not perfectly correlated with the sensitivity and precision of the physics analysis that uses them -neural networks trained to directly optimize for sensitivity (as in a Higgs search) or precision (as in a top mass measurement) were demonstrated to outperform event selectors trained for maximal classification accuracy (with thresholds or working points chosen to optimize precision). It is precisely this misalignment that we seek to explain and alleviate in this paper and the subsequent 4 work [9,10].
In [22,23], an evolutionary method for training neural networks called NeuroEvolution of Augmenting Topologies (NEAT) [24] was used in the direct optimization of event selectors. Another approach to directly optimizing neural networks for discovery significance was introduced in [25], using cost functions defined over batches of training events. Thick-Brick, on the other hand, relies on simply using the output of typical ML-based classifiers in an optimal manner.
At the heart of this paper lies the following observation: the task of finding the event selector/categorizer that maximizes the statistical sensitivity of a search analysis (performed in the distribution of an event variable) is intrinsically related to the space of event selectors/categorizers over which the maximization is performed. We claim that selectors/categorizers based only on classifier output (flat thresholds) lack the flexibility needed to achieve global optimality (i.e., in the space of all selectors/categorizers), and expanding the space of categorizers under consideration can potentially lead to improved results. This is in line with the findings of [26], where, in addition to using thresholds on a BDT classifier output, the flexibility of cutting on another carefully chosen variable was added to the categorization process, leading to a better performing categorizer. We show in sec-JHEP03(2021)291 tion 2 that, from an information theory standpoint, global optimality can be attained with our proposal to use the ML classifier output as well as the event variable x, and impose x-dependent thresholds.
Our work is closely related to refs. [27,28] which consider the optimal partitioning of events into different categories, in order to maximize the sensitivity of an analysis that searches for a signal or measures a parameter by counting the number of events in each category. They approach this task by maximizing the Fisher information [19] left in the data after reducing each event to the category it belongs to. Parts 1 and 2 of our work can be viewed as maximizing the Fisher information left in the data after reducing each event to the category it belongs to and an event variable x. Note that a signal search problem can be viewed as a signal cross-section measurement problem, and the performance measures introduced in our paper reduce to the relevant Fisher information in the faint signal limit (when the 'background only' and 'background+signal' hypotheses are only infinitesimally apart).
Refs. [29,30] considered double Higgs production in the bbγγ channel, and performed a simultaneous optimization of cuts on kinematic variables and BDT hyperparameters, for an analysis that used both filters before counting the number of events that pass them. For such a counting analysis (as opposed to a distribution analysis), applying a threshold on the score from a perfectly trained classifier should yield the highest sensitivity (from the Neyman-Pearson lemma). However, due to practical limitations the BDT training cannot be perfect, and cuts on kinematic variables can be used to improve the performance.
Several approaches exist in the literature for decorrelating the output of a classifier from a particular event variable. One possibility is to use only features decorrelated from the event variable as inputs to the ML classifier [31][32][33]. Another way is to force decorrelation during training with appropriate penalties [34][35][36]. In our case, the decorrelation is achieved naturally as a byproduct of maximizing performance measures more in line with the sensitivity of the physics analysis than, say, classification accuracy. Another technique related to the correlation between variables is the s Plot tool [37] which can extract the distribution of a variable x within each subsample of events, given a mixture of the different subsamples and a variable y, uncorrelated with x, whose distribution within each subsample is known.
The rest of this paper is organized as follows: in section 2, we will introduce several performance measures for event selectors/categorizers, and lay the groundwork for a training prescription to maximize them. We will also explain in more detail how our methods realign the goals of event selection and categorization with the physics goals. Section 3 will be a description of our training algorithm, intended to serve as an implementation and usage guide. To this end, we will keep the material in section 3 mostly self-contained, with minimal dependence on section 2. In section 4, we will demonstrate the working of our algorithm using a toy example analyzed using the ThickBrick package [38], our publicly available implementation of the algorithm. Finally, we will provide some outlook on the future of this work and possible extensions in sections 5 and 6.

Setup and notation
In this subsection we will establish the setup and notation for a generic signal search procedure considered in this paper. Throughout this paper, we will work under the frequentist approach to statistical inference, 5 so no notion of probability will be associated with the presence or absence of a signal.
Let S be a collection of collider events. We will treat a given instance of S as one of the possible outcomes of the experiment performed to produce it. The events in S are assumed to be independent and identically distributed. The total number of events in S is assumed to be a Poisson distributed random variable whose mean could possibly depend on the underlying theory. In practice, S can be a collection of events recorded at the collider that pass some event pre-selection criteria.
Let e be a random vector which contains all the observable information (that is available to an analysis) pertaining to a single collider event. In practice, e can include estimated number and ids of parton-level final state particles in the event, their reconstructed (i.e., after detector smearing) energies and momenta, etc.
Let x be an event variable (also a random vector) constructed out of e whose distribution is to be analyzed to search for the presence of a signal. For example, x could be the 4-lepton invariant mass for a Higgs search in the 4-lepton channel.
In general, the vectors e and x can have continuous (real valued) and discrete components. 6 Let Ω e and Ω x represent, respectively, the sample spaces e and x belong to, with appropriate probability measure functions. 7 We consider a generic signal search analysis performed in the following two steps: Step 1. Event selection/categorization. From S, events are categorized based on their e values into C categories (C ≥ 1) numbered 1, . . . , C. If we allow for event rejection, an event can also be not included in any of the C categories (rejected) based on its e value. For now we will assume that the number of categories, C, is predetermined.
Let S c be the collection of events in category c for c ∈ {1, . . . , C}.
Note that if we use only one category and allow for event rejection, this behaves as a simple in-or-out event selection procedure. Henceforth, "event categorization" will refer to both event selection and event categorization, unless a distinction is explicitly made. 5 The techniques we will introduce can have applications in Bayesian inference as well. Under Bayesian inference, the goal of an analysis would be to reduce the expected posterior entropy of the presence or absence of a signal (i.e., reduce the uncertainty on whether or not the signal exists). Our techniques can roughly be interpreted as minimizing this expected entropy. 6 For categorical/discrete valued components of e, like the number and ids of reconstructed final state particles, it is a good idea in practice to group events based on those features and analyze each group separately, although our formalism does not assume this to have been done. 7 Using as σ-algebra, the product of Borel algebras for real valued components and powersets for discrete valued components.

JHEP03(2021)291
Step 2. Number density analysis. The event variable x is calculated for events in each of the C categories.
The number density distribution of x in each of the categories S c is then analyzed to test for the presence of a signal, say, using the likelihood-ratio test or the χ 2 difference test.
We will assume that there are two subpopulations of events, background and signal, each with its own exactly specified number density distribution of e. The null hypothesis, H b is that only the background subpopulation exists in the true event population, and the alternative hypothesis, H b+s , is that both signal and background events exist in the true event population.
The goal of this paper is to provide a prescription to perform the event categorization step (step 1) "optimally". To this end, we need a measure of how distinguishable the two hypotheses are after reducing each event to its x value and the category it belongs to, i.e., after event categorization. We lay the notational groundwork for that in table 1.
Note that both the null and the alternative hypotheses are point hypotheses, i.e., there are no free parameters in either H b or H b+s , including the relative total abundance of signal and background, S/B. The effect of scaling the signal and background cross-sections relative to each other 8 will become apparent at the end of the construction.

Statistical distances
To formalize the problem of optimizing the event selection/categorization procedure, we need a measure of how distinguishable the null and the alternative hypotheses are after reducing each event to the category it belongs to and its x value. In eq. (2.1) we provide six such measures, referred to as statistical distances between the hypotheses. The prescription we develop in this paper is compatible with any of these distance measures; they represent a choice available when using ThickBrick.

JHEP03(2021)291
Notation Definition/Interpretation e Random vector which contains all the observable information (that is available to an analysis) pertaining to a single collider event. Randomness of e refers to the randomness of collider events.
x or x(e) Event variable (also a random vector) constructed out of e whose distribution is to be analyzed to search for the presence of a signal.

Ω e
The sample space of event e.

Ω x
The sample space of event variable x.

H b
Null hypothesis that only the background subpopulation exists in the true event population.
H b+s Alternative hypothesis that both background and signal subpopulations exist in the true event population.

η(e)
Represents the event categorizer used in the analysis. η is a function of e and returns the category (integer) between 1 and C to place an event in. η can also return None if event rejection is allowed.
The probability that an event sampled from the alternative hypothesis H b+s is a signal event, conditional on its e value.
The probability that an event sampled from the alternative hypothesis H b+s is a signal event, conditional on its x value and the category c it belongs to. p c (x) can also be thought of as the expected value of p(e) conditional on (c, x).
Note: all quantities with the subscript c implicitly depend on the categorizer η used.

JHEP03(2021)291
We will only provide a brief description of these distances here, and relegate the derivation of these formulas to appendix A. We will go over the estimation of these integrals in section 2.3, and provide a detailed comparison between the distance measures in appendix D.
• D Neymχ 2 is the asymptotic expected value of the Neyman-χ 2 statistic [39], sometimes referred to as the modified χ 2 statistic, in the fine binning limit under the alternative hypothesis. This may be familiar as "s/ √ n summed over bins in quadrature".
• D Pearχ 2 is the asymptotic expected value of the Pearson-χ 2 statistic [40] in the fine binning limit under the alternative hypothesis. Again, this may be familiar as "s/ √ b summed over bins in quadrature".
• D KL is the Kullback-Leibler divergence [41] from the null hypothesis to the alternative hypothesis. Up to a factor of 1/2, this form may be familiar as the (asymptotic) expected significance of excess under the alternative hypothesis, squared [42]. We provide an alternative interpretation of D KL in appendix B.
• D revKL 9 is the Kullback-Leibler divergence [41] from the alternative hypothesis to the null hypothesis. We provide an interpretation of D revKL in appendix B.
• D J is Jeffreys divergence [43] between the null and alternative hypotheses. It is a symmetrized version of the Kullback-Leibler divergence defined as • D B is the Bhattacharyya distance [44] between the null and alternative hypotheses.
In the limit s c (x) b c (x) for all c and almost all x, we have up to leading order in s c (x)/n c (x). Generically we will refer to these distances as D stat .
Some key points about these distances relevant to our discussion are: 1. They capture distinguishability of hypotheses. The measures given in eq. (2.1) do not depend on actual data. Instead they depend on the two hypotheses, H b and H b+s , for the distribution of data after event categorization. The greater the value of these distances, the more distinguishable the hypotheses.
Although not explicit from their form, it can be shown that D stat ≥ 0, with equality occurring iff b c (x) = n c (x) for all c ∈ {1, . . . , C} and almost all x ∈ Ω x , i.e., iff the hypotheses H b and H b+s are experimentally indiscernible from one another.

JHEP03(2021)291
2. They depend on the event categorization criteria. How events are categorized implicitly affects each of the distance measures. This allows us to phrase the problem of optimizing the event categorization procedure as maximizing D stat by varying the event categorization criteria.
3. They scale linearly with the integrated luminosity of the experiment. Scaling the number of signal and background events proportionally, without changing their distributions or relative abundance, will scale D stat by the same factor. As a result, we can write each of our statistical distances as N times the expected value of a perevent contribution defined appropriately. We will use this in constructing a training prescription to maximize our distance measures. This means that there exists an optimal categorization criterion based only on the x and p values of events. This is to be expected from the Neyman-Pearson lemma, a consequence of which is that p(e) contains all 10 the information in an event relevant to distinguishing between the hypotheses H b and H b+s . The optimal categorization criterion will also take into account the x value of an event to make sure that only information complementary to x is used in the categorization.

Eventwise contribution to D Neymχ 2
Let us choose D Neymχ 2 as a demo distance measure to construct our prescription. For notational simplicity, wherever applicable, x will refer to x(e) of the particular event e under consideration. Let us begin with a (training) set of events sampled from H b+s , i.e., both signal and background events. We want to train an event categorizer on this event sample to maximize D Neymχ 2 . We will assume that the values of x and p(e) are known for each event. p(e) can be estimated separately using machine learning techniques as shown in [14,17]. In section 3.1, we will discuss some practical aspects of learning p(e) for our event categorization purpose.
In the interest of providing some intuition for what follows, let us momentarily sacrifice mathematical rigor. 11 Since we are looking to train an event categorizer, let us look at the 10 Since the total number of events is itself a random variable with different distributions under H b and H b+s , the mere occurrence/existence of an event in data also carries some information relevant to distinguishing between H b and H b+s , in addition to p(e). But in the context of event categorization, this is a moot point. 11 The following argument can be presented more rigorously by considering a probabilistic event categorizer that returns the probabilities of assigning an event to different categories, and using functional derivatives to capture small changes to the categorizer. This, however, is unnecessary for our purposes here.

JHEP03(2021)291
effect of modifying a given event categorizer 'slightly'. Let us modify an event categorizer so that a given event e moves from category c 1 to category c 2 . This changes D Neymχ 2 by changing s 2 Here ∆n c (x) and ∆s c (x) represent small changes to n c (x) and s c (x) caused by the change in the event categorizer, with ∆s c (x) = p(e)∆n c (x). Note that for our modification of moving event e from c 1 to c 2 , ∆n c 2 = −∆n c 1 . The first term captures the loss (gain) from increasing (decreasing) n c , and the second term captures the gain (loss) from increasing (decreasing) s c . Armed with this intuition, let us rewrite D Neymχ 2 as follows: where δ is the Kronecker delta, and MEAN e ∼ H b+s . . . represents the expected value computed over events e sampled from H b+s . As a reminder, in eqs. (2.6)-(2.10), x represents x(e) of the event e being integrated over. Note the commonality between the right-hand sides of eqs. (2.4) and (2.10). We will first justify each step of the derivation, and then unpack the power of the last equation for our categorizer training purpose. Eq. (2.6) follows from the fact that where δ is the Kronecker delta function, and δ x satisfies 12

JHEP03(2021)291
Eq. (2.8) follows from the fact that s c (x)/n c (x) is the expected value of s(e)/n(e) over events sampled from H b+s , conditional on their c and x values. More concretely, eq. (2.8) can be shown to follow from Eq. (2.9) follows from the definitions of p c (x) and p(e), and eq. (2.10) follows from the definition of expected value of a random variable over events sampled from a certain distribution. Based on the form of eq. (2.10), let us define a function Λ Neymχ 2 : (2.14) Λ Neymχ 2 (U, V ) represents the eventwise contribution to the statistical distance D Neymχ 2 of adding an event e with p(e) = V to a category c with p c (x(e)) = U . This lets us write eq. (2.10) as To understand the meaning of the phrase "eventwise contribution to the statistical distance", let us see how eq. (2.15) can be used to estimate D Neymχ 2 / N for a given event categorizer η.
Estimating D Neymχ 2 / N for a given event categorizer η: 1. Use the event categorizer η to categorize events sampled from H b+s .
2. Estimate p c (x) for all c and x using the categorized events. This can be done either as the average of p(e) conditional on (c, x), or directly from the label carried by each training event indicating whether it is a signal or background event.
3. Next for each event e, estimate its contribution to D Neymχ 2 as Λ Neymχ 2 p c (x), p(e) , where c is the category it belongs to under the given event categorizer η. If event rejection is allowed, the contribution from rejected events is just 0. The average value of this contribution over training events is an estimate for D Neymχ 2 / N . 13 In this way, the Λ Neymχ 2 term represents the eventwise contribution to D Neymχ 2 . While this is admittedly a roundabout way of estimating D Neymχ 2 from a training sample, it leads to a method of training an optimal event categorizer iteratively, as we will see next.

Iterative maximization of D Neymχ 2
The idea behind writing the statistical distance, D Neymχ 2 , in terms of an eventwise contribution, Λ Neymχ 2 , is that from event categorizer η, we can get a "better" event categorizerη by requiring thatη places each event in the category that maximizes Λ Neymχ 2 p c (x), p(e) (with p c (x) corresponding to the categorizer η).
This leads to the following iterative prescription to train an event categorizer: 1. Let the current categorizer be η. Categorize events sampled from H b+s as per η.
2. Estimate p c (x) for all c and x (for the current η) using the categorized events.
3. Construct a new categorizerη which places each event e in the category c with the highest Λ Neymχ 2 p c (x), p(e) . If event rejection is allowed, then an event will be rejected byη, iff Λ Neymχ 2 p c (x), p(e) < 0 for all categories c.
We will postpone the detailed description of this training procedure to section 3, which is intended to serve as a usage guide independent from the build-up we are doing now. But first, to gain some intuition, let us take a closer look at what the iterative prescription described above does operationally. From a given categorizer η with corresponding signal-fraction functions p c (x), we build a new categorizerη as 14 has a global maximum in the first argument at U = V , and monotonically decreases as U moves away from V . It is also symmetric 15 in U about this maximum. Combining this observation with eq. (2.16), we can see thatη places each event in the category with p c (x) closest to p(e). We can think ofη as grouping events by their p(e) values with x dependent boundaries between different categories - figure 3 shows a cartoon of what such a categorization could look like. This makes sense considering that the purpose of categorization is to reduce the information loss arising from mixing up regions of phase-space with different p(e) values when reducing the full event e to the event variable x. 14 To keep the presentation simple, we are ignoring the case where event rejection is allowed.η(e) will be rejected or None if every category leads to a negative contribution. 15 Maximum in the first argument at U = V and monotonic decrease as U moves away from this maximum will hold for the eventwise contributions for other statistical distances as well, but not the symmetricity.

JHEP03(2021)291
Recall that p c (x) is the mean p in category c conditional on x. This process of iteratively placing events in the category with the closest mean, and then recomputing the mean for each category could be reminiscent of Lloyd's k-means clustering algorithm [45,46]. This connection can be made clearer by noting from eq. (2.14) that (2.17) The first term is congruous to [X − X c ] 2 , the objective function k-means clustering seeks to minimize the average of, where X c is the sample mean within cluster c which contains X, and the last term is independent of the event categorization criteria. More explicitly, the objective function to be minimized in k-means clustering can be written as which bears close resemblance to the right-hand side of eq. (2.15). In fact, for the case of D Neymχ 2 , our iterative training procedure can be thought of as the k-means clustering algorithm with events within each x-"bin" being clustered based on their p(e) values. But instead of doing this bin-by-bin in x, we can take an unbinned approach to the estimation of p c (x). We will elaborate more on this in section 3. Note that we have not yet proved that the categorizer will "improve" during each iterative step or that it will converge on an optimal categorizer in a finite number of steps -at this point the algorithm is held together by the heuristic argument presented in eq. (2.4) and the connection to k-means clustering. In appendix C, we present a proof of the correctness of the iterative training algorithm (for D Neymχ 2 , as well as the other statistical distances). The proof will be similar to that of the generic k-means clustering algorithm, and will rely on the fact that Λ Neymχ 2 p c (x), p(e) and its counterparts for the other distances are linear in p(e) and have a global maximum in the first argument at p c (x) = p(e).

Generalizing to other statistical distances
We can repeat the process in section 2.3 for the other statistical distances in eq. (2.1) and generate eventwise contributions to the other statistical distances as well. We will skip the derivation and only present the relevant results here. First let us defineD stat as The numerical factors of 2 and 8 are introduced to make all theD stat -s converge to each other in the limit s c (x) / n c (x) → 0 for all c and almost all x. For allD stat -s, we havẽ (2.20) where the functions Λ stat : [0, 1) × [0, 1] → R are given by 16 With these choices, the validity of eq. (2.20) can be verified (after some manipulation) by noting that because Λ stat p c (x), p(e) is linear in p(e), it can be replaced by For each Λ stat it can be shown that for all V ∈ [0, 1], Λ stat (U, V ) is monotonically increasing in U for 0 ≤ U ≤ V and monotonically decreasing in U for V ≤ U < 1, with global maximum in U attained at U = V . This means that in each iterative step, the "next" categorizer will place event e in one of the two categories with "current" p c (x) closest to p(e) on either side, or reject the event if that option is allowed.

The rectified "misalignment"
In the introductory section 1.1, the source of misalignment between the physics goals and typical ML-based event selectors was attributed to not ensuring that categorization is based only on information complementary to the event variable x. Now we are in a position to point out what that misalignment is. In typical ML-based selector training, all signal events are treated on the same footing and all background events are treated on the same footing, in the sense that the reward/penalty for including/excluding an event depends only on whether the event is from the signal or background sample. On the other hand, our eventwise contribution functions Λ stat (p c (x), p(e)) suggest that the reward/penalty should also depend on the x value of the event. For the rest of this subsection, let us restrict our attention to event selectors only, for simplicity. Negative Λ stat (p c , 0) represents the penalty for selecting a background event in an x-"bin" of signal purity p c . As can be seen from the left panel of figure 4, this penalty increases with p c for each of our distance measures. Similarly, Λ stat (p c , 1) represents the reward for selecting a signal event in an x-"bin" of signal purity p c . As can be seen from the right panel of figure 4, this reward also increases with p c .
An event e will warrant selection only if the following net-reward corresponding to p(e) is non-negative: This net-reward is precisely Λ stat p c (x), p(e) . Let p sel_thresh stat (p c ) be the selection threshold on p(e) above which an event warrants being selected into a bin of purity p c . This threshold is defined by the equation (2.23) Figure 5 shows that this threshold is an increasing function of p c for each of our statistical distances, implying that the threshold is stronger in x-"bins" of higher purity. This is how ThickBrick naturally leads to decorrelation of the event selection criteria from the event variable x. Failure to incorporate these x dependences, or more precisely the p c (x) dependences, is the misalignment present in typical ML-based selector training that our ThickBrick prescription rectifies. The fact that p c (x) itself depends on the event selector is the reason our training prescription changes the per-event reward function being maximized at each iterative step.

Prescription for training an optimal event categorizer
In this section we provide a description of our training prescription. almost all the information we need to carry from the previous section (groundwork) into this one.
Let us begin with a sample of simulated events containing both signal and background, split into training and validation sets. For every simulated event i: • Let e i represent all the observable information from the event.
• Let w i be the non-negative weight associated with the event. 17 • Let y i be the label specifying the subsample the event is from: 1 for signal, 0 for background. In HEP applications, this labeling is trivially possible because we know which model was used to simulate a particular event.
• Let x i = x(e i ) be the value of the event variable x whose distribution is to be analyzed to search for the presence of a signal.
LetS/B be the overall signal to background ratio in the training sample. We will not assume that this matches the overall signal to background ratio S/B expected in an experiment under the hypothesis H b+s . This is to allow for the fact that this ratio could be a priori unknown in the search (if the signal cross-section is an unspecified parameter in the theory). Even in cases when the ratio S/B is known, working with a highly imbalanced sample with very different values forS andB can make learning p(e) using ML techniques difficult. So it is often preferable to use training samples withS/B close to 1.

Preprocessing: learning p(e)
From the training data we can estimate p(e) as the expected value of y i , weighted by w i , conditional on e. If the dimensionality of e is large, this can be done with machine learning techniques [14,17], or the Matrix Element Method [48,49]. The accuracy of the estimate 17 We accommodate the possibility of weighted events in order to allow for easy scans of BSM parameter spaces by simple reweighting of training events using the matrix element [2,48,49].

JHEP03(2021)291
will depend on several factors including, but not limited to, the amount of computational resources and training data available. We can address the limitation posed by inaccuracies in the estimation of p(e) by treating the output of the classifying algorithm (BDT or neural network for example) not as p(e) but simply as a property of the event to base our categorization on. Let O ML (e) be the machine learning classifier output. We can estimate p(e) as the probability that an event sampled from H b+s will be a signal event conditional on both O ML (e) and x(e), as opposed to conditional on e. Presumably the dimensionality of x(e) is small enough to perform this estimation with sufficient accuracy. This can be done either using traditional regression methods (parameterized or unparameterized), or with another machine learning layer (with O ML and x as inputs). Henceforth, p(e i ) and p i will both refer to this estimate of p(e i ). 18 Out of the six statistical distance measures, onlyD Pearχ 2 maximization leads to a categorizer that is independent of the value of the overall signal to background ratio S/B under H b+s . Training using the other five measures is sensitive to S/B; if training is to be done using these measures, w i and p i both need to be appropriately scaled before proceeding further. To scale the signal cross-section by a factor of λ (i.e., if S/B = λS/B), we need to apply the following transformations 19 , for signal events only 20 (3.1a) In searches where there is no a priori expectation for the size of the signal cross-section, 21 one option is to useD Pearχ 2 as the distance measure or choose a very small value of λ (which is equivalent to usingD Pearχ 2 ). Another option is to choose a value of λ that puts the signal cross-section either near the low end of discoverability or near the potential upper-limit on the signal cross-section (low end of rejectability) expected from the available experimental data. At the end of this preprocessing, each training event should have w, y, x, and p associated with it (possibly scaled). In addition, we should have a way of computing x and p for validation MC events as well as real events from the experiment (by first running the event e through the machine learning classifier to get O ML (e) and using that to get p, performing the scaling operations in eq. (3.1) if applicable). It

JHEP03(2021)291
Eq. (3.2) just means that p approximately equals the probability of an event sampled from H b+s being a signal event, conditional on (x, p), with the inexactness resulting from training limitations.

Choices to be made
1. Choose a distance measure to train with. As mentioned earlier, ThickBrick involves choosing a distance measure between the 'background only' and 'background+signal' hypotheses to be maximized by the event categorizer. The possible options arẽ D Neymχ 2 ,D Pearχ 2 ,D KL ,D revKL ,D J , andD B .
In actual experiments, previously undetected signals (if present) are expected to be faint. In this limit, all the statistical distances converge to one another up to constant factors as-per (2.2). As a result, the choice of distance measure will be of little consequence in most situations. However, here we will comment on the differences between the distance measures (when the signal being searched is not weak) for academic purposes.
As mentioned above in section 3.1, maximizingD Pearχ 2 leads to a categorizer whose optimality is independent of the overall signal to background ratio. This is because scaling the signal number density by an overall factor (s(e) → λs(e)) only changes D Pearχ 2 by a multiplicative factor (D Pearχ 2 → λ 2 D Pearχ 2 ). This makesD Pearχ 2 a good option when there is a priori no expectation for the potential signal cross-section.
We show in appendix B that training usingD revKL (with training signal cross-section near the low end of discoverability) is theoretically optimal for signal discovery. Similarly, training usingD KL (with training signal cross-section near the expected upperlimit on it) is theoretically optimal for upper-limit setting.
In practice, the main difference between the different distance measures lies in the (x dependent) boundaries on the values of p between different categories -they have different thresholds to move events from a low purity category to a high purity category. Let us call a distance measure that leads to higher thresholds as 'stricter'. We show in appendix D that the distance measures arranged in the increasing order of strictness are as follows (left to right, least strict to most strict): Here, the notationD Neymχ 2 ≺D revKL just means thatD Neymχ 2 precedesD revKL in order of strictness (as described above). Note that the hierarchy (3.3) is reflected in the ordering of the selection thresholds in figure 5.
One might want to try out different distance measures, and pick one based on practical considerations like having sufficient number of events in each category, shape features of the distribution of x post-categorization, etc. -it is possible that background MC validation (post-categorization) is easier when using certain distance measures than others.
Henceforth, where applicable,D stat will refer to the chosen distance measure.

JHEP03(2021)291
2. Choose the number of categories (int C ≥ 1) and whether or not to allow event rejection (bool allow_rejection).
To use our prescription to train an event selector (as opposed to a categorizer), set the number of categories C to 1, and set allow_rejection to True.
Theoretically, in the asymptotic (large number of events) limit, having more categories will lead to higher sensitivity. In fact, in the infinite categories limit, the sensitivity of the search will approach the sensitivity of a likelihood-ratio test performed in the full phase-space e, which is the theoretical upper-limit of the sensitivity of the search. Also, for a given number of categories, allowing event rejection will lead to higher sensitivity than disallowing it. In practice, however, increasing the number of categories will lead to data getting diluted out across categories, which can lead to problems.
Fortunately, though, continually increasing the number of categories typically leads to diminishing returns. 22 In section 3.4 we will show how to estimate the value ofD stat (post training) in the C → ∞ limit. This can be used to assess the utility of increasing the number of categories and weigh it against the corresponding disadvantages, in order to settle on a good value of C.

Iterative training prescription
Algorithm 1 shows the iterative event categorizer training prescription introduced in section 2.4. We have already motivated the algorithm in section 2, and the proof of correctness has been relegated to appendix C. Here, we will elaborate on different parts of the algorithm from an implementation point of view, as well as provide diagnostic notes from a practical usage point of view.
Lines 1 and 2 refer to the preprocessing covered in section 3.1 and the choices covered in section 3.2.

Initialization (line 3):
the categorizer η will categorize events based on their x and p values. It can be initialized to apply arbitrary cutoffs/thresholds on the p values. The initial cutoffs/thresholds could possibly be x dependent, though this is not necessary. η an also be initialized to a randomized categorizer or to a constant function.
Note that η being a function of x and p is consistent with our original notation of η being a function of event e, since x(e) and p(e) are part of the event description after the preprocessing step. Also note that in our description here, η is expected to return an integer between 1 and C or, alternatively, "rejected" if allow_rejection is set to True.
Lines 5-21 contain the code block for one iterative training step. 22 The diminishing returns can be attributed to the differentiable behavior of Λstat(U, V ) in parameter U at/near the maximum U = V . As a result, the returns from creating more categories to reduce the distance between p(e) and the closest pc(x) are diminishing. Estimation of p c (x) (lines 5, 6): 23 to estimate the signal-fraction functions p c (x) for a given categorizer η, training events are first categorized based on η. Then, from training events in a given category c, p c (x) can be estimated as the expected value of either p or y (weighted by w) conditional on x. This regression problem can be solved using a plethora of techniques -binned and unbinned, parametric and non-parametric (like kernel regression). The estimation of p c (x) for a given categorizer comes with the typical challenges of machine learning, including the bias-variance trade-off. One might want to employ crossvalidation techniques (for example, in picking the bandwidth for kernel regression) to get the best results. If the estimation of p c (x) is affected in some regions of x by a low density of training events, it might be helpful to increase the sampling density in those regions 23 Note that the estimation of pc(x) could be split between line 6 and line 8, with line 6 representing any required preprocessing and line 8 representing the actual estimation for particular values of x and c. This is typically the case in kernel regression.

JHEP03(2021)291
(and appropriately decrease the event weights). Of course, with a reasonably large number of training events, and a reasonably small dimensionality of x and number of categories, these measures will be rendered inessential.
Defining the next categorizer (lines 7-20): the next categorizer is constructed so as to reject events if Λ stat (p c (x), p) < 0 for all categories c (and if event rejection is allowed). Otherwise, the next categorizer groups events at a given value of x into different categories, based on their p values.
The sorting in line 8 ensures that at each value of x, events in a higher category will have higher value of p. In other words, line 8 ensures that category boundaries (on values of p) do not interweave or cross each other when x is varied. Note that, our distance measures D stat are all invariant under an x dependent cyclic permutation of categories. But having the categories ordered by their p c at each x makes the categorization more meaningful. This also helps with the estimation of the signal-fraction functions p c (x) by reducing their sensitivity to changes in x.
Note that during each iteration, the next categorizer depends implicitly on the estimated functions, p c (x), for the current categorizer. This means that the estimated functions p c (x) will need to be packaged into the final product of the training, which is the trained categorizer. The details of this packaging will depend on the particular regression technique employed.

Termination condition (line 21):
Lloyd's k-means clustering algorithm converges and terminates when the category assignments do not change in a training step. This convergence is guaranteed within a finite number of steps, and in practice the fraction of events whose category assignments change during a step drops significantly over only a few tens of iterations.
The guarantee of convergence within in a finite number of steps extends to our algorithm as well (as argued in appendix C) if the estimation of p c (x) is performed by binning events into different (fixed) x-bins and finding the expected value of p or y in each bin. But if one takes an unbinned approach to the estimation of p c (x), it is possible to run into cycles near the end of the training process due to artifacts in the regression technique. This can be handled by terminating the iterative training when the fraction of events reassigned in some step drops below a certain threshold.
One can also terminate the algorithm after a predetermined number of steps -the convergence of the algorithm will typically be fast enough (for small values of C) to warrant this. Alternative to these, we can also employ a stopping condition based on the saturation ofD stat , as explained below.

JHEP03(2021)291
This can serve multiple purposes. First, we can use it to devise a termination conditionterminate the training when the fractional or absolute increase in the (estimated) value of D stat drops below a certain threshold. 24 Secondly, the values ofD stat for categorizers trained with different values of C (estimated from validation datasets) can be used to choose between different values of C. Furthermore, in the C → ∞ limit, the value ofD stat (for the optimal categorizer) can be estimated as lim . (3.5) This is also theoretical upper-limit onD stat achievable by analyzing the distribution of events in the full-phase space e. As mentioned in section 3.2 this can be used to assess the improvement one stands to gain by continually increasing C. Eq. (3.5) follows from eq. (2.20) by noting that the optimal categorizer in the infinite categories limit will keep events with different values of p from mixing. Note that comparisons between categorizers based onD stat will only be meaningful if a common performance measure is employed, regardless of the performance measure the individual categorizers were trained to maximize. In fact, it can be shown that for any given H b , H b+s , and categorizer η, the differentD stat values will satisfỹ which also happens to be the same as the order of strictness in eq. (3.3).

Miscellaneous notes
• Note that the iterative training procedure will converge on a locally optimal categorizer. To ensure that the training does not lead to a categorizer much worse than the globally optimal one, one may have to repeat the training procedure with different initial categorizers.
• It is possible that one is not interested in x dependent cutoffs and thresholds on p, and simply wants to maximize the overall c S 2 c /N c (or the equivalent for a different statistical distance), using flat boundaries between categories. In such situations, a dummy event variable x can be used with each event having the same value of the event variable. Alternatively, instead of estimating the signal-fraction functions p c (x) in line 6 of algorithm 1, one can estimate the expected value of p conditional only on c. Either of these simplifications leads to an elegant solution to the problem of finding the optimal cutoffs or "working points" on ML classifier outputs. This leeway in categorizer specification can be interpreted as certain decisions regarding category assignments of events not mattering much, and could be a sign that the number of categories C is too high (to be warranted by the amount of training data used). In such situations, it may be possible to reduce the variability in the trained categorizer by reducing the number of categories C without a significant drop in performance. Alternatively, one can reduce the variability in the trained classifier by sufficiently increasing the number of training events, in order to reliably probe the small variation in performance.
• Note that one has no control over the fraction of events that get placed in different categories by the trained categorizer. This could lead to very lopsided categorizations, leading to situations where increasing the number of categories C makes the expected number of events in some categories inconveniently small while other categories still have sufficient number of events to warrant further splitting. In such situations, one can freeze certain categories and force others to be split further by passing them through another categorizer. This is similar to the sequential splitting of events into categories used in [26].
One could also use different values of C in different regions of x, with each region using a different categorizer.
• It is possible to train a categorizer stochastically instead of iteratively, by maintaining a moving estimate of p c (x), updated based on a few randomly sampled training events (mini-batch) in each step. Similar to stochastic k-means clustering algorithms [50,51], such a stochastic implementation will scale up better with large training datasets and large number of categories C.
As one can see, our training prescription is not hard-and-fast. It is more of a paradigm with several choices and variations, not all of which have been covered here, and a comprehensive exploration of these variations is beyond the scope of this work.

The ThickBrick package: a public implementation in Python
We provide a publicly available implementation of ThickBrick as a Python 3 package of the same name [38]. In this section we will demonstrate our ThickBrick package with a toy example. A guided tutorial of the package in the context of a physics example (a Higgs search in associate production ZH → (µ + µ − )(bb) ) was recently presented at the PyHEP 2020 virtual workshop [52]. Additional physics-motivated examples will be demonstrated in a forthcoming publication [53].
Toy example. For our toy example, we will use two dimensional data e ≡ (x, x c ) generated as follows. The background is taken to be uniformly distributed over the square-shaped JHEP03(2021)291 The signal is taken to be constrained within the same region with a truncated bivariate normal distribution given by with σ = 2 and σ c = 3, and A is the normalization factor given by (4.3) Figure 6 shows the unit normalized signal distribution. In this example, we will use x as the event variable x which will be used to search for the signal. x c represents the information in e complementary to x, which in a real physics example could be high dimensional. In the preprocessing step, we can take training events from the background and signal distribution in a 1:1 ratio and estimate the probability p that an event came from the signal distribution condition on (x, x c ) using a ML classifier training algorithm as per section 3.1. But since the focus of our work here is not this preprocessing step, for simplicity we will use our knowledge of the underlying distributions to directly calculate p(x, x c ) as 25 , (4.4) with S/B set to 1. This p(x, x c ) is plotted in the top-right panel of figure 7. In this section, we will consider other values of the overall signal-to-background ratio S/B as well, but we 25 Note that the quality of the event selection/categorization is dependent on the quality of estimation of p(e), and here we are bypassing this step.  4). The bottom two panels show the distribution of signal events (orange) and background events (blue) in the event variable x before any event selection. The background distribution is normalized to 1 in both, and the signal-to-background ratio is taken to be 1 (bottom-left panel, the signal is 50% of the total) and 1/9 (bottom-right panel, the signal is 10% of the total). will let the ThickBrick package handle the scaling internally. Therefore, in all plots in this section p will refer to the value corresponding to S/B = 1. The top-left panel of figure 7 shows the unit normalized 2d distribution of x, logit (p(x, x c ) . For better visualization of the spread of p at low values of |x|, here we use the logit function given by logit(p) ≡ ln p 1 − p , (4.5) which also has the convenient property that scaling the signal cross-section or, equivalently, changing S/B, only changes logit (p) by an additive constant. In the top-left panel of figure 7, the function logit (p) appears to be constrained from below only because of the truncation of the data at the boundaries x c = ±10. Finally, the bottom two panels of figure 7 show the distribution of signal events (orange) and background events (blue) in the event variable x before any event selection. The bottom-left panel has S/B = 1 (the signal is 50% of the total) and the bottom-right panel has S/B = 1/9 (the signal is 10% of the total), with the background distributions in both panels normalized to 1. changing S/B). Figure 8 illustrates the best event selector with an x-independent threshold on p. The black horizontal line in the top-left panel shows the constant threshold that max-imizesD Pearχ 2 -events above this threshold will be selected. The top-right panel shows the region in the full phase-space that is selected by this threshold (the region bounded by the orange curve). The bottom two panels show the signal and background distributions in x after event selection. The background distribution before selection is normalized to 1 in both, and the signal-to-background ratio before selection is taken to be 1 (bottom-left panel) and 1/9 (bottom-right panel). These plots can therefore be directly compared to the bottom panels of figure 7. Notice the strong correlation between the selection criterion and the event variable x in the top-right plots, and the resulting background shaping in the bottom panels.

JHEP03(2021)291
Next, let us consider a selector trained as per our prescription to maximize the same distance measureD Pearχ 2 . We used the 1d Nadaraya-Watson kernel regression technique [54,55] to estimate p c in each iterative step (implemented within the ThickBrick package). A training sample of 50000 background and signal events each was used to JHEP03(2021)291 train the selector, and the training converged in 10 training steps. Figure 9 illustrates the trained event selector, and all the panels are analogous to their counterparts in figure 8in the top-right panel, the region within the two horizontal orange lines is selected. Note how the selection is decorrelated from the event variable, confirming that event selection is occurring only based on the complementary information contained in x c .
A word on decorrelation. In this toy example x c and x were constructed to be independent of each other under both the signal and background distributions for illustrative purposes only -in our training process, we never use this fact. The optimal selection thresholds were learned only from the distribution of events in the (x, p) space. One can see that our event selector uses only complementary information from the fact that the selection cuts in the top-right panel of figure 9 are horizontal.
Note that different distance measures will decorrelate the selection criteria from the event variable x to different degrees, andD Pearχ 2 is a special distance measure that leads to categorizers that are completely decorrelated from the event variable x. Since the decorrelation properties of our distance measures are not the main focus of this paper, we only briefly discuss this point in section 6. In figure 10, we illustrate a selector trained to maxi-mizeD Neymχ 2 , which is at the other end of the spectrum in terms of degree of decorrelation. Since training with distances other thanD Pearχ 2 is sensitive to the value of S/B, and the contrast between different distance measures increases with increasing S/B, we will use a high signal-to-background ratio of 1 for illustrative purposes.  . The same as the top two panels in figure 9, but for the case of event categorizers trained by ThickBrick. Event rejection was turned off and the number of categories was two (three) in the top (bottom) row. In the top-right panel, the region within the orange horizontal lines comprises category 2, while the two regions outside the orange lines comprise category 1. In the bottom-right panel, the region within the orange horizontal lines comprises category 3, the two regions bounded by an orange and a white line comprise category 2, and the two regions outside the white lines comprise category 1.

JHEP03(2021)291
Note that the implication of figure 10 is not that the corresponding event selector bases its decision partially on information non-complementary to x. In our toy example, the threshold on p at a given value of x will not change if we scale up the number of JHEP03(2021)291 signal events at other values of x. The correct interpretation is that the selector bases its decision on the number density distributions of x c under H b and H b+s conditional on x, instead of the probability distributions of x c conditional on x -the background and signal distributions in x affect the selection criteria through this subtle difference.
Event categorization. Next we will consider event categorization with event rejection turned off (allow_rejection = False). As before, we will only considerD Pearχ 2 and use a training sample of 50000 background and signal events each. The top and bottom rows of figure 11 illustrate categorizers trained with number of categories set to 2 and 3, respectively. In both cases, the black curves in the plots on left show the threshold between categories. The right panels show the corresponding boundaries in the (x, x c ) phase-space. In the top-right panel, the region within the orange horizontal lines comprises category 2, while the two regions outside the orange lines comprise category 1. In the bottom-right panel, the region within the orange horizontal lines comprises category 3, the two regions bounded by an orange and a white line comprise category 2, and the two regions outside the white lines comprise category 1.
Performance comparison usingD Pearχ 2 . Figure 12 illustrates the performance of the event selectors and categorizers trained using our prescription, as captured by the estimatedD Pearχ 2 with S/B set to 1/9. For comparison, the best event selector that uses a flat threshold on p(e) is also shown, along with the theoretical upper-limit on D Pearχ 2 achievable by analyzing the distribution of events in the full phase-space (x, x c ). The x-axis of the plot shows the number of selected categories C. The red cross corresponds to the best flat-threshold selector, while the blue triangle corresponds to an event selector trained using our prescription. The three black dots all correspond to event categorizers with different values of C (all with allow_rejection = True). The horizontal green dashed line in figure 12 shows the theoretical upper-limit onD Pearχ 2 for C → ∞, which is set by the data processing inequality. There are no appreciable error bars to report in figure 12. Note that, as expected, the event selector trained with ThickBrick performs better than a flat-threshold selector in terms of the statistical distance measure,D Pearχ 2 . Also note the diminishing returns with increasing values of C, as anticipated in section 3.2.
Performance comparison using the median significance for signal discovery. In order to get a median significance value for discovering a signal, we re-write D KL as where B sel and N sel are the expected number of non-rejected events under H b and H b+s , respectively: This shows that D KL is the value of the test statistic ln(P b+s /P b ) for an "Asimov dataset" [42] produced under H b+s , where P b+s and P b are the probability densities for JHEP03(2021)291 The horizontal green dashed line shows the theoretical upper-limit onD Pearχ 2 for C → ∞.
observing the dataset (with events characterized by the category they belong to and their x values) under H b+s and H b , respectively. Using the results of [42], for large N , the median significance of discovering a signal, given H b+s is true, is given by For the event selectors and categorizers shown in figure 12, the median significance of discovery (estimated using (4.8)) is shown in figure 13, for S/B = 1/99 and N = 50000. The green line in figure 13 corresponds to the median significance achievable by performing the likelihood-ratio test on the full event information e.

Conclusions
In this paper we have provided an optimal event selection and categorization prescription called ThickBrick to maximize the statistical significance of a potential signal excess (in the distribution of a low dimensional event variable). This statistical significance is captured by one of six different statistical distance measures given in eq. (2.1). Our prescription calls for first learning the probability p(e) that an event (sampled from the "sig+bg" hypothesis, H b+s ) is a signal event, possibly using machine learning methods. Let us refer to the probability thus learned as "the machine learning output". ThickBrick then finds the optimal selection thresholds (for event selectors) or boundary between categories (for event categorizers) in the machine learning output in an iterative manner. 26 These thresholds and boundaries are dependent on the event variable. The thresholds can be thought of as chosen to extract the best sensitivity out of each x-"bin".
Note that since p(e) contains all the relevant useful information, in principle it can be used directly for statistical inference, for example, by constructing the Neyman-Pearson statistic. However, in practice this is complicated because p(e) is estimated from simulations which may have known and unknown systematic uncertainties. In such cases, analyses based on event variables tend to be more robust because they allow for extensive simulation validation, e.g., using control region validation. Such analyses still benefit from the information orthogonal to the event variable x by using that information for event selection and classification. The technique introduced in this paper optimizes this aspect of the analysis and demands only very minimal changes to the current practices -event selectors and categorizers trained by our technique can be used as drop-in replacements for the currently used ones. Any errors and uncertainties in the estimation of p(e) will not affect the robustness of the analyses, only the optimality of event selection.
Event selectors (categorizers) trained using ThickBrick are expected to outperform typical ML-classifier-based selectors (categorizers) in terms of statistical sensitivity, as demonstrated using a toy example in section 4. In addition, our event selectors and categorizers will also be decorrelated from the event variable used in the analysis, which can help reduce the impact of systematic uncertainties on signal search analyses.

Epilogue: decorrelated classifiers and taggers
The distance measure D Pearχ 2 from eq. (2.1b) is given by

A Statistical distances
In this section, we will derive the expressions for the different statistical distances given in eq. (2.1). Let S be an ordered 28 collection of collider events. A given instance of S will be treated as one of the possible outcomes of the experiment performed to produce it.
The events in S are independent and identically distributed. The total number of events in S is a Poisson distributed random variable whose mean could possibly depend on the underlying theory. After categorizing events in S into C categories (and possibly rejecting some), let S c be the ordered 28  Since hypothesis testing is to be performed by analyzing the Υ-s, we will next write down their probability densities under the two hypotheses. Let P b ( · ) ≡ P( · ; H b ) and P b+s ( · ) ≡ P( · ; H b+s ) represent the probability density of · under the null and alternative hypotheses respectively. We can write the probability densities of Υ c as These probability densities are normalized as 29 Ordering of events can be derived from some form of event id. Ordering is demanded just so we don't have to worry about combinatorial factors when writing down the probability (density) of a particular instance of S, Sc, or Υc, since the events are distinguishable. 29 Eq. (A.2) implicitly specifies the reference measure with respect to which the probability densities of Υc and {Υ} are defined.

JHEP03(2021)291
Since the different Υ-s are independent of each other, their joint probability density is given by From eqs. (A.1a), (A.1b), (A.3a), and (A.3b), the ratio of the probability densities can be written as This expression will be used in the derivation of some of our statistical distances.

A.1 D Neymχ 2 and D Pearχ 2 : asymptotic expected values of Neyman-χ 2 and Pearson-χ 2 statistics
For a binned analysis of the distribution of x within each category, the Neyman-χ 2 and Pearson-χ 2 test statistics are given by In the asymptotic and fine binning limit we can show, for both variants of the χ 2 statistic, that We will omit an explicit proof of statements (A.6a)-(A.6c) here, and instead only provide an outline. We begin with the following observations: if X is a Poisson distributed random variable with mean µ, then the expected value of X 2 is µ 2 + µ. Similarly, the expected value of 1/X for positive X tends to 1/µ + 1/µ 2 in the large µ limit [62], and the probability of X being 0 vanishes in the large µ limit. Using these ideas, we can show that asymptotically (up to the fastest growing term), the left-hand sides of eqns (A.6a)-(A.6c) tend to C c=1 i∈x bins for Neyman-χ 2 and C c=1 i∈x bins for Pearson-χ 2 . These are approximations to the expressions for D Neymχ 2 and D Neymχ 2 in eqs.eq. (2.1a) and eq. (2.1b), and the approximations can be made arbitrarily precise in the fine binning limit.

A.2 D KL : Kullback-Leibler divergence from null to alternative
The Kullback-Leibler divergence from the null hypothesis to the alternative hypothesis (for the Υ-s) is given by where d{Υ} is a shorthand for Using the fact that the different Υ-s are independent of each other, and the fact that x c,i for different i values are independent of each other, we can simplify this to (A.14) The last equation matches the expression for D KL given in eq. (2.1c).

A.3 D revKL : Kullback-Leibler divergence from alternative to null
The Kullback-Leibler divergence from the alternative hypothesis to the null hypothesis (for the Υ-s) is given by Using manipulations similar to the ones we did for D KL , we can show that which matches the expression for D revKL given in eq. (2.1d). This expression can also be derived from eq. (A.14) by making the substitutions n c ←→ b c and s c → −s c .

A.4 D J : Jeffreys divergence
Jeffreys divergence is a symmetrized version of Kullback-Leibler divergence given by the sum of the two asymmetric KL divergences. From eqs. (A.14) and (A. 16) we can see that This matches the expression for D J given in eq. (2.1e).

A.5 D B : Bhattacharyya distance
The Bhattacharyya distance between null and alternative hypotheses (for the Υ-s) is given by We can rewrite this as

JHEP03(2021)291
Using the power series representation of the exponential function given by we can rewrite eq. (A.22) as This matches the expression for D B given in eq. (2.1f) (after taking natural-log and multiplying by −1).

B An interpretation of Kullback-Leibler divergence
Let us consider a generic statistical test that decides between H b or H b+s , based on data (post-categorization, i.e., after reduction of events to their c and x values). Power of the test, 1 − β, is the probability of the test choosing H b+s , given that H b+s is true. It captures the ability of the test to detect the presence of a signal.
Size (or significance level) of the test, α, is the probability of the test choosing H b+s , given that H b is true. It captures the false alarm rate, and is typically set to ∼ 3 × 10 −7 (5σ) in high energy physics.
A statistical test that has the greatest power 1 − β * among all tests of a given size α * is called a uniformly most powerful test or a UMP test (with α * being a tunable parameter in the test). The likelihood-ratio test is a UMP for testing point hypotheses (like our H b and H b+s ). It can be shown for point hypotheses that 1. For a UMP test of a given (fixed) power 0 ≤ 1 − β * < 1, For a UMP test of a given (fixed) size 0 < α * ≤ 1, where "asymptotically" refers to the large number of events limit. o(D KL ) and o(D revKL ) are functions asymptotically dominated by D KL and D revKL , respectively. The formal definition of the little-o notation is: Note that D KL and D revKL both scale linearly with the integrated luminosity. This means that any non-zero difference between D KL and D revKL will not be asymptotically dominated by either of them.

JHEP03(2021)291
When the goal of event categorization is to maximize the chance of signal discovery (assuming its existence), we want to maximize the power for a given size (say, p-value corresponding to 5σ). This can be done by maximizing D revKL , according to statement (B.2). When there is no a priori expectation for the size of the signal cross-section, it makes sense to work with a signal cross-section near the low end of discoverability.
On the other hand, when the goal of event categorization is to maximize the chance of ruling out the existence of a signal (assuming its absence), we want to minimize the size for a given power (say, 95%: this is the confidence that the signal does not existotherwise it would have been seen). This can be done by maximizing D KL , according to statement (B.1). When there is no a priori expectation for the size of the signal crosssection and the goal is to set the best upper-limit on it, it makes sense to work with a signal cross-section value near the expected upper-limit.
This discussion is purely theoretical and it is likely that the difference between training withD revKL andD KL will not be practically significant in most situations. Nevertheless, we find this (slight) misalignment between the goals of event categorization for signal discovery and upper-limit setting a curiosity worth noting.
Proof of statement (B.1): for any hypothesis test performed using the categorized events described by {Υ}, the power of the test 1 − β and the size of the test α are given by where the integrations are performed over the critical region of the test. From these, we can write where E b+s represents the expectation under H b+s . This result holds for any hypothesis test. Now, let us turn to a UMP test performed using the following test statistic T .
The critical region of the test contains all the outcomes {Υ} for which T ({Υ}) is below a certain threshold. We are interested in the asymptotic distribution of the test statistic T under H b+s . From (A.4), we can write T as

JHEP03(2021)291
This shows that T can be written as the sum of ln(b c /n c ) over all the selected events plus a constant term (independent of the observed {Υ}, but dependent on the integrated luminosity). In the asymptotic limit (large expected number of events), this sum converges in distribution to the normal distribution, as per the central limit theorem. 30 Let µ and σ 2 be the mean and variance of T . Two keys points here are, • Asymptotically, both µ and σ 2 scale linearly with the integrated luminosity of the experiment.
• Under the alternative hypothesis H b+s , the expected value of T is −D KL , as can be seen from (A.9). In other words, µ = −D KL .
Let T * be the threshold value of T that leads to a given power 1 − β * ∈ [0, 1). The critical region of the test is T ({Υ}) ≤ T * . Asymptotically, we have where Φ is cumulative distribution function of the standard normal distribution.
In light of (B.7), let us rephrase the UMP in terms of the ratio P b /P b+s . For convenience, let The critical region of the test is given by Furthermore, it follows from the asymptotic distribution of T , that the asymptotic distribution of W under H b+s is Lognormal(µ = −D KL , σ 2 ). Plugging these in (B.7), we can write the size of the UMP α * , in the aymptotic limit as for W distributed as-per Lognormal(µ = −D KL , σ 2 ). The conditional expectation of W under the Lognormal distribution can be written as 31 Since the number of events in the sample (and hence the number of terms in the summation) is random, here we invoke a special form of the central limit theorem for the sum of a random number of independent random variables [60,61]. 31 This can be proved by making the substitution t = (ln(W ) − (µ + σ 2 ))/σ in the integral expression for the conditional expectation.

JHEP03(2021)291
Φ T * −µ σ asymptotes to (1 − β * ), as per (B.11). On the other hand, In terms of the error function erf this can be written as is finite (and fixed). Recall that σ 2 scales linearly with the integrated luminosity in the asymptotic limit. So, we can expand this expression up to the leading order in 1/σ for large σ, which gives Plugging (B.11), (B.15), and (B.19) into (B.14), and using µ = −D KL we get Noting that a) erf −1 (1 − 2β * ) is finite and fixed, b) D KL scales linearly with the integrated luminosity, and c) σ scales as square-root of the integrated luminosity, we can write which completes the proof of (B.1). Statement (B.2) can be proved in a similar fashion. Alternatively, it can also be derived from (B.1) by making the substitutions α * → β * (Type I error → Type II error) and D KL → D revKL .

C Proof of correctness of the iterative training algorithm
In this section we will prove the correctness as well as convergence of our event categorizer training prescription. The key ingredients to the proof are the following two properties The linearity property (a) is by construction of Λ stat -s from D stat -s. The global maximum property (b) is from the nature of D stat -s, which penalize mixing of regions of phase-space with different p(e) values.

JHEP03(2021)291
In each iterative step, from the current categorizer, η, a new categorizer,η, is constructed. Let p c (x) andp c (x) be the signal-fraction functions of η andη respectively. Recall thatη rejects an event (if allowed) if Λ stat p c (x), p(e) < 0 for all categories c. Otherwise,η places each event in the category that maximizes Λ stat p c (x), p(e) . From the construction ofη it is clear that (C.1) By treating "averaging over all events" as averaging over events with a given value of (c, x) followed by an (appropriately weighted) average over (c, x) values, we can write Eq. (C.6) shows that the categorizer "improves" in each iterative step. The only caveat here is the assumption that p c (x) can be estimated accurately enough at each step.
To see that the algorithm will converge in a finite number of steps, note that during the training process a categorizer in the current step is defined completely by the assignment of training events to categories in the previous step. If the categorizer has to improve in each step, it has to be different in each step (no cycles), and there is only a finite number of possible category assignments for training events. So, in a finite number of iterations, the algorithm should converge on a categorizer that maximizes (the estimate of)D stat locally.

JHEP03(2021)291
A caveat to this argument is that due to artifacts in the unbinned estimation of the signal-fraction functions p c (x), it is possible to run into cycles near the end of the training process. So it may make sense to terminate the iteration process when the fraction of events reassigned in some step drops below a certain threshold or if the (absolute or relative) gain in the estimatedD stat drops below a certain threshold. For a binned analysis the convergence argument presented above holds without caveats, and it is guaranteed thatη will be the same as η in some training step -although it may take an unfeasible number of iterations to reach this stage.
The time complexity of the training process is O (CN i + f (C, N )i) where N is the number of training events, C is the number of categories, i is the number of iterative steps. While the number of iterations i needed for convergence in the worst case is superpolynomial [63] (unbounded by any polynomial), in practice, for reasonable values of C and d, the value ofD stat flattens out after a few tens of iterations -just like with Lloyd's k-means clustering algorithm. So, with appropriate termination criteria, the iterative training process can be implemented to practically have a time complexity of O(CN ) when using binned and O(CN log N ) when using unbinned estimation of p c (ignoring i and 2 d ).

D A comparison of the different distance measures
In a given iterative step of training, each event e will be categorized into one of the categories based on their p c (x(e)) values from the previous step -event e will be placed into the one of the categories with previous p c (x(e)) closest to p(e) on either side (if event rejection is allowed, rejected events can be thought of as forming a category of their own with p c (x) = 0). The only difference between using different measures is in the threshold value of p(e), for a pair of categories, above which events will be placed in the "higher purity" category.
Let c 1 and c 2 be two adjacent categories when ordered by p c (x) for a given x, with p c 1 (x) < p c 2 (x). At the threshold value of p(e), say p thresh stat , between p c 1 (x) and p c 2 (x), Λ stat p c 1 (x), p thresh stat = Λ stat p c 2 (x), p thresh stat . For values of p(e) greater (lesser) than p thresh stat , Λ stat p c 2 (x), p(e) is greater (lesser) than Λ stat p c 1 (x), p(e) .
Formally, let us define the threshold value p thresh stat (p 1 , p 2 ) of p(e), at a given value of x, between two categories with p c 1 (x) = p 1 and p c 2 (x) = p 2 , as satisfying min(p 1 , p 2 ) ≤ p thresh stat (p 1 , p 2 ) ≤ max(p 1 , p 2 ) , (D.1a) Λ stat p 1 , p thresh stat (p 1 , p 2 ) = Λ stat p 2 , p thresh stat (p 1 , p 2 ) . Plots of p thresh stat (p 1 , p 2 ) for the different statistical distances (middle column) arranged in decreasing order of strictness (top to bottom). The right column shows the ratio between p thresh stat (p 1 , p 2 ) for consecutive distances, in decreasing order of strictness. The ratios being ≥ 1 demonstrates the correctness of their ordering.

JHEP03(2021)291
It can be proved that ∀ (p 1 , p 2 ) ∈ [0, 1) × [0, 1), with equality iff p 1 = p 2 . This property is written in eq. (3.3) as the "strictness" order of the statistical distances, withD Neymχ 2 andD Pearχ 2 being the least and most strict respectively. In figure 14 we plot p thresh stat (p 1 , p 2 ) for the different statistical distances (middle column) arranged in decreasing order of strictness (top to bottom). Figure 14 also shows the ratio between p thresh stat (p 1 , p 2 ) for consecutive distances, in decreasing order of strictness (right panel). The ratios being greater than or equal to one demonstrates the correctness of the ordering in eq. (D.2).
Finally we want to point out that there is nothing sacred about the six distance measures used in this paper. Any measure of distance that is joint-convex in the distributions H b and H b+s , and proportional to the Fisher-Rao metric distance between them in the H b+s → H b limit, can be used for training. The six distances provided in this paper are just a few named distances known in the statistics literature for which we were able to find usable closed form expressions in terms of the number density functions s c (x) and b c (x). For example, any linear combination ofD stat 's from eq. (2.19) can also be used to train categorizers, along with the same linear combination of eventwise contributions Λ stat from eq. (2.21), although this much choice is probably overkill for most applications.
On a related note, we would like to point to the following approximation for use in estimating the statistical significance of deviations from expectation: Here the linear combination of Neyman-χ 2 and Pearson-χ 2 test statistics on the right-hand side, dubbed the "combined Neyman-Pearson χ 2 " in [64], approximates the left-hand side better than they do individually. In terms of our statistical distances between H b and H b+s , this approximation can be written asD KL ≈ D Neymχ 2 + 2D Pearχ 2 /3.

Open Access. This article is distributed under the terms of the Creative Commons
Attribution License (CC-BY 4.0), which permits any use, distribution and reproduction in any medium, provided the original author(s) and source are credited.