Finding Wombling Boundaries in LHC Data with Voronoi and Delaunay Tessellations

We address the problem of finding a wombling boundary in point data generated by a general Poisson point process, a specific example of which is an LHC event sample distributed in the phase space of a final state signature, with the wombling boundary created by some new physics. We discuss the use of Voronoi and Delaunay tessellations of the point data for estimating the local gradients and investigate methods for sharpening the boundaries by reducing the statistical noise. The outcome from traditional wombling algorithms is a set of boundary cell candidates with relatively large gradients, whose spatial properties must then be scrutinized in order to construct the boundary and evaluate its significance. Here we propose an alternative approach where we simultaneously form and evaluate the significance of all possible boundaries in terms of the total gradient flux. We illustrate our method with several toy examples of both straight and curved boundaries with varying amounts of signal present in the data.


Introduction
Collider data in high energy physics can be viewed, at least at the parton level, as a collection of points { x 1 , x 2 , . . . , x N } in the relevant phase space P of the final state signature. The ultimate goal of a high-energy physics experiment is then to test whether the distribution of those N points (commonly referred to as "events") in P follows the probability distribution predicted in some theory model by the fully differential cross-section where x ∈ P is a particular phase space point and {α} is a set of input model parameters such as particle masses, widths, couplings, etc. More specifically, in searches for new physics (NP) beyond the standard model (SM), eq. (1.1) can be split into respective SM and NP contributions where {α SM } and {α N P } label respectively the set of SM parameters and the set of additional NP parameters. Let the corresponding regions of phase space populated by SM and NP events be P SM and P N P , then the respective total cross-sections are given by Since the standard model is well known, its distribution f SM , a.k.a. "the background", is calculable theoretically (up to some fixed order in perturbation theory). However, once we account for the experimental realities, the result may suffer from non-negligible systematic uncertainties, particularly in the case of challenging signatures involving QCD and/or reducible backgrounds. This is the main roadblock in NP searches via counting experiments, where one focuses on a suitably chosen "signal region" P SR ⊂ P N P and looks for an excess over the SM expectation P SR f SM ( x, {α SM }) d x. 1 Instead, in this paper we shall consider methods which could allow us to infer, at least in principle, the existence of the NP contribution f N P without any prior knowledge of the SM prediction f SM . Recently there has been hightened interest in such "blind" or "background-independent" searches for NP, particularly using machine learning techniques [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Here, instead of looking for an excess in the signal region P SR , we shall follow up on the idea of Refs. [16,17] to target directly the boundary ∂P N P of the NP phase space region P N P , by using the fact that the combined distribution (1.2) is non-differentiable anywhere on ∂P N P where f N P is non-vanishing. As it turns out, the latter is a very safe assumption -if anything, f N P is not only non-vanishing, but often enhanced and even singular on the boundary ∂P N P [18][19][20][21][22][23][24][25][26][27]. Since the background distribution f SM is a smooth function across ∂P N P , the presence of f N P creates a discontinuous "jump" in the combined event density (1.2), precisely at the location of the boundary ∂P N P [16,17]. We can thus reformulate the original problem of finding evidence of NP in the collider data as follows: Given a collection of N points { x 1 , x 2 , . . . , x N } in the phase space P, identify (the locations of) any candidate wombling boundaries and estimate their statistical significance.
The detection of such difference boundaries (or wombling boundaries, named after a pioneer in the field, W. H. Womble [28]) is a well-known problem in spatial statistics, see, e.g., [29]. Broadly speaking, wombling is any of a number of techniques used for identifying zones of rapid change, typically in some quantity as it varies across some geographical or Euclidean space. Wombling techniques are being applied in a wide variety of disciplines, including computational ecology, anthropology, linguistics, geography and many others. 2 In our case here, we shall be interested in identifying the phase space region in the vicinity of ∂P N P where the density of points is changing significantly.
Before going through the typical steps of a wombling analysis, several comments are in order. First, the discontinuous jump of the combined distribution (1.2) across ∂P N P in practice will be smoothed out to some extent by the detector resolution and finite width effects, leading to a well-defined, finite, density gradient everywhere in P. This precludes us from using methods specifically designed to detect image discontinuities such as ridges and cliffs but do not admit gradients [40]. Second, a good wombling method should also be able to pick up any boundaries created within P SM by interesting SM subprocesses, e.g., top, Higgs or heavy gauge boson production. This would guarantee an opportunity for the LHC experiments to start testing and validating the method with existing real data, before any NP discovery. Third, while most applications of wombling in the literature have been limited to two-dimensional data, the method shoud be readily generalizable to higher dimensions, if it is to be of any interest to the high-energy physics community where the dimensionality of the relevant phase space P is typically much higher (although in some special cases it can be reduced to 2 or even 1 through suitable projections preserving the boundary). Along those lines, it is also important to choose a good parametrization of P, so that the dimensionality can be reduced by projecting out uninteresting degrees of freedom without washing out the wombling boundary.
The main steps of a typical wombling analysis are the following [41].
• Data preparation and preprocessing. The starting point in wombling is a spatially referenced dataset ( x i , f i ), i = 1, 2, . . . , N, (1.4) where a set of values {f i } for the function of interest f ( x) are obtained at some finite number N of point locations { x i }. In some applications, e.g., for aerial and remotely sensed images, the locations { x i } can be chosen by the experimenter -then it may be convenient to arrange them in some kind of a regular lattice, as required by some wombling algorithms, including the original proposal in [28]. Alternatively, when the data are gathered by an irregular or random design, eq. (1.4) is known as pointreferenced or geostatistical data. While in many other fields of science the functional values {f i } for geostatistical data are obtained directly from field observations, for the Monte Carlo simulations used in high-energy physics the situation is more subtle -each f i is supposed to be a measure of the local point density at x i and needs to be evaluated as a preprocessing step. For this purpose, Refs. [16,17] proposed to consider the Voronoi tessellation in P of the dataset (1.4), since the geometric volume v i of the Voronoi cell containing x i provides a natural local estimator of the point density at x i [42]: While Voronoi tessellations have been widely used in many other fields of science, they seem to be underutilized in high energy physics where their application has been limited to jet clustering [43] and the partitioning of the signature phase space into search regions, as implemented in the SLEUTH algorithm [44][45][46] which was used to perform model-independent new physics searches at D0 [47][48][49], HERA [50] and CDF [51][52][53]. Yet, the Voronoi approach 3 is ideally suited for finding interesting (e.g., singular) features in f ( x), since it preserves the maximum spatial resolution in the data [54]. For example, the standard approach of binning the data in order to obtain a local density estimate necessarily throws away a certain amount of useful information, and is associated with some arbitrariness in the exact choice of binning [16].
• Gradient estimation. Once we have the point-referenced dataset (1.4), the next step is usually to estimate the magnitude of the local gradient ∇f of the function f ( x), since any zone of rapid change is necessarily associated with large values for | ∇f |. In calculating the gradient, one has to overcome the fact that the data is a) discrete and b) irregularly sampled. One possible approach is to obtain a continuous approximation for f ( x) via some spatial interpolation method [42,55]. Unfortunately, interpolation techniques tend to smooth out not only the noise but also small local discontinuities, which can result in masking some true boundaries [34]. For this reason, Refs. [16,17] explored several boundary detection techniques (further developed and illustrated in [22,23,56]) which continued to use the Voronoi tessellation of the data and the fundamental relation (1.5). Among the different options studied in [16,56], the normalized standard deviation (sometimes also called the coefficient of variance) of the volumes of the neighboring cells emerged as a viable measure of the magnitude of the local gradi-ent within a given Voronoi cell. In this paper, we shall pursue a somewhat orthogonal and more traditional approach, known in the literature as "triangulation wombling" [34,39], which makes use of the dual Delaunay tessellation of the data (1.4). Even though the two types of tessellations are dual to each other 4 , the Delaunay version seems more natural for the specific problem at hand of calculating gradients -this will be further discussed and illustrated in Section 2 below.
• Tagging selection. Having constructed the Delaunay or Voronoi tessellation and obtained estimates for the local gradients, the next task is to identify elements of the tessellation (edges or vertices) which are likely to be located on or near a wombling boundary. This is typically accomplished with a cut on the ranked values of | ∇f | for all elements in the tessellation, e.g., selecting elements whose calculated gradients are in a certain upper percentile 5 . However, using such a simple threshold for tagging boundary elements has been viewed as somewhat arbitrary and rather subjective [41] -in the absence of any robust guidelines, typical values used in the literature range in the 5 th to 10 th percentile [30,31,36,39]. Furthermore, for any given value of the threshold, there will always be a certain number of elements passing the cut, including elements "in the bulk", i.e., away from any wombling boundaries, and one has to design a prescription on how to deal with such false positives. Obviously, a value for the cut which is too stringent will miss many true boundary elements, while a value which is too generous will bring about a lot of false positives. Finally, given that the dimensionalities of P N P and ∂P N P necessarily differ by one, the concept of a "boundary element" can be open to interpretation -how close to the boundary does an element have to be in order to be considered a "boundary element"?
• Agglomeration. The previous step results in a collection of tagged boundary elements scattered throughout P, so now the question is how to use that information to reconstruct the complete boundary. As a first step, one can start forming subboundaries by linking adjacent tagged boundary elements, possibly subject to some additional criteria, e.g., that the directions 6 of their gradients are within 30 • of each other [31]. This agglomeration procedure will result in a graph whose nodes are the tagged boundary elements [57]. The properties of this graph can then be studied to determine its statistical significance [32,34] (see the next item) and to get some idea about the shape of the boundary. 4 In graph theory, the dual graph of a plane graph G is a graph that has a vertex for each face of G.
Correspondingly, each edge e of G has a corresponding dual edge, whose endpoints are the dual vertices corresponding to the faces on either side of e. Some care must be exercised for the Delaunay edges along the convex hull of the point set -their Voronoi duals are infinite rays which can be turned into finite line segments by adding an artificial point at infinity which serves as a common endpoint for all the rays. In our analysis, this complication will not arise since we will perform our analysis in the interior of P. 5 In NP scenarios where the signal density is additionally enhanced on the boundary ∂PNP , Ref. [22] proposed a two-dimensional cut, simultaneously targeting both large gradients and large values of the function. 6 The two requirements -that the gradients are large and that their directions are correlated -can be conveniently encoded in the scalar (dot) product of the gradient vectors of neighboring elements [16].
At this point it is worth mentioning that in collider physics applications there can be situations where the shape of the boundary is parametrically known. This is precisely the case with "simplified model" NP searches at the LHC [58] -once the event topology is assumed, the geometry of the NP final state phase space P N P is also fixed. As an example, consider a sequence of three two-body decays, which is the classic squark signature in supersymmetry (SUSY) [59]. The relevant phase space P N P is three-dimensional and can be parametrized by the invariant masses of the three pairs of visible decay products. The equation for the boundary ∂P N P is known analytically [21,60,61] in terms of just four parameters -the masses of the SUSY particles participating in the decay chain. In that situation, Ref. [23] proposed to bypass both of the last two steps (the tagging and the agglomeration) altogether and instead fit the equation for the surface ∂P N P to the full tessellation. Operationally this was done by computing a quantity inspired by Bayesian wombling (see next bullet), namely, a two-dimensional surface integral of the gradient magnitude over the boundary surface, normalized to the total area of the surface: In Ref. [23], it was demonstrated that this quantity is maximized for the true values of the SUSY masses, resulting in a novel method for SUSY mass measurements.
• Significance estimation. Any wombling algorithm as described so far will produce numerical results regardless of whether a true pattern exists or not. The crucial question now is to assess the likelihood that the observed pattern could have been produced from random fluctuations instead of a true boundary. One possible approach, known as sub-boundary statistics [32,34], is to analyze the properties of the graph mentioned in the previous step, formed out of the tagged boundary elements. Strictly speaking, sub-boundary statistics tests whether the different components of the graph are sufficiently contiguous (and not whether the rates of change are sufficiently large) [29,41]. To this end, one looks at (distributions of) quantities which would characterize coherent boundaries formed from connected boundary elements, such as: the total number of subgraphs, the number of single-node subgraphs, the maximum and the mean of the length and/or the diameter of the subgraphs, the superfluity, etc. An alternative approach, named Bayesian wombling, starts with an ansatz for the shape of the wombling line (in two spatial dimensions) and then computes the average flux of the two-dimensional gradient field through all possible such lines [40,62,63]. The idea is that the average flux will be maximized when the ansatz matches the true wombling boundary. As already mentioned in relation to eq. (1.6), the advantage of this approach is that it avoids the subjectiveness associated with the steps of tagging and agglomeration. With either approach, one has to specify a null hypothesis, in order to quantify the confidence level. Unlike other fields of science, where the null hypothesis may not be immediately obvious and one typically has to rely on a ran-domization scheme [29], in high-energy physics the null hypothesis is well definedit is the SM.
In this paper we further develop and refine the Voronoi boundary detection methods from Refs. [16,22,23,56]. As before, the main goal will be to outline a method for discovering new physics in LHC collider data by identifying wombling boundaries in phase space. The paper is structured around the five typical steps of algorithmic wombling described above. The novel elements in the analysis presented here are the following: • In addition to the Voronoi tessellations utilized in [16,22,23,56], here we also consider Delaunay tessellations. In Section 2 we shall briefly review the two types of tessellations and outline the range of new possibilities offered by the use of a Delaunay tessellation for our purposes. In particular, in Section 2.3 we shall illustrate the four possible types of boundary elements, discuss their relations to each other, and how each can be potentially targeted in the tagging step of the wombling algorithm.
• Unlike previous work in high-energy physics, here our main tool for calculating the local gradients will be the Delaunay tessellation (often referred to as "triangulation" since in two dimensions the Delaunay polygons are triangles). Section 3 is devoted to the topic of gradient estimation -after a brief review in Section 3.1 of previous work on estimating gradients from a Voronoi tessellation, in Section 3.2 we shall describe the gradient calculation from the Delaunay triangulation. In the process, we shall pay special attention to techniques for reducing the random fluctuations in the obtained gradient values -three such procedures and their interplay and optimization are discussed in Section 3.3.
• Different techniques for tagging boundary elements will be discussed in Section 4. In addition to tagging Voronoi cells [16,22], here we shall also be interested in tagging Voronoi edges, Delaunay cells and Delaunay edges as well. Although this is not our main focus here, in Section 5 we shall illustrate how these tagging methods can be used for agglomeration.
• The main results of the paper are presented in Sections 6-8. We follow the approach of Bayesian wombling [40,62,63], which lets us avoid the intricacies and uncertainties of the tagging and agglomeration steps. To gain some intuition, in Section 6 we first go over a toy example where the function f ( x) can be sampled continuously from a distribution which resembles real data including finite width effects and detector resolution. Then in Section 7 we study point-referenced data of the type (1.4) with a straight (Section 7.1) or circular (Section 7.2) boundary. The corresponding estimates of the statistical significance of the obtained wombling boundaries are performed in Section 8.

Simulation Details
Virtually all wombling studies in the literature have been concerned with two-dimensional point data generated, e.g., from field samples taken within a certain geographical area, from remotely sensed images, etc. In this paper, we shall continue to work in two-dimensions, but this will be done only for clarity of the presentation, since it is difficult to visualize Voronoi and Delaunay tessellations in more than two dimensions; the methods which we shall describe will be applicable to higher dimensional data as well. To be specific, we shall consider the Cartesian plane where the data points are specified by their coordinates (x i , y i ), so that the dataset (1.4) reduces to For concreteness, we shall choose our field of view to be the unit square, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, although data will be generated beyond the boundaries of the unit square -this will eliminate any spurious boundary effects like clipping which would modify the statistical properties of the Voronoi cells near the boundaries [64]. Following Refs. [16,17,22], the datasets will be generated according to (1.2) with the following assumptions for f SM and f N P : • Background. As in previous work [16,17,22], our proxy for the SM background distribution f SM will be the uniform distribution The exact value of the constant will depend on the normalization: for pure-background samples within the unit square the constant is 1, while for background plus signal samples, it will depend on the relative strength of the signal. Strictly speaking, the assumption (2.2) is unrealistic from the point of view of a high energy physicist, since f SM is in general a function of the kinematic variables parametrizing the phase space P. Nevertheless, it is good enough for our purposes here -the important point is that any realistic SM distribution is very weakly varying across the boundary ∂P N P , which justifies the use of (2.2) for our model-independent toy examples below. A typical background distribution of N = 500 points within the unit square is shown in the left panels of Figs. 1 and 2. It is evident that such a distribution does not have any obvious features and any wombling boundary would have to be created purely by chance.
• Signal with a flat boundary. As our first example of a hypothesized NP signal we shall consider a distribution f N P populating a region P N P with a flat boundary. Again following Refs. [16,17,22], we shall take the boundary to be the vertical line at x = 0.5 and the corresponding signal distribution f line to be flat and non-zero only to the left of the boundary: 3) where H is the Heavyside step function. When adding this signal to the background distribution, it is important to specify the mixing ratio.
In the middle and right panel of Fig. 1 we show distributions of N = 500 points according to (2.5) with ρ = 1.5 and ρ = 5, respectively. In the latter case, the value of ρ is sufficiently large that the boundary is clearly visible with the naked eye. As in Refs. [16,17,22] (which considered an even more extreme value of ρ = 6), the case of relatively large ρ is meant mostly for illustration -it makes it easier to visualize the benefits from the various wombling and denoising techniques introduced below. Our real target will be the case of relatively low values of ρ as shown in the middle panel of Fig. 1, where it is rather difficult to discern any apparent wombling boundary.
• Signal with a circular boundary. For completeness, we shall also consider an example of a signal in a domain bounded by a curvilinear boundary. Following [22], we shall take the signal distribution f circle to be confined to a circular region of radius R < 0.5 centered at (x, y) = (0.5, 0.5): so that the combined total distribution (1. 2) becomes with the ρ parameter suitably defined in analogy to (2.4) as the ratio of point densities across the circular boundary. The middle and right plots in Fig. 2 depict typical distributions of N = 500 points according to (2.5) with ρ = 1.5 and ρ = 5, respectively. As before, the boundary for ρ = 5 is clearly visible, but the case of ρ = 1.5 appears much more challenging.

Voronoi and Delaunay tessellations
The Voronoi and Delaunay tessellations of the point data sample in the right panel of Fig. 1 are illustrated in Fig. 3. In the left panels, which show the Voronoi tessellation, the data points appear as dots, while in the right panels, which illustrate the Delaunay tessellation, the data points are located at the vertices of the Delaunay triangles and are not explicitly shown.
As illustrated in the left panels of Fig. 3, a Voronoi tessellation of N points (often referred to as generators) in the plane is constructed as follows (see, e.g., [42]). Every location in the plane is assigned to the closest member of the point set. If a location happens to be equally close to two (or more) generator points, it is assigned to all of those points; all such eqidistant locations form the edges of the Voronoi graph. The set of locations assigned to a given member of the point set forms the Voronoi cell corresponding to that generator point; as seen in Fig. 3, the Voronoi cells in the plane are polygons, with the corresponding generator point located somewhere in the polygon's interior, but not necessarily at its geometric center. Note that Voronoi polygons have different shapes and sizes; in particular, the number of edges of a polygon varies greatly, the average number being no more than six [42]. An endpoint of a Voronoi edge is called a Voronoi vertex; alternatively, a vertex may be defined as a point shared by three (or more) Voronoi edges. When each vertex belongs to three and only three edges, the Voronoi tessellation is nondegenerate; as seen in Fig. 3 this will be our case as well, since the probability of generating a degenerate vertex in Monte Carlo sampled data is vanishingly small.
The right panels in Fig. 3 depict the corresponding Delaunay tessellation of the same data. Since the Voronoi and Delaunay tessellations are dual to each other, one way to construct the Delaunay tessellation is to start from the Voronoi diagram and join all pairs of generator points whose Voronoi polygons share a common Voronoi edge. If the Voronoi tessellation is non-degenerate, each Voronoi vertex belongs to exactly three Voronoi edges, which in turn define a triangular polygon in the Delaunay tessellation (see Fig. 3); for this reason the Delaunay tessellation is often referred to as a triangulation. The described procedure 9 also manifestly pairs up all Voronoi edges with their corresponding dual edges in the Delaunay tessellation; if each such pair of dual edges from the two tessellations has a common point, i.e., the dual edges cross each other, the Delaunay triangulation is known as Pitteway triangulation [65,66]. As we shall see explicitly below in Fig. 4, our datasets will generally not lead to Pitteway triangulations; so it will be important to keep in mind that some dual pairs of edges may be slightly offset and not intersect each other.
In what follows, our discussion will often switch back and forth between the two types of tessellations, so at this point it may be useful to build some intuition by recapping some of the relationships between the constituent objects of the Voronoi and Delaunay tessellations for a dataset consisting of N points: • Each data point defines both a Voronoi polygon and a corresponding Delaunay vertex; the total number of Voronoi polygons or Delaunay vertices is thus N . The number of sides of a Voronoi polygon is equal to the number of Delaunay edges joining at the corresponding Delaunay vertex.
• Each Voronoi edge has a dual Delaunay edge; the total number of Voronoi edges is therefore equal to the total number of Delaunay edges and is on the order of, but slightly less than, 3N [42].
• Each Voronoi vertex defines a corresponding Delaunay triangle; the total number of such objects is on the order of, but slightly less than, 2N [42].
In summary, we have the following duality relations between the elements of the Voronoi and Delaunay tessellations: Note the trade-off in complexity -in the Voronoi case, vertices are labelled with exactly 3 indices, but the number of polygon edges N i varies, while in the Delaunay case, the number of polygon edges is always 3, but vertices are labelled with a varying number of indices.

Candidate boundary objects
Having described the Voronoi and Delaunay tessellations of the data, before proceeding to the next two stages of gradient computation and tagging, it is worth pausing for a moment to discuss which elements of the tessellation are best suited for describing a wombling boundary. In Fig. 3 the theoretical boundary at x = 0.5 was marked with a vertical blue dashed line, but this was done only to guide the eye, since in reality both the existence and location of the boundary will be a priori unknown. In practice, we need to identify individual elements of the tessellations located at (or close to) the boundary which could be targeted by the wombling analysis. By including the Delaunay tessellation into our discussion, we obtain three new possibilities in addition to the approach of Refs. [16,17]. The four panels in Fig. 3 illustrate these four options: • Boundary Voronoi Cells (BVCs), shown in the upper left panel in Fig. 3. When working with the Voronoi tessellation, this is the most natural and perhaps only option. A Boundary Voronoi Cell was defined as any Voronoi cell which is crossed by the theoretical boundary [16,17]. In the upper left panel of Fig. 3 and in the left panel of Fig. 4, the BVCs are shaded in yellow.
• Boundary Delaunay Triangles (BDTs), shown in the upper right panel in Fig. 3. When working with the Delaunay triangulation, in analogy we can now define a Boundary Delaunay Triangle to be any Delaunay triangle which is crossed by the theoretical boundary. In the upper right panel of Fig. 3 and in the right panel of Fig. 4, the BDTs are shaded in yellow.
• Boundary Delaunay Edges (BDEs), shown in the lower right panel in Fig. 3. Similarly, we can define a Boundary Delaunay Edge to be any Delaunay edge which is crossed by the theoretical boundary. In the lower right panel of Fig. 3 and in the two panels of Fig. 4, the BDEs are indicated with green lines. • Boundary Voronoi Edges (BVEs), shown in the lower left panel in Fig. 3. Finally, using the duality relation (2.8b) we can define a Boundary Voronoi Edge to be any Voronoi edge which is dual to a Boundary Delaunay Edge. In the lower left panel of Fig. 3 and in the two panels of Fig. 4, the BVEs are indicated with red lines.
Note that the last three options all rely on the Delaunay triangulation and would not have been possible if we were only considering the Voronoi tessellation of the data. Given the duality relations (2.8) between the Voronoi and Delaunay tessellations, the different categories of boundary objects defined above are related to each other. These relationships are exhibited in Fig. 4, where we simply superimpose some of the results from Fig. 3 in order to better see the existing correlations. Figs. 3 and 4 demonstrate that all four definitions lead to a contiguous set of boundary objects strung along the theoretical boundary. Now the question becomes how to tag these boundary objects with a suitable algorithm using their geometric properties.

Gradient estimation from a Voronoi tessellation
As discussed in the Introduction, the Voronoi tessellation provides a natural estimate (1.5) for the values of the function f at the location of each generator point. In our case, since we are dealing with a two-dimensional dataset (2.1), eq. (1.5) reduces to where a i is the area of the i th Voronoi cell V i . The identification (3.1) is pictorially illustrated in Fig as [16] ( where the prefactor of (a i a j ) 3 4 was included to make the directional derivative dimensionless. Since each Voronoi cell V i has a varying number of edges N i ≡ |N i |, there will be a different number of directional derivatives available at each point i, but they can all be fitted to the expected distribution from the true gradient, thus producing an estimate G i of the gradient vector at the i th generator point P i [16]. Another variable explored in Ref. [16] was the relative standard deviationσ i of the areas of the neighboring cells, is the mean area of the neighbors of the i th Voronoi cell. As demonstrated in Refs. [16,22,56], among the different possibilities, the relative standard deviation (3.3) performed rather well in tagging the BVCs. Of course, those studies were utilizing only the Voronoi tessellation, which is not ideal for computing gradients. The Delaunay triangulation, on the other hand, provides a more natural framework for the gradient estimation 10 , as will be discussed in the following subsection.

Triangulation wombling from a Delaunay tessellation
The Delaunay triangulation leads to a natural method for computing the gradient known as "triangulation wombling" [34,39]. The starting point is the observation that for each Delaunay triangle D α , the functional values at its three vertices are known from (3.1). Three points are enough to fit a plane, whose slope will provide an estimate of the gradient vector G α to be associated with the Delaunay triangle D α . Recall the duality relation (2.9c) which maps the triangle D α to its three vertices which carry indices i, j and k in the Voronoi tessellation. We can then parametrize the plane defined by D α as where C is some constant. Applying this relation at each vertex, we obtain three independent equations which can be solved for the gradient G α and the constant C as [34,39]    From here, a wombling analysis would typically focus on the magnitude of the gradient and proceed to select Delaunay triangles D α with relatively large values of G α , typically in the top 10 th percentile of all cells, as candidates for boundary elements. However, the straightforward application of this procedure leads to a problem which is illustrated in Fig. 6. In the left panel we show the distributions of gradient magnitudes G α as calculated by the triangulation wombling method just described. We see that the distribution is a very steeply falling function with a long tail -most Delaunay cells have relatively small gradients and only a small fraction populates the large G α tail. Now, if the cells on the tail were predominantly BDTs, the method would have succeeded and there would be no problem, but unfortunately, that is not the case. In the middle panel of Fig. 6 10 A purist might say that, since the two tessellations are dual to each other, strictly speaking there is nothing more to be gained from the Delaunay tessellation that could not have already be obtained from the Voronoi tessellation. While this may be technically correct, we found the Delaunay tessellation useful in hinting at some new techniques and ideas as discussed below. we show a scatter plot of the calculated gradient magnitudes G α versus the horizontal position of the corresponding Delaunay triangle D α as given by its centroid. We notice that the gradients computed for cells in the dense region (x < 0.5) tend to be much larger than the gradients in the sparse region (x > 0.5) (this is to be expected, since statistical fluctuations scale as √ f ). As a result, the cells with large gradients will tend to be more or less uniformly distributed in the dense region, with no relation to the boundary at x = 0.5. This is confirmed in the right panel of Fig. 6, where we identify with red shading the Delaunay triangles whose gradients are in the top 10 th percentile of gradient magnitudes. As anticipated from the result in the middle panel, the red-shaded Delaunay triangles are located almost entirely in the dense region and there is no apparent clustering near the boundary. This means that in order to properly tag the BDTs, we must first pre-process the computed gradients G α in order to mitigate the effect of the statistical noise.

Denoising
In this subsection we outline three different procedures for denoising the computed gradient vectors G α . As we shall demonstrate, each of them has the desired effect, and the optimal approach will be some combination of the three, although finding out the exact proportions is beyond the scope of this paper.

Rescaling of the naive gradients
The first approach is to rescale each calculated gradient magnitude G α as where a i , a j and a k are the areas of the Voronoi cells centered on the three vertices of the Delaunay triangle D α (recall the duality relation (2.9c)). Fig. 7 demonstrates that the rescaling (3.8) does have the desired effect. In the left panel, the distribution of rescaled gradient magnitudesG α still follows the same trend as in Fig. 6, but the numerical values have been reduced by several orders of magnitude. More importantly, the middle panel of   7 confirms that the rescaled gradient magnitudes are now uniformly consistent across the two regions, with a cluster of points with relatively largeG α beginning to emerge near the theoretical boundary. The right panel of Fig. 7 shows the updated plot of the Delaunay triangles falling in the top 10 th percentile ofG α . We observe that, while the Delaunay cells tagged as BDTs are still scattered throughout the field of view, a significant fraction of them appears at the location of the boundary, which further validates the rescaling procedure (3.8). That is why from now on, unless specified otherwise, we shall always work with gradient vectors which have been rescaled as in (3.8).

Lloyd steps uniformization (LSU)
Another method for removing unwanted noise fluctuations in the data was explored in [16] and involved the so called Voronoi relaxation of the data by means of Lloyd's algorithm [67]. A closer inspection of the Voronoi tessellations depicted in Figs. 3 and 4 reveals that after the tessellation is constructed, the generator points can be found pretty much anywhere within the Voronoi polygon -near the center of the cell, close to an edge, or somewhere in between. The idea of the Voronoi relaxation is to make the whole Voronoi structure more uniform by performing several steps (or iterations) of Lloyd's algorithm, where at each iteration, the generator point is moved to the centroid of the corresponding Voronoi cell and the tessellation is redone. The effect of performing such Lloyd step uniformization (LSU) on our data is shown in Figs. 8 and 9. Each figure has 8 panels, depicting the Delaunay tessellation 11 of the data after a certain number of Lloyd iterations, starting with 0 (no Lloyd steps) in the upper left panels and going up to 20 Lloyd steps in the lower right panels. In addition, in Fig. 8 each Delaunay triangle D α is color-coded by the rescaled magnitudeG α of the respective gradient (the gradients are recalculated after each step). Note that the color bar extends only up to 60% of the largest gradient magnitude found in the data 12 ; any cell with a gradient magnitude above that threshold is colored yellow, essentially creating an overflow color bin. This was done in order to minimize the effect of outliers and better visualize  the bulk of the cells with the more typical values. Fig. 9, on the other hand, marks the Delaunay triangles falling in the top 10 th percentile ofG α values, just like the plot in the right panel of Fig. 7.
There are several lessons that can be drawn from Figs. 8 and 9. First, as expected, the Lloyd relaxation causes the Delaunay triangles to become more regularly shaped. For example, notice that a large fraction of the triangles in the original Delaunay tessellation are obtuse (see the right panels in Figs. 3-7). However, within a few Lloyd steps, the fraction of obtuse triangles drops significantly and obtuse triangles are rather rare 13 in the plots in the lower rows of Figs. 8 and 9. More importantly, as shown in Figs. 8, the LSU procedure also tends to wash out the noisy fluctuations in the calculated rescaled gradient magnitudes G α within the bulk regions away from the boundary (note the decreasing range ofG α on the color bars). This further sharpens the contrast between the Delaunay triangles situated near the boundary versus those in the bulk. In particular, notice the gradual emergence of the boundary, which becomes quite pronounced and unmistakable after 5-7 Lloyd steps. However, the figures also show that the number of Lloyd steps should be chosen with care -applying too few may not optimally showcase the boundary, while applying too many may cause the boundary to start disintegrating, as evidenced in the lower right panels after 20 iterations.

Local averaging of gradient vectors
A third approach for smoothing out the local statistical fluctuations in the data is to perform some type of averaging procedure over a region extending beyond the individual Voronoi or Delaunay cells and their immediate neighbors. For example, Ref. [16]  for ease of comparison we reproduce again in the left panel of Fig. 11. The difference is that the color map plots in Figs. 8 and 11 identify cells only by the magnitudeG α of the gradient, while the vector field plots of Fig. 10 include the directional information as well, which is useful in visualizing the spatial patterns and correlations of the gradient vectors. Now, given the rescaled gradient vectors˜ G α shown in the left panel of Fig. 10, there are two ways to perform local averaging of these vectors, depending on whether we want to associate the result from the averaging with a Voronoi cell (i.e., a Delaunay vertex) or with a Delaunay triangle: • Voronoi cell averaging. Recall that according to the duality relation (2.9a), each Voronoi cell V i corresponds to a Delaunay vertex D α 1 α 2 α 3 ...α N i , which is the common vertex of N i Delaunay triangles D α 1 , D α 2 , D α 3 , . . . , D α N i . Thus we can simply define the average gradient G i at any Voronoi cell V i to be 14 • Delaunay cell averaging. Once we have the averaged vectors (3.9) at our disposal, we can go back to each Delaunay triangle and further average the three vectors (3.9) associated with its three vertices.
where the indices i, j and k label the data points at the vertices of D α (recall the duality relation (2.9c)).
The vector fields resulting from the averaging prescriptions in eqs. (3.9) and (3.10) are shown in the middle and right panels of Fig. 10, respectively. One can see that the statistical fluctuations are indeed getting suppressed as a result of the averaging, and furthermore, the directions of the gradient vectors near the boundary are becoming better correlated with each successive iteration. This directional correlation will become important in the next two sections where we shall compare the properties of neighboring cells in the tessellation. For now, in order to demonstrate the benefits from the averaging procedures (3.9) and (3.10) for the purposes of boundary detection, it is sufficient to update the color maps from Fig. 8 using the magnitudes of the averaged gradients instead. This is done in the middle and right panels of Fig. 11, where the individual cells in the tessellation have been colorcoded by the magnitudes of the Voronoi-averaged gradient (3.9) and the Delaunay-averaged gradient (3.10), respectively. By comparing Fig. 11 to Fig. 8, we see that the local averaging procedures produce comparable benefits to Voronoi relaxation, so that we can view the two procedures as alternatives to the other. More specifically, local averaging seems to be at least equivalent to (if not better than) running on the order of 5-7 Lloyd iterations, which seemed to be the optimal choice in Figs. 8 and 9. Of course, the two methods can also be applied simultaneously, so that their benefits can be optimally exploited. In our analysis of Sections 7 and 8, unless specified otherwise, we shall choose to employ the Delaunayaveraged gradient vectors (3.10).

Tagging elements of the tessellation as boundary candidates
Armed with the various estimates of the local gradient vectors discussed in the previous section, we are now ready for the next step of the wombling algorithm, namely, the tagging of Voronoi or Delaunay cells as boundary candidates. The standard approach is to place a lower cut on the relevant variable (typically the magnitude) which measures the size of the gradient. This selection singles out a certain set of candidate boundary cells as shown in Fig. 9. The purpose of this section is to study how effective 15 this selection is and to suggest a potential improvement of the standard approach by utilizing the correlations between gradient vectors computed in neighboring cells. The idea will be to place a premium not just on cells whose own gradient˜ G α has a large magnitude, but on cells where the neighboring gradients˜ G β have both a) large magnitudes and b) correlated directions with G α . A convenient variable which captures the desired correlations between two vectors˜ G α and˜ G β is their dot product,˜ G α ·˜ G β [16]. Fig. 12 shows scatter plots of such dot products of neighboring vectors for the three types of gradients introduced in the previous section:˜ G α ·˜ G β (upper right panel), G i · G j (lower left panel) and G α · G β (lower right panel). In each case, the result is plotted versus the horizontal position (x i + x j )/2 of the midpoint of the respective 16 Delaunay edge 15 Selection efficiency is typically illustrated with ROC curves, where one varies the cut on the selection variable and plots the fraction of signal versus the fraction of background surviving the cut. This was also the procedure used in Refs. [16,22]. Here, however, we prefer to simply show scatter plots of the tagging variable versus the distance to the boundary. In this way, we avoid the need to define what exactly is meant by a "boundary" object versus a "bulk" object. 16 In the case of˜ Gα ·˜ G β and G α · G β , the relevant Delaunay edge is simply D αβ , i.e., the edge separating    .3) versus the horizontal position x i of the corresponding generator point P i . As explained in Sec. 3.1, the relative standard deviationσ i is constructed from the Voronoi tessellation and was found to perform best among several other Voronoi-constructed alternatives [16]. The upper left panel in Fig. 12 confirms that the highest values ofσ i are indeed found for cells near the boundary; in fact, the top 12 highestσ i values belong to such cells. At the same time, we also observe a significant variation in theσ i values for cells dual to the Voronoi edge Vij, see eq. (2.9b). in the bulk; forσ i < 0.7 this starts introducing a certain number of false positives.
The remaining three panels of Fig. 12 demonstrate that the corresponding dot products of gradients computed from the Delaunay tessellation are also efficient in identifying boundary objects (in this case, Delaunay edges). Among the three options illustrated in the plot, the dot product of the Delaunay-averaged gradients seems to perform the best -the top 45 highest dot products of neighboring G α vectors belong to Delaunay edges near the boundary; and there is a well defined cluster of points with large y values in the boundary region 0.45 < x < 0.6. Note the different y-axis range on these three plots -the variation of dot product values is largest for˜ G α ·˜ G β and smallest for G α · G β , further demonstrating the beneficial effects from the averaging procedures discussed in Sec. 3.3.3.
Comparing the top left panel in Fig. 12 to the other three panels, we conclude that the gradient dot products, which take advantage of the correlations between neighboring gradient vectors in terms of both direction and magnitude, are able to identify the boundary better thanσ i and similar variables. As a byproduct of this new method of tagging, we have also automatically built up a network of associations among the Delaunay cells in the triangulation, which is readily available for use in the next step (agglomeration), where one attempts to construct the actual boundary. This is the subject of the next section.

Agglomeration of tagged boundary elements
The previous step (the tagging of boundary elements) typically fails to result in a continuous boundary, especially in case of weak signals. Instead, the algorithm produces a collection of scattered "islands" of tagged cells, as seen in the right panels of Figs. 6 and 7 and in the top panels of Fig. 9. This necessitates the next step of agglomerating the individual tagged cells into subgraphs and evaluating whether the resulting pattern is consistent with a linear boundary [29]. The downside of this approach is that it does not take into further consideration the cells which have failed the tagging cut -those cells are simply ignored from this point on. Another potential drawback is that the tagging and agglomeration steps are done independently from each other, so that the existence of any spatial correlations among neighboring cells is not being used during the tagging. As already mentioned in the previous section, both of these problems are avoided when we use the dot products of neighboring gradients as tagging variables. As illustrated in the last three panels of Fig. 12, each dot product of gradients can be uniquely associated with a Delaunay edge D αβ or with its dual Voronoi edge V ij , see eq. (2.9b). We can then treat the original Voronoi and Delaunay tessellations as weighted networks, where each edge is assigned a weight equal to the dot products of the corresponding two gradients. This weighted network representation is illustrated in Fig. 13, where we superimpose the Voronoi tessellation (red lines) and the Delaunay triangulation (blue lines). The weight of an edge is indicated by the line thickness -thicker lines imply higher weights and vice versa. The three panels show three different ways to compute the weights from the gradient dot products, depending on which set of gradient vectors from Fig. 10 we choose to use:˜ G α ·˜ G β (left panel), G i · G j (middle panel) or G α · G β (right panel). As seen in Fig. 12, a certain fraction of dot products are negative; if that is the case, the corresponding edge is not plotted. Weighted network representations of the Voronoi tessellation (red lines) and the Delaunay triangulation (blue lines). The line thickness is proportional to the weight, which is given by˜ G α ·˜ G β (left panel), G i · G j (middle panel) and G α · G β (right panel). Edges with negative weights are not shown.
The three panels in Fig. 13 can be contrasted with the corresponding results in Fig. 11, where we used just the magnitudes of the individual gradient vectors, without any reference to their neighbors. The boundary seems to be better outlined in Fig. 13, particularly when we make use of the averaging procedures from Sec. 3.3.3. We also note the benefit of plotting the Voronoi and Delaunay tessellations simultaneously -the orientation of the edges with respect to the boundary is random, so whenever a given edge happens to be orthogonal to the boundary, its dual tends to be parallel to it, so taken together, they trace out the correct shape of the boundary.  Fig. 12, where we observed that while many edges situated close to the boundary enjoyed relatively large values of their gradient dot products, there was also a non-negligible fraction of edges near the boundary with rather low values of the gradient dot products. Fig. 13 now reveals that these two populations are spatially correlated -note how the edges with large values of the gradient dot products are linked together, as are their counterparts. This confirms that using the gradient dot products as tagging variables automatically also takes care of the agglomeration.
Until now we have been following the standard steps of a wombling analysis. As already mentioned, the last remaining step is to evaluate the statistical significance of the observed pattern of tagged boundary candidates. Note that the last three figures illustrate tagging procedures for each of the four types of boundary candidate objects defined in Sec. 2.3: Voronoi cells (middle panel in Fig. 11 and upper left panel in Fig. 12), Delaunay cells (left and right panels in Fig. 11), Delaunay edges (upper right and lower panels in Fig. 12 and all three panels in Fig. 13) and Voronoi edges (Fig. 13). Of course, since the Voronoi and Delaunay edges are dual to each other, any procedure which can tag one edge type can also be applied to tag the other.
In the next three sections we shall outline an alternative approach originally proposed in [40], which allows us to perform the tagging, agglomeration and statistical evaluation steps in one go. In order to introduce and illustrate the method, in the next section Sec. 6 we shall start with the case of a continuously defined function f (x, y) and then proceed to analyze the case of point data in Secs. 7 and 8.

Finding wombling boundaries: analytical examples
In order to bypass the tagging and agglomeration steps, Ref. [40] proposed to directly consider various curves C in the (x, y) plane, and to associate a "wombling measure" Γ with each one, so that true wombling boundaries can be identified by their large values of Γ. Since a wombling boundary is supposed to represent a zone of rapid change in the function f , it is natural to define the wombling measure in terms of the local gradient ∇f suitably integrated along C. In particular, Ref. [40] defined Γ to be the total gradient flux through wheren C is a unit vector normal to the curve C and d is the infinitesimal length along C. Additionally, Ref. [40] also considered the average gradient flux where using the total length of the curve as a normalization factor eliminates the unfair advantage of curves which happen to be too long. For this reason, in what follows we shall make use of (6.2) and not (6.1). 17 Without any further constraints on the type of curves C that we are allowed to consider, this method would be rather impractical. To make further progress, two approaches are possible. The first one is the model-dependent route -if we specify exactly what kind of new physics model generates the wombling boundary, then C can be specified by only a handful of parameters (typically the masses of the new particles). Then the problem of maximizing the functional (6.2) over all possible curves C reduces to a simple global maximization problem in the parameter space describing C [23]. Here, however, we would like to stay as model-independent as possible, so we shall not assume any specific parametrization of the boundary. At the same time, we do not want to consider arbitrarily general curves C either.
An intermediate compromise approach is the following. Note that any curve C can locally be approximated by a straight line segment. Therefore, we can perform a scan of the (x, y) plane where at each point P we try a line segment (centered on P ) of fixed length L and arbitrary angular orientation ϕ. Each point in the so-defined 4-dimensional parameter space (x, y, L, ϕ) corresponds to a well-defined line segment, for which the wombling measure 17 An additional variation mentioned in [40] was to consider integrating the flux in absolute value, e.g., in order to avoid cases where a large positive flux over one section of the curve C is cancelled by a large negative flux over another section. However, in our case such cancellations are welcome since the noise fluctuations are random and we would like to allow them to cancel out each other as much as possible. We have confirmed numerically in our examples that the absolute value alternative leads to lower sensitivity.  would then identify (segments of) the wombling boundary. Since this procedure involves optimization in a 4-dimensional parameter space, it will be difficult to illustrate here. This is why from now on we choose to focus on the (L, ϕ) subspace -one can think of this as first zooming in on an interesting region of the (x, y) plane and then testing for the presence of a linear wombling line segment. Our reparametrization of the remaining two degrees of freedom describing the line segment is illustrated in Fig. 14. As before, we retain the unit square as our field of view. We then consider all possible straight lines crossing the unit square -each such line can be identified by the point where it enters the square and the point where it exits the square. We shall identify the locations of those two points by their respective coordinates p in and p out measured along the perimeteter, as shown in the left panel of Fig. 14. Since the perimeter of a unit square is equal to 4, the parameter space (p in , p out ) spans the 4 × 4 square shown in the right panel of Fig. 14 -any point within that 4 × 4 square can be uniquely associated with a straight line crossing the field of view in one of the two possible directions. For example, (p in , p out ) = (0, 2) describes a diagonal line traversed from the lower left corner to the upper right corner, while (p in , p out ) = (2, 0) describes the same diagonal line covered in reverse. In our previous examples, the true boundary was located at x = 0.5, and corresponds to either 18 (p in , p out ) = (0.5, 1.5) or (p in , p out ) = (2.5, 3.5).
In the remainder of this section and in the next Sec. 7, our main goal will be to compute the wombling measureΓ[C] in the (p in , p out ) parameter space and identify the relevant wombling boundary segment(s). First we shall illustrate this procedure with the example of a continuous function f (x, y) before tackling the case of point datasets in the next section.

A straight line boundary
In this subsection we shall revisit the vertical straight line boundary example from the previous sections. However, we shall not use the original distribution (2.5), for two reasons: first, the discontinuity at x = 0.5 would generate an infinite gradient when computed analytically, and second, the distribution (2.5) corresponds to an idealized situation where the effects of particle widths and detector smearing are ignored. In any realistic experimental analysis the sharp step at x = 0.5 will be smeared and the boundary will be characterized by a large but finite gradient. This is illustrated in Fig. 15, where the black dotted lines show (the x-dependence of) the original distribution (2.5) before any smearing, for ρ = 5 (left panel) and ρ = 1.5 (right panel). We then apply Gaussian smearing with σ = 0.2, resulting in the red histograms, which have the typical shapes expected in a realistic experimental analysis. In particular, notice how the gradient at the boundary is significantly reduced as a result of the smearing, making the task of finding the wombling boundary quite challenging. At this point we fit an analytical function to the so obtained smeared distributions. For the fit, we choose to utilize the (unit-normalized) ArcTan sigmoid function whose derivative is maximal (in absolute value) at x = a, the sharpness of the transition being controlled by the parameter k. The remaining parameter ξ is analogous to ρ in the sense that (compare to (2.4)) Since our field of view is limited to x ∈ [0, 1] and the boundary is at x = 0.5, for the actual fit we choose the parametrization f (x, y) = g res (x; a = 0.5, ξ, k) (6.4) and then adjust ξ and k to match the smeared distributions shown by the red histograms. As seen in Fig. 15, the fit reproduces the effects of smearing rather well, so in the rest of this subsection we shall use (6.4) as our analytically defined distribution.
Using the respective fit (6.4) as our proxy, we can now compute the wombling measurē Γ[C] in the (p in , p out ) space of all lines intersecting our unit square. The result for ρ = 5 (ρ = 1.5) is shown in the left (right) panel of Fig. 16. We choose to plot the absolute value ofΓ[C], since the sign ofΓ[C] is determined by the direction in which we traverse the line, and does not have any bearing on whether the line is a wombling boundary or not. Fig. 16 reveals that, as expected, there are two locations with maximal wombling measures: at (p in , p out ) = (0.5, 1.5) and at (p in , p out ) = (2.5, 3.5). Both of those correspond to the same vertical line at x = 0.5 which was the true boundary. This demonstrates that the method is indeed able to find the correct boundary. The significance of these findings, however, is sensitive to the amount of signal present -in the left panel, where ρ = 5, the two winning answers are very clearly identified, while in the right panel (using the same color scale) they appear to be less noticeable, which is a hint that the effect might be in danger of being washed out once we include the statistical fluctuations present in the point data -this issue will be investigated in detail in Sec. 7 below.

A circular boundary line
The example in the previous Sec. 6.1 might appear somewhat contrived since the true boundary was a straight line and at the same time, we also used straight lines in computing the wombling measureΓ [C]. Since the shapes of the lines match, it was inevitable to find a unique best match, as shown in Fig. 16. To be fair, we shall now consider a less trivial example where the true boundary has a different shape from the line segments which we use to test for the presence of a wombling boundary. In particular, we shall revisit the case of a circular boundary introduced in Sec. 2.1, where the probability distribution was given by (2.7). Once again, we shall not rely directly on (2.7), but in order to account for the detector resolution, we shall sample the events according to f (r, ϕ) ∼ rg res (r; a = 0.25, ξ = 2, k = 30), (6.5) where r and ϕ are the polar coordinates in the plane, measured from an origin at the center of the circle, and the smearing function g res was already defined in (6.3). The resulting probability distribution is plotted in the left panel of Fig. 17. Note that the densities on the two sides of the circular boundary differ by no more than a factor of 2, so in this sense this example is analogous to ρ ∼ 2 in our usual notation. The corresponding heat map of |Γ[C]| in the (p in , p out ) parameter space is shown in the right panel of Fig. 17. The most striking difference from the previous results in Fig. 16 is that now we find not just a single wombling boundary candidate, but a whole class of wombling boundaries, identified by the two 19 bright yellow stripes running diagonally across the plot. A careful inspection of the right panel in Fig. 17 reveals that each of the identified wombling boundary candidates is tangential to the circular boundary, which suggests that the dominant contribution to the integral comes from the region in the vicinity of the circular boundary, where the gradient is largest. The slight difference in the brightness along the stripes can then be attributed to the different orientation of the lines, which leads to differences in the (sub-dominant) flux contributions in the regions away from the true boundary.

Finding wombling boundaries: point data examples
Having illustrated the basic idea of Refs. [40,63] with the continuous examples from the previous section, we shall now apply it to point data. Following the outline of Sec. 6, we 19 We obtain two stripes because of the double counting (pin, pout) ←→ (4 − pout, 4 − pin) due to the possibility to traverse a line segment in each of the two opposite directions, as shown in the right panel of Fig. 14. shall first consider the case of a straight line boundary in Sec. 7.1 and then the case of a circular boundary in Sec. 7.2.

A straight line boundary
7.1.1 An example with ρ = 5 Our first straight-line boundary example will be the same point data example which we have been using so far throughout the paper to illustrate the various methods and techniques of a wombling analysis, see Figs. 3, 4, 6, 7, 8, 9, 10, 11, 12 and 13. (As a reminder, we used N = 500 points generated according to the distribution (2.5) with ρ = 5.) In particular, we shall repeat the procedure from Sec. 6 and compute a wombling measureΓ[C] for each possible straight line C crossing the field of view. However, since we are now dealing with discretely sampled point data instead of a continuous function f (x, y), we need to adapt the definition (6.2) as followsΓ where each sum runs over all Delaunay triangles D α which are crossed by the straight line C and, as before, G α is the Delaunay-averaged 20 gradient vector (3.10) associated with D α . Since the Delaunay tessellation gives complete coverage of the field of view, the line C necessarily gets fragmented into individual line segments of length ∆ α , defined so that each segment is contained within a single Delaunay cell D α . Finally,n C is a unit vector orthogonal to the straight line C and therefore, to each individual line segment ∆ α as well, so that an index α on it is unnecessary. Fig. 18 shows a heat map of the wombling measureΓ[C] computed from eq. (7.1) throughout the parameter space (p in , p out ) of all possible straight lines passing through our data. As a pre-processing step, we applied one Lloyd iteration and then used the Delaunayaveraged gradient vectors illustrated in the right panel of Fig. 10. The heat map in Fig. 18 contains four dark squares situated along the diagonal from the upper left corner to the lower right corner. All points within those four squares define line segments C which do not enter the field of view at all, and instead run along one of the edges of the field of view. Clearly, such line segments are irrelevant for our wombling boundary analysis, so they have been assignedΓ[C] = 0 by default and are excluded from further consideration. Fig. 18 reveals that there is a unique line with the largest possible wombling measure -let us denote this winning line with C w : By construction, the winning line C w is the best wombling boundary candidate among the set of all straight-line boundary candidates. Is it the correct wombling boundary though? 20 In principle, we can define the wombling measure (7.1) in terms of the original gradient vectors˜ Gα or in terms of the Voronoi-averaged gradient vectors G i defined in (3.10). However, as argued in Secs. 4 and 5, the Delaunay-averaged vectors offer the best option for our purposes. Figure 18.
Results from applying the wombling procedure of Sec. 6 to the illustrative point data example used in the previous sections (N = 500 points generated from (2.5) with ρ = 5). For concreteness, at the pre-processing stage we applied one Lloyd iteration and then used the Delaunay-averaged gradient vectors which were defined in (3.10). In constructing the heat map, we sampled the (p in , p out ) parameter space on a 200 by 200 grid with step size 0.02 and then computed Γ[C] from eq. (7.1). The four dark squares in the plot correspond to line segments which overlap with one of the edges of the unit square which is our field of view -those segments were assigned Γ[C] = 0 by default and excluded from further consideration. For reference, the true boundary is located at (p in , p out ) = (0.5, 1.5), or equivalently, at (p in , p out ) = (2.5, 3.5).
According to the result from Fig. 18, the answer in this example is yes: C w is found at the exact location (0.5, 1.5) (or equivalently, (2.5, 3.5)) of the true theoretical boundary x = 0.5. This can be verified explicitly in Fig. 19, where we plot C w overlaid on top of the Delaunay-averaged gradient vectors G α from the right panel of Fig. 10. Fig. 19 not only confirms that C w is the correct wombling boundary, but also helps us understand why C w was chosen by the algorithm. Note that for any given line C, the calculation of its wombling measureΓ[C] depends only on the Delaunay cells D α which happen to be crossed by the line C -in Fig. 19 those cells are shaded in red. A careful inspection of Fig. 19 reveals that in the large majority of red-shaded Delaunay cells the gradient vectors are both large and (roughly) orthogonal to the line C w , thus maximizing the average flux (7.1) through it. To better visualize this, we have color-coded the individual line segments ∆ α of C w according to their individual contributions G α ·n C to the total flux, with warm (cold) colors corresponding to large (small) values. We see that C w is predominantly colored with warm colors, indicating large fluxes all the way throughout. On the other hand, it is not Figure 19. The line C w with the largest wombling measureΓ[C] in Fig. 18, plotted over the Delaunay-averaged gradient vector field from the right panel of Fig. 10. The individual line segments ∆ α have been color-coded by their respective individual contributions G α ·n C to the total flux. The red-shaded Delaunay cells are those which are crossed by the line C w and are therefore included in the two summations in (7.1).
difficult to convince oneself that this will not be true for any other randomly chosen line C crossing the field of view -the gradient vectors may happen to be relatively large and perhaps even roughly orthogonal to it purely by chance in some restricted region, but this will not occur consistently along the full length of the line as was the case with C w . In short, the wombling procedure applied here automatically takes into account the spatial correlations of the gradient vectors along a wombling boundary. 7.1.2 An example with ρ = 1.5 We are now ready to tackle a more difficult case, with a smaller signal to background ratio. In this subsection we consider N = 1000 data points, still distributed according to (2.5), but with the much smaller value of ρ = 1.5. This data is shown in the left panel of Fig. 20. Unlike the previous example, this time the wombling boundary is not as easy to identify visually in the data. As already discussed in Ref. [16], the weaker the signal, the more Lloyd steps are needed for optimal results. Correspondingly, here we apply 5 Lloyd iterations at the preprocessing stage and obtain the data shown in the right panel of Fig. 20, on which the subsequent wombling analysis is done.
The end result from our wombling procedure is shown in Fig. 21, which is the analogue of Fig. 18. We notice that the typical values obtained forΓ[C] are now much lower than what we saw in Fig. 18. This is to be expected due to the smaller value of ρ -the boundary is less pronounced, and the magnitudes of the gradient vectors are generally reduced as well. Despite these difficulties, the boundary is still correctly identified -notice the two bright spots in the heatmap located near (0.5, 1.5) and (2.5, 3.5), which is the right answer.  The corresponding winning line C w hasΓ[C w ] = 0.19 and is plotted in Fig. 22, where for illustration we also show a second, rather generic, line C at (p in , p out ) = (0.7, 0.8) which has a more typical value of the wombling parameter,Γ[C] = 0.08. The individual line segments of each of the two lines are color-coded similar to Fig. 19, i.e., proportional 21 to their individual contributions to the total flux. We see that the winner C w is again colored with mostly warm colors, indicating large flux contributions everywhere (except for just a few spots where the gradient vectors happen to be either too small or oriented along the line), while the generic line C is colored with mostly cold colors, confirming that its wombling measureΓ is indeed rather small.
By comparing Figs. 18 and 21, one can notice that in Fig. 21 we have excluded from consideration not only lines which run purely along the perimeter of the field of view but also any line of length less than 0.5; this additional constraint results in the elimination of several quadrant sectors on the plot -one at the lower left corner, one at the upper right corner, and six along the diagonal running from the upper left corner to the lower right corner. Since our wombling measure is the average flux, short lines which literally "cut a corner" of the square field of view, can potentially pick up large gradients due to local fluctuations in the bulk, without an opportunity to cancel those fluctuations elsewhere. In other words, if we encounter two candidate lines with the same value ofΓ, and one is much longer than the other, then we would treat the longer line as the more likely wombling boundary. Thus eliminating very short lines from consideration early on would go a long way in simplifying the significance estimation procedure, see Sec. 8 below.

The results from a wombling analysis with straight line segments
We now proceed with the discrete version of the circular boundary example considered in Sec. 6.2. The point dataset is shown in the left panel of Fig. 23 and consists of N = 1000 points distributed according to the probability distribution with a circular boundary (2.7) with ρ = 5. As a preprocessing step, we then apply a single Lloyd iteration, obtaining the  data shown in the right panel of Fig. 23, on which the subsequent wombling analysis is performed.
The results from the wombling procedure are displayed in Figs. 24 and 25. Fig. 24 is the analogue of Figs. 18 and 21, but for the circular boundary example considered in this subsection. Comparing to those previous figures, we notice that the largest absolute values forΓ obtained here are lower than those in Fig. 18 but higher than those in Fig. 21. The former is due to the fact that we are using straight line segments to test for a curvilinear wombling boundary, and no single line segment can capture the full extent of a circular boundary, while the latter is due to the fact that the value of ρ is higher for the dataset used to produce Fig. 24. Fig. 24 exhibits the typical pattern observed in the continuous version of this example (the right panel of Fig. 17). The locations with the largest values of |Γ| trace out the two stripes seen in Fig. 17, which indicates that in the discrete version of the example it is still the line segments tangential to the circular boundary which tend to have large values of |Γ|. The line C w with the largest wombling measure happens to be at (p in , p out ) = (0.55, 1.9) and is plotted in Fig. 25, together with the set of Delaunay-averaged gradient vectors G α (left panel) or the heatmap of their magnitudes (right panel). As anticipated, C w is tangential to the circular boundary, and the largest contributions to its wombling measure indeed come from the (warm-colored) segments in the vicinity of the boundary. One should keep in mind that the particular line C w shown in Fig. 25 is a very close winner among several other worthy challengers with similar values for the wombling measure -as Fig. 24 showed, there are several locations along the two bright stripes with similarly large values of |Γ|.

Identifying the true shape of the boundary
The analysis from the previous subsection 7.2.1 demonstrated that even in the case of a curvilinear boundary, our wombling procedure produces reasonable results -it is able to identify a class of line segments, each of which already contains a portion of the true boundary. Unfortunately, none of the identified line segments is able to reproduce the full boundary all by itself. In this subsection, we shall therefore address the question of being able to globally reconstruct the wombling boundary, regardless of its shape, from the results presented so far.
In principle, there can be several different approaches to this problem.
• Algorithmic wombling. The standard approach is the algorithmic wombling proce-dure outlined in the introduction [29]. One applies a lower cut (tagging) on the magnitudes of the locally estimated gradient vectors and then suitably connects them (agglomeration). Although this approach has been subject to criticism [40], its modern implementation can perhaps benefit from some of the improvements which we have introduced here, in particular gradient rescaling (Sec. 3 • Use the correct ansatz for the shape of the boundary. At the cost of giving up modelindependence, one could focus on a particular theory model, derive the parametric form of the expected boundary shape, and then use that parametrization to test for the presence of such boundaries in the data. This approach was proved successful in specific event topologies motivated by supersymmetry [22,23,56], but relies on the experimenter being able to make the correct theory model assumption.
• Construct the envelope of the line segments with the largest wombling measures. As shown by the results in Figs. 17 and 24, when probing a curvilinear boundary with straight line segments, we obtain a whole family of wombling boundary candidates which can be tagged with their relatively large values ofΓ [C]. We saw that each of these candidates is tangential to the true boundary, therefore, the task of constructing the true boundary reduces to the task of finding a planar curve which is tangential to each of the tagged straight line candidate segments at some point. The answer to this problem is precisely the envelope curve [68].
• Identify and agglomerate "the best" line segments ∆ α . A specific realization of an approximate piece-wise reconstruction of the envelope curve mentioned above is offered by the following procedure. Note that our analysis not only tags straight lines C with large values of the wombling measure, but it also identifies which individual portions ∆ α of those lines are most likely to be tangential to a true boundary, as indicated by the rainbow-color coding of the line C w in Figs. 19, 22 and 25. This suggests that instead of working with the full line C tagged by the algorithm, we can instead focus our attention on the individual elements ∆ α from it with the largest contributions to the average flux through C. The straightforward application of this method, however, will reintroduce sensitivity to local gradient fluctuations. In order to avoid this, we propose to rank the individual elements ∆ α not by their average flux G α ·n C , as was done in Figs. 19, 22 and 25, but by the rescaled average flux where the power γ is a suitably chosen positive parameter, which interpolates smoothly between "complete locality" (γ = 0) and "complete globality" (γ → ∞). The scaling by a power ofΓ[C] in eq. (7.3) ensures that a given element ∆ α is judged not only by the local flux going through it, but also by its association with a suitable wombling boundary candidate line C. By increasing the power γ, we can suppress the effects of local statistical fluctuations, and in the limit of γ → ∞ we eventually recover our previous results where one would select all the segments ∆ α belonging to the best wombling candidate line C w . However, using finite values of γ allows us to 1) let in "the best" individual segments ∆ α from other candidate lines C which are not C w , but have comparably large values ofΓ[C], and 2) eliminate from consideration those individual segments ∆ α from the winning line C w whose local flux values are too low -presumably such elements belong to C w only because we have used the wrong ansatz for the shape of the boundary. This approach strikes the right balance between the "locality" of the flux through an individual element ∆ α and the "globality" (in the method of calculation) of the wombling measureΓ [C]. The procedure is illustrated in Fig. 26, where we show the individual line segments ∆ α whose rescaled flux values (7.3) are in the top 1 percentile, after scanning the (p in , p out ) parameter space on an 80 × 80 grid. We see that, despite using the wrong ansatz (straight lines), the circular boundary is reconstructed quite well, with only a few stragglers showing up in the bulk.

Significance estimation
In the previous sections we discussed different techniques for identifying wombling boundaries in point datasets. In applications to collider event data in high-energy physics, the presence of a wombling boundary could be indicative of new physics, if its location is in a region of phase space which is unremarkable from the point of view of the SM background. However, before claiming a discovery, one must be confident that such wombling boundaries cannot be accidentally generated by SM data alone. For this purpose, it is necessary to supplement any proposed wombling technique with a corresponding prescription for assessing the significance of any reconstructed wombling boundary. Since previous work [16,22,23] did not address this issue, we shall now do so using a frequentist approach. The end result of the wombling method described in Sec. 7 was the selection of the best possible wombling line candidate C w , together with its corresponding wombling measurē In order to use |Γ w | as our test statistic, we need to know its distribution under the purebackground hypothesis. For this purpose, we generate a number of pseudo-experiments where the point data is generated from the pure-background distribution (2.2), and for each pseudo-experiment, we repeat the wombling analysis from the previous section. A typical result from one such pseudo-experiment is shown in Fig. 27, where, in analogy to Figs. 18, 21 and 24, we show a heat map of |Γ[C]| in the (p in , p out ) parameter space. We see that, as expected, in the absence of a real signal the typical values for the wombling measure are relatively low almost everywhere in the (p in , p out ) parameter space, except at one very special location near (1, 3). It is not difficult to realize that this location corresponds to very short candidate lines C which "cut" the lower right corner of our field of view. We already alluded to this problem at the very end of Sec. 7.1.2, and our proposed solution was simply to exclude such very short lines from consideration 22 . Therefore, when we derive the |Γ w | 0.06 0.08 0.10 0.12 0.14 0.16 | w | 0  Fig. 1, with the points sampled from (2.5) with ρ = 1.5. In either case, the wombling measureΓ[C] was computed throughout the (p in , p out ) parameter space on an 80 × 80 grid and the best wombling line C w of minimum length √ 2/2 was identified following the procedure of Sec. 7.
distributions below, we shall apply a minimum cut on the allowed length of any candidate line C. In order to make sure this problem does not reappear, we shall conservatively increase our previous minimum length cut from 0.5 to √ 2/2, which is the length of the line connecting the midpoints of any two neighboring edges of our field of view.
With those preliminaries, we are ready to compare the distributions of our test statistic |Γ w | for signal and background. The blue histogram in Fig. 28 shows the |Γ w | distribution for 100 pure-background pseudo-experiments with N = 1000 points each, where the data was generated from the pure-background distribution (2.2) as in the left panel of Fig. 1. At the preprocessing stage, we increased the number of Lloyd iterations to 10, since we shall be interested in the case of relatively weak signals (see the related discussion in Sec. 7.1.2 and Ref. [16]). Following the procedure of Sec. 7, we then computed the wombling measurē Γ[C] on an 80 × 80 grid in the (p in , p out ) parameter space and the best wombling line C w (of minimum length √ 2/2) was identified and its wombling measure (8.1) was entered in the histogram. The resulting distribution is relatively concentrated around a mean of 0.08, and extends up to 0.11, which sets the lower limit on the target range for signal detection. For illustration, in Fig. 29 we show results for one typical pure-background pseudo-experiment whose value forΓ w is equal to the mean of the distribution shown in Fig. 28.  might be discoverable. Obviously, the larger the signal component, the more pronounced the wombling boundary. In our conventions, the signal strength was parametrized by the ρ parameter. For example, in Sec. 7.1.1 we saw that for ρ = 5 we obtainedΓ w ∼ 0.7, while the weaker signal with ρ = 1.5 in Sec. 7.1.2 resulted in onlyΓ w ∼ 0.19 (note that in Sec. 7.1.2 the data was preprocessed with only 5 Lloyd steps; adding 5 additional steps as was done in Fig. 28 would further reduce the value ofΓ w slightly). Given the purebackground distribution in Fig. 28, it is clear that in both of those examples the observed effect could not have been attributed to a background fluctuation and would represent a discovery. At the same time, a careful inspection of the left panel in Fig. 20 shows that the ρ = 1.5 example of Sec. 7.1.2 was rather "lucky" due to fortuitous fluctuations in the data near the theoretical boundary. In order to estimate the prospects for a more typical signal scenario, we simulate 100 pseudo-experiments with N = 1000 data points each, generated from the distribution (2.5) with ρ = 1.5. The corresponding distribution of the test statistic |Γ w | for those signal pseudo-experiments is shown with the orange histogram in Fig. 28. We see that, on average, the values of |Γ w | are larger in the presence of a signal -the mean of the orange histogram is shifted to 0.10. Comparing the tails of the two distributions, we find that in 40% of the cases, the signal is discoverable at 2 sigma and in 16% of the cases it is discoverable at 3 sigma. These prospects can probably be further improved by optimizing the different aspects of our wombling algorithm, but such an optimization is outside the scope of this paper.

Conclusions and outlook
In this paper we reviewed and refined the existing procedures for identifying wombling boundaries in point datasets. Our interest in this topic stems from the fact that high energy physics collider data can be viewed as point data in the relevant phase space of the final state signature. For better visual illustration, we considered point data examples in two dimensions, but our technique can be readily generalized to higher dimensional data. We proposed several modifications to the standard algorithm which lead to improved detection efficiency and significance: • We advocated the use of the Delaunay triangulation of the data instead of (or perhaps in addition to) the Voronoi tessellation of the data. We argued that the Delaunay tessellation is the natural framework for computing the local gradient vectors which is the first and most important step of any wombling algorithm.
• We considered three different techniques for reducing the effect of statistical fluctuations: 1. Rescaling the local gradient vectors as in (3.8), see Section 3.3.1.
2. Applying Voronoi relaxation of the data via several Lloyd iterations as a preprocessing step, see Section 3.3.2.
3. Local averaging of the gradient vectors, which can be performed either at the level of a Voronoi cell (3.9) or at the level of a Delaunay triangle (3.10), see Section 3.3.3.
• We studied new tagging variables (dot products of neighboring averaged gradient vectors) for selecting elements of the tessellation marking the location of a wombling boundary. In Section 4 we showed that the new variables have improved selection efficiency, since they take into account the spatial correlations among neighboring gradient vectors along the wombling boundary. In Section 5 we pointed out that the new variables additionally can be used to naturally connect the tagged elements into continuous boundaries.
• In Secs. 6 and 7 we explored the idea of Refs. [40,63] to rank wombling boundary candidates C by a global wombling measure, e.g., Γ[C] from (6.1) orΓ[C] from (6.2). On the basis of several toy examples we showed that this approach is successful in identifying the correct boundary, and with a slight modification (7.3) can be used even when the shape of the boundary is different from the assumed ansatz, see Sec. 7.2.2.
• In Sec. 8 we showed how one can estimate the statistical significance of any detected wombling boundary using a frequentist approach.
The present study complements and further expands the work of Refs. [16,22,23,25,56] in an interesting direction which, while popular in other fields of science, is still rather new to the field of high energy physics. We believe that our investigations here are only scratching the surface of what could be a very promising research thrust. In particular, the approach of treating high energy collider data as point data and studying its geometric properties is complementary to the existing binning techniques and in the long run could prove to be more suitable to the application of modern machine learning techniques [70,71].