Object condensation: one-stage grid-free multi-object reconstruction in physics detectors, graph, and image data

High-energy physics detectors, images, and point clouds share many similarities in terms of object detection. However, while detecting an unknown number of objects in an image is well established in computer vision, even machine learning assisted object reconstruction algorithms in particle physics almost exclusively predict properties on an object-by-object basis. Traditional approaches from computer vision either impose implicit constraints on the object size or density and are not well suited for sparse detector data or rely on objects being dense and solid. The object condensation method proposed here is independent of assumptions on object size, sorting or object density, and further generalises to non-image-like data structures, such as graphs and point clouds, which are more suitable to represent detector signals. The pixels or vertices themselves serve as representations of the entire object, and a combination of learnable local clustering in a latent space and confidence assignment allows one to collect condensates of the predicted object properties with a simple algorithm. As proof of concept, the object condensation method is applied to a simple object classification problem in images and used to reconstruct multiple particles from detector signals. The latter results are also compared to a classic particle flow approach.


Introduction
Accurately detecting a large number of objects belonging to a variety of classes within the same image has triggered very successful developments of deep neural network architectures and training methods [1,2,3,4,5,6,7].Among these are two-stage detectors, where a first stage generates a set of candidate proposals, comparable to seeds, and in a second stage the object properties are determined.Even though two-stage approaches yield high accuracy, they are very resource demanding and comparably slow.One-stage architectures, however, have proven to be just as powerful but with significantly lower resource requirements [5,8,9,10,11].A large fraction of one and two-stage detectors uses a grid of anchor boxes to attach object proposals directly to the anchors corresponding to the object in question.Ambiguities are usually resolved in a second step by evaluating the intersection over union score of the bounding boxes [12].Recent anchor free approaches identify key points instead of using anchor boxes, which are tightly coupled to the centre of the object [9,10].
Reconstructing and identifying objects (e.g.particles) from detector hits in e.g. a high-energy physics experiment are, in principle, similar tasks, in the sense that both rely on a finely grained set of individual inputs (e.g.pixels or detector hits) and infer higher-level object properties from them.However, a detector is made of several detector subsystems, each with their own signal interpretation and granularity.This and the fact that particles often overlap, even such that certain hits are only fractionally assigned to a certain object, pose additional challenges.The reconstruction of individual particles often starts by identifying seeds, adding remaining hits using a certain class or quality hypothesis, and then assigning such clusters or hits to one or another object, such as in particle flow algorithms [13,14,15,16].
Only after this step, neural network based algorithms are applied to each individual object to either improve the momentum resolution (regression) or the identification performance; recent examples are described in Refs.[17,18,19,20,21,22,23,24,25].However, there is a large overlap in all these steps as far as the requirements on the algorithms are concerned, since all of them rely heavily on identifying the same patterns: the seed finding algorithm needs to employ pattern recognition with high efficiency, as well as the segmentation (clustering) algorithm to assign the right detector signals to the right objects on an object by object basis, driven by the seeds; the subsequent identification and momentum improvement algorithms also employ pattern recognition, but with higher-purity thresholds.Every individual step usually comes with a set of thresholds.After each threshold that is applied, the information available to the next step usually decreases.In an ideal case, however, the information should be retained and available until the object with all its properties is fully identified, since this information can carry valuable information.
Neural network based algorithms offer the possibility of retaining the information, and furthermore, there is a trend towards employing such algorithms for more tasks in high energy physics further towards the beginning of the reconstruction sequence.In this context, graph neural networks [26] are receiving increasing attention because they allow to directly process detector inputs or particles, which are both sparse and irregular in structure [27,28,29].However, when attempting to also incorporate the seeding step together with subsequent steps, the above mentioned methods from computer vision are not directly applicable.
For anchor-based approaches, it has already been shown for image data that the detection performance is very sensitive to the anchor box sizes, aspect ratios and their density [5,3].For detector signals, these factors are even more pronounced: the high dimensional physical input space, very different object sizes, overlaps, and the highly variable information density are not well suited for anchor-based neural network architectures.Some shortcomings can be addressed by pixel based object detection, such as e.g. proposed in Ref. [9,10]; however, these approaches heavily rely on using the object centre as a key point.This key point is required to be well separated from other key points, which is not applicable to detector signals, where two objects that have an identical central point can be well resolvable.
Therefore, edge classifiers have been used so far in particle physics to separate an unknown number of objects in the data [30,31,32].Here, an object is represented by a set of vertices in a graph that are connected with edges, each carrying a high connectivity score.While this method in principle resolves the issues mentioned above, it comes with stringent requirements on the architecture choice, on the pre-processing of the data, and at the inference step: it strictly requires a graph structure and therefore the corresponding neural networks for processing, at the preprocessing stage, all possibly true edges need to be inserted in the graph, such that they can be classified by the network, and the same connections need to be evaluated once classified to built the object under question.Moreover, the binary nature of an edge classification makes this approach less applicable to situations with large overlaps and fractional assignments.
Edge building and classification can be avoided by adapting a method originally proposed for image or point cloud segmentation in Ref [33] and extended in Ref [34].In principle it already satisfies many requirements, but focuses solely on segmentation and still relies on object centres.Objects are identified by clustering those pixels or 3-dimensional points belonging to a certain object by learning offsets to minimise the distance between the point and the object centre.Also the expected spatial extent of the object in the clustering space after applying this offset is learnt and inferred from the point or pixel with the highest seed score to eliminate ambiguities during inference.This seed score is also learnt and tightly coupled to the predicted distance to the object centre.Even though these methods rely on centre points and the natural space representation of the data (2 dimensional images or 3 dimensional point clouds), the general idea can be adapted to more complex inputs, such as physics detector signals, or other data with a large amount of overlap or only fractional assignment of points or pixels to objects.This paper describes this extension of the ideas summarized in Ref [33] and Ref [34] to objects without a clear definition of a centre by interpreting the segmentation in terms of physics potentials in a lower dimensional space than the input space.Moreover, the object condensation method proposed here allows to simultaneously infer object properties, such as its class or a particle momentum by condensing the full information to be determined in one representative condensation point per object.The segmentation strength can be tuned and does not need to be exact.Therefore, the object condensation method can also be applied to overlapping objects without clear spacial boundaries.
Object condensation can be implemented through a dedicated loss function and truth definition as detailed in the following.Since these definitions are mostly independent of the network architecture, this paper focuses on describing the training method in details and provides an example of an application to object identification and segmentation in an image as proof of concept.

Encoding in neural network training
The object condensation relies on the fact that a reasonable upper bound on the number of objects in an image, point cloud or graph is the number of pixels, points or vertices (or edges), respectively.This means that in this limit an individual pixel, point or vertex can accumulate and represent all features of an entire object.Even with a smaller number of objects, this idea is a central ingredient to the object condensation and used to define the ground truth.At the same time, the number of objects can be as small as one.
To define the ground truth, every pixel, point, edge or vertex (in the following referred to as vertex only) is assigned to exactly one object to be identified.This assignment should be as simple as possible, e.g. a simple pixel assignment for image data, or an assignment by highest fraction for fractional affinity between objects and vertices.Keeping this assignment algorithm simplistic is crucial for fast training convergence, and more important than assigning a similar number of vertices to each object.In practice, e.g. an object in an image that is mostly behind another object might have just a few vertices assigned to it.The such assigned vertices now carry all object properties to be predicted, such as object class, position, bounding box dimensions or shape, etc, in the following referred to as t i for vertex i.The deep neural network should be trained to predict these features, annotated as p i .Subsets of these features might require different loss functions.For simplicity their combination is generalised as L i (t i , p i ) in the following.
Those N B vertices that are not assigned to an object out of N total vertices are marked as background or noise, with n i = 1 for i being a noise vertex and 0 otherwise.The total number of objects is annotated with K, and the total number of vertices associated to an object with N F .
To assign a vertex to the corresponding object and aggregate its properties in a condensation point, the network is trained to predict a scalar quantity per vertex 0 ≤ β i ≤ 1, which is a measure of i being a condensation point, mapped through a sigmoid activation.The value of β i is also used to define a charge q i per vertex i through a function with zero gradient at 0 and monotonically increasing gradient towards a pole at 1. Here, the function is chosen to be The strictly concave behaviour also assures a well defined minimum for β i , which will be discussed later.The scalar q min ≈ [0.1, 1] is a hyperparameter taking place of a minimum charge.The charge q i of each vertex belonging to an object k defines a potential V ik (x) ∝ q i , where x are coordinates in a learnable clustering space.The force affecting vertex j belonging to an object k can, for example, then be described by with M ik being 1 if vertex i belongs to object k and 0 otherwise.In principle, this introduces matrices with N × N dimensions in the loss, which can easily be very resource demanding.Therefore, the potential of object k is approximated by the potential of the vertex α belonging to object k, which has the highest charge: Finally an attractive ( Vk (x)) and a repulsive ( Vk (x)) potential is defined as: (5) Here || • || is the L2 norm.The attractive potential acts on a vertex i belonging to object k, while the repulse potential applies if the vertex does not belong to object k.The attractive term ensures a monotonically growing gradient with respect to ||x − x α ||.The repulsive term is a hinge loss that scales with the charge, avoiding a potential saddle point at x = x α , and creating a gradient up to ||x − x α || = 1.By combining both terms, the total potential loss L V takes the form: In this form, the potentials ensure that vertices belonging to the same object are pulled towards the condensation point with highest charge, and vertices not belonging to the object are pushed away up to a distance of 1 until the system is in the state of lowest energy.The property Vk (x) → inf for x → inf allows to detach the clustering space completely from the input space, since wrongly assigned vertices receive a penalty that increases with the separability of the remaining vertices belonging to the different objects.Furthermore, the interpretation as potentials circumvents class imbalance effects e.g. from a large contribution of background vertices with respect to foreground vertices.Since both potentials are rotationally symmetric in x, the lowest dimensionality for x that ensures a monotonically falling path to the minimum is 2. As illustrated in Figure 1, apart from a few saddle points, the vertex is pulled consistently towards the object condensation point.Besides its advantages with respect to computational resources, building the potentials from the highest charge condensation point has another advantage: if instead, e.g. the mean of the vertices would be used as an effective clustering point, this point would be the same for all objects initially.For large N , a local minimum is then given by a ring or hypersphere (depending on the dimensionality of x) in which all vertices have the same distance to the centre.This symmetry is immediately broken by focusing on the highest charge vertex, only.An obvious minimum of L V is given for q i = q min ∀ i, or equivalently β i = 0 ∀ i.To counteract this behaviour and to enforce one condensation point per object, and none for background or noise vertices, the following additional loss term is L β defined as: where s B is a hyperparamater describing the background suppression strength, that needs to be tuned corresponding to the dataset1 .It should be low, e.g. in case where not all objects are correctly labelled as such.The linear scaling of these penalty terms together with Eq. 1 helps to balance the individual loss terms.
Finally, the loss terms L(t, p) are also weighted by arctanh 2 β i such that they scale similarly with β as the charge: As a consequence of this scaling, a condensation point will form the centre of the object in x through L V and simultaneously carry the most precise estimate of the objects properties through L p .In practice, individual loss terms might need to be weighted differently, which leads to the total loss of: The terms L β and L V outweigh each other through β with the exception of the weight s B .This leads to the following hyperparameters: • The minimum charge q min , which can be used to increase the gradient performing segmentation, and therefore allows a smooth transition between focus on predicting object properties (low q min ) or focus on segmentation (high q min ).
• The relative weight of the condensation loss with respect to the property loss terms s c , which is partially correlated with q min .

Inference
During inference, the calculation of the loss function is not necessary.Instead, potential condensation points are identified by considering only vertices with β above t β ≈ 0.

Example application to image data
As a proof of concept, the method is applied to image data, aiming to classify objects in a 64 × 64 pixels image.Each image is generated using the skimage package [35] generating up to 9 objects (circles, triangles, and rectangles).All objects are required to have a maximum overlap of 90%, and to have a width and height between 21 and 32 pixels.For the classification, a standard categorical cross-entropy loss is used.The clustering space is chosen to be 2 dimensional, and also all other loss parameters follow the prescription in Section 2.
Since this is a proof of concept example, the architecture of the deep neural network is simplistic: It consists of 3 main blocks of feature reduction and standard convolutional layers [36].The reduction branch takes as input the full image and applies convolutional layer with a kernel of 4 × 4 , followed by max pooling 2 × 2 adjacent pixels.This configuration is repeated 3 or 4 times with a different number of filters for each convolutional layer.At the end of the reduction, the image is upsampled back to its original size.One of the three network blocks consists of the following sequence of layers: • Batch normalisation • Concatenate the average of all pixel features to each pixel This configuration is repeated 3 times, in each the kernel size of the convolutional layer is increased.Finally, the output is fed through 3 dense layers with 128, 64, and 64 nodes, each.All layers use elu activation functions [37].
In total, 750,000 training images are generated and the network is trained with a batch size of 200 using the Adam [38] optimiser for 20 epochs with a learning rate of 5 • 10 −4 .
The thresholds for the selection of condensation points after inference are chosen as t d = 0.8 and t β = 0.1.An example image is shown in Figure 2 with predicted classes, alongside a visualisation of the clustering space spanned by x.The individual objects in this proof of principle application are well identified, also for images with any other number of objects compared to the one shown here.The condensation points are clearly visible and well separated in the clustering space, which underlines the fact that the values of t d and t β do not require particular fine tuning.As a result also the object segmentation is working very well.Particularly noteworthy is that in none of the cases, the object centre is identified as the best condensation point, but rather points at the edges, partially with larger distance to other similar objects are chosen.

Summary
The object condensation method described in this paper allows to detect the properties of an unknown number of objects in an image, point cloud, or particle physics detector without explicit assumptions on the object size or the sorting of the objects.It does not require any anchor boxes, or a prediction of cardinality as well as permutation of the objects.Moreover, it generalises naturally to point cloud or graph data by using the input structure itself to determine potential condensation points.While the training of a deep neural network might require more resources, the inference algorithm does not add any significant overhead with respect to the deep neural network itself and is therefore also suited for time-critical applications.

Figure 1 :
Figure 1: Illustration of the effective potential, a vertex is affected by belonging to the condensation point of the central object, in presence of 3 other condensation points around it.

Figure 2 :
Figure 2: Left: input image with prediction overlay.The representative pixels are highlighted, and their colour coding indicates the predicted classification: green (triangle), red (rectangle), blue (circle).Right: clustering space.The object colours are the same as in the left image, the background pixels are coloured in gray.The alpha value indicates β, with a minimum alpha of 0.05, such that background pixels are visible.
1 as condensation point candidates, leaving a similar number of condensations points as objects.The latter are sorted in descending order in β.Starting from the highest β vertex, all vertices within a distance of t d < 1 in x are assigned to that condensation point, and the object properties are taken from that condensation point.Each subsequent vertex is considered for the final list of condensation points if it has a distance of at least t d in x to each vertex that has already been added to this list.The threshold t d ≈ 1 is closely coupled to the repulsive potential defined in Eq. 5, which has a sharp gradient turn on at a distance of 1 with respect to the condensation point.The thresholds t β and t d do not require a high level of fine tuning, since potentially double-counted objects by setting t β to a too low value are removed by an adequate choice of t d .

•
Reduction block with 4 convolutional layers with 32, 48, 64, and 96 filters • Concatenate the output of the reduction block and the original image • Convolutional layer with 64 filters and a kernel size of 3 × 3, 4 × 4, or 5 × 5