Weakly Supervised Classification in High Energy Physics

As machine learning algorithms become increasingly sophisticated to exploit subtle features of the data, they often become more dependent on simulations. This paper presents a new approach called weakly supervised classification in which class proportions are the only input into the machine learning algorithm. Using one of the most challenging binary classification tasks in high energy physics - quark versus gluon tagging - we show that weakly supervised classification can match the performance of fully supervised algorithms. Furthermore, by design, the new algorithm is insensitive to any mis-modeling of discriminating features in the data by the simulation. Weakly supervised classification is a general procedure that can be applied to a wide variety of learning problems to boost performance and robustness when detailed simulations are not reliable or not available.


Introduction
With the increasing complexity of theoretical models and computing power, many scientific projects increasingly rely on simulations to design analysis techniques. This is especially true for high energy particle physics, where high fidelity Monte Carlo (MC) simulation is used to model physical processes at distances ranging from 10 −25 meters all the way to the macroscopic dimensions of detectors. However, just as models have become more complex, analysis techniques have also become more sophisticated. Numerous, often subtle, features of events are combined using powerful supervised learning algorithms trained on large simulated (labeled) datasets. Despite these advances, there is no guarantee that techniques highly optimized in simulation are also optimal in nature. One of the most ubiquitous analysis procedures is classifying events as originating from one of two different processes. It is sometimes the case that one knows the proportions of each class better than the properties of each class that are useful for classification. Weakly supervised classification is a new machine learning paradigm for classification where training is performed directly on (unlabeled) data.
The task of training a classifier on multidimensional data based only on class proportions is highly under-constrained. Neural network training is already a non-convex problem, but removing all local information about class labels significantly increases the difficulty of optimization. However, the field of multi-instance learning (MIL) [1] has shown that local information is not necessarily needed for classification 1 . The setup of MIL is a series of sets ('bags') of individual instances without individual labels. Consider the task of distinguishing two classes, called A and B. For the training set, it is known if a bag contains at least one instance of class A. The algorithm is then optimized to identify the presence of at least one instance of class A in an unseen bag. Recent work has extended this procedure to identify the class of individual instances, still only training on bag-level labels [3]. In this paper we make supervision even weaker in that bag-labels are only known on average. In particular, all that is known to the training is the expected fraction of class A in any particular bag. This paradigm is also referred to as Learning with Label Proportions (LLP) [4].
High energy quarks and gluons produced in reactions at the Large Hadron Collider (LHC) result in collimated streams of particles traveling at nearly the speed of light, known as jets. One of the most challenging classification tasks in high energy physics is to distinguish quark-induced jets from gluoninduced jets based on their radiation pattern. There is an extensive literature exploring discriminating observables [5]. However, standard quark and gluon discriminants are known to be poorly modeled by state-of-the-art simulations [6,7]. Despite this, the fraction of quark and gluon jets in a given sample is often well-known. At a fixed order in perturbation theory, the probability for an out-going parton to be a quark or a gluon depends on well-known parton distribution functions and matrix elements. Therefore, quark versus gluon discrimination is well-suited for for weakly supervised classification and is therefore the main example used later in this paper. This paper is organized as follows. Section 2 formally introduces weakly supervised classification and describes how it is applied in practice. The technique is illustrated using quark versus gluon jet tagging in Sec. 3. The paper ends in Sec. 4 with some concluding remarks. The source code implemented to produce the results presented in this paper is available at DOI: 10.5281/zenodo.322813.

Weakly supervised classification
Given a set of data originating from two classes labeled 0 and 1, the goal of classification is to construct a function f : R n → {0, 1}, where n is the dimensionality of the feature space used to discriminate the two classes. In the traditional classification paradigm of fully supervised training, the function f full is built by minimizing a loss function like the following: where N is the number of labeled data available for training, is a loss function with lim x→0 (x) = 0, and t i is the true label of example i. A common loss function is the squared error. In order to provide flexibility and stability, one often modifies the original problem to take f : R n → [0, 1] and the output is interpreted as a probability for an event to be in class 0 or 1. The ideal classifier that one tries to approximate with Eq. 2.1 is based on the likelihood ratio p( x|0)/p( x|1), where p( x|i) is the n-dimensional probability density for the feature vector x for the class i ∈ {0, 1}. Weakly supervised classification is a new paradigm in which instead of knowing the t i , all that is known is the proportion of events in either class: y = i t i /N . Thus, the weakly supervised f weak is given by The argument of Eq. 2.2 is non-convex, with many minima. In particular, the trivial solution f (x) = y results in a loss of zero. However, using multiple batches of data with different proportions y k is sufficient to collapse the solution space, so long as the distribution p( x|i; k) = p( x|i), i.e. the distribution of the discriminating features for a particular class is the same in every batch k. To build intuition for why there is any hope to solve this problem, consider a case where there are two batches A and B with proportions y A and y B . Consider an n-dimensional histogram where the i th dimension captures a discretized version of the i th discriminating feature. If the i th dimension has m i bins, then the total number of bins in the histogram is M = n i=1 m i . One can always rearrange bins so that instead of an n-dimensional histogram with m i bins in the i th dimension, there is a one-dimensional histogram with M bins. As visualizing high dimensional histograms can be cumbersome, let h A be one-dimensional histograms with M bins for the batch A and h B be the corresponding histogram for batch B. Then, for each bin i, one can write where h X,i is the content of the i th bin of the histogram h X . Except for contrived scenarios, Eq. 2.3 will have a unique solution for h 0,i and h 1,i , which are discretized versions of the probability densities p( x|0) and p( x|1). One can then form an (approximately) optimal classifier from the ratio of histograms with bin contents h 0,i /h 1,i . If the number of dimensions is large, one can add a further step to use machine learning to approximate the optimal classifier from h 0,i and h 1,i . As a result, the problem is completely solvable. Weakly supervised training combines the classification step with the first step and does so without binning. Solving Eq. 2.3 'by-hand' is intractable when n is relatively large or the number of examples is relatively small. It is also complicated when there are more than two batches (overconstrained). These challenges are all naturally handled by the all-in-one machine learning approach of weakly supervised classification, as illustrated below.
In the weakly supervised training used in the following examples, f in Eq. 2.2 is parametrized as a three-layer neural network with three inputs, a hidden layer with 30 neurons, and a sigmoid output. We use the Adam optimizer [8] in Keras [9] with a learning rate of 0.009 and train for 25 iterations. As reference, we consider a traditional classifier where t i labels the individual instances and f is parametrized as a three-layer neural network with three inputs, a hidden layer with 10 neurons, and a sigmoid output. Minimization is performed with stochastic gradient descent in Keras with a learning rate of 0.01 run for 40 iterations. For each training, both networks are initialized with random weights, following a normal distribution.  Table 1: Mean (µ) and standard deviation (σ) values of the normal distributions for class 0 and 1 of each feature. Figure 1 shows the weakly supervised classifier performance when training with 9 subsets of data with proportions between 0.2 and 0.4 compared with that of the fully supervised one. Three features, labeled 1 − 3 are constructed so that the distribution of feature i given class j follows a normal distribution with mean µ ij and standard deviation σ ij . For reference, the values of µ ij and σ ij used for the example shown in Fig. 1 are in Table. 1. Both the traditional and weakly supervised classifiers have the same Receiver Operator Characteristic (ROC) and thus have identical classification performance.
Note that the loss for weakly supervised classification is symmetric with respect to swapping the class assignment, therefore the classifier output for a given training can give higher values for class 0, while for a different training it would give higher value for class 1.
As with any machine learning algorithm with inherent randomness, the performance of a weakly supervised classifier has a stochastic component. This is quantified by retraining the same network many times with different random number seeds in each iteration. The interquartile range (IQR) over the Area Under the Curve (AUC) values for each training is a measure of the spread due to the inherent randomness. Figure 2 shows the AUC IQR for the toy example with one proportion fixed to 0.2 and the second proportion scanned from 0.2 to 0.7. The stability improves as the difference between the class proportions increases. In addition to the performance varying less as the proportions are further apart, the overall performance quantified by the median AUC (denoted by AUC ) also improves (increases). The improvement in the median AUC is not as dramatic as the reduction in the AUC IQR, but it does suggest that it is (slightly) easier for the machine learning algorithm when the proportions are very different 2 . This makes sense in the context of the two-step intuition-building paradigm given above: the algorithm can spend more attention on the classification task if it is easier to extract the class distributions.

Example: quark and gluon jet discrimination
Due to the strength of the strong force, there is a plethora of gluon jets produced at the LHC. However, many processes result in mostly quark jets. Prominent examples include the identification of hadronically decaying W bosons [10,11], jets associated with vector boson fusion [12][13][14], and multiquarks resulting from supersymmetry [15]. The references given here are the small number of public results that mention quark/gluon tagging, but there many more analyses that would benefit from a tagger if a robust technique existed.
The weakly supervised classification strategy is particularly useful for quark/gluon tagging because the fraction of quark jets for a particular set of events is well-known from parton distribution functions and matrix element calculations while useful discriminating features have not been computed to high accuracy and simulations often mis-model the data. To illustrate this concrete example, quark and gluon jets are simulated and a weakly supervised classifier is trained on the generated event sample. Unlike real data, in the simulated sample, we also know per-event labels which are used to additionally train a fully supervised classifier. Events with 2 → 2 quark-gluon scattering (dijet events) are simulated using the Pythia 8.18 [16] event generator. Jets are clustered using the anti-k t algorithm [17] with distance parameter R = 0.4 via the FastJet 3.1.3 [18] package. Jets are classified as quark-or gluoninitiated by considering the type of the highest energy quark or gluon in the full generator event record that is inside a 0.3 radius of the jet axis. For simplicity, one transverse momentum range is considered: 45 GeV < p T < 55 GeV. Additionally, there is a pseudo-rapidity requirement that mimics the usual detector acceptance for charged particle tracking: |η| < 2.1. Heuristically, gluons have twice as much strong-force charge as quark jets, resulting in more constituents and a broader radiation pattern. Therefore, the following variables are useful for quark/gluon discrimination: the number of jet constituents n, the first radial moment in p T (jet width) w, and the fraction of the jet p T carried by the leading anti-k T R = 0.1 subjet f 0 . The constituents i considered for computing n and w are the hadrons in the jet with p T > 500 MeV.  Figure 3: Comparison of ROC curves for quark/gluon jet discrimination using a fully supervised classifier or a weakly supervised classifier. In (a) the fully and weakly supervised classifiers are trained on identical simulated data and evaluated on a test sample drawn from the same population. The weakly supervised classifier matches the performance of the fully supervised one. The curves corresponding to the three input observables used as discriminant are shown as reference. In (b), the fully supervised classifier (blue line) is trained on a labeled simulated training sample. The weakly supervised classifier (red line) is trained on an unlabeled pseudo-data training sample. In both cases, the performance is evaluated on the same pseudo-data test sample. The ratios to the performance of a fully supervised classifier trained on a labeled pseudo-data sample are shown in the bottom pad.
A weakly supervised classifier with one hidden layer of size 30 is trained by considering 12 bins of the distribution of the absolute difference in pseudorapidity between the two jets [19]. The propor-tion of quark initiated jets varies between 0.21 and 0.32. Figure 3 shows that, while the individual observables perform differently in the high or low gluon efficiency (true positive rate) regimes, their combination in a NN gives consistently better performance. The weakly supervised classifier matches the performance of the fully supervised NN, despite only knowing sample proportions instead of individual event labels. By construction the weakly supervised classifier is also robust against a realistic amount of mis-modeling in the input variables. This feature is tested by building a pseudo-data sample where the probability distributions of n and w are distorted in the training sample to emulate the difference in efficiency measured in Ref. [6]. The study in Ref. [6] found that a classifier extracted from simulation is more powerful than one extracted from the data. This is reflected in the results presented in the right plot of Fig. 3. When a fully supervised classifier is trained on a sample generated with the same distribution as the test sample (mimicking training and testing on simulation), it achieves a better performance than when trained on the original sample and tested on the distorted pseudo-data (mimicking training on simulation and testing on data). In contrast, the weakly supervised classifier can be trained directly on the distorted pseudo-data sample (representing the data) so is insensitive to the mismodeling of the input variables. This results in a 10% bias from the standard procedure that is avoided by the weakly supervised classifier. Even larger differences may be expected from this and other classification tasks that utilize even more input features or are more mis-modeled. The weakly supervised classifier is robust and outperforms the standard supervised learning trained on simulation.

Conclusions
We have presented a new approach to classification with NN in cases where class proportions are known but individual labels are not readily available. This weakly supervised classification has broad applicability and has been demonstrated in one important discrimination task in high energy physics: quark versus gluon jet tagging. In the quark/gluon and related contexts, weakly supervised classification provides a robust and powerful approach because it can be directly trained on examples from (unlabeled) data instead of (labeled, but unreliable) simulation. The examples presented so far have used a small number of input features to illustrate the ideas, but there is no algorithmic limitation on the number of features. Figure 4 is a simple extension of Fig. 1 with 5 features instead of 3; in future work, we will study the extension to many more features (tens to hundreds). This paper has laid the conceptual groundwork for this new tool that has started a new classification paradigm that can be applied to a wide variety of learning problems to boost performance and robustness when detailed simulations are not reliable or not available.