Classification without labels: Learning from mixed samples in high energy physics

Modern machine learning techniques can be used to construct powerful models for difficult collider physics problems. In many applications, however, these models are trained on imperfect simulations due to a lack of truth-level information in the data, which risks the model learning artifacts of the simulation. In this paper, we introduce the paradigm of classification without labels (CWoLa) in which a classifier is trained to distinguish statistical mixtures of classes, which are common in collider physics. Crucially, neither individual labels nor class proportions are required, yet we prove that the optimal classifier in the CWoLa paradigm is also the optimal classifier in the traditional fully-supervised case where all label information is available. After demonstrating the power of this method in an analytical toy example, we consider a realistic benchmark for collider physics: distinguishing quark- versus gluon-initiated jets using mixed quark/gluon training samples. More generally, CWoLa can be applied to any classification problem where labels or class proportions are unknown or simulations are unreliable, but statistical mixtures of the classes are available.


Introduction
In the data-rich environment of the Large Hadron Collider (LHC), machine learning techniques have the potential to significantly improve on many classification, regression, and generation problems in collider physics. There has been a recent surge of interest in applying deep learning and other modern algorithms to a wide variety of problems, such as jet tagging . Despite the power of these methods, they all currently rely on significant input from simulations. Existing multivariate approaches for classification used by the LHC experiments all have some degree of mis-modeling by simulations and must be corrected post-hoc using data-driven techniques [22][23][24][25][26][27][28][29][30]. The existence of these scale factors is an indication that the algorithms trained on simulation are sub-optimal when tested on data. Adversarial approaches can be used to mitigate potential mis-modeling effects during training at the cost of algorithmic performance [31]. The only solution that does not compromise performance is to train directly on data. This is often thought to not be possible because data is unlabeled.
In this paper, we introduce classification without labels (CWoLa, pronounced "koala"), a paradigm which allows robust classifiers to be trained directly on data in scenarios common in collider physics. Remarkably, the CWoLa method amounts to only a minor variation on wellknown machine learning techniques, as one can effectively utilize standard fully-supervised techniques on two mixed samples. As long as the two samples have different compositions of the true classes (even if the label proportions are unknown), we prove that the optimal classifier in the CWoLa framework is the optimal classifier in the fully-supervised case. 1 In residual dependence on simulation; indeed, one could even combine adversarial approaches with CWoLa in this case to mitigate simulation dependence [31]. Finally, the CWoLa approach presented here only applies to mixtures of two categories, and further developments would be needed to disentangle multicategory samples.
The remainder of this paper is organized as follows. In Sec. 2, we explain the theoretical foundations of the CWoLa paradigm and contrast it with LLP-style weak supervision and full supervision. We illustrate the power of CWoLa with a toy example of two gaussian random variables in Sec. 3. We then apply CWoLa to the challenge of quark versus gluon jet tagging in Sec. 4, using a dense network of five standard quark/gluon discriminants to highlight the performance of CWoLa on mixed samples. The paper concludes in Sec. 5 with a summary and future outlook.

Machine learning with and without labels
The goal of classification is to distinguish two processes from each other: signal S and background B. Let x be a list of observables that are useful for distinguishing signal from background, and define p S ( x) and p B ( x) to be the probability distributions of x for the signal and background, respectively. A classifier h : x → R is designed such that higher values of h are more signal-like and lower values are more background-like. A classifier operating point is defined by a threshold cut h > c; the signal efficiency is then , for the Heaviside step function Θ. The performance of a classifier h can be described by its receiver operating characteristic (ROC) curve which is the function 1− h B ( S ). A classifier h is optimal if for any other classifier h , h B ( S ) ≥ h B ( S ) for all possible S . By the Neyman-Pearson lemma [39], an optimal classifier is the likelihood ratio: h optimal ( x) = p S ( x)/p B ( x). Therefore, the goal of classification is to learn h optimal or any classifier that is monotonically related to it.
In practice, one learns to approximate h optimal ( x) from a set of signal and background x examples (training data). When the dimensionality of x is small and the number of examples large, it is often possible to approximate p S ( x) and p B ( x) directly by using histograms. When the dimensionality is large, an explicit construction is often not possible. In this case, one constructs a loss function that is minimized using a machine learning algorithm like a boosted decision tree or (deep) neural network. The following section describes three paradigms for learning h optimal ( x) with different amounts of information available at training time: full supervision, LLP, and CWoLa. The ideas presented here apply to any procedure for constructing h optimal ( x).

Full supervision
Fully supervised learning is the standard classification paradigm. Each example x i comes with a label u i ∈ {S, B}. For models trained to minimize loss functions, typical loss functions are the mean squared error: (2.1) for the indicator function I, or the cross-entropy: where N is the size of the subset (batch) of the available training data. With large enough training samples, flexible enough model parameterization, and suitable minimization procedure, the learned h should approach the performance of h optimal .

Learning from label proportions
For weak supervision, one does not have complete and/or accurate label information. Here, we consider the case of accurate labels, but in the context of mixed samples. Consider two processes M 1 and M 2 that are mixtures of the original signal and background processes: with the signal fractions satisfying 0 ≤ f 2 < f 1 ≤ 1. Instead of having training data labeled as being from p S or p B , we are now only given examples drawn from p M 1 and p M 2 with the corresponding M 1 and M 2 labels. We are however told f 1 and f 2 ahead of time. The resulting optimization problems are much less constrained than those in Sec. 2.1, but learning is still possible. The key is to use several different mixed samples with sufficiently different fractions in order to avoid trivial failure modes, as discussed in Ref. [34]. One possible loss function is given by: 5) where N M 1 and N M 2 are the number of M 1 and M 2 examples in the batch. One could extend (and improve) this paradigm by adding in more samples with different fractions, but we consider only two here for simplicity.

Classification without labels
CWoLa is an alternative strategy for weak supervision in the context of mixed samples. Rather than modifying the loss function to accommodate the limited information as in Sec. 2.2, the CWoLa approach is to simply train the model to discriminate the mixed samples M 1 and M 2 from one another. The classifier h trained to distinguish M 1 from M 2 (using full supervision) is then directly applied to distinguish S from B. An illustration of this technique is shown in Fig. 1. Remarkably, this procedure results in an optimal classifier (as defined in the beginning of Sec. 2) for the S versus B classification problem: Proof. The optimal classifier to distinguish examples drawn from p M 1 and p M 2 is the likelihood Similarly, the optimal classifier to distinguish examples drawn from p S and p B is the likelihood ratio Where p B has support, we can relate these two likelihood ratios algebraically: which is a monotonically increasing rescaling of the likelihood L S/B as long as An important feature of CWoLa is that, unlike the LLP-style weak supervision in Sec. 2.2, the label proportions f 1 and f 2 are not required for training. Of course, this proof only guarantees that the optimal classifier from CWoLa is the same as the optimal classifier from fully-supervised learning. We explore the practical performance of CWoLa in Secs. 3 and 4.
The problem of learning from unknown mixed samples can be shown to be mathematically equivalent to the problem of learning with asymmetric random label noise, where there have been recent advances [32,40]. The equivalence of these frameworks follows from the fact that randomly flipping the labels of pure samples, possibly with different flip probabilities for signal and background, produces mixed samples. In the language of noisy labels, Ref. [32] argues that even unknown class proportions can be estimated from mixed samples under certain conditions using mixture proportion estimation [41], which may have interesting applications in collider physics. There are also connections between learning from unknown mixed samples and the calibrated classifiers approach in Ref. [42], where measurement of the class proportions from unknown mixtures is also shown to be possible.

Operating points
While the optimal classifier from CWoLa is independent of the mixed sample compositions, some minimal input is needed in order to establish classification operating points. Specifically, to define a cut on the classifier h at a value c to achieve signal efficiency S , one requires some degree of label information.
One practical strategy is to use CWoLa to train on two large mixed samples without label or class proportion information, and then benchmark it on two smaller samples where the class proportions f 1 and f 2 are precisely known. In that case, one can solve a simple system of equations on the smaller samples: where the probabilities can be estimated numerically by counting the number of events that pass the classifier cut in some sample, e.g.
where M 1 is the mixed sample data. Thus with class proportions only, the ROC curve of a classifier can be determined. 4 For the purpose of establishing working points, one might need to rely on simulations to determine the label proportions of the test samples. In many cases, though, label proportions are better known than the details of the observables used to train the classifier. For instance, in jet tagging, the label proportions of kinematically-selected samples are largely determined by the hard scattering process, with only mild sensitivity to effects such as shower mismodeling. In this way, one is sensitive only to simulation uncertainties associated with sample composition, which in most cases are largely uncorrelated with uncertainties associated with tagging performance.
To summarize, the CWoLa paradigm does not need class proportions during training, and it only requires a small sample of test data where class proportions are known in order to determine the classifier performance and operating points, with minimal input from simulation.

Illustrative example: Two gaussian random variables
Before demonstrating the combination of CWoLa with a modern neural network, we first illustrate the various forms of learning discussed in Sec. 2 through a simplified example where the optimal classifier can be obtained analytically. Consider a single observable x for distinguishing a signal S from a background B. For simplicity, suppose that the probability distribution of x is a Gaussian with mean µ S and standard deviation σ S for the signal and a Gaussian with mean µ B and standard deviation σ B for the background. We then consider the mixed samples M 1 and M 2 from Eqs. (2.3) and (2.4) with signal fractions f 1 and f 2 .
In this one-dimensional case, the optimal fully-supervised classifier can be constructed analytically: .     use the area under the curve (AUC) metric to quantify performance. For continuous random variables, the AUC can be defined as Pr(h(x|S) > h(x|B)). This notion extends well to discrete random variables (indexed by integers): Pr(x = i | S) Pr(x = i | B). One advantage of CWoLa over the LLP approach is that the fractions f 1 and f 2 are not required for training. In Fig. 3, we demonstrate the impact on the AUC for LLP when the wrong fractions are provided at training time. Here, the true fractions are f 1 = 80% and f 2 = 20%, but different fractions f 1,wrong = 1 − f 2,wrong are used to calculate Eq. (3.3). For f 1,wrong far from 50%, there is little dependence on the fraction used for training. This insensitivity is likely due to the preservation of monotonicity to the full likelihood with small perturbations in f , as discussed in detail in Ref. [38].
With this one-dimensional example, the estimate for the optimal classifier under each of the three schemes is computable directly. It is often the case that x is highly multidimensional, though, in which case a more sophisticated learning scheme may be required. We investigate the performance of CWoLa in a five-dimensional space in the next section.

Realistic example: Quark/gluon jet discrimination
Quark-versus gluon-initaited jet tagging [43][44][45][46][47][48][49][50][51] is a particularly important classification problem in high energy physics where training on data would be beneficial. This is because correlations between key observables known to be useful for tagging are not always well-modeled by simulations as they depend on the detailed structure of a jet's radiation pattern [24,52]. Furthermore, even the LLP paradigm proposed in Ref. [34] can be sensitive to the input fractions which are themselves dependent on non-perturbative information from parton distribution functions. In this section, we test the performance of CWoLa in a realistic context where a small number of quark/gluon discriminants are combined into one classifier, similar to the CMS quark/gluon likelihood [25,26].
A key limitation of this study is that we artificially construct mixed samples M 1 and M 2 from pure "quark" (S) and pure "gluon" (B) samples. 5 In the practical case of interest at the LHC, one would measure a quark-enriched sample in Z plus jet events and a gluon-enriched sample in dijet events, with more sophisticated selections possible as well [53]. However, the "quark" jet in pp → Z + j event is not the same as the "quark" jet in pp → 2j, since there are soft color correlations with the rest of the event. Jet grooming techniques [54][55][56][57][58][59] can mitigate the impact of soft effects to provide a more universal "quark" jet definition [60,61]. Still, one needs to validate the robustness of quark/gluon classifiers to the possibility of sample-dependent labels, and we leave a detailed study of this effect to future work.
This study is based on five key jet substructure observables which are known to be useful quark/gluon discriminants [37]. The discriminants are combined using a modern neural network employing either CWoLa or fully-supervised learning. We do not show a benchmark curve for LLP since it is difficult to ensure a fair comparison. By contrast, CWoLa and full supervision use the same loss function with the same training strategy, so a direct comparison is meaningful. All of the observables can be written in terms of the generalized angularities [51] (see also [62][63][64]): where ∆R i is the rapidity/azimuth distance to the E-scheme jet axis, 6 p T,i is the particle transverse momentum, and R is the jet radius. The observables used to train the network use (κ, β) values of: (0, 0) (2, 0) (1, 0.5) (1, 1) (1, 2) multiplicity p D T LHA width mass (4.2) where the names map onto the well-known discriminants in the quark/gluon literature. 7 Quark and gluon jets are simulated from the decay of a heavy scalar particle H with m H = 500 GeV in either the pp → H → qq or pp → H → gg channel. Production, decay, and fragmentation are modeled with Pythia 8.183 [70]. Jets are clustered using the anti-k t algorithm [71] with radius R = 0.6 implemented in Fastjet 3.1.3 [72]. Only detector-stable hadrons are used for jet finding. Since the gluon color factor C A is larger than the quark color factor C F by about a factor of two, gluon jets have more particles and are "wider" on average as measured by the angularities listed above.
To classify quarks and gluons with either the CWoLa or fully-supervised method, we use a simple neural network consisting of two dense layers of 30 nodes with rectified linear unit (ReLU) activation functions connected to a 2-node output with a softmax activation function. All neural network training was performed with the Python deep learning library Keras [73] with a Tensorflow [74] backend. The data consisted of 200k quark/gluon events, partitioned into 20k validation event, 20k test events, and the remainder used as training event samples of various sizes. He-uniform weight initialization [75] was used for the model weights. The network was trained with the categorical cross-entropy loss function using the Adam algorithm [76] with a learning rate of 0.001 and a batch size of 128.
In Fig. 4, we show the performance of CWoLa training for quark/gluon classification using mixed samples of different purities. These mixed samples of 25k and 150k training events were generated by shuffling the pure samples into two sets in different proportions. Performance is measured in terms of the classifier AUC. The behavior resembles that found in the toy model of Fig. 2, with more training data resulting in increased robustness to sample impurity. It is remarkable that such good performance can be obtained even when the signal/background events are so heavily mixed.
In Fig. 5, we show ROC and significance improvement (SI) curves for 150k training events, where SI is a curve of q / √ g at different q values [50]. Results are given for the fully-supervised classifier trained on pure samples and the CWoLa classifier trained on mixed samples with f 1 = 80% and f 2 = 20%, along with the curves of the input observables. Both the fully-supervised and CWoLa dense networks achieve similar performance, with the    Figure 5. Quark/gluon discrimination performance in terms of (a) ROC curves and (b) SI curves. Shown are results for the dense net trained on 150k pure samples, and then with CWoLa on f 1 = 80% versus f 2 = 20% mixed samples, as well as the input observables individually. The classifier trained on the mixed samples achieves similar performance to the classifier trained on the pure samples, with improvement in performance over the input observables. expected improvement over the individual input observables. This suggests that the proof of CWoLa optimality in Theorem 1 is achievable in practice, though many more studies are needed to demonstrate this in a wider range of contexts.

Conclusions
We introduced the CWoLa framework for training classifiers on different mixed samples of signal and background events, without using true labels or class proportions. The observation that the optimal classifier for mixed samples of signal and background is also optimal for pure samples of signal and background, proven in Theorem 1, could be of tremendous practical use at the LHC for learning directly from data whenever truth information is unknown or uncertain and whenever detailed and reliable simulations are unavailable. We highlight that no new specific code, loss function, or model architecture is needed to implement CWoLa. Any tools for training a classifier using truth information can be directly applied to discriminate mixed samples and thus to train in the CWoLa framework directly on data.
Using a toy example, we found that CWoLa performs as well as LLP (which requires knowledge of the class proportions), suggesting that CWoLa is a robust paradigm for weak supervision. Of course, to determine operating points and classification power for the CWoLa method, some label information is needed, but it can be furnished by a smaller sample of testing data that can be separate from the larger mixed samples used for training. It is also worth remembering that CWoLa assumes that the mixed samples are not subject to contamination or sample-dependent labeling, though one could imagine using data-driven cross-validation with more than two mixed samples to identify and mitigate such effects. More ambitiously, one could try to apply CWoLa to event samples that otherwise look identical, to try to tease out potential subpopulations of events.
As a realistic example, we applied the CWoLa framework to the important case of quark/gluon discrimination, a classification task for which simulations are typically unreliable and true labels are unknown. We showed that the CWoLa method can be successfully used to train a dense neural network for quark/gluon classification on mixed samples with five jet substructure observables as input. Though the realistic example made use of a neural network, the CWoLa paradigm can be used to train many other types of classifiers. While in this study we considered a relatively small network on a small (but important) number of inputs, the same principles apply for any type of model or input. In future work, we plan to study CWoLa in the context of deeper architectures and larger inputs.