1 Introduction

The Standard Model (SM) of particle physics is widely regarded as one of the most successful theories ever developed by mankind to date. However, it is not regarded as a complete theory for several reasons pointing to physics beyond the SM. Among these enigmas are the fact that it is unable to solve the Higgs hierarchy problem and cannot account for the presence of dark matter, whose existence is proved by the works [1,2,3].

The supersymmetric extension of the standard model (SUSY) [4,5,6,7,8,9,10,11,12] predicts a superpartner for every equivalent particle in SM that differs by a half spin. In R-parity conserving minimal supersymmetric extension of standard model (MSSM), the lightest neutralino (\({\tilde{\chi }}_1^0\)) is the lightest supersymmetric particle (LSP), stable, weakly interacting, and thus is a candidate for dark matter. At the Large Hadron Collider (LHC), the CMS and ATLAS collaborations have been carrying out searches for supersymmetric particles. Both experiments place limits on the masses of colored supersymmetric particles. The masses of gluinos in models involving the pair production of gluinos decaying via off-shell top and bottom squarks are excluded up to \(\approx 2.4\) TeV and 2.35 TeV, respectively, for a massless LSP case [13]. The work [14] presents the combination of previously published analyses for the pair production of supersymmetric partner of top quarks in 0,1,2-leptons final state [15,16,17] and excludes the mass of stop (\(\tilde{t}\)) up to 1325 GeV for a massless neutralino; however, the largest excluded squark mass is obtained with the search [18], which excludes the mass of top squark below 1.55 TeV for a massless LSP. On the other hand, the mass limits on the electroweakly produced charginos-neutralinos are less constrained since these particles suffer from a smaller production cross section in a hadron collider. Limit on the masses of these particles are set by the work [18], reporting masses of electroweakinos, chargino(\({\tilde{\chi }}_1^{\pm }\)) and neutralino(\({\tilde{\chi }}_2^0\)) are excluded below 900 GeV.

CMS and ATLAS experiments trace for directly produced sleptons at 8, and 13 TeV in final states with di-leptons and the LSP [19,20,21,22,23]. As LSP leaves the detector without any trace, it contributes to missing transverse momentum, an important signal-background discriminator in SUSY searches. Assuming a mass differential of \( \Delta \textrm{M} \le 20 \) GeV and \(\Delta \textrm{M} \le 60 \) GeV between the slepton and the LSP, slepton signal production was investigated in events with final states having missing energy, di-lepton and an initial state radiation (ISR jet) or a pair of VBF jets at 14 TeV collision energy by the works [24,25,26], respectively.

The classic method to search for SUSY signal is cut-and-count method which has become a promising method thus far in the field of particle physics. The strategy is based on applying cuts that eliminate as much of the SM background as feasible while maintaining the maximum amount of signal events. However, it is limited by our capacity to grasp what we observe. We will need to enhance our analytical techniques to keep up with the ever-increasing volume and complexity of the data recorded by the CMS and ATLAS experiments. We can, however, overcome our limitations as humans with the help of machine learning (ML) algorithms. In fact, the ML approach is frequently a more efficient analysis method than the traditional approach, as it can process enormous datasets and return findings in a fair amount of time. ML techniques can even be better at discriminating background and signal since they can potentially find difficult-to-detect patterns in the data. Hence, machine learning approaches would enhance our ability to interpret the data, which is often multi-dimensional and complex. Work [27] investigates SUSY production in the low mass region through machine learning algorithms and compares results with the classical cut and count method used by the works [28, 29]. Signal processes considered are chargino pair production, mono-Z process, slepton pair production, and chargino pair production with slepton/neutrino mediated decay. Results show that better sensitivities can be obtained with machine learning algorithms compared to the classical cut and count method.

In recent years, there has been an increasing amount of literature on the application of machine learning algorithms in SUSY searches [30,31,32,33,34,35]. In one of these [33], the authors investigated SUSY and an electrically neutral Higgs boson production in dilepton + missing energy and single lepton along with at least four jets in the final state, respectively, using machine learning algorithms. The study utilized low-level and high-level features in a combined and separate way to check the classification performance of the model as well as statistical significance value, which is used in high energy physics to assess if there is a sign for new physics.

Through the use of neural networks, work [34] investigates a number of simpler dark matter models with events containing mono-jet and missing transverse energy in the final states. However, in order to train the algorithms, the data is structured in 2D histograms rather than the classical method of passing events one by one. The histogrammed data set is fed into deep neural network (DNN) and Convolutional neural network (CNN) separately for building a model. It has been demonstrated that, when compared to DNN, CNN with 2D histograms enhances efficiency slightly. However, the primary drawbacks of these types of applications are that it necessitates more data, and training time, consequently, the usage of more hardware sources.

Characterizing Dark Matter at colliders using Machine Learning techniques has been studied by the work [35]. The focus was on the monojet and missing transverse energy (MET) channel, and a set of benchmark models beyond standard model was proposed for the study. Various representations of the data were explored, either event-by-event form or imaged versions of the kinematic distributions, which are then fed into a Logistic Regression algorithm or a Fully Connected Neural Network, Deep and Convolutional Neural Networks. All of these benchmarks were compared to each other and to the \(Z+jets\) SM background. It was found that using the 2D images of the combined information of multiple events significantly improves the discrimination performance compared to the list of events with kinematical features.

The signal considered in this work is the production of a pair of left/right-handed slepton produced from \(Z^*\) or \(\gamma ^{*}\) exchange in quark anti-quark/gluon-quark pair interaction together with a single extra jet emitted from one of the incident partons (see Fig. 1). Slepton can be both left/right-handed selectron or smuon. When a slepton pair is produced, both slepton decays promptly to LSP and the same flavor leptons (\(e^{+}e^{-}\) or \(\mu ^{+}\mu ^{-}\)). Final state leptons in the compressed mass spectra scenario, where the mass difference between slepton and LSP is small, are expected to be soft. Consequently, lepton reconstruction in the compressed scenario becomes a challenge, and the presence of soft products renders the signal indistinguishable from SM processes. Therefore, it is necessary to have a strongly energetic ISR jet recoiling against the sleptons, which results in an increase in the transverse momentum (\(p_T\)) of pair-produced sleptons and their decay products, consisting of a \({{\tilde{\chi }}}_1^0\) pair and same flavor opposite sign (SFOS) leptons. Namely, requiring a significant amount of missing energy coming from LSP and SFOS lepton pair and one hard jet in the final state form the signal (\( p p \rightarrow \tilde{l}^{+} \tilde{l}^{-} j \rightarrow l^{+} l^{-} {\tilde{\chi }}_{1}^{0} {\tilde{\chi }}_{1}^{0} j \)).

Fig. 1
figure 1

Slepton pair (\({\tilde{\ell }} {\tilde{\ell }}\)) production mechanism along with ISR jet emitted from one of the incident partons

The background process considered in this study is the production of a pair of W’s (see Fig. 2 for production mechanism). WW becomes a background when each of the W’s decays into a lepton and neutrino, which is nothing but a source of missing energy. Consequently, WW production mimics the SUSY signal and becomes a background. Signals and the background are generated with the leading order event generator MadGraph5 \(\_\)aMC@NLO version 2.6.7 [36], then pushed for parton showering and hadronization into Pythia 6 [37], which is then followed by detector simulation. Detector simulation is carried out with Delphes 3 [38] and default delphes card for CMS detector [39] is used for detector response. Signal and background are generated up to two partons. MLM scheme [40] is applied to avoid double counting. SUSY spectrum generator called SUSY-HIT is utilized for generating all the parameter cards that are used as input for Madgraph in the signal production process [41].

Recently, particle physicists have shown an increased interest in the application of machine learning algorithms. However, no previous study has investigated the power of transfer learning (TL) in SUSY searches in the case of compressed spectra. Hence, in this analysis, a binary classifier was built to distinguish SM background from SUSY signal plus SM background mixture. Two machine learning algorithms, support vector machines and logistic regression, are trained with the features extracted through transfer learning. The training was done on the signal sample with small mass splitting, where the mass difference between slepton and the lightest neutralino is small (\( \Delta m \equiv \Delta m(\tilde{l}, {\tilde{\chi }}_{1}^{0})=m(\tilde{l})-m\left( {\tilde{\chi }}_{1}^{0}\right) = 5 \) GeV).

The rest of the paper has been organized in the following way. Section 2 begins by laying out the proposed method, providing details on the ML algorithms and techniques along with the signal benchmark point. The features analyzed and used for the construction of the histogrammed dataset are also explained. Section  3 is concerned with research findings and demonstrates the discrimination strength of the technique employed for the classification of signal plus SM and SM histograms. The conclusion is left for Sect. 4, which briefly summarizes and critiques the findings.

Fig. 2
figure 2

Representative Feynman diagram of WW pair production followed by their decays to the same flavor leptons and neutrinos

2 Methodology

The cut-and-count technique is the most typical way to extract the signal from the background. However, as shown by the distributions in Fig. 3, in some cases signal remains buried in the background, and applying cuts on the specific features may not always result in an effective signal extraction. In most recent studies, signal and background have been classified either by using conventional machine learning algorithms, deep neural networks, or convolution neural networks. However, training deep neural networks from scratch can be challenging due to several limitations, such as an imbalance in classes hindering the learning process, missing values, or unlabeled data. Moreover, training deep neural networks needs substantial computer resources, which can be costly and time-consuming. The other major problem with the method mentioned above is that it may not be very accurate and requires more data for a better result as the model is built from scratch [42,43,44]. However, in order to classify histograms with SM background from mixing plots of SUSY signals and SM background, machine learning algorithms are utilized after extracting features through transfer learning. This approach has several attractive characteristic features: Less training time, consequently less use of computational resources, less quantity of training data, and better at feature extraction. The strategy comprises two stages. In the first stage, some cuts characterizing the signal are applied. In the second phase, 2D histograms are generated, and then, using ML algorithms, a binary classifier is built to separate these two types of histograms.

Fig. 3
figure 3

Some kinematical distributions for signal and background obtained after applying the premier cuts that make the signal stands out against the background

2.1 Logistic regression

Logistic regression, despite its name, is a classification model developed by David Cox in 1958 [45]. It is commonly employed for binary and multi-class classification tasks and performs exceptionally well for linearly separable classes. This can be done by using the logistic function or known as the sigmoid function, to predict the probability of the binary outcome. Hence, the value that the logistic function gives out lies between 0 and 1. The sigmoid function used by logistic regression is shown in 1. For a given \(x_n\), probability, \(p(x_n)\), corresponds to the target \(y_n\). Namely, when \(p_n \ge 0.5\), \(y_n=1\) otherwise \(y=0\) for \(p < 0.5\).

$$\begin{aligned} p =\frac{1}{1+e^{-x}} \end{aligned}$$
(1)

2.2 SVM

The Support Vector Machine (SVM) is another type of supervised learning technique that can be applied to classification and regression problems [46]. Constructing an optimal hyperplane in a multidimensional space in order to divide classes and make predictions about which classes a new example belongs to is the basic premise upon which this method is based. The best possible hyperplane is obtained by maximizing the distance between the hyperplane and the nearest data points of any class. This hyperplane is also known as a maximum-margin hyperplane.

2.3 Transfer learning

Transfer learning, also known as transfer of learning, is the process of transferring acquired knowledge and skills from one domain to another. In the context of artificial intelligence and machine learning, the goal of transfer learning is to enhance performance and reduce the amount of computational power required for solutions by levering the knowledge that has been previously learned. In recent years, transfer learning has become ubiquitous in the field of computer vision and pattern recognition, in speech recognition and recommendation engines [47,48,49,50,51,52]. Transfer learning is also widely used in the field of high-energy physics. The study [53] investigates the use of transfer learning as a new approach to train emulators for relativistic heavy ion collision simulations. The findings reveal that transfer learning is remarkably efficient and can substantially reduce the computational cost of building emulators. When training deep neural networks using simulations for a specific task like neutrino interaction classification, a significant number of simulated events is often required. Moreover, this can be computationally expensive, and the deep learning algorithm may underperform if sufficient events are unavailable. To address this issue, the study [54] examines the use of transfer learning, where a pre-trained model on generic image recognition tasks is fine-tuned using a set of simulated neutrino images for the specific task. The study used a ResNet18 model pre-trained on photographic images, fine-tuned using simulated neutrino images, and achieved an F1 score of \(0.896 \pm 0.002\) with 100,000 training events. The paper [55] examines the potential of transfer learning techniques to develop efficient jet taggers using existing models. The primary objective was to investigate the ability of neural networks to learn the fundamental features of QCD and transfer them to a distinct task. Specifically, the study applied transfer learning to top tagging at varying transverse momentum thresholds and the tagging of boosted objects with two or three prongs, such as top quark and W boson decays. Transfer learning may be effective, particularly in the case where the data sample is insufficient to build an image classification model with high accuracy.

2.3.1 Feature extraction

Feature extraction plays a pivotal role in the domain of transfer learning, wherein the acquired information from one task or domain is utilized to enhance performance in another task or domain. In particular, pre-trained models on large datasets of images, such as ImageNet, can be used as feature extractors for other image-related tasks. This approach involves using the convolutional layers of a pre-trained depth and complex CNN models as a fixed feature extractor while removing the fully connected layers. The pre-trained model is then fine-tuned on a new domain or task by adding a new classifier on top of the extracted features. By employing this approach, the model can acquire the ability to identify and categorize objects within a novel domain using a relatively limited quantity of annotated data. This approach has been successfully applied in various domains, including medical image analysis, object detection, and natural language processing, and can significantly improve the performance of machine learning models in a wide range of applications [56,57,58,59].

2.4 Inception-v3

Inception-v3 is the third generation TensorFlow-based [60] 48-layer deep inception model introduced by Google [61]. It has been trained on over a million images from the ImageNet dataset [62] and can be utilized in computer vision tasks that require a feature extractor, such as machine learning algorithms. Inception-v3 is structured like Inception-v1, and it has 1000 image classes and can be used to extract features from an image or used as a mask to add objects to other images. The neural network has a very simple structure and is computationally efficient. This kind of feature extraction mainly aims to detect and recognize images faster with higher accuracy.

2.5 ResNet-50

ResNet-50, a convolutional neural network architecture consisting of 50 layers, was introduced by the authors of [63] as a residual learning framework to address the issue of vanishing gradients in deep networks. Trained on the labeled subset of the ImageNet dataset, ResNet-50 has learned to recognize 1.2 million images across 1000 classes, making it a valuable pre-trained model for various computer vision tasks. The residual connections in ResNet-50 enable the flow of information from the initial layer to the final layers, facilitating the construction of deeper networks without degrading performance. As a result, ResNet-50 has achieved state-of-the-art performance in tasks such as image classification, object detection, and segmentation. While more recent models have surpassed ResNet-50’s performance on the ImageNet classification task, it remains a popular and effective architecture in the field of deep learning.

2.6 Deep learning

Deep learning is an area of machine learning that makes use of neural networks in order to develop models capable of learning from data and making predictions. Neural networks are collections of linked nodes that mimic the structure and operation of the human brain. A neural network is composed of multiple layers of nodes, with each layer conducting a unique operation on the data as it traverses the network. A shallow neural network, the simple network is composed of an input layer, a hidden layer, and an output layer. When the quantity of hidden layers is expanded, the network is referred to as a deep neural network. Data is taken in by the input layer and sent on to the next processing layer. The output of one layer is used as input to the next layer, with each layer performing some computation on the output of the layer before it. Finally, the last layer contains an overall prediction or classification. The back-propagation process, which involves changing the weights of connections created between nodes, is utilized in a deep learning model in order to obtain a greater level of accuracy and reduce the amount of prediction error that occurs during training.

2.7 Convolutional neural networks

Convolutional neural networks are a type of deep learning architecture specifically engineered to effectively process and analyze various forms of data, particularly images, and videos, with the help of pattern recognition. CNNs have demonstrated exceptional performance in capturing spatial hierarchies and extracting significant features from input data, leading to their remarkable success in numerous computer vision tasks [64,65,66,67]. Designed to emulate the visual processing mechanism of the human brain, CNNs excel at recognizing and extracting intricate patterns from input data. This is achieved through the integration of multiple layers, including convolutional, pooling, and fully connected layers, which collaboratively perform the complex task of pattern recognition. Convolutional layers apply filters to localize and extract relevant features from the input data while pooling layers play a vital role in downsampling the extracted features while preserving essential information. Fully connected layers integrate the features to generate predictions or perform classifications.

2.8 Benchmark for SUSY signal

Results presented here are for the signal mass point of \(m_{\tilde{l}}=280\) GeV with \( \Delta M = 5\) GeV; however, heavier slepton masses might also be scanned. In order to decouple the production of colored particles and elektroweakinos, their masses have been set to 10 TeV, which is way higher than that of interest. Right, and left-handed slepton masses are assumed to be equal, and their decay probability to the same flavor leptons are 100% \( \left( \tilde{e}^{+} \tilde{e}^{-}\left( {\tilde{\mu }}^{+} {\tilde{\mu }}^{-}\right) \rightarrow e^{+} e^{-}\left( \mu ^{+} \mu ^{-}\right) =100 \%\right) \).

Fig. 4
figure 4

2D histogram of \( \Delta \phi \left( {\varvec{p}}_{T}^{\text {jet }}, E_{T}^{\text {miss }}\right) \div \pi \) as a function of \( M_{T 2}-\mu \). (From top left to the right)Histogram of 25K SM events only and plot mixing SM background and signal at the ratio of \( S/B=0.001 \), respectively. (From bottom left to the right) Plots from signal and SM background mixed at the ratio of \( S/B=0.01 \) and \( S/B=0.1 \), respectively. All the distributions maintain the total number of 25K events

2.9 Proposed method

Since all the attributes have varying value ranges, they do not contribute equally to the model, which is problematic for machine learning algorithms. To resolve the performance issue, the Scikit-learn’s [68] “MinMaxScaler” class was employed to scale all the features to be between 0 and 1.

  1. 1.

    Veto on tagged hadronically decaying \(\tau \)

  2. 2.

    Require two same flavor opposite sign leptons with \( p_{T}>10 \) GeV and \( |\eta |<2.4 \)

  3. 3.

    Veto events including b-jets with \( p_{T}>30 \) GeV and \( |\eta |<2.4 \)

  4. 4.

    Require only one hard jet with \( p_{T}>60 \) GeV and reject any events with additional jets having \( p_{T}>30 \) GeV.

After having the events that survived the applied cuts, 2D histograms were produced from a pair combination of all the features. However, most combinations provided no information for discriminating the SM and SM+Signal case. As the \(m_{T2}\) and azimuthal angle difference between jet momentum and missing transverse energy is highly unique and discriminant for the SUSY signal, this kinematic feature pair could provide additional information to separate SM and mixing plots. Therefore, these two are utilized for the construction of 2D histograms. Each histogram in this study was constructed using a total of 25,000 events, with a bin number of \(50\times 50\). While histograms including only SM background are built with 25K WW samples, other histograms mixing signal and background are constructed using the S/B ratio of 0.001, 0.002, 0.003, 0.004, 0.005, 0.007, 0.0085, 0.01, 0.02, 0.03, 0.05, 0.07, 0.1 and 0.2. For each benchmark point and each class, 1 K (10 K) histograms are produced.Footnote 1 Production of each histogram is carried out with a data augmentation technique that randomly selects the required number of samples from the total number of samples. When employing this augmentation technique, no histogram is allowed to contain the same data twice. Some example of the generated 2D histograms corresponding to \(S/B=0\), 0.001, 0.01, and 0.1 are shown in Fig. 4.

Fig. 5
figure 5

The visual feature extraction process

Fig. 6
figure 6

Distribution of training data mapped on the first three principal components with the highest explained variance ratio (EVR). Principal components were obtained after applying PCA on the features extracted through Inception-v3 pre-trained model

The second stage of the proposed method is to utilize the pre-trained models Inception-v3 and Resnet-50 as feature extractor modules to extract features from 2D histograms. Before feeding the images, which are in the form of a Numpy array, into the model, the pixel values were divided by 255, a vital pre-processing step. This step is crucial since it puts pixel values in the range [0, 1], enabling the model to operate effectively on a standardized input range and ensuring that each feature’s scale is consistent across different samples. By leveraging the pre-trained models with these normalized image inputs, discriminative features were pulled out from the images and used for downstream tasks.

The visual feature extraction process is depicted in Fig. 5. Since the inception-v3 and ResNet-50 pre-trained models have been trained on millions of image data, they can extract features from images efficiently, resulting in improved neural network performance overall with a shorter training time. The extracted features having the size of \(5\times 5\times 2048\) and \(7\times 7\times 2048\) for inception and ResNet-50 respectively are then flattened and pushed into the principal component analysis (PCA) for dimensionality reduction. After employing PCA, there ended up being 2000 features for a single image, which can be denoted by \( f_{PCA} \in {\mathbb {R}}^{1 \times 2000} \), where \(f_{PCA}\) is corresponding features after employing PCA. The first three principal components explaining more than \(23\%\) of the total variance are plotted pairwise and shown in Fig. 6. PCA applied set is then used as input for machine learning algorithms SVM and LR.

Fig. 7
figure 7

The architecture of the CNN model

In order to optimize the hyperparameters, consequently, to ensure that model performs in optimum conditions, the GridSearchCV tuning function from scikit-learn library is utilized with fivefold for transfer learning and machine learning algorithms. After fivefold grid-search, the optimum hyperparameters giving the best area under the receiver operating characteristic curve (AUC) scores for logistic regression and support vector machines are selected. Hyperparameters and their optimized values for each algorithm are tabulated and listed in Table 1. With obtained optimum parameters, two ML algorithm models for every benchmark point are built. In order to avoid overfitting the RepeatedKFold function from scikit-learn is utilized with the number of splits being five and the number of repeats being five. Namely, the whole dataset for each benchmark is split into five, four of which are used for training and one for testing. This procedure is repeated five times for each benchmark with both ML algorithms. In the binary image classification model with the convolutional neural network, various layers with different numbers of neurons were utilized to determine the optimal configuration that would yield the highest AUC score. The final model consists of several Conv2D layers with rectified linear unit (ReLU) activation, which perform feature extraction from the input images. A \(3\times 3\) filter size was used for the convolution layers to capture spatial information effectively. MaxPooling layers were employed to downsample the feature maps and reduce spatial dimensions. Dropout layers were inserted to mitigate overfitting, where a dropout rate of 0.2 was applied after each convolution layer and before the dense layers. The flattened feature maps were fed into fully connected dense layers, including a dense layer with 1024 units and ReLU activation. Finally, a dense layer with a single unit and sigmoid activation was employed for binary classification. The model has a total of 6,611,969 trainable parameters. The architecture and configuration of the model are depicted in Fig. 7.

Table 1 Optimum hyperparameters for SVM and LR

The CNN was trained using the Adam optimizer with a learning rate set to 0.001. Binary cross-entropy was chosen as the appropriate loss function for the binary classification task (Refer to Table 2 for the complete set of hyperparameters). To evaluate the performance of the CNN model, the area under the receiver operating characteristic curve was employed as a metric, which provides a comprehensive measure of classification accuracy. The scikeras library was utilized to wrap the Keras model, enabling seamless integration with scikit-learn’s cross-validation capabilities. For robust evaluation, a k-fold cross-validation approach with 10 splits was employed. The model was trained for 1000 epochs, with a batch size set to 300 during the training process.

Additionally, the performance of DNN was checked for the purpose of comparison. As the tabular data, survived the cuts mentioned in Sect. 2.9, proved to be insufficient for training the DNN models, synthetic data generation techniques were employed. To address this, approximately 25,000 more synthetic data points were generated using the Synthetic Minority Over-sampling Technique (SMOTE) [69]. Each class’s tabulated data was meticulously constructed by randomly selecting background and signal events from the original tabulated dataset at varying ratios. For each ratio, 6000 “Signal+Background” and 6000 “Background Only” tabular data points were generated. Each data point corresponds to an ensemble of 1000 background events and signal event. Similarly, the “Background Only” class comprises 1000 background events in each data point. These ensemble data points were created with varying ratios of signal to background events, resulting in datasets with a shape of (6000, 1000, 12) and (6000, 1000, 2). These datasets was then subjected to PCA to reduce dimensionality while retaining a representation of approximately 93% of the original dataset’s variance. The data preparation and dimensionality reduction steps ensure that the DNN model operates efficiently and effectively, with a dataset that captures the essential features while reducing computational complexity. This carefully curated dataset, along with the optimized hyperparameters, facilitated the construction of robust DNN models for varying ratios of signal-to-background events, enabling comprehensive performance evaluation.

In the binary image classification model with DNN, a systematic approach was employed for hyperparameter optimization to ensure optimal model performance. For each signal-to-background ratio, various hyperparameters were fine-tuned, including but not limited to the learning rate, the dropout rate, loss functions, etc., as listed in Table 2. This meticulous tuning was carried out using the keras-tuner library [70], aiming to achieve the best model configuration. Following hyperparameter optimization, a total of five DNN models were constructed using fivefold cross-validation, each incorporating the optimized hyperparameters. To mitigate the risk of overfitting, a significant concern in deep learning, the early stopping technique was implemented, halting training if there was no improvement in validation accuracy after ten epochs. The training process was capped at 120 epochs, ensuring that the models generalize well to unseen data. Subsequently, a comprehensive evaluation was conducted on a dedicated test set using these five models. This five-fold cross-validation procedure was repeated five times for each S/B ratio, providing a robust assessment of the DNN models’ performance under varying conditions.

3 Results

The properties of the dataset utilized in this study, as well as the simulation results, are detailed in the sections that follow.

Table 2 Training parameters for CNN and DNN models. The learning rates correspond to the S/B =0.01 case for DNN-12 (DNN-2) models, respectively

3.1 Dataset

The dataset used in this analysis for the production of 2D histograms is comprised of two classes, “1” and “0”, which correspond to the SUSY signal and SM background, respectively. The total number of simulated events is 62 M, 8.5M of which belongs to the SUSY signal, while the rest is WW sample. Therefore, we can express the dataset as follows:

$$\begin{aligned} D=\left\{ \left( x_{n}, y_{n}\right) \mid x_{n} \in {\mathbb {R}}^{1 \times M}, y_{n} \in \{0,1\}\right\} \end{aligned}$$

Here, M is 2, \(n = 1\dots N\), and N represents the total number of raw data samples before application of any pre-selection cuts, which is 62 M. The dataset includes the kinematical variables listed in Table 3 as features. While seven of the features are low-level features, which are missing transverse momentum, transverse momentum, and pseudorapidity of leptons, and jet, the rest are high-level features, such as the azimuthal angle difference between either lepton and missing transverse, between both leptons, and between jet transverse momentum and missing transverse momentum. The most discriminating feature between signal and background is the difference between stransverse mass \(m_{T2}\) and trial mass (\(\mu \)) for a given LSP mass. \(m_{T2}\) is used to calculate the lower limit on the mass of the mother particle (slepton) based on the mass and kinematics of the visible and invisible decay products. To compute the \(m_{T2}\) variable, the bisection algorithm provided by the authors of the work [71] was employed. It is presumed that final state leptons carry the same flavor and originate from the same flavor of sleptons.

Table 3 Low-level and High-level features

Reducing the size of the datasets is essential for achieving a more manageable training time as well as preventing the consumption of excessive amounts of computing resources during the training process. Thus, before feeding the data into machine learning classifiers, a few preselection cuts listed under Sect. 2.9 were applied on the original dataset, which also helped the signal to stand out against the SM background. Some kinematical distributions obtained after applying the aforementioned cuts are displayed in Fig. 3.

3.2 Simulation findings

The simulations involving transfer learning and pure SVM/LR are conducted on a Windows 10 PC with 64-bit architecture. The PC is equipped with an i5-8250U CPU running at a clock speed of 1.8GHz and has 32GB of RAM. For feature extraction using Inception-v3 and Resnet-50, a dedicated NVIDIA GeForce 940mx GPU with a total of 6GB of RAM is utilized, while the machine learning algorithms are executed on the CPU. However, CNN and DNN models are trained on Kaggle.com since more GPU-RAM is required for building these models. For the utilization of the GPU, NVIDIA CUDA 11.2.2 toolkit is used in conjunction with the NVIDIA CUDA Deep Neural Network library v8.1.0.77. TensorFlow v2.9.1 and Keras v2.9.0 libraries are also imported.

Fig. 8
figure 8

AUC values for each classifier in relation to the \(S/\sqrt{B}\). The presented results were derived using transfer learning (TL), as well as Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN). The green and pink lines represent DNN and CNN, while the blue and red lines denote Logistic Regression (LR) and Support Vector Machine (SVM) classifiers, respectively, when leveraging the pre-trained ResNet-50 as a feature extractor

Table 4 Performance metrics for each benchmark point evaluated with SVM, LR, CNN, and DNN. The ’DNN-12’ column presents results from a model utilizing tabular dataset constructed with all twelve low and high-level features, while ’DNN-2’ corresponds to results obtained from a model trained with tabular data set constructed with two features, which are also utilized for construction of 2D histogrammed data

The performance of pure support vector machines and logistic regression, as well as transfer learning, deep neural networks and convolutional neural networks, was assessed for different signal-to-background ratios. The results of these evaluations are presented in Tables 4, 5 and 6. While Area Under Curve values are used as the performance metric for this study, accuracy, recall, and F1-score for pre-trained models are also provided for reference. AUC is characterized by the area under the receiver operating characteristic curve (ROC), and its value ranges between 0 and 1. While below 0.5 suggests the model fails in classification, 1 corresponds to the model best in classification. For an AUC value of 0.7, the classifier is accepted to be successful as there is a 70% chance that the model will classify classes correctly.

Table 5 Performance metrics for every benchmark point after transfer learning with Inception-V3
Table 6 Performance metrics for every benchmark point after applying transfer learning with ResNet-50

Using the ROC curve that is plotted with signal efficiency (\(\epsilon _s\)) as a function of background rejection (\(1/\epsilon _b\)), for each single benchmark point and each classifier, a mean AUC score is calculated based on fivefold cross-validation with five iterations (Fig. 8).

The AUC values obtained using DNN, pure support vector machines or logistic regression in Tables 4, 5, 6 and Fig. 8 demonstrate comparable or higher performance compared to transfer learning and convolutional neural networks. Nevertheless, the performance of both classifiers using transfer learning exhibits minimal differences at specific benchmark points, surpassing that of CNN or falling within the range of statistical uncertainty. The observed trend in the results confirms the expectation that the models’ performances enhance as the signal-to-background (S/B) ratio increases. Although both models utilizing transfer learning demonstrate strong performance across most benchmark points, neither model effectively classifies \(S/B\approx 0.001\). However, an AUC value greater than 0.7 can be achieved for the \(S/B > 0.003\) with transfer learning. Additionally, \(S/\sqrt{B}\) was calculated for each benchmark point and presented as a function of AUC (see Fig. 8). \(S/\sqrt{B}\) is used to determine if there is a new physics in the histograms for the given number of background and signal samples.

Fig. 9
figure 9

Training and validation loss versus epoch for \(S/B =0.02\) with DNN-12 model. Both curves indicate the DNN’s ability to adapt to the data, showcasing its capacity to capture meaningful patterns and generalize to unseen examples

In addition to transfer learning and other machine learning approaches, in-depth training of a deep neural network was also conducted. The training process was carried out using TensorFlow and Keras on dedicated hardware, making use of a tensor processing unit (TPU) to expedite computations. The performance of the DNN was monitored through the analysis of loss trends during training. The training and validation losses were recorded at each epoch to evaluate the model’s convergence and generalization capabilities. Figure 9 showcases the training and validation loss curves as a function of epoch for the model trained with all twelve features. These plots provide valuable insights into the DNN’s learning behavior and its ability to adapt to the data. The DNN’s performance is assessed using the AUC score. This evaluation employs a five-fold cross-validation approach, where five models are constructed and tested on the previously reserved test data. The AUC scores obtained from each of these five models on the test set are averaged and presented in Table 4.

4 Conclusion

In this work, machine learning models were built to search for new physics using 2D images constructed from a signal and a standard model background at different ratios. While the study is conducted with a single signal benchmark and a standard model background, the study might be expanded by using different signals from different models or including more signal benchmark points and/or other standard model backgrounds. The present study is designed to demonstrate that the transfer learning method on signal and background classification is efficient in terms of time and computing resources yet highly accurate.

The performance of pure SVM/LR, CNN, and DNN was also checked for the purpose of comparison. DNN and pure SVM/LR seems to yield better results for the low S/B ratio; however, training time and use of computing resources are enormous compared to transfer learning. Furthermore, in certain instances, training models solely using SVM and LR demands extensive computational resources, exceeding 27GB of RAM. The intensive resource utilization associated with this approach not only renders the application of these algorithms on larger datasets infeasible but also conflicts with the core objective of this study. Not only the computing resources used for the task were enormous, but building models with pure SVM/LR models for some cases were also much longer compared to the proposed approach. Specifically, training time for building some models with pure LR exceeded that with transfer learning by a factor of 110. One key factor contributing to this disparity is the absence of GPU utilization for SVM and LR in the scikit-learn library. Furthermore, the developers of scikit-learn have no plans to implement GPU support for these algorithms in the near future either [72]. Thus, these algorithms may not be practical for dealing with large amounts of data as they cannot take advantage of the high computational power of GPUs.

While there exists a noticeable distinction between this study and previous works [33,34,35] in terms of signal model or benchmark points and backgrounds employed, direct comparison of results seems unfeasible. However, drawing from those studies, it can be inferred that convolutional neural networks or neural networks either demonstrate insignificant outcomes or necessitate a larger amount of data. In contrast, the methodology, utilizing transfer learning, offers advantages such as faster implementation and reduced computing resources, avoiding the need to train the model from scratch or gather additional data.

Starting from the feature extraction step from the images using transfer learning, followed by applying PCA and training the two models, the whole process took between 60 to 200 s and  300 to 1200s with Inception-v3 and ResNet-50, respectively. The variation in time is due to the change in sample size between benchmarks within each pre-trained model, while the time variation between the models is due to the structure of the models. Tables 5 and 6 show that ResNet-50 performs slightly better than Inception-v3 in terms of AUC, F1 score, and recall the majority of the time. This suggests that deeper networks may perform better than shallower ones. While using the Inception-v3 pre-trained model resulted in slightly worse performance than ResNet-50, it may be worthwhile to start an investigation with Inception-v3 due to its faster training time to determine if any new physics is present before committing to further investigation with other better-performance-providing pre-trained models.

Two different machine learning algorithms were employed for the classification task, with the AUC parameter chosen as the performance metric. AUC score is computed for each fold as well as for each benchmark using the ROC curve. Figure 10 illustrates the ROC curve for one of the benchmark points, \(S/B=0.01\), along with the AUC values of each fold and the mean AUC value obtained with ResNet-50. Both machine learning algorithms, SVM and Logistic Regression perform almost identically for every single benchmark point in terms of performance. Starting from the signal-background ratio equal to 0.003, the model trained with ResNet-50 begins to show promise in the classification task. With higher collision energy and increased luminosity, the expected number of signals after the aforementioned cuts will be significantly higher; as a result, the signal-to-background ratio will be higher. In such cases, the models developed, as evident from Fig. 8, will exhibit high performance. Consequently, analogous models for different SUSY models and different benchmark points can be built and used with real data in order to discover beyond standard model physics.

This study comprises an extensive examination of diverse machine learning techniques, encompassing transfer learning and deep neural networks, to address the complex task of signal-background classification in the realm of high-energy physics. The visualization of the training and validation loss trends across increasing epoch numbers, depicted in Fig. 9, offers valuable insights. These curves indicate the DNN’s ability to adapt to the data, showcasing its capacity to capture meaningful patterns and generalize to unseen examples. A consistent reduction in both training and validation losses was observed throughout the training process, underscoring the effectiveness of the DNN approach for the classification of the signal+background class from only the background class.

High energy physics data often have complex structures, such as particle collision events with multiple particles and interactions. Transfer learning can be used to extract relevant discriminative features or create new high-level features from existing ones. In this context, this process might be highly important for BSM studies, particularly for detecting and separating the signal from SM backgrounds. Although pre-trained Inception-v3 and ResNet-50 are utilized as a feature extractor in this work, the study could also be expanded and performed with other pre-trained models, such as DenseNet [73] and VGG16 [74]. In addition, it may be worthwhile to try a fine-tuning approach instead of transfer learning, not for an improvement in training time but for the sake of improving accuracy.

Fig. 10
figure 10

ROC curves and AUC values corresponding to \(S/B = 0.01\) benchmark for each fold