1 Introduction

Imbalanced datasets are frequent occurrences in a large spectrum of fields, where Machine Learning (ML) has found its applications, including business, finance and banking as well as bio-medical science. Oversampling approaches are a popular choice to deal with imbalanced datasets (Chawla et al. 2002; Han et al. 2005; Haibo et al. 2008; Bunkhumpornpat et al. 2009; Barua et al. 2014). We here present Localized Randomized Affine Shadowsampling (LoRAS), which produces better ML models for imbalanced datasets, compared to state-of-the art oversampling techniques such as SMOTE and several of its extensions. We use computational analyses and a mathematical proof to demonstrate that drawing samples from a locally approximated data manifold of the minority class can produce balanced classification ML models. We validated the approach with 12 publicly available imbalanced datasets, comparing the performances of several state-of-the-art convex-combination based oversampling techniques with LoRAS. The average performance of LoRAS on all these datasets is better than other oversampling techniques that we investigated. In addition, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate for local mean of the underlying data distribution, in some neighbourhood of the minority class data space.

For imbalanced datasets, the number of instances in one (or more) class(es) is very high (or very low) compared to the other class(es). A class having a large number of instances is called a majority class and one having far fewer instances is called a minority class. This makes it difficult to learn from such datasets using standard ML approaches. Oversampling approaches are often used to counter this problem by generating synthetic samples for the minority class to balance the number of data points for each class. SMOTE is a widely used oversampling technique, which has received various extensions since it was published by Chawla et al. (2002). The key idea behind SMOTE is to randomly sample artificial minority class data points along line segments joining the minority class data points among k of the minority class nearest neighbors of some arbitrary minority class data point. In other words, SMOTE produces oversamples by generating random convex combinations of two close enough data points.

The SMOTE algorithm, however has several limitations for example: it does not consider the distribution of minority classes and latent noise in a data set (Hu et al. 2009). It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model (Puntumapon and Waiyamai 2012). Several other limitations of SMOTE are mentioned in Blagus and Lusa (2013). To overcome such limitations, several algorithms have been proposed as extensions of SMOTE. Some are focusing on improving the generation of synthetic data by combining SMOTE with other oversampling techniques, including the combination of SMOTE with Tomek-links (Elhassan et al. 2016), particle swarm optimization (Gao et al. 2011; Wang et al. 2014), rough set theory (Ramentol et al. 2012), kernel based approaches (Mathew et al. 2015), Boosting (Chawla et al. 2003), and Bagging (Hanifah et al. 2015). Other approaches choose subsets of the minority class data to generate SMOTE samples or cleverly limit the number of synthetic data generated (Santoso et al. 2017). Some examples are Borderline1/2 SMOTE (Han et al. 2005), ADAptive SYNthetic (ADASYN) (Haibo et al. 2008), Safe Level SMOTE (Bunkhumpornpat et al. 2009), Majority Weighted Minority Oversampling TEchnique (MWMOTE) (Barua et al. 2014), Modified SMOTE (MSMOTE), and Support Vector Machine-SMOTE (SVM-SMOTE) (Suh et al. 2017) (see Table 1) (Hu et al. 2009). Another recent method, G-SMOTE, generates synthetic samples in a geometric region of the input space, around each selected minority instance (Douzas and Bacao 2019). Voronoi diagrams have also been used in recent research for improving classification tasks for imbalanced datasets. Because of properties inherent to Voronoi diagrams, a newly proposed algorithm V-synth identifies exclusive regions of feature space where it is ideal to create synthetic minority samples (Young et al. 2015; Carvalho and Prati 2018).

Related research and novelty A more recent trend in the research on imbalanced datasets is to generate synthetic samples, aiming to approximate the latent data manifold of the minority class data space. In Bellinger et al. (2018), a general framework for manifold-based oversampling, especially for high dimensional datasets, is proposed for synthetic oversampling. The method has been successfully applied in Bellinger et al. (2016) to deal with gamma-ray spectra classification. It produces a synthetic set S of n instances in the manifold-space by randomly sampling n instances from the PCA-transformed reduced data space. In order to produce unique samples on the manifold, they apply i.i.d. additive Gaussian noise to each sampled instance prior to adding it to the synthetic set S, controlling the distribution of the noise through the Gaussian distribution parameters. The synthetic Gaussian instances are then mapped back to the feature space to produce the final synthetic samples (Bellinger et al. 2018). Another scheme, using auto-encoders to oversample from an approximated manifold, has also been discussed in Bellinger et al. (2018). This approach selects random minority class samples by adding Gaussian noise to them, and using the auto-encoder framework first maps them non-orthogonally off the manifold and then maps them back orthogonally on the manifold (Bellinger et al. 2018). It remains unclear from this research how the approach would perform in terms of improving F1-Scores of imbalanced classification models as it focuses on relative improvement in the Area Under the (ROC) Curve (AUC) as a performance measure. According to Saito and Rehmsmeier (2015), AUC of the Receiver Operating Characteristic Curve (ROC) curve might not be informative enough for imbalanced datasets. This issue has also been addressed in Davis and Goadrich (2006). Unlike the work of Bellinger et al. (2018) LoRAS relies on locally approximating the manifold by generating random convex combination of noisy minority class data points. Our oversampling strategy LoRAS, rather aims at improving the precision-recall balance (F1-Score) and class wise average accuracy (Balanced accuracy) of the ML models used. The F1-Score can measure how well the classification model handled the minority class classification, whereas Balanced accuracy provides us with a measure of how both majority and minority classes were handled by the classification model. Thus, these two measures together can give us a holistic understanding of a classifier performance on a dataset.

Notably, in the pre-SMOTE era of research, related to oversampling there has been works aiming to enrich minority classes of imbalanced datasets by adding Gaussian noise (Lee 2000) and using the noisy data itself, as oversampled data. The strategy of generating oversamples with convex combinations of minority class samples is also well known, SMOTE itself being an example of such a strategy. Our oversampling strategy LoRAS leverages from a combination of these two strategies. Unlike Lee (2000), we generate Gaussian noise in small neighbourhoods around the minority class samples and create our final synthetic data with convex combinations of multiple noisy data points (shadowsamples) as opposed to SMOTE based strategies, that consider combination of only two minority class data points. Adding the shadowsamples allows LoRAS to produce a better estimate for local mean of the latent minority class data distribution.

We also provide a mathematical framework to show that convex combinations of multiple shadowsamples can provide a proper estimate for the local mean of a neighbourhood in the minority class data space. To be specific, an LoRAS oversample is an unbiased estimator of the mean of the underlying local probability distribution, followed by a minority class sample (assuming that it is some random variable) such that the variance of this estimator is significantly less than that of a SMOTE generated oversample, which is also an unbiased estimator of the mean of the underlying local probability distribution, followed by a minority class sample. In addition to this, LoRAS provides an option of choosing the neighbourhood of a minority class data point by performing prior manifold learning over the minority class using t-Stochastic Neighbourhood Embedding (t-SNE) (van der Maaten and Hinton 2008). t-SNE is a state-of the art algorithm used for dimension reduction maintaining the underlying manifold structure in a sense that, in a lower dimension t-SNE can cluster points, that are close enough in the latent high dimensional manifold. It uses a symmetric version of the cost function used for it’s predecessor technique Stochastic Neighbourhood Embedding (SNE) and uses a Student-t distribution rather than a Gaussian to compute the similarity between two points in the low-dimensional space. t-SNE employs a heavy-tailed distribution in the low-dimensional space to alleviate both the crowding problem and the optimization problems of SNE (van der Maaten and Hinton 2008; Hinton and Roweis 2003).

Till date there are at least eighty five extension models built on SMOTE (Kovács 2019). Considering a large number of benchmark datasets explored in our study, it was necessary to shortlist certain oversampling algorithms for a comparative study. We found quite a few studies that have applied or explored SMOTE and extension of SMOTE such as Borderline1/2 SMOTE models, ADASYN, and SVM-SMOTE (Suh et al. 2017; Ah-Pine and Soriano-Morales 2016; Adiwijaya and Saonard 2017; Chiamanusorn and Sinapiromsaran 2017; Wang et al. 2014; Le et al. 2019). Moreover all these oversampling strategies are focused on oversampling from the convex hull of small neighbourhoods in the minority class data space, a similarity that they share with our proposed approach. Considering these factors, we choose to focus on these five oversampling strategies for a comparative study with our oversampling technique LoRAS.

Table 1 Popular algorithms built on SMOTE

2 LoRAS: localized randomized affine shadowsampling

In this section we discuss our strategy to approximate the data manifold, given a dataset. A typical dataset for a supervised ML problem consists of a set of features \(F=\{f_1, f_2, \ldots \}\), that are used to characterize patterns in the data and a set of labels or ground truth. Ideally, the number of instances or samples should be significantly greater than the number of features. In order to maintain the mathematical rigor of our strategy we propose the following definition for a small dataset.

Definition 1

Consider a class or the whole dataset with n samples and |F| features. If \(\log _{10}(\frac{n}{|F|})<1\), then we call the dataset, a small dataset.

The LoRAS algorithm is designed to learn from a dataset by approximating the underlying data manifold. Assuming that F is the best possible set of features to represent the data and all features are equally important, we can think of a data oversampling model to be a function \(g: \prod _{i=1}^{l} R^{|F|} \rightarrow R^{|F|}\), that is, g uses l parent data points (each with |F| features) to produce an oversampled data point in \(R^{|F|}\).

Definition 2

We define a random affine combination of some arbitrary vectors as the affine linear combination of those vectors, such that the coefficients of the linear combination are chosen randomly. Formally, a vector v, \(v=\alpha _1u_1+\cdots +\alpha _nu_m\), is a random affine combination of vectors \(u_1,\ldots ,u_m\), (\(u_j\in R^{|F|}\)) if \(\alpha _1+\cdots +\alpha _m=1\), \(\alpha _j\in R^{+}\) and \(\alpha _1,\ldots ,\alpha _m\) are the coefficients of the affine combination chosen randomly from a Dirichlet distribution.

The simplest way of augmenting a data point would be to take the average (or random affine combination with positive coefficients as defined in Definition 2) of two data points as an augmented data point. But, when we have |F| features, we can assume that the hypothetical manifold on which our data lies is |F|-dimensional. An |F|-dimensional manifold can be locally approximated by a collection of \((|F|-1)\)-dimensional planes.

Given |F| sample points we could exactly derive the equation of an unique \((|F|-1)\)-dimensional plane containing these |F| sample points. Note that, a small neighbourhood of a dataset can itself be considered as a small dataset. A small neighbourhood of k points around a data point in a dataset, given sufficiently small k, satisfies Definition 1, that is k and |F| satisfies, \(\log _{10}(\frac{k}{|F|})<1\). Thus, considering k to be sufficiently small we can assume that this small neighbourhood is a small dataset. To enrich this small dataset, we create shadow data points or shadowsamples from our k parent data points in the minority class data point neighbourhood. Each shadow data point is generated by adding noise from a normal distribution, \(\mathscr {N}(0, h(\sigma _f))\) for all features \(f \in F\), where \(h(\sigma _f)\) is some function of the sample variance \(\sigma _f\) for the feature f. For each of the k data points we can generate m shadow data points such that, \(k \times m \gg |F|\). Now it is possible for us to choose |F| shadow data points from the \(k \times m\) shadow data points even if \(k<|F|\). We choose |F| shadow data points as follows: we first choose a random parent data point p and then restrict the domain of choice to the shadowsamples generated by the parent data points in \(N_k^p\).

For high dimensional datasets, choosing k-nearest neighbours of data point using simple Euclidean, Manhattan or general Minkowski distance measures can be misleading in terms of approximating the latent data manifold. To avoid this, we propose to adopt a manifold learning based strategy. Before choosing the k-nearest neighbours of a data point, we perform a dimension reduction on the data points of the minority class using the well-known dimension reduction and manifold learning technique t-SNE (van der Maaten and Hinton 2008). Once we have a two dimensional t-embedding of the minority class data, we choose the k-nearest neighbours of a particular data point consistent to its k-nearest neighbours (measured as per usual distance metrics) in the 2-dimensional t-SNE embedding of the minority class.

Once we choose our neighbourhood and generate the shadowsamples, we take a random affine combination with positive co-efficients (Convex combination) of the |F| chosen shadowsamples to create one augmented Localized Random Affine Shadowsample or a LoRAS sample as defined in Definition 2. Considering the arbitrary low variance that we can choose for the Normal distribution from which we draw our shadowsamples, we assume that our shadowsamples lie in the latent data manifold itself. It is a practical assumption, considering the stochastic factors leading to small measurement errors. Now, there exists an unique \((|F| - 1)\)-dimensional plane, that contains the |F| shadowsamples, which we assume to be an approximation of the latent data manifold in that small neighbourhood. Thus, a LoRAS sample is an artificially generated sample drawn from an \((|F| - 1)\)-dimensional plane, which locally approximates the underlying hypothetical |F|-dimensional data manifold. It is worth mentioning here, that the effective number of features in a dataset is often less than |F|. In high dimensional data there are often correlated features or features with low variance. Thus, for practical use of LoRAS one might consider generating convex combinations of effective number of features which might be less than |F|.

figure a

In this article, all imbalanced classification problems that we deal with are binary classification problems. For such a problem, there is a minority class \(C_{\text {min}}\) containing a relatively less number of samples compared to a majority class \(C_{\text {maj}}\). We can thus consider the minority class as a small dataset and use the LoRAS algorithm to oversample. For every data point p we can denote a set of shadowsamples generated from p as \(S_p\). In practice, one can also choose \(2 \le N_{\text {aff}} \le |F|\) shadowsamples for an affine combination and choose a desired number of oversampled points \(N_{\text {gen}}\) to be generated using the algorithm. We can look at LoRAS as an oversampling algorithm as described in Algorithm 1.

The LoRAS algorithm thus described, can be used for oversampling of minority classes in case of highly imbalanced datasets. Note that the input variables for our algorithm are: number of nearest neighbors per sample \(\texttt {k}\), number of generated shadow points per parent data point \({|\texttt {S}_{\mathrm{p}}|}\), list of standard deviations for normal distributions for adding noise to every feature and thus generating the shadowsamples \({\texttt {L}_{\sigma }}\), number of shadowsamples to be chosen for affine combinations \({\texttt {N}_{\mathrm{aff}}}\), number of generated points for each nearest neighbors group \({\texttt {N}_{\mathrm{gen}}}\) and embedding strategy \({\texttt {embedding}}\). There is a conditional input variable \(\texttt {perplexity}\) which takes a positive numerical value if one chooses a t-embedding. The perplexity parameter of the t-SNE algorithm is quite crucial. The perplexity parameter can influence the t-Embedding calculated by the t-SNE algorithm. There have been several studies that address the issue on finding a right perplexity parameter for a given problem (Kobak and Berens 2019). That is why, we recommend careful choice of this parameter in order to leverage more from our algorithm. Another important parameter of our algorithm is the \({\texttt {N}_{\mathrm{aff}}}\). For this parameters an ideal choice would be the number of effective features in a dataset since this number would be a reasonable approximation to the dimension of the underlying data manifold. One could employ a feature selection technique to find out a good estimate for this. A simple random grid search is also very helpful to get reasonably good estimates of these parameters. We have mentioned all the default values of the LoRAS parameters in Algorithm 1, showing the pseudocode for the LoRAS algorithm. As an output, our algorithm generates a LoRAS dataset for the oversampled minority class, which can be subsequently used to train a ML model (Fig. 1).

Fig. 1
figure 1

Visualization of the workflow demonstrating a step-by-step explanation for LoRAS oversampling. a Here, we show the parent data points of the minority class points \(C_{\text {min}}\). For a data point p we choose three of the closest neighbors (using knn) to build a neighborhood of p, depicted as the box, b extracting the four data points in the closest neighborhood of p (including p), c drawing shadow points from a normal distribution, centered at these parent data point n, d we randomly choose three shadow points at a time to obtain a random affine combination of them (spanning a triangle). We finally generate a novel LoRAS sample point from the neighborhood of a single data point p

3 Case studies

For testing the potential of LoRAS as an oversampling approach, we designed benchmarking experiments with a total of 14 datasets which are either highly imbalanced, high dimensional or with a small number of data points. With this number of diverse case studies we should have a comprehensive idea of the advantages of LoRAS over the other oversampling algorithms of our interest.

3.1 Datasets used for validation

Here we provide a brief description of the datasets and the sources that we have used for our studies.

Scikit-learn imbalanced benchmark datasets The imblearn.datasets package is complementing the sklearn.datasets package. It provides 27 pre-processed datasets, which are imbalanced. The datasets span a large range of real-world problems from several fields such as business, computer science, biology, medicine, and technology. This collection of datasets was proposed in the imblearn.datasets python library by Lemaître et al. (2017) and benchmarked by Ding (2011). Many of these datasets have been used in various research articles on oversampling approaches (Ding 2011; Saez et al. 2016). A statistically reliable benchmarking analysis of all 27 datasets in a stratified cross validation framework involves a lot of computational effort. We thus choose 11 datasets out of these two depending on two criteria:

  • Highly imbalanced We choose datasets with imbalance ratio more than 25:1. This category includes abalone_19, letter_image, mammography, ozone_level, webpage, wine_quality, yeast_me2 datasets.

  • High dimensional We choose the datasets with more than 100 features. This category includes arrhythmia, isolet, scene, webpage and yeast_ml8.

Note that the \(\texttt {webpage}\) dataset is common in both the criteria, giving us a total of 11 datasets. We choose these two categories because they are of special interest in research related to imbalanced datasets and have received extensive attention in this research area (Anand et al. 2010; Hooda et al. 2018; Jing et al. 2019; Blagus and Lusa 2013).

Credit card fraud detection dataset We obtain the description of this dataset from the website. https://www.kaggle.com/mlg-ulb/creditcardfraud. “The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where there are 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.00172 percent of all transactions. The dataset contains only numerical input variables, which are the result of a PCA transformation. Feature variables \(f_1, \ldots , f_{28}\) are the principal components obtained with PCA, the only features that have not been transformed with PCA are the ‘Time’ and ‘Amount’. The feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ consists of the transaction amount. The labels are encoded in the ‘Class’ variable, which is the response variable and takes value 1 in case of fraud and 0 otherwise” (Pozzolo et al. 2017).

Small datasets We were also interested to check the performance of LoRAS on small datasets. We obtained two such datasets: ar1, ar3. Both of these datasets have very few data points and less than 10 points in the minority class.

Thus, in total we benchmark our oversampling algorithms against the existing algorithms on a total of 14 datasets. We provide relevant statistics on these datasets in Table 2

Table 2 Table showing some statistics for the datasets we study in this article

3.2 Methodology

For every dataset we have analyzed, we used a consistent workflow. Given a dataset, for every machine learning model, we judge the model performances based on a \(5\times 10\)-fold stratified cross validation framework. However, for the two small datasets ar1 and ar3 we use a \(5\times 3\)-fold stratified cross validation framework, since there are less than 10 samples in the minority class. First we randomly scuffle the dataset. For a given dataset, we first split the dataset into tenfolds, each one distinct from the other maintaining the imbalance ratio for every fold. We then train the machine learning models on the dataset without any oversampling with tenfold cross validation. This means that we train and test the model 10 times, each time considering a fold as a test fold and rest ninefolds as training folds. However, while training the ML models with oversampled data, we oversample only on the training folds and leave the test fold as they are for each training session. For each dataset we repeat the whole process five times to avoid the stochastic effects as much as possible.

For the oversampling algorithms, for a given dataset, we chose the same neighbourhood size for every oversampling model. If there were less than 100 data points in the minority class the neighbourhood size was chosen to be 5. Otherwise we chose a neighbourhood size of 30. Given a large number of datasets we are analyzing, we did not customize this for every dataset and rather chose to stick to the above mentioned general rule. For LoRAS oversampling however, we performed a preliminary study to find out customized parameter values for every dataset, since the LoRAS algorithm is highly parametrized in nature. We tried several combinations of parameters \({\texttt {N}_\mathrm{aff}}\), \(\texttt {embedding}\) and \(\texttt {perplexity}\) employing random grid search. For our initial study involving the parameter optimization of LoRAS, given a dataset, we performed a simple train-test split of the dataset (1:1 train-test split ratio), and then applied LoRAS with parameter grids on the training data to oversample and test the classifier performances on the test data. The training set is kept relatively small, so that the classifier does not gain much experience on the data while parameter estimation and gets prone to overfitting. This study was kept completely independent from our main cross-validation based results so that the samples from the test sets of our cross validation have minimum effect on parameter tuning. For parameter \({\texttt {N}_\mathrm{aff}}\) the grid interval is [2, |F|], |F| being the number of features. We choose five numbers while forming a search grid from this interval. Three of them are randomly chosen and the numbers 2 and |F| are always included in this set of 5 numbers. For parameter \(\texttt {embedding}\) we the grid values are the two possible entries that the parameter adopts. For the \(\texttt {perplexity}\) parameter, we used grid values [0.01, 0.1, 1, 10, 30, 100].

We emphasize here, that for all the algorithms including LoRAS, for a given dataset, we keep the neighbourhood size for every oversampling model fixed. For every oversampling model that we considered, the neighbourhood size for the oversampling model is the parameter that the model is highly sensitive to, since it contributes the most in determining the distribution of the oversampled minority class. For LoRAS, there are three (out of seven parameters in total) parameters designed to better model/approximate the minority class data manifold (for example: the ones involving the t-SNE on the minority class), which are tuned to show the applicability of manifold approximation to improve convex combination based oversampling. However, as suggested, we keep all parameters related to the original distribution of the minority class, for all oversampling models fixed for all comparisons.

However, considering the philosophy of LoRAS and a comparatively large number of parameters it use, we take liberty to tune the other parameters for LoRAS, since the other parameters are the key to a proper approximation or modelling of the minority class data manifold, which we argue to be the key factor behind the success of LoRAS.

For LoRAS oversampling every dataset we use an unique value for \({\texttt {N}}_\mathrm{aff}\) as presented in Table 3. For individual ML models we use different settings for the LoRAS parameters \(\texttt {embedding}\) and \(\texttt {perplexity}\) which we mention explicitly in our supplementary materials while presenting the results for each ML model for each dataset. To ensure fairness of comparison, we oversampled such that the total number of augmented samples generated from the minority class was as close as possible to the number of samples in the majority class as allowed by each oversampling algorithm. Speaking of other parameters of the LoRAS algorithm, for \({\texttt {L}}_{\sigma }\), we chose a list consisting of a constant value of .005 for each dataset and for the parameter \({\texttt {N}_{gen}}\) we chose the value as: \(\frac{|C_{\text {maj}}|-|C_{\text {min}}|}{|C_{\text {min}}|}\). We provide a detailed list of parameter settings used by us for the oversampling algorithms in Table 3.

Table 3 In this table we present the details of parameter settings for the oversampling algorithms used by us for our experiment

To choose ML models for our study we first did a pilot study with ML classifiers such as k-nearest neighbors (knn), Support Vector Machine (svm) (linear kernel), Logistic regression (lr), Random forest (rf), and Adaboost (ab). As inferred in Blagus and Lusa (2013) we found that knn was quite effective for the datasets we used. We also noticed that lr and svm performed better compared to rf and ab in most cases. We thus chose knn, svm and lr for our final studies. We used lbfgs solver for our logistic regression model and a linear kernel for our svm models. For our knn models, we choose 10 nearest neighbours for our prediction if there are less than 100 samples in the minority class and 30 nearest neighbours otherwise. For ‘arrhythmia’, ‘abalone-19’, ‘ar1’ and ‘ar3’ however we use only 5 nearest neighbours for the knn model since it has only 25, 32, 9 and 8 minority class samples respectively. We choose this parameter to be consistent to the neighbourhood size of the oversampling models, since the neighbourhood size directly influences the distribution of the training data and hence the model performance.

In our analysis we take special notice of the credit card fraud detection dataset. This dataset is not included in the imblearn.datasets Python library. However, the main reason why we want to pay a special attention to this dataset is that, it is by far the most imbalanced publicly available dataset that we have come across. The extreme imbalance ratio of 577:1 is incomparable to any of the datasets in imblearn.datasets. Also, this dataset has received special attention of researchers attempting to use ML in Credit fraud detection (Varmedja et al. 2019). In this article we see that lr and rf have good prediction accuracies on the dataset. Thus we chose these two ML models for the credit fraud dataset. Varmedja et al. (2019) has also not provided cross validated analysis of their models, while our models have been trained and tested with the usual tenfold cross validation framework as discussed before. Also, for two small datasets with a critically small minority class, we used only knn and lr classifiers, with parameter settings as specified before. The reason is, for all the 12 other datasets, svm did not stand out to be the best performer in terms of F1-Score in any of them.

For computational coding, we used the scikit-learn (V 0.21.2), numpy (V 1.16.4), pandas (V 0.24.2), and matplotlib (V 3.1.0) libraries in Python (V 3.7.4).

4 Results

For imbalanced datasets there are more meaningful performance measures than Accuracy, including Sensitivity or Recall, Precision, and F1-Score (F-Measure), and Balanced accuracy that can all be derived from the Confusion Matrix, generated while testing the model. For a given class, the different combinations of recall and precision have the following meanings:

  • High Precision & High Recall: The model handled the classification task properly

  • High Precision & Low Recall: The model cannot classify the data points of the particular class properly, but is highly reliable when it does so

  • Low Precision & High Recall: The model classifies the data points of the particular class well, but misclassifies high number of data points from other classes as the class in consideration

  • Low Precision & Low Recall: The model handled the classification task poorly

F1-Score, calculated as the harmonic mean of precision and recall and, therefore, balances a model in terms of precision and recall. These measures have been defined and discussed thoroughly by Elrahman and Abraham (2013). Balanced accuracy is the mean of the individual class accuracies and in this context, it is more informative than the usual accuracy score. High Balanced accuracy ensures that the ML algorithm learns adequately for each individual class.

In our experiments we have noticed an interesting behaviour of oversampling models in terms of their average F1-Score and Balanced accuracy. Once we present our experiment results, we will discuss why considering F1-Score and Balanced accuracy can give us a clearer idea about model performances. We will use the above mentioned performance measures wherever applicable in this article.

Selected model performances for all datasets We provide the detailed results of our experiments for all machine learning models as supplementary material. To be precise, for every combination of datasets, ML models and oversampling strategies we provide the mean and variance of the tenfold cross validation process over 5 repetitions. For judging the performance of the oversampling models we follow the following scheme:

  • First, for a given dataset, we choose the ML model trained on that dataset that provides the highest average F1-Score over all the oversampling models and training without oversampling. The F1-Score reflects the balance between precision and recall and considered as a reliable metric for imbalanced classification task.

  • We then consider the Balanced accuracy and F1- score of the chosen model as an evaluation of how well the oversampling model performs on the considered dataset. Following this evaluation scheme we present our results in Table 4.

Table 4 Table showing balanced accuracy/F1-score for several oversampling strategies (Baseline, SMOTE, SVM-SMOTE, Borderline1 SMOTE, Borderline2 SMOTE, ADASYN and LoRAS column-wise respectively) for all 14 datasets of interest for ML learning models producing best average F1 score over all oversampling strategies and baseline training for respective datasets

Calculating average performances over all datasets, LoRAS has the best Balanced accuracy and F1-Score. As expected, SMOTE improved Balanced accuracy compared to model training without any oversampling. Surprisingly, it lags behind in F1-Score, for quite a few datasets with high baseline F1-Score such as letter_image, isolet, mammography, webpage and credit fraud. Interestingly, the oversampling approaches SVM-SMOTE and Borderline1 SMOTE also improved the average F1-Score compared to SMOTE, but compromised for a lower Balanced accuracy. On the other hand, applying ADASYN increased the Balanced accuracy compared to SMOTE, but again compromises on the F1-Score. In contrast, our LoRAS approach produces the best Balanced accuracy on average by maintaining the highest average F1-Score among all oversampling techniques. We want to emphasize that, even considering stochastic factors, LoRAS can improve both the Balanced accuracy and F1-Score of ML models significantly compared to SMOTE, which makes it unique.

Datasets with high imbalance ratio To verify the performance of LoRAS on highly imbalanced datasets we present average of the selected model performances for the datasets with highest imbalance ratios (among the ones we have tested) in Table 5.

Table 5 Table showing the average balanced accuracy/F1-score of the selected models for datasets with the highest imbalance ratios and high dimensional datasets separately

From our results we observe that LoRAS oversampling can significantly improve model performances for highly imbalanced datasets. LoRAS provides the highest F1-Score and Balanced accuracy among all the oversampling models. The results here show similar properties for SMOTE, Borderline-1 SMOTE, SVM SMOTE, ADASYN and LoRAS as discussed before. Note that, for the credit fraud dataset, which is the most imbalanced among all, LoRAS has significant success over the other oversampling models in terms of Balanced accuracy. For the webpage dataset as well it improves the Balanced accuracy significantly, compromising minimally on the baseline F1-Score. The same trend follows for the letter_image dataset. Notably, these three datasets have the highest number of overall samples as well, implying that with more data LoRAS can significantly outperform compared convex combination based oversampling models.

High dimensional datasets It is also of interest to us to check how LoRAS performs on high dimensional datasets. We therefore select five datasets with highest number of features among our tested datasets and present the performances of the selected ML methods in Table 5 From our results for high dimensional datasets, we observe that LoRAS produces the best F1-Score and second best Balanced accuracy on average among all oversampling models as Borderline-2 SMOTE beats LoRAS marginally. SMOTE improves both Balanced accuracy with respect to the baseline score here. Borderline-1 SMOTE and SVM SMOTE further increases SMOTE’s performance both in terms of F1-Score and Balanced accuracy. Borderline-2 SMOTE, although improves the Balanced accuracy of SMOTE compromises on the F1-Score. Note that, even excluding the webpage dataset, where LoRAS has an overwhelming success, LoRAS still has the best average F1-Score and third highest Balanced accuracy marginally behind SVM-SMOTE and Borederline-2 SMOTE. We thus conclude, that for high dimensional datasets LoRAS can outperform the compared oversampling models in terms of F1-Score, while compromising marginally for Balanced accuracy.

Small datasets For the two small datasets (with less than 10 samples in minority class) we have explored, we observed that LoRAS performs reasonably well. For the ‘ar1’, LoRAS produces the best F1-Score and third best Balanced accuracy.For the ‘ar2’ dataset LoRAS produces the best Balanced accuracy and the third best F1-Score. Note that LoRAS performs quite well for the ‘abalone’ and ‘arrhythmia’ datasets, which also have a small number of data points in the minority class.

Statistical analysis Following Tarawneh et al. (2020), we use the Wilcoxon’s signed rank test to compare LoRAS against the other convex-combination based oversampling algorithms, in terms of both the performance measures we have used: F1-Score and Balanced accuracy. Tarawneh et al. (2020) chose this test for comparative studies since it is safer than parametric tests as it refrains from assuming homogeneity or normal distribution of data. Therefore, it can be applied to any classifier evaluation measure. Tarawneh et al. (2020) further confirms: ‘The Wilcoxon test aims to find if a null hypothesis is true or not. The null hypothesis \(H_0\) assumes that there is no significant difference between the classification results (observations) obtained from two different methods. We assume that the null hypothesis is rejected if the p-value of the Wilcoxon test is less than \(\alpha =0.05\)’(Tarawneh et al. 2020).

Table 6 Table showing p-values for comparison of LoRAS against the other oversampling algorithms, in terms of both the performance measures we have used: F1-score and balanced accuracy

From Table 6 we observe that the p-values for all the paired tests are less than 0.05 for the F1-Score, and therefore, the \(H_0\) is rejected for all the paired tests in case of the F1-Score. Thus, the F1-Scores LoRAS produce have a big enough difference compared to the other compared algorithms, to be statistically significant. For Balanced accuracy, the algorithms Borderline-2 SMOTE and ADASYN do not show significant statistical difference to LoRAS. However, since F1-Score is a more reliable and widely used metric for imbalanced datasets, we conclude that overall results generated by LoRAS are significantly different from the compared oversampling algorithms.

Tarawneh et al. (2020) also remarks that the p-value alone is informative enough and does not provide information about the relationship strength between variables. The p-values do not reveal whether the results are significantly different in favour of LoRAS or against LoRAS. For that following Tarawneh et al. (2020) we use the metrics \(W_+\), \(W_-\) and R. These are calculated using the following steps:

  • For each data pair (involving LoRAS and some other oversampling algorithm) of model predictions , the difference between both predictions is calculated and stored in a vector D, excluding the zero difference values.

  • The signs of the difference is recorded in a sign vector S.

  • The entries in |D| are ranked, forming a vector \(R^{\prime }\). In case of tied ranks, an average ranking scheme is adopted. This means, after ranking the entries of |D| are ranked using integers and then, in case of tied entries the average of the integer ranks are considered as the average rank for all the respective tied entries with a specific tied value.

  • Component-wise product of S and \(R^{\prime }\) provides us with the vector W, the vector of the signed ranks. The sum of absolute values of the positive entries in W is \(W_+\) and the sum of absolute values of the negative entries in W is \(W_-\). After this we define, \(W_R=min\{W_+,W_-\}\)

  • Then the test statistic Z is calculated by the equation

    $$\begin{aligned} \begin{aligned} Z= \frac{W_R-\frac{n(n+1)}{4}}{\sqrt{\frac{n(n+1)(2n+1)}{24}-\frac{\Sigma t^3-\Sigma t}{48}}} \end{aligned} \end{aligned}$$
    (1)

    where n is the number of components in D and t is the number of times some i-th entry occurs in \(R^{\prime }\), summed over all such repeated instances.

  • Finally R is calculated using \(R=\frac{|Z|}{\sqrt{N}}\), where N is the total number of datasets compared, which is 14 in our case.

Note that a higher value \(W_+\) for LoRAS indicates towards a superior performance of LoRAS and the value of R indicates towards how superior(with a higher \(W_+\))/ inferior(with a higher \(W_-\)) the performance of LoRAS is, compared to the other oversampling model for the tested datasets. Tarawneh et al. (2020) have considered ranges of \(R\le 0.1\), \(0.1 < R \le 0.5\) and \(R>0.5\) to be indicators for small, medium and high degree of change (improvement or deterioration) in the predictive performance respectively.

Table 7 Table showing \(W_+\)/\(W_-\) /R for comparison of LoRAS against the other oversampling algorithms, in terms of both the performance measures we have used: F1-score and balanced accuracy

From Table 7 we note that, LoRAS has a higher \(W_+\) value for both F1 Score and Balanced accuracy in comparison to each of the other convex combination based oversampling methods in consideration. Moreover for the F1 Score measure, the R value is also more than 0.5, indicating a high degree of improvement in F1-Score for LoRAS, over the considered oversampling models. Similarly, for Balanced accuracy, we find high degree of improvement for LoRAS, over all considered oversampling models except the Borderline-2 SMOTE, for which there is a medium degree of improvement. Overall, we thus conclude that LoRAS provides a significant improvement in performance over the compared convex combination based oversampling methods.

5 Discussion

We have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate for the mean of the underlying local data distribution of the minority class data space. Let \(X=(X_1,\ldots ,X_{|F|}) \in C_{\text {min}}\) be an arbitrary minority class sample. Let \(N^X_k\) be the set of the k-nearest neighbors of X, which will consider the neighborhood of X. Both SMOTE and LoRAS focus on generating augmented samples within the neighborhood \(N^X_k\) at a time. We assume that a random variable \(X \in N^X_k\) follows a shifted t-distribution with k degrees of freedom, location parameter \(\mu\), and scaling parameter \(\sigma\). Note that here \(\sigma\) is not referring to the standard deviation but sets the overall scaling of the distribution (Simon 2009), which we choose to be the sample variance in the neighborhood of X. A shifted t-distribution is used to estimate population parameters, if there are less number of samples (usually, \(\le\) 30) and/or the population variance is unknown. Since in SMOTE or LoRAS we generate samples from a small neighborhood, we can argue in favour of our assumption that locally, a minority class sample X as a random variable, follows a t-distribution. Following Blagus and Lusa (2013), we assume that if \(X, X^{\prime }\in N^X_k\) then X and \(X^{\prime }\) are independent. For \(X, X^{\prime }\in N^X_k\), we also assume:

$$\begin{aligned} \begin{aligned} {\mathbf{E}}[X]&={\mathbf{E}}[X^{\prime }]\\&=\mu = (\mu _1,\ldots ,\mu _{|F|})\\ \mathrm {Var}[X]&=\mathrm {Var}[X^{\prime }]\\&=\sigma ^2\Big (\frac{k}{k-2}\Big ) \\&=\sigma ^{{\prime }2}=(\sigma ^{{\prime }2}_1,\ldots ,\sigma ^{{\prime }2}_{|F|}) \end{aligned} \end{aligned}$$
(2)

where, \({\mathbf{E}}[X]\) and \(\mathrm {Var}[X]\) denote the expectation and variance of the random variable X respectively. However, the mean has to be estimated by an estimator statistic (i.e. a function of the samples). Both SMOTE and LoRAS can be considered as an estimator statistic for the mean of the t-distribution that \(X \in C_{\text {min}}\) follows locally.

Theorem 1

Both SMOTE and LoRAS are unbiased estimators of the mean \(\mu\) of the t-distribution that X follows locally. However, the variance of the LoRAS estimator is less than the variance of SMOTE given that \(|F|>2\).

Proof

A shadowsample S is a random variable \(S=X+B\) where \(X \in N^X_k\), the neighborhood of some arbitrary \(X \in C_{\text {min}}\) and B follows \(\mathscr {N}(0,\sigma _B)\).

$$\begin{aligned} \begin{aligned} {\mathbf{E}}[S]&= {\mathbf{E}}[X]+ {\mathbf{E}}[B]\\&=\mu \\ \mathrm {Var}[S]&= \mathrm {Var}[X]+\mathrm {Var}[B]\\&=\sigma ^{{\prime }2}+\sigma _B^2 \end{aligned} \end{aligned}$$
(3)

assuming S and B are independent. Now, a LoRAS sample \(L=\alpha _1S^1+\cdots +\alpha _{|F|}S^{|F|}\), where \(S^1,\ldots ,S^{|F|}\) are shadowsamples generated from the elements of the neighborhood of X, \(N^X_k\), such that \(\alpha _1+\cdots +\alpha _{|F|}=1\). The affine combination coefficients \(\alpha _1,\ldots ,\alpha _{|F|}\) follow a Dirichlet distribution with all concentration parameters assuming equal values of 1 (assuming all features to be equally important). For arbitrary \(i,j \in \left\{ 1, \ldots , |F| \right\}\),

$$\begin{aligned} {\mathbf{E}}[\alpha _i]&=\frac{1}{|F|}\\ \mathrm {Var}[\alpha _i]&=\frac{|F|-1}{|F|^2(|F|+1)}\\ \mathrm {Cov}(\alpha _i,\alpha _j)&=\frac{-1}{|F|^2(|F|+1)} \end{aligned}$$

where \(\mathrm {Cov}(A,B)\) denotes the covariance of two random variables A and B. Assuming \(\alpha\) and S to be independent,

$$\begin{aligned} {\mathbf{E}}[L] ={\mathbf{E}}[\alpha _1]{\mathbf{E}}[S^1]+\cdots +{\mathbf{E}}[\alpha _{|F|}]{\mathbf{E}}[S^{|F|}] = \mu \end{aligned}$$
(4)

Thus L is an unbiased estimator of \(\mu\). For \(j,k,l \in \left\{ 1, \ldots , |F| \right\}\),

$$\begin{aligned} \begin{aligned} \mathrm {Cov}[\alpha _kS^k_j,\alpha _lS^l_j]&= {\mathbf{E}}[\alpha _kS^k_j\alpha _lS^l_j]-{\mathbf{E}}[\alpha _kS^k_j]{\mathbf{E}}[\alpha _lS^l_j]\\&= {\mathbf{E}}[\alpha _k\alpha _l]\mu _j^2-\frac{\mu _j^2}{|F|^2}\\&=\Big [\mathrm {Cov}(\alpha _k,\alpha _l)+\frac{1}{|F|^2}\Big ]\mu _j^2-\frac{\mu _j^2}{|F|^2}=\mu _j^2\mathrm {Cov}(\alpha _k,\alpha _l) \end{aligned} \end{aligned}$$
(5)

since \(\alpha _k\alpha _l\) is independent of \(S^k_jS^l_j\). For an arbitrary j, j-th component of a LoRAS sample \(L_j\)

$$\begin{aligned} \begin{aligned} \mathrm {Var}(L_j)&= \mathrm {Var}(\alpha _1S^1_j+\cdots +\alpha _{|F|}S^{|F|}_j)\\&= \mathrm {Var}(\alpha _1S^1_j)+\cdots +\mathrm {Var}(\alpha _{|F|}S^{|F|}_j)+\Sigma _{k=1}^{|F|}\Sigma _{l=1,l \ne k}^{|F|}\mathrm {Cov}(\alpha _kS^k_j,\alpha _lS^l_j)\\&=\frac{\mu _j^2(|F|-1)+2(\sigma ^{{\prime }2}_j+\sigma _{Bj}^2)|F|}{|F|(|F|+1)}-\frac{\mu _j^2(|F|-1)}{|F|(|F|+1)}\\&= \frac{2(\sigma ^{{\prime }2}_j+\sigma _{Bj}^2)}{(|F|+1)} \end{aligned} \end{aligned}$$
(6)

For LoRAS, we take an affine combination of |F| shadowsamples and SMOTE considers an affine combination of two minority class samples. Note, that since a SMOTE generated oversample can be interpreted as a random affine combination of two minority class samples, we can consider, \(|F|=2\) for SMOTE, independent of the number of features. Also, from Eq. 4, this implies that SMOTE is an unbiased estimator of the mean of the local data distribution. Thus, the variance of a SMOTE generated sample as an estimator of \(\mu\) would be \(\frac{2\sigma ^{{\prime }2}}{3}\) (since \(B=0\) for SMOTE). But for LoRAS as an estimator of \(\mu\), when \(|F|>2\), the variance would be less than that of SMOTE. \(\square\)

This implies that, locally, LoRAS can estimate the mean of the underlying t-distribution better than SMOTE. To visualize the key aspects of LoRAS oversampling, we provide the PCA plots for oversampled data from the ozone_level dataset several oversampling methods we have studied in Fig. 2. From Fig. 2 we can observe that SMOTE and ADASYN oversamples highly on the neighbourhood of the outliers, depicted by a blue box in each subplot. While this is somewhat controlled in Borderline1-SMOTE and SVM SMOTE, they still generate some synthetic samples in this neighbourhood. LoRAS on the other hand refrains, leveraging on its strategy to produce a better estimate for local mean of the underlying local data distribution. This enables LoRAS to ignore the outliers and to oversample more uniformly resulting in a better approximation of the data manifold. Note that, the average F1-Scores of the oversampling models as presented in Table 4 has a direct correlation to how the oversampling strategy oversamples in this neighbourhood. SMOTE and ADASYN generates the lowest F1-Scores and show a tendency of oversampling excessively from this neighbourhood. Borderline-SMOTE and SVM improves the F1-Score compared to SMOTE and ADASYN, again, consistent to their behaviour of oversampling lesser in this neighbourhood. LoRAS, has the highest average F1-Score and oversampling very sparsely from this neighbourhood.

Fig. 2
figure 2

Figure showing for principal component analysis plot of ozone dataset for baseline data and oversampled data with several oversampling strategies for the ozone_level dataset. The boxed region in each subplot shows a neighbourhood of outliers and how each oversampling stategy generates synthetic samples in that neighbourhood

6 Conclusions

Oversampling with LoRAS produces comparatively balanced ML model performances on average, in terms of Balanced Accuracy and F1-Score among the compared convex-combination strategy based oversampling techniques. This is due to the fact that, in most cases LoRAS produces lesser misclassifications on the majority class with a reasonably small compromise for misclassifications on the minority class. From our study we infer that for tabular high dimensional and highly imbalanced datasets our LoRAS oversampling approach can better estimate the mean of the underlying local distribution for a minority class sample (considering it a random variable) and can improve Balanced accuracy and F1-Score of ML classification models. However, the scope of such convex combination based strategies including LoRAS, might be limited for heterogeneous image based imbalanced datasets.

The distribution of both the minority and majority class data points is considered in the oversampling techniques such as Borderline1 SMOTE, Borderline2 SMOTE, SVM-SMOTE, and ADASYN (Gosain and Sardana 2017). SMOTE and LoRAS are the only two techniques, among the oversampling techniques we explored, that deal with the problem of imbalance just by generating new data points, independent of the distribution of the majority class data points. Thus, comparing LoRAS and SMOTE gives a fair impression about the performance of our novel LoRAS algorithm as an oversampling technique, without considering any aspect of the distributions of the minority and majority class data points and relying just on resampling. Other extensions of SMOTE such as Borderline1 SMOTE, Borderline2 SMOTE, SVM-SMOTE, and ADASYN can also be built on the principle of LoRAS oversampling strategy. According to our analyses LoRAS already reveals great potential on a broad variety of applications and evolves as a true alternative to SMOTE, while processing highly unbalanced datasets.