LoRAS: An oversampling approach for imbalanced datasets

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our LoRAS algorithm with 28 publicly available datasets and show that that drawing samples from an approximated data manifold of the minority class is the key to successful oversampling. We compared the performance of LoRAS, SMOTE, and several SMOTE extensions and observed that for imbalanced datasets LoRAS, on average generates better Machine Learning (ML) models in terms of F1-score and Balanced Accuracy. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to mean of the underlying local data distribution of the minority class data space.


Introduction
Imbalanced datasets are frequent occurrences in a large spectrum of fields, where Machine Learning (ML) has found its applications, including business, finance and banking as well as medical science. Oversampling approaches are a popular choice to deal with imbalanced datasets (Barua et al., 2014, Bunkhumpornpat et al., 2009, Chawla et al., 2002, Haibo et al., 2008, Han et al., 2005. We here present Localized Randomized Affine Shadowsampling (LoRAS), which produces better ML models for imbalanced datasets, compared to state-of-the art oversampling techniques such as SMOTE and several of its extensions. We use computational analyses and a mathematical proof to demonstrate that drawing samples from an approximated data manifold of the minority class is key to successful oversampling. We validated the approach with 28 imbalanced datasets, comparing the performances of several state-of-the-art oversampling techniques with LoRAS. The average performance of LoRAS on all these datasets is better than other oversampling techniques that we investigated. In addition, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to local mean of the underlying data distribution, in some neighbourhood of the minority class data space.
For imbalanced datasets, the number of instances in one (or more) class(es) is very high (or very low) compared to the other class(es). A class having a large number of instances is called a majority class and one having far fewer instances is called a minority class. This makes it difficult to learn from such datasets using standard ML approaches. Oversampling approaches are often used to counter this problem by generating synthetic samples for the minority class to balance the number of data points for each class. SMOTE is a widely used oversampling technique, which has received various extensions since it was published by Chawla et al. (2002). The key idea behind SMOTE is to randomly sample artificial minority class data points along line segments joining the minority class data points among k of the minority class nearest neighbors of some arbitrary minority class data point.
The SMOTE algorithm, however has several limitations for example: it does not consider the distribution of minority classes and latent noise in a data set (Hu et al., 2009). It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model (Puntumapon and Waiyamai, 2012). Several other limitations of SMOTE are mentioned in Blagus and Lusa (2013). To overcome such limitations, several algorithms have been proposed as extensions of SMOTE. Some are focusing on improving the generation of synthetic data by combining SMOTE with other oversampling techniques, including the combination of SMOTE with Tomek-links (Elhassan et al., 2016), particle swarm optimization (Gao et al., 2011, Wang et al., 2014, rough set theory (Ramentol et al., 2012), kernel based approaches (Mathew et al., 2015), Boosting (Chawla et al., 2003), andBagging (Hanifah et al., 2015). Other approaches choose subsets of the minority class data to generate SMOTE samples or cleverly limit the number of synthetic data generated (Santoso et al., 2017). Some examples are Borderline1/2 SMOTE (Han et al., 2005), ADAptive SYNthetic (ADASYN) (Haibo et al., 2008), Safe Level SMOTE (Bunkhumpornpat et al., 2009), Majority Weighted Minority Oversampling TEchnique (MWMOTE) (Barua et al., 2014), Modified SMOTE (MSMOTE), and Support Vector Machine-SMOTE (SVM-SMOTE) (Suh et al., 2017) (see Table 1) (Hu et al., 2009). Recent comparative studies have focused on SMOTE, Borderline1/2 SMOTE models, ADASYN, and SVM-SMOTE (Ah-Pine et al., 2016, Suh et al., 2017, which is why we will focus on these five models for a comparison with our newly developed oversampling technique LoRAS. LoRAS allows us to resample the data uniformly from an approximated data manifold of the minority class data points and, thus, creating a more balanced and robust model. A LoRAS oversample is an unbiased estimator of the mean of the underlying local probability distribution followed by a minority class sample (assuming that it is some random variable) such that the variance of this estimator is significantly less than that of a SMOTE generated oversample, which is also an unbiased estimator of the mean of the underlying local probability distribution followed by a minority class sample. In this section we discuss our strategy to approximate the data manifold, given a small dataset. A typical dataset for a supervised ML problem consists of a set of features F = {f 1 , f 2 , . . . }, that are used to characterize patterns in the data and a set of labels or ground truth. Ideally, the number of instances or samples should be significantly greater than the number of features. In order to maintain the mathematical rigor of our strategy we propose the following definition for a small dataset.
Definition 1. Consider a class or the whole dataset with n samples and |F | features. If log 10 ( n |F | ) < 1, then we call the dataset, a small dataset.
The LoRAS algorithm is designed to learn from a small dataset by approximating the underlying data manifold. Assuming that F is the best possible set of features to represent the data and all features are equally important, we can think of a data oversampling model to be a function g : l i=1 R |F | → R |F | , that is, g uses l parent data points (each with |F | features) to produce an oversampled data point in R |F | .
Definition 2. We define a random affine combination of some arbitrary vectors as the affine linear combination of those vectors, such that the coefficients of the linear combination are chosen randomly. Formally, a vector v, v = α 1 u 1 + · · · + α n u m , is a random affine combination of vectors u 1 , . . . , u m , (u j ∈ R | F |) if α 1 + · · · + α m = 1, α j ∈ R + and α 1 , . . . , α m are chosen randomly from a Dirichlet distribution.
The simplest way of augmenting a data point would be to take the average (or random affine combination as defined in Definition 2) of two data points as an augmented data point. But, when we have |F | features, we can assume that the hypothetical mani-fold on which our data lies is |F |-dimensional. An |F |-dimensional manifold can be approximated by a collection of (|F |−1)-dimensional planes.
Given |F | sample points we could exactly derive the equation of an unique (|F |−1)dimensional plane containing these |F | sample points. By Definition 1, for a small dataset, however, log 10 ( n |F | ) < 1, and thus, there is even a possibility that n < |F |. To resolve this problem, we create shadow data points or shadowsamples from our n parent data points in the minority class. Each shadow data point is generated by adding noise from a normal distribution, N (0, h(σ f )) for all features f ∈ F , where h(σ f ) is some function of the sample variance σ f for the feature f . For each of the n data points we can generate m shadow data points such that, n × m ≫ |F |. Now it is possible for us to choose |F | shadow data points from the n × m shadow data points even if n < |F |.
Since real life data are mostly nonlinear, to approximate the data manifold effectively, we have to localize our strategy. For each parent data point p in a small dataset D, let us denote by N p k the set of k-nearest neighbors (including p) of p in D. We can always choose m > 0 in such a way that |N p k | × m ≫ |F |. Every time we choose |F | shadow data points as follows: we first choose a random parent data point p and then restrict the domain of choice to the shadowsamples generated by the parent data points in N p k .
We then take a random affine combination of the |F | chosen shadowsamples to create one augmented Localized Random Affine Shadowsample or a LoRAS sample as defined in Definition 2. Thus, a LoRAS sample is an artificially generated sample drawn from an (|F |−1)-dimensional plane, which locally approximates the underlying hypothetical |F |-dimensional data manifold.
Theoretically, we can generate n ′ LoRAS samples such that log 10 ( n ′ |F | ) ≥ 1 and use them for training a ML model. In this article, all imbalanced classification problems that we deal with are binary classification problems. For such a problem, there is a minority class C min containing a relatively less number of samples compared to a majority class C maj . We can thus consider the minority class as a small dataset and use the LoRAS algorithm to oversample. For every data point p we can denote a set of shadowsamples generated from p as S p . In practice, one can also choose 2 ≤ N aff ≤ |F | shadowsamples for an affine combination and choose a desired number of oversampled points N gen to be generated using the algorithm. We can look at LoRAS as an oversampling algorithm as described in Algorithm 1. The LoRAS algorithm thus described, can be used for oversampling of minority classes in case of highly imbalanced datasets. Note that the input variables for our algorithm are: number of nearest neighbors per sample k, number of generated shadow points per parent data point |S p |, list of standard deviations for normal distributions for adding noise to every feature and thus generating the shadowsamples L σ , number of shadowsamples to be chosen for affine combinations N aff , and number of generated points for each nearest neighbors group N gen .
We have mentioned the default values of the LoRAS parameters in Algorithm 1, showing

Constraint:
Naff < k * |Sp| Initialize loras set as an empty list For each minority class parent data point p in C min do neighborhood ← − calculate k-nearest neighbors of p and append p Initialize neighborhood shadow sample as an empty list For each parent data point q in neighborhood do shadow points ← − draw |Sp| shadowsamples for q drawing noises from normal distributions with corresponding standard deviations Lσ containing elements for every feature Append shadow points to neighborhood shadow sample Repeat selected points ← − select Naff random shadow points from neighborhood shadow sample affine weights ← − create and normalize random weights for selected points generated LoRAS sample point ← − selected points · affine weights Append generated LoRAS sample point to loras set Until Ngen resulting points are created; Return resulting set of generated LoRAS data points as loras set the pseudocode for the LoRAS algorithm. One could use a random grid search technique to come up with appropriate parameter combinations within given ranges of parameters. As an output, our algorithm generates a LoRAS dataset for the oversampled minority class, which can be subsequently used to train a ML model.
The implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https://github.com/narek-davtyan/LoRAS. In our computational code in GitHub, |S p | corresponds to num shadow points, L σ corresponds to list sigma f, N aff corresponds to num aff comb, N gen corresponds to num generated points.
For each neighbor n, and p, sample #s shadow points s from a normal distribution (0;σ 1 , ... ,σ |F| ) times randomly select ≈ |F| shadow points to generate a LoRAS sample point l as a random affine combination Figure 1: Visualization of the workflow demonstrating a step-by-step explanation for LoRAS oversampling. (a) Here, we show the parent data points of the minority class points C min . For a data point p we choose three of the closest neighbors (using knn) to build a neighborhood of p, depicted as the box. (b) Extracting the four data points in the closest neighborhood of p (including p). (c) Drawing shadow points from a normal distribution, centered at these parent data point n. (d) We randomly choose three shadow points at a time to obtain a random affine combination of them (spanning a triangle). We finally generate a novel LoRAS sample point from the neighborhood of a single data point p.

Case studies
For testing the potential of LoRAS as an oversampling approach, we designed benchmarking experiments with a total of 28 imbalanced datasets. With this number of diverse case studies we should have a comprehensive idea of the advantages of LoRAS over other existing oversampling methods.

Datasets used for validation
Here we provide a brief description of the datasets and the sources that we have used for our studies.
Scikit-learn imbalanced benchmark datasets: The imblearn.datasets package is complementing the sklearn.datasets package. It provides 27 pre-processed datasets, which are imbalanced. The datasets span a large range of real-world problems from several fields such as business, computer science, biology, medicine, and technology. This collection of datasets was proposed in the imblearn.datasets python library by Lemaître et al. (2017) and benchmarked by Ding (2011). Many of these datasets have been used in various research articles on oversampling approaches (Ding, 2011, Saez et al., 2016.

Methodology
For each case study, we split the dataset into 50 % training and 50 % testing data. We did a pilot study with ML classifiers such as k-nearest neighbors (knn), Support Vector Machine (svm) (linear kernel), Logistic regression (lr), Random forest (rf), and Adaboost. As inferred in (Blagus and Lusa, 2013)  First, we trained the models with the unmodified dataset to observe how they perform without any oversampling. Then, we oversampled the minority class using SMOTE, Borderline1 SMOTE, Borderline2 SMOTE, SVM SMOTE, ADASYN, and LoRAS to retrain the ML algorithms including the oversampled datasets. We then measured the performances of our models using performance metrics such as Balanced Accuracy and F1-Score. In our study, we benchmark LoRAS against several other oversampling algorithms for the 27 benchmark datasets. To ensure fairness of comparison, we oversampled such that the total number of augmented samples generated from the minority class was as close as possible to the number of samples in the majority class as allowed by each oversampling algorithm. For the credit card fraud detection dataset we compared performances of several oversampling techniques including LoRAS and several ML models as well, ensuring that we build the best possible ML model using customized parameter settings for respective oversampling techniques. For this case we also chose the ML models lr and rf since their performance were the best.
LoRAS has several parameters (k, |S p |, L σ , N aff , N gen ). We have ensured, for a fair comparison with other models, to choose the same values for the parameter denoting the number of nearest neighbors of a minority class sample k, where ever applicable.

Results
For imbalanced datasets there are more meaningful performance measures than Accuracy, including Sensitivity or Recall, Precision, and F1-Score (F-Measure), and Balanced Accuracy that can all be derived from the Confusion Matrix, generated while testing the model. For a given class, the different combinations of recall and precision have the following meanings : F1-Score, calculated as the harmonic mean of precision and recall and, therefore, balances a model in terms of precision and recall. Balanced Accuracy is the mean of the individual class accuracies and in this context, it is more informative than the usual accuracy score. High Balanced Accuracy ensures that the ML algorithm learns adequately for each individual class. These measures have been defined and discussed thoroughly by M. Abd Elrahman and Abraham (2013). We will use the above mentioned performance measures wherever applicable in this article. Table 2 we show the Balanced Accuracies and F1-Scores for the 27 inbuilt datasets in Scikit-learn. Calculating average performances over all datasets, LoRAS has the best Balanced Accuracy and F1-Score. As expected, SMOTE improved both Balanced Accuracy and F1-Score compared to normal model training. Interestingly, the oversampling approaches SVM-SMOTE and Borderline1 SMOTE also improved the average F1-Score compared to SMOTE, but compromised for a lower Balanced Accuracy. Between SVM-SMOTE and Borderline1 SMOTE we noted that SVM-SMOTE improved the F1-Score the most, but has the lesser Balanced Accuracy. In contrast our LoRAS approach produces a better Balanced Accuracy than SMOTE on average by maintaining the highest average F1-Score among all oversampling techniques. From Table 3, we see that LoRAS performs best in terms of Balanced Accuracy and F1-Score for 11 and 9 datasets respectively. Thus, LoRAS outperforms other oversampling algorithms in terms of both Balanced Accuracy and F1-Score for a maximum number of datasets. Interestingly, we also observe a trend that the oversampling approaches that have good performances in terms of F1-Score, have a relatively weaker performance for Balanced Accuracy. Interestingly, not only LoRAS proves to be the best choice for the highest number of datasets but also retains its performance for both of the performance measures.

Scikit-learn imbalanced datasets: In
Credit card fraud detection dataset: The credit card fraud detection dataset has 492 fraud instances out of 284,807 transactions. The task is to predict fraudulent transactions. In Table 4, we show the number of samples generated from several oversampling approaches. For testing, we have 246 samples of frauds and 142,158 samples of normal non-fraudulent people for each case. We summarize our results in a tabular form in Table 5. In the table we show the scores of our models for the performance measures: Precision, Recall, F1-Score, and Balanced Accuracy for lr and rf ML models. From Table 5 we infer that rf model with LoRAS oversampling has the best F1-Score. Interestingly, LoRAS on both lr and rf produces a Balanced Accuracy higher than 0.85 and an F1-Score higher than 0.8. Other models such as SVM SMOTE (with both lr and rf) and ADASYN with lr also produces very good results. Thus LoRAS produces better F1-Score with a reasonable compromise on the Balanced Accuracy.

Discussion
We have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to the mean of the underlying local data distribution of the minority class data space. Let X = (X 1 , . . . , X |F | ) ∈ C min be an arbitrary minority class sample. Let N X k be the set of the k-nearest neighbors of X, which will consider the neighborhood of X. Both SMOTE and LoRAS focus on generating augmented samples within the neighborhood N X k at a time. We assume that a random variable X ∈ N X k follows a shifted t-distribution with k degrees of freedom, location parameter µ, and scaling parameter σ. Note that here σ is not referring to the standard deviation but sets the overall scaling of the distribution (Jackman, 2009), which we choose to be the sample variance in the neighborhood of X. A shifted t-distribution is used to estimate population parameters, if there are less number of samples (usually, ≤ 30) and/or the population variance is unknown. Since in SMOTE or LoRAS we generate samples from a small neighborhood, we can argue in favour of our assumption that locally, a minority class sample X as a random variable, follows a t-distribution. Following Blagus and Lusa (2013), we assume that if X, X ′ ∈ N X k then X and X ′ are independent. For X, X ′ ∈ N X k , we also assume: where, I E[X] and Var[X] denote the expectation and variance of the random variable X respectively. However, the mean has to be estimated by an estimator statistic (i.e. a function of the samples). Both SMOTE and LoRAS can be considered as an estimator statistic for the mean of the t-distribution that X ∈ C min follows locally.
Theorem 1. Both SMOTE and LoRAS are unbiased estimators of the mean µ of the t-distribution that X follows locally. However, the variance of the LoRAS estimator is less than the variance of SMOTE given that |F | > 2.

Proof.
A shadowsample S is a random variable S = X + B where X ∈ N X k , the neighborhood of some arbitrary X ∈ C min and B follows N (0, σ B ).

B
(2) assuming S and B are independent. Now, a LoRAS sample L = α 1 S 1 + · · · + α |F | S |F | , where S 1 , . . . , S |F | are shadowsamples generated from the elements of the neighborhood of X, N X k , such that α 1 + · · · + α |F | = 1. The affine combination coefficients α 1 , . . . , α |F | follow a Dirichlet distribution with all concentration parameters assuming equal values of 1 (assuming all features to be equally important). For arbitrary i, j ∈ 1, . . . , |F | , where Cov(A, B) denotes the covariance of two random variables A and B. Assuming α and S to be independent, Thus L is an unbiased estimator of µ. For j, k, l ∈ 1, . . . , |F | , since α k α l is independent of S k j S l j . For an arbitrary j, j-th component of a LoRAS sample L j Var(L j ) = Var(α 1 S 1 j + · · · + α |F | S For LoRAS, we take an affine combination of |F | shadowsamples and SMOTE considers an affine combination of two minority class samples. Note, that since a SMOTE generated oversample can be interpreted as a random affine combination of two minority class samples, we can consider, |F | = 2 for SMOTE, independent of the number of features. Also, from Equation 3, this implies that SMOTE is an unbiased estimator of the mean of the local data distribution. Thus, the variance of a SMOTE generated sample as an estimator of µ would be 2σ ′2 3 (since B = 0 for SMOTE). But for LoRAS as an estimator of µ, when |F | > 2, the variance would be less than that of SMOTE.
This implies that, locally, LoRAS can estimate the mean of the underlying t-distribution better than SMOTE.

Conclusions
Oversampling with LoRAS produces comparatively balanced ML model performances on average, in terms of Balanced Accuracy and F1-Score. This is due to the fact that, in most cases LoRAS produces lesser misclassifications on the majority class with a reasonably small compromise for misclassifications on the minority class. Moreover, we infer that our LoRAS oversampling strategy can better estimate the mean of the underlying local distribution for a minority class sample (considering it a random variable).
The distribution of the minority class data points is considered in the oversampling techniques such as Borderline1 SMOTE, Borderline2 SMOTE, SVM-SMOTE, and ADASYN (Gosain and Sardana, 2017). SMOTE and LoRAS are the only two techniques, among the oversampling techniques we explored, that deal with the problem of imbalance just by generating new data points, independent of the distribution of the minority and majority class data points. Thus, comparing LoRAS and SMOTE gives a fair impression about the performance of our novel LoRAS algorithm as an oversampling technique, without considering any aspect of the distributions of the minority and majority class data points and relying just on resampling. Other extensions of SMOTE such as Borderline1 SMOTE, Borderline2 SMOTE, SVM-SMOTE, and ADASYN can also be built on the principle of LoRAS oversampling strategy. According to our analyses LoRAS already reveals great potential on a broad variety of applications and evolves as a true alternative to SMOTE, while processing highly unbalanced datasets.
Availability of code: The implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https://github.com/narek-davtyan/LoRAS. In our computational code, |S p | corresponds to num shadow points, L σ corresponds to list sigma f, N aff corresponds to num aff comb, N gen corresponds to num generated points.