1 Introduction

The class imbalance problem arises in various real world applications such as: medical diagnosis (Fotouhi et al., 2019), credit card fraud detection (Li et al., 2021), software testing (Balogun et al., 2020), e-commerce (Wu & Meng, 2016), and stock selection (Atiya et al., 1997). The unbalanced data problem occurs when one class is under-represented (the minority class), while the other class is over-represented in the data (the majority class). The class imbalance could be due the data collection process. For example, in medical diagnosis, normal cases could be larger than patients suffering from a certain uncommon disease (Liu et al., 2022; Fotouhi et al., 2019) although the target is to identify the minority class denoting the patients.

Standard machine learning classifiers such as Support vector machines (SVM) (Hearst et al., 1998), decision trees (Quinlan, 1996), and K-nearest neighbor (KNN) (Guo et al., 2003) generally assume at least implicitly an even class distribution. Thus, applying the standard approaches without handling the class imbalance could dramatically impact the classification performance since classifiers would be biased towards the over-represented (majority) class.

There are three major approaches for handling the class imbalance problem: the data level approach (Batista et al., 2004; Chawla et al., 2002; Guzmán-Ponce et al., 2021), the cost sensitive approach (Devi et al., 2022), and the algorithm level approach (Mullick et al., 2018; Buda et al., 2018; Ganaie et al., 2021).

The data level approach is the most prevalent paradigm in handling unbalanced data. Data level algorithms are sampling methods that apply data pre-processing before classification, typically by increasing the number of minority class samples which is known as over-sampling (Chawla et al., 2002; Koziarski et al., 2021). Conversely, some majority class samples could be excluded from the data, which is known as under-sampling (Chennuru & Timmappareddy, 2022; Vuttipittayamongkol & Elyan, 2020). A key advantage of the data level approach is its generality since it can be applied to any classifier.

Over-sampling can be performed using two main approaches. The first approach is replicating the original minority class samples such as: random over-sampling (Abd Elrahman & Abraham, 2013). However, this approach may result in over-fitting by over-emphasizing noisy minority samples. The second approach for increasing the number of minority class samples is to generate new synthetic minority class samples (Abd Elrahman & Abraham, 2013; Chawla et al., 2002; Wan et al., 2017; Goodman et al., 2022).

One of the most popular over-sampling methods is “Synthetic Minority Over-sampling Technique (SMOTE)" developed by Chawla et al. (2002). The SMOTE method generates synthetic data by applying linear interpolation between a minority class point and one of its K nearest neighbors. SMOTE is a powerful over-sampling method that has been widely adopted in many applications (Fernández et al., 2018; Ahsan et al., 2018; Kishor & Chakraborty, 2021). Furthermore, a plethora of SMOTE extensions have been developed such as: Borderline SMOTE (Han et al., 2005), Safe-level SMOTE (Bunkhumpornpat et al., 2009), ADASYN (He et al., 2008), SVM SMOTE (Nguyen et al., 2011), Localized Random Affine Shadowsampling (LoRAS) (Bej et al., 2021), CDSMOTE (Elyan et al., 2021), and Deep SMOTE (Dablain et al., 2022).

Another technique for synthetically generating minority class samples is to estimate the underlying minority class probability distribution, and generate samples from it such as: PDF oversampling (PDFOS) (Gao et al., 2014) and random walk oversampling (Zhang & Li, 2014). However, density estimation in case of scarce data samples would be inaccurate especially for high dimensional data. On the other hand, for high dimensional data such as images, Wan et al. (2017) develop a variational autoencoder for generating similar synthetic samples to the original ones.

In this work, we mainly investigate the SMOTE method due to its popularity and competitive performance. Despite the efficacy of the SMOTE over-sampling algorithm (Chawla et al., 2002), it has some limitations. For example, SMOTE oversamples noisy examples which could magnify the noise impact and degrade the classification performance. In addition, SMOTE could falsely generate synthetic samples in the majority class region misleading the classifier. Furthermore, SMOTE does not consider minority classes composed of several small disjuncts or sub-concepts (Prati et al., 2004). One of the main reasons for all of the aforementioned SMOTE pitfalls is the fact that the SMOTE patterns are not genuine as they are not generated from the original minority class distribution. It is important to establish that they are true representatives of the underlying class, by showing that they obey a similar distribution.

Another concern regarding SMOTE is that it is not sufficiently grounded on a solid mathematical theory. As a step towards this goal, this work aims to establish a mathematical foundation for analyzing the SMOTE algorithm. Specifically, this work derives a mathematical formulation for the probability distribution of the SMOTE synthetically generated samples. The benefit of this analysis is that it allows us to study how relevant are the generated samples, or how close are they in distribution to the true ones. Moreover, more better-suited SMOTE extensions could be constructed based on the insights gained from the theoretical analysis. Also, the analysis will shed some insight into the other SMOTE extensions.

The main contributions of this work are summarized as follows:

  • In this work, we derive a mathematical formulation for the probability distribution of the SMOTE generated patterns. The presented theoretical formulation is general, and it can be applied to any class-conditional probability distribution. To the best of our knowledge, this is the first theoretical analysis deriving the probability density of the SMOTE generated patterns.

  • As a follow-up test, we illustrate the general theoretical analysis by applying it to some distributions for verification of the main contribution.

The paper is organized as follows: Section 2 presents a literature review. Then, the mathematical derivation of SMOTE probability distribution is introduced in Sect. 3. After that, the experimental results are demonstrated in Sect. 4. Finally, Sect. 5 concludes the paper and presents some potential future research directions.

2 Related work

Handling unbalanced data has been extensively studied in the literature (Japkowicz & Stephen, 2002; Kamalov et al., 2022; Wang et al., 2018; Haixiang et al., 2017; Wang et al., 2021; Kaur et al., 2019). However, there are a few studies that provide theoretical or empirical analyses of the data sampling methods, in particular SMOTE. In this review, we focus primarily on these works. Elreedy and Atiya (2019) derive the expectation and covariance matrix of the SMOTE generated patterns. However, the analysis we present here is not restricted to the moments since we develop a mathematical formulation for the density function itself of the SMOTE generated samples. Another key distinction between this work and the work of Elreedy and Atiya (2019) is that the previous work assumes some approximations, while the work presented here is an exact formula. Studying the density characteristics of the SMOTE patterns could stimulate developing density oriented over-sampling methods (Yan et al., 2022; Mayabadi & Saadatfar, 2022). To the best of our knowledge, this is the first analytical formula for the density of SMOTE generated samples. We will limit this work on developing the analytical formula, rather than a complete analysis of SMOTE, as this is outside of the scope of the paper. For a comprehensive analysis of the features of SMOTE refer to the work of Elreedy and Atiya (2019).

Another theoretical analysis for resampling algorithms is performed by Moniz and Monteiro (2021). In particular, the authors apply no free lunch machine learning theorems to imbalanced learning. In addition, they provide a comparative empirical study for different resampling methods which are: random under-sampling, random over-sampling, importance sampling, SMOTE, and SMOTE combined with random under-sampling. The authors conclude that any two resampling strategies would have the same classification performance given no a priori knowledge or data assumptions.

Several empirical studies have been conducted to inspect sampling methods including SMOTE. For example, Luengo et al. (2011) analyze the behavior of different sampling methods, including SMOTE, one of its extensions called SMOTE-ENN, and an under-sampling method named EUSCHC (García & Herrera, 2009). The authors measure the impact of the different sampling methods on the shape of the processed data after sampling including: the overlapping between the different classes, and class separability and its geometrical properties. However, these measures do not consider the distribution of the generated examples.

Furthermore, the study introduced by Dudjak and Martinović (2020) develops a comparative analysis of the classification performance for diverse SMOTE extensions. The study classifies SMOTE extensions into three different categories according to the interpolation mechanism. The three categories are: SMOTE-like interpolation, range restricted interpolation, and multiple interpolations. SMOTE-like interpolation employs the same interpolation mechanism as SMOTE such as: Modified SMOTE (Hu et al., 2009). The range restricted interpolation elects only particular minority class samples for interpolation such as: Borderline SMOTE (Han et al., 2005). The multiple interpolations method adopts multiple neighbors for the interpolation process like Distance-SMOTE (De La Calleja & Fuentes, 2007).

Another piece of work analyzing resampling methods is introduced by García et al. (2010). This study investigates the impacts of the employed classifier and imbalance ratio on the classification performance. The authors recommend using over-sampling for low and moderate class imbalance ratios. Thabtah et al. (2020) provide another in-depth analysis of imbalance ratio and its effect on classifier accuracy using large scale experimental analysis. Moreover, Kamalov et al. (2022) attempt to determine the optimal sampling ratio using a large-scale study The study developed by Dubey et al. (2014) compares among different under-sampling methods, over-sampling methods, and combinations of both approaches. Their experimental analysis considers random over-sampling, SMOTE, random under-sampling, and K-Medoids under-sampling (Dubey et al., 2014). Their work assures that the sophisticated sampling methods such as: SMOTE and K-Medoids surpass random sampling methods.

Bolívar et al. (2022) conduct an empirical analysis evaluating the SMOTE performance on big data. Specifically, the authors consider high dimensional and sparse data. Their results indicate that the sparsity is more influential than dimensionality on the SMOTE performance on big data.

Several contributions have been devoted to determine the optimal over-sampling rate. For example, Weiss and Provost (2003) perform an experimental analysis to find the optimal class ratio from thirteen proposed class distributions by varying the minority class percentage in the training set. They conclude that the optimal balance is not necessarily achieved at full balance and it is a function of the underlying dataset. Albisua et al. (2013) extend the analysis developed by Weiss and Provost (2003) by conducting experiments on several sampling methods and different classifiers. Their experiments demonstrate that the optimal class balance depends not only on the data, but also on the employed classifier and the re-sampling method.

These works provide in-depth analysis of the functionality of SMOTE and other resampling approaches. These analyses are very useful for guiding the researchers for better usage of these algorithms. However, much of this analysis is empirical, and there is little theoretical analysis. This paper attempts to fill this void and provides an exact and full characterization of the probability density of SMOTE-generated patterns. This will help the researchers understand the functioning of SMOTE and find out how the different factors impact its performance.

3 SMOTE density analysis

3.1 SMOTE algorithm

In this section, we briefly describe the SMOTE over-sampling algorithm developed by Chawla et al. (2002). The SMOTE over-sampling algorithm proceeds as follows:

figure a

Figure 1 demonstrates the SMOTE generation mechanism. It can be noted from the figure that the SMOTE patterns lie on the connection lines between the minority class samples and their K nearest neighbors. The inward positioning of SMOTE patterns make them more contracted than the original distribution as inferred by Elreedy and Atiya (2019). This supports our argument that SMOTE patterns do not necessarily follow the original minority class density. In this work, we analytically derive the probability density function of the SMOTE generated patterns.

Fig. 1
figure 1

The SMOTE interpolation mechanism displaying the original minority samples and the SMOTE generated patterns

3.2 Notation

In this section, we define the notation adopted in the theoretical analysis.

Let \(x_0\) denote a candidate minority class sample for interpolation by SMOTE, and the original probability density of the minority class is denoted as \(p_X(x)\). The synthetically generated sample created by SMOTE is defined as Z. Let d be the dimension of the class pattern.

The total number of minority class samples is defined as N. Let K represent the total number of neighbors used in SMOTE where different numbers of neighbors k are used each time, with k being randomly generated from 1 to K (Chawla et al., 2002). The Euclidean distance between the minority class sample \(x_0\), and its chosen \(k^{th}\) neighbor is defined as r.

Let \(B(x_0,r)\) define the spherical ball centered at \(x_0\) with radius r enclosing all up to the the \(k^{th}\) nearest neighbor of \(x_0\). The integral \(I_{B(x_0,r)}\), called the coverage, defines the integral of the minority class density on the ball \(B(x_0,r)\). The integral \(I_{B(x_0,r)}\) is computed as follows:

$$I_{{B\left( {x_{0} ,r} \right)}} = \int_{{B\left( {x_{0} ,r} \right)}} {pX\left( x \right)} dx.$$
(1)

The incomplete beta function B(qab) (Dutka, 1981; Al-Sirehy & Fisher, 2013a, b), is defined as:

$$\begin{aligned} B(q;a,b) =\int _{t=0}^q t^{a-1} (1-t)^{b-1}\,dt \end{aligned}$$
(2)

3.3 Theoretical analysis of SMOTE density

In this section, we introduce a mathematical analysis for the SMOTE oversampling method developed by Chawla et al. (2002). Specifically, we evaluate the probability density function of the SMOTE generated patterns \(p_Z(z)\) for a general minority class density \(p_X(x)\).

Theorem 1

Let \(x\) be a random sample of a random variable X. Let Z be a random variable defined as a random linear interpolation between K-nearest neighbors of \(x_k\) as given in Algorithm 1. Then, the probability density of Z is given by:

$$\begin{aligned} p_Z(z)\;=\;& {} (N-K){{N-1}\atopwithdelims (){K}}\int _{x} p_X(x)\int _{r=\Vert z-x\Vert }^{\infty }\! p_X\Bigg (x+\frac{(z-x)r}{\Vert z-x\Vert }\Bigg ) \Bigg ({\frac{r^{d-2}}{\Vert z-x\Vert ^{d-1}}}\Bigg ) \\{} & {} \times B\Bigg (1-I_{B(x,r)};N-K-1,K\Bigg ) \textrm{d}r \, \textrm{d}x \end{aligned}$$

where \(B\Bigg (1-I_{B(x,r)};N-K,K\Bigg )\) is the incomplete beta function (Dutka, 1981) defined in Eq. (2).

Lemma 1

The conditional probability density function of the SMOTE generated patterns given a certain minority class sample \(x_0\), \(p_Z(z\vert x_0)\) is evaluated as:

$$\begin{aligned} p_Z(z\vert x_0)\;=\;& {} (N-K){{N-1}\atopwithdelims (){K}}\int _{r=\Vert z-x_0\Vert }^{\infty }\! p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg )\\{} & {} \times B\Bigg (1-I_{B(x_0,r)};N-K-1,K\Bigg ) \textrm{d}r \end{aligned}$$

This theorem essentially gives a full analytic solution for the probability density of the SMOTE generated samples. The formulas are exact and do not rely on any assumptions or approximations.

Proof of Theorem 1 and Lemma 1

The SMOTE algorithm first selects one of the minority samples \(x_0\) randomly. Assume for the moment that this point \(x_0\) is fixed or given. Then we select a neighbor (say \(x_k\)) randomly out of the K nearest neighbors of \(x_0\). Then we pick a point z randomly from the line joining that neighbor \(x_k\) and \(x_0\). This is given by the linear interpolation formula:

$$\begin{aligned} z=(1-w)x_0+wx_k \end{aligned}$$
(3)

where w is a uniform random number in [0, 1].

Consider a two-dimensional case for illustration and consider a probability mass located at area A in Fig. 2 which denotes an infinitesimal area element around the chosen neighbor \(x_k\), at distance r from \(x_0\). After applying SMOTE, namely Eq. (3), the probability mass is mapped into the yellow shaded area B, which is an infinitesimal circle sector reaching out till the k nearest neighbors of \(x_0\). Then, the probability mass at area C, which is an infinitesimal area element around the SMOTE generated pattern z (somewhere between \(x_k\) and \(x_0\) according to the value of w) can be evaluated as:

$$\begin{aligned} p(z \in C)=\frac{p(z \in A)}{r \frac{\Vert z-x_0\Vert }{r}} \end{aligned}$$
(4)

where r denotes the euclidean distance between \(x_0\) and its chosen neighbor \(x_k\), as defined in Sect. 3.2. The division by r is because the mass of A gets spread out to a region of length r, thus, diluting the density by that amount. Essentially, the probability mass that is concentrated in area A will become spread out to the whole circle sector B by the randomized linear interpolation procedure (Eq. 3). Thus, the density will be divided by r, due to the target area being bigger by that amount.

Fig. 2
figure 2

SMOTE density mapping clarification, where \(x_0\) is the minority class pattern on which SMOTE is applied, \(x_k\) denotes the randomly chosen neighbor, and z is the SMOTE generated sample

The division by \( \frac{\Vert z-x_0\Vert }{r}\) is because the area at C is smaller than the area of A by that amount (due to the ratio of arc lengths: the arc length for C is \(\Vert z-x_0\Vert d\theta \) (radius times \(d\theta \)), while for A it is \(r d\theta \) ), so the probability mass gets more concentrated by that amount. This analysis can be generalized to the d-dimensional case as follows:

$$\begin{aligned} p(z \in C)={p(z \in A)}{ \frac{r^{d-2}}{{\Vert z-x_0\Vert }^{d-1}}} \end{aligned}$$
(5)

According to the SMOTE method’s geometry and as shown in Fig. 2, Z lies in the line connecting between \(x_0\) and its the chosen neighbor x. Then:

$$\begin{aligned} r=\Vert x-x_0\Vert \ge \Vert z-x_0\Vert \end{aligned}$$
(6)

The previous analysis assumes that the \(k^{th}\) neighbor is fixed and at a distance r from \(x_0\), i.e. the probability density computed is conditioned on this assumption. Next step is to assume that r is random and take the expectation over r:

$$\begin{aligned} p_Z(z \vert x_0,k)=\int _{r=\Vert z-x_0\Vert }^{\infty }\! p_Z(z \vert x_0,r,k)p(r \vert x_0)\textrm{d}r \end{aligned}$$
(7)

Note that we used \(r=\Vert z-x_0\Vert \) as the lower limit of the integral because this is implied in Eq. (6): the distance \(\Vert z-x_0\Vert \) of the interpolated point is always smaller than or equal to the distance of the \(k^{th}\) neighbor r. The term \(p(r \vert x_0)\) represents the probability density that a \(k^{th}\) nearest neighbor of \(x_0\) is located at distance r away from \(x_0\). This term will be evaluated later. Substituting from Eq. (5) into Eq. (7):

$$\begin{aligned} p_Z(z\vert x_0,k)=\int _{r=\Vert z-x_0\Vert }^{\infty }\! \frac{p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )}{\int _{S(x_0,r)} p_X(x) \textrm{d}x}\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg ) p(r \vert x_0)\textrm{d}r \end{aligned}$$
(8)

where \(S(x_0,r)\) denotes a spherical shell around \(x_0\) with radius r. The first quotient (\(p_X\) divided by the integral over the spherical shell) represents the probability density of a point at the location \(x_k\), given that it occurs at a distance r from \(x_0\). This means that the \(k^{th}\) neighbor (the point that is the target of the interpolation) is at distance r away from \(x_0\). This quotient is obtained by straightforward application of Bayes probability rule, as follows:

$$\begin{aligned} p_X(x \vert x \in {S(x_0,r)})=\frac{p_X(x, x \in {S(x_0,r)})}{p(x \in {S(x_0,r)})}=\frac{p_X(x)}{ {\int _{S(x_0,r)} p_X(x) \textrm{d}x} } \end{aligned}$$
(9)

Note that the point \(x_k\) is written as \(x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\) according to Eq. (3) and enforcing the fact that it is a distance r away from \(x_0\) (in other words that \(x \in {S(x_0,r)}\) according to Eq. (9). The reason for writing it this way is that it has to be expressed in terms of z. So, \(x_k\) is written as \(x_0\) plus the unit vector in the direction of z: \(\frac{(z-x_0)}{\Vert z-x_0\Vert }\) multiplied by r so that it lands on the shell that is distance r away from \(x_0\).

In summary, the derivation proceeds in several steps. In the first step we assume that the \(k^{th}\) neighbor is at a fixed distance r away from \(x_0\), and evaluate the probability density (given r). Next step is to obtain the probability density of z given that the neighbor is a distance r away. If the neighbor is r distance away, then this means that the neighbor is located on a spherical shell of radius r. Using Bayes formula we obtain the quotient indicated in the formula. Next step we take the expectation over r, using the probability that the \(k^{th}\) neighbor is distance r away.

Evaluating \(p(r \vert x_0)\) in Eq. (7):

According to the work of Fukunaga and Hostetler (1973), the coverage u of the k nearest neighbors is denoted as:

$$\begin{aligned} u=G(r)=\int _{B(x_0,r)}\! p_X(x) \, \textrm{d}x \end{aligned}$$
(10)

where \({B(x_0,r)}\) is the ball around \(x_0\) enclosing up to the \(k^{th}\) nearest neighbor of \(x_0\) and \(p_X(x)\) represents the probability density function of the underlying distribution (from which the k neighbors are drawn).

Furthermore, according to the work of Fukunaga and Hostetler (1973), u follows a Beta distribution such that \(u \sim Beta(u;k,N-k)\). Then, the density \(p_U(u)\) is defined as:

$$\begin{aligned} p_U(u)=\frac{(N-1)! u^{k-1}(1-u)^{N-k-1}}{(k-1)! (N-k-1)!} \end{aligned}$$
(11)

Using the theory of transformation of random variables (Magdon-Ismail & Atiya, 2002; Venkatesh, 2013), the probability density of r, the distance from \(x_0\) to the \(k^{th}\) neighbor is given by:

$$\begin{aligned} p(r \vert x_0)={p_U(u)}{\big \vert \frac{du}{dr} \big \vert } \end{aligned}$$
(12)

Using Eq. (10), then Eq. (12) could be written as:

$$\begin{aligned} p(r \vert x_0)={p_U(u)}{G'(r)} \end{aligned}$$
(13)

where \(G'(r)\) is the first derivative of G(r).

Substitute from Eq. (11) into Eq. (13) yields:

$$\begin{aligned} p(r\vert x_0)\;=\;(N-1) {{N-2}\atopwithdelims (){k-1}} G^{k-1}(r){(1-G(r))}^{N-k-1}G'(r) \end{aligned}$$
(14)

Substituting from Eqs. (10) and (11) into Eq. (14), then the conditional probability density of r given a minority sample \(x_0\), \(p(r \vert x_0)\) is evaluated as:

$$\begin{aligned} p(r \vert x_0)\;=\;& {} (N-1) {{N-2}\atopwithdelims (){k-1}} \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}x\Bigg )^{k-1}{\Bigg (1- \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}x\Bigg )}\Bigg )^{N-k-1} \nonumber \\{} & {} \times \int _{S(x_0,r)} p_X(x) \textrm{d}x \end{aligned}$$
(15)

where \(G'(r)\) equals the integral over the shell of radius r: \(\int _{S(x_0,r)} p_X(x) \textrm{d}x\) because changing r to \(r+dr\) in Eq. (10) will yield this integral over the shell. Substituting from Eq. (15) into Eq. (8) produces the following:

$$\begin{aligned}p_Z(z\vert x_0,k)=(N-1) {{N-2}\atopwithdelims (){k-1}} \int _{r=\Vert z-x_0\Vert }^{\infty }\! \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}X\Bigg )^{k-1} \\ \quad\quad\quad\quad\quad\quad\quad\quad\times {\Bigg (1- \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}X\Bigg )}\Bigg )^{N-k-1} \Bigg [\int _{S(x_0,r)} p_X(x) \textrm{d}x\Bigg ]\frac{p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )}{{\int _{S(x_0,r)} p_X(x) \textrm{d}x}} \nonumber \\ \quad\quad\quad\quad\quad\quad\quad\quad \times \Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg )\textrm{d}r\end{aligned}$$
(16)

Then, Eq. (16) can be simplified into:

$$\begin{aligned}p_Z(z \vert x_0,k)=(N-1) {{N-2}\atopwithdelims (){k-1}} \int _{r=\Vert z-x_0\Vert }^{\infty }\! \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}x\Bigg )^{k-1} \\ \quad\quad\quad\quad\quad\quad \times {\Bigg (1- \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}x\Bigg )}\Bigg )^{N-k-1} {p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )}\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg )\textrm{d}r\end{aligned}$$
(17)

So far we have assumed that we will take a fixed neighbor k. Since we select a neighbor at random among the K neighbors with probability \(\frac{1}{K}\) next step we will take the expectation over this random selection over k. This results in the following:

$$\begin{aligned}p_Z(z\vert x_0)=\sum _{k=1}^{K}\frac{N-1}{K} {{N-2}\atopwithdelims (){k-1}} \int _{r=\Vert z-x_0\Vert }^{\infty }\! \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}x\Bigg )^{k-1} \nonumber \\ \quad\quad\quad\quad\quad \times {\Bigg (1- \Bigg (\int _{B(x_0,r)} p_X(x) \textrm{d}x\Bigg )}\Bigg )^{N-k-1} p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg )\textrm{d}r\end{aligned}$$
(18)

Let \(I_{B(x_0,r)} =\int _{B(x_0,r)} p_X(x)dx\). Then:

$$\begin{aligned}p_Z(z \vert x_0)= \frac{N-1}{K}\int _{r=\Vert z-x_0\Vert }^{\infty }\sum _{k=1}^{K} {{N-2}\atopwithdelims (){k-1}} \! I_{B(x_0,r)}^{k-1}{\Bigg (1- I_{B(x_0,r)})\Bigg )}^{N-k-1} \\ \quad\quad\quad\quad\quad \times p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg )\textrm{d}r\end{aligned}$$
(19)

Define \(J(x_0,r)\) as follows:

$$\begin{aligned} J(x_0,r)=\sum _{k=1}^{K} {{N-2}\atopwithdelims (){k-1}} \! I_{B(x_0,r)}^{k-1}{\Bigg (1- I_{B(x_0,r)})\Bigg )}^{N-k-1} \end{aligned}$$
(20)

Equation (20) can be written as:

$$\begin{aligned} J(x_0,r)=\sum _{m=0}^{K-1} {{N-2}\atopwithdelims (){m}} \! I_{B(x_0,r)}^{m}{\Bigg (1- I_{B(x_0,r)})\Bigg )}^{N-m-2} \end{aligned}$$
(21)

From Eq. (21), \(J(x_0,r)\) is a cumulative probability function \(F(I_{B(x_0,r)},N-2,K-1)\) for the Binomial distribution \({\mathcal {B}}(N-2,I_{B(x_0,r)})\).

The cumulative probability function F(yNk) (Wadsworth, 1960) can be expressed as:

$$\begin{aligned} F(y,N,k)= (N-k){{N}\atopwithdelims (){k}}\int _{t=0}^{1-y} t^{N-k-1}(1-t)^{k}\,dt \end{aligned}$$
(22)

Thus, \(F(I_{B(x_0,r)},N-2,K-1)\) is evaluated as follows:

$$\begin{aligned} F(I_{B(x_0,r)},N-1,K-1)= (N-K-1){{N-2}\atopwithdelims (){K-1}}\int _{t=0}^{1-I_{B(x_0,r)}} t^{N-K-2}(1-t)^{K-1}\,dt \end{aligned}$$
(23)

However, Eq. (23) can be expressed in terms of the incomplete beta function \(B(1-I_{B(x_0,r)};N-K-1,K)\) (Dutka, 1981), and defined in Eq. (2). From Eq. (23), \(J(x_0,r)\) can be formulated as:

$$\begin{aligned} J(x_0,r)\;=\;& {} F(I_{B(x_0,r)},N-1,K)\nonumber \\=\;& {} (N-K-1){{N-2}\atopwithdelims (){K-1}} B\Bigg (1-I_{B(x_0,r)};N-K-1,K\Bigg ) \end{aligned}$$
(24)

Substitute from Eqs. (20) and (24) into Eq. (19):

$$\begin{aligned} p_Z(z \vert x_0)\;=\;& {} \frac{N-1}{K}(N-K){{N-2}\atopwithdelims (){K-1}} \int _{r=\Vert z-x_0\Vert }^{\infty }\! p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg ) \nonumber \\{} & {} \times B\Bigg (1-I_{B(x_0,r)};N-K-1,K\Bigg ) \textrm{d}r \end{aligned}$$
(25)

Simplifying Eq. (25) results in:

$$\begin{aligned} p_Z(z\vert x_0)\;=\;& {} (N-K){{N-1}\atopwithdelims (){K}}\int _{r=\Vert z-x_0\Vert }^{\infty }\! p_X\Bigg (x_0+\frac{(z-x_0)r}{\Vert z-x_0\Vert }\Bigg )\Bigg ({\frac{r^{d-2}}{\Vert z-x_0\Vert ^{d-1}}}\Bigg ) \nonumber \\{} & {} \times B\Bigg (1-I_{B(x_0,r)};N-K-1,K\Bigg ) \textrm{d}r \end{aligned}$$
(26)

Accordingly, Eq. (26) proves Lemma 1.

Finally, taking expectation over \(x_0\) yields the density of the SMOTE generated patterns p(Z).

$$\begin{aligned} p_Z(z)\;=\;& {} (N-K){{N-1}\atopwithdelims (){K}}\int _{x} p_X(x)\int _{r=\Vert z-x\Vert }^{\infty }\! p_X\Bigg (x+\frac{(z-x)r}{\Vert z-x\Vert }\Bigg ) \Bigg ({\frac{r^{d-2}}{\Vert z-x\Vert ^{d-1}}}\Bigg ) \nonumber \\{} & {} \times B\Bigg (1-I_{B(x,r)};N-K-1,K\Bigg ) \textrm{d}r \, \textrm{d}x \end{aligned}$$
(27)

\(\square \)

Consequently, Eq. (27) proves Theorem 1.

4 Experiments and results

In this section, we present the results of the numerical experiments conducted in support of our derived theoretical analysis presented in Sect. 3. To this end, we estimate the SMOTE patterns density p(Z) using the developed theoretical analysis, and also evaluate the SMOTE density p(Z) empirically. Then, we compare the two density estimates for verification.

In our experiments, we adopt two different distributions: multivariate uniform distribution over a disk, and multivariate Gaussian distribution. The uniform distribution is taken over a two-dimensional disk centered at the origin. The disk radius is set to 3, so the density equals \(\frac{1}{9 \pi }\) for \(x_1^2 +x_2^2\le 3\) and zero otherwise.

We use a two-dimensional zero mean multivariate Gaussian distribution of the minority class samples, \(\mu =[0,0]\), and we have examined different covariance matrices: the identity matrix \(\Sigma =\mathbb {I}_2\), \(\Sigma = \begin{bmatrix} 1 &{} 0.8 \\ 0.8 &{} 4 \\ \end{bmatrix}\), and and \(\Sigma = \begin{bmatrix} 2 &{} 0 \\ 0 &{} 1.5 \\ \end{bmatrix}\).

We perform the empirical density estimation of SMOTE patterns p(Z) according to Algorithm 1 above. We have examined two different values for the number of original patterns for applying SMOTE at a single run, specifically, we tried \(N=30\) and \(N=50\) to mimic the scarcity of minority class patterns when applying SMOTE. For the K parameter in the K nearest neighbor applied in SMOTE, we have adopted two values: \(K= 3\) and \(K=5\). The number of generated SMOTE patterns for the empirical density estimation is set to \(M=5 \times 10^7\) in order to obtain an accurate density estimate.

To implement the theoretically derived formula, we use a two-dimensional grid for integration. For display, we often hold one feature or dimension to be constant and evaluate the density of the other feature. This facilitates the presentation of the theoretical versus empirical densities on the same plots, as it may be hard to visualize two two-dimensional densities in a single plot.

We tested both results of the Theorem and the Lemma in the experiments. The first experiments, described this paragraph, verify the accuracy of \(p_Z(x)\), while the next experiments verify the conditional distribution \(p_Z(z \vert x_0)\). Figures 3 and 4 represent the marginal density of SMOTE generated patterns \(p_Z(z)\) as given in Theorem 1 for the multivariate uniform distribution over a disk. The figures show the theoretical density estimated using the presented analysis, the empirical density, and the true original minority class density. As mentioned, these are one-dimensional sections in the 2-D for visualization purposes. The figures demonstrate the closeness between the theoretical and empirical density estimates which verifies the proposed analysis. Moreover, it can be noted from Figs. 3 and 4 that both of the theoretical and empirical SMOTE densities are close to the original minority class density in case of the multivariate uniform distribution over a disk. In other words, these results imply that the SMOTE patterns are adequate representatives for the original minority class in case of uniform distribution original density.

Fig. 3
figure 3

Empirical versus theoretical densities of SMOTE patterns \(p_Z(z)\) for 2-dimensional disk for \(z_1=0.2\) using \(N=30\) and \(K=3\)

Fig. 4
figure 4

Empirical versus theoretical densities of SMOTE patterns \(p_Z(z)\) for 2-dimensional disk for \(z_1=0\) using \(N=50\) and \(K=5\)

Figures 5 and 6 depict the SMOTE patterns density given a certain original data point \(x_0\) (i.e. \(p_Z(z \vert x_0)\)) empirically and theoretically for the multivariate Gaussian distribution (as given in Lemma 1). The figures show the conformity between the the theoretical and empirical density estimates which confirms our introduced theoretical analysis. In these figures, the true density can not be plotted as we evaluate the conditional density of SMOTE samples given a particular original minority sample \(x_0\).

Fig. 5
figure 5

Conditional empirical versus theoretical densities of SMOTE patterns \(p_Z(z \vert x_0)\) for 2-dimensional Gaussian original distribution, \(x_0=[0.5,1]\) and \(z_2=1.3\) for \(p_X(x) \sim {\mathcal {N}}([0,0],[1\,\,0.8;0.8\,\,4])\) using \(N=50\) and \(K=3\)

Fig. 6
figure 6

Conditional empirical versus theoretical densities of SMOTE patterns \(p_Z(z \vert x_0)\) for 2-dimensional Gaussian original distribution, \(x_0=[-0.2,0.2]\) and \(z_1=-0.3, p_X(x) \sim {\mathcal {N}}([0,0],[2\,\,0;0\,\,1.5])\) using \(N=50\) and \(K=5\)

Figure 7 demonstrates the two-dimensional density for SMOTE patterns estimated using the presented analysis. Similarly, Fig. 8 shows the SMOTE density estimated empirically. In order to obtain smooth results for the empirical estimation of the 2-dimensional density, we adopt the Parzen window density estimation (Parzen, 1962; Rosenblatt, 1956). We use Gaussian kernel, and the kernel width h is set to 0.05.

It could be observed from Figs. 7 and 8 that the SMOTE density is concentrated around \(x_0\). This is reasonable as the SMOTE method places the synthetic patterns inwards around the minority class samples as presented in Fig. 1. These results are consistent with the argument of the contracteness behavior of the SMOTE oversampling method as raised by Elreedy and Atiya (2019). Specifically, the figures show that the SMOTE density estimates either theoretically or empirically are centered around the original minority sample \(x_0\) on which SMOTE is applied.

Fig. 7
figure 7

Theoretical SMOTE density \(p_Z(z_1,z_2)\) for 2-dimensional Gaussian original distribution, \(x_0=[0,0]\) and for \(p_X(x) \sim {\mathcal {N}}([0,0],\mathbb {I})\) using \(N=50\) and \(K=3\)

Fig. 8
figure 8

Empirical SMOTE density \(p(Z_1,Z_2)\) for 2-dimensional Gaussian original distribution using parzen window density estimation, \(x_0=[0,0]\) and for \(p_X(x) \sim {\mathcal {N}}([0,0],\mathbb {I})\) using \(N=50\) and \(K=3\)

In the next experiment we test whether the proposed formulas provide accurate estimates for the case of classification. Of course, obtaining the correct density is the building block of any further classification method, and therefore should guarantee accurate computation of any classification-based outcome such as classification accuracy. We considered a simple two-class classification problem, where the minority class is a uniform distribution over a two-dimensional disk centered at the origin, where the disk radius is set to 3, so the density equals \(\frac{1}{9 \pi }\) for \(x_1^2 +x_2^2\le 3\) and zero otherwise. The majority class also has a uniform density over a disk of radius 3, but centered around a mean vector \(\mu =(a, \ a)^T\), where a is a number that we vary. We considered a linear discriminant function classifier and applied our theoretical formulas versus the empirical way of generating SMOTE samples. We computed the geometric mean (G-mean) (Barandela et al., 2003), defined in Eq. (28) for both methods for several values of a. Table 1 shows the G-mean result. One can observe that both theoretical and empirical approaches produce very close numbers, indicating the accuracy of the developed formula.

$$\begin{aligned} Gmean=\sqrt{Sensitivity \times Specificity} \end{aligned}$$
(28)

The sensitivity and specificity metrics are defined in Eqs. (29) and (30), respectively:

$$\begin{aligned} Sensitivity= & {} \frac{True\,\,Positive}{True\,Positive+ False\,Negative} \end{aligned}$$
(29)
$$\begin{aligned} Specificity= & {} \frac{True\,\,Negative}{True\,Negative+ False\,Positive} \end{aligned}$$
(30)
Table 1 Theoretical and empirical Gmean values for different values of parameter a determining the mean of the majority class where \(\mu =(a, \ a)^T\)

5 Conclusions and future work

In this paper, we develop a theoretical analysis of one of the most dominant over-sampling methods: the Synthetic Minority over-sampling TEchnique (SMOTE) method. The SMOTE algorithm is a very powerful over-sampling method for generating artificial minority class samples in order to balance the class distribution. However, the synthetic data generated by SMOTE may not exactly follow the original minority class distribution, which could impact the classification performance. Thus, this work theoretically analyzes the distribution of the synthetically generated patterns. Specifically, we introduce a full derivation of the probability density function of the SMOTE generated patterns. We applied the developed analysis to some distributions and verified correctness of the presented theoretical analysis by comparing with the empirical density estimates. The goal here has been to focus on deriving a complete and exact formula. Providing a theoretical formula would lay the groundwork for further analysis and guide further modifications of SMOTE in the future in the direction that improves its functionality. This is how this could potentially lead to improving the efficiency of classifiers and providing a better understanding of SMOTE. For example, this can lead to optimal formulas that set different weight to the original samples versus the weight given to the generated samples, in any classification scheme. Analyzing the behavior of SMOTE or quantifying the deviation of SMOTE from the true density could be performed in future work. Another potential future research direction could be investigating more complex minority class distributions such as multi-modal distributions, which present significant challenge for SMOTE in particular.