Testing for equality of distributions using the concept of (niche) overlap

In this paper, we propose a new non-parametric test for equality of distributions. The test is based on the recently introduced measure of (niche) overlap and its rank-based estimator. As the estimator makes only one basic assumption on the underlying distribution, namely continuity, the test is universal applicable in contrast to many tests that are restricted to only specific scenarios. By construction, the new test is capable of detecting differences in location and scale. It thus complements the large class of rank-based tests that are constructed based on the non-parametric relative effect. In simulations this new test procedure obtained higher power and lower type I error compared to two common tests in several settings. The new procedure shows overall good performance. Together with its simplicity, this test can be used broadly. Supplementary Information The online version contains supplementary material available at 10.1007/s00362-021-01239-y.


Introduction
Analyzing data sets appropriately is of immense importance in any discipline. Not only are the numerical characteristics of the individual data sets of interest, but often their distribution in comparison to other data sets. In production one would like to know which process is more efficient or which product has higher quality, or in ecology one is interested in the overlap of the survival space of two species, just to name some examples.
For univariate data many tests have been proposed for the comparison of distributions prominently including those by Kolmogorov (1933) and Smirnov (1939), and the Cramér-von Mises test. The two-sample rank sum test due to Wilcoxon (1945) and Mann and Whitney (1947) assesses whether observations of one distribution tend to larger values than those from the other. The elementary concepts of these tests are likely key to their success, as even without a strong mathematical background it is possible to grasp the main concept behind the statistics. Of course, these classical tests also have their shortcomings which has sparked plenty of adaptations in order to enhance their performance. However, the adaptations have often complicated the tests too much to get well established themselves, or they are only useful in rather special situations. Consider, for example Khamis (1990), Drezner et al. (2010), or Baringhaus and Kolbe (2015). Other tests only focus on location or scale differences, see for example Marozzi (2012). More recent tests for equality of distributions like for example Ping (2000), Bera et al. (2013), and Wan et al. (2018) have been proposed but haven't been able to establish themselves.
In clinical studies resources of participants or patients are often limited, due to ethical reasons, limited budget, or other reasons, leading to small sample sizes. With only a small sample it regularly is difficult to assess whether all model assumptions of a test are met and the results are reliable. It is thus of interest to have tests available with as little requirements as possible and yet good performances, i.e. reliable results. Likewise, plenty of literature show that non-parametric or quantile-based methods are generally appealing for new test methods, compare for example Al-Mutairi and Stat Papers (2017), Hassler (2018), Soni et al. (2019), Zamanzade (2019), or Jokiel-Rokita and Topolnicki (2019).
In this paper, we propose a new and easily motivated non-parametric test with competitive performance and straightforward interpretation. Based on the non-parametric relative effect, a quantity that has received renewed attention lately, due to its favorable properties, see Brunner et al. (2018) or Dastbaravarde and Zamanzade (2017), the test concentrates on an easy interpretation. Using the approach of Parkinson et al. (2018), we propose a fully non-parametric testing method that can easily be performed using ranks. The original intent of the cited paper was to measure overlap of two distributions representing ecological niches. The overlap can be considered a measure of similarity between two species. As those results are not only applicable to the quantification of niches but to any arbitrary data set, this measure of overlap is an adequate basis for a test statistic for testing whether two data sets are drawn from the same distribution or not. Instead of considering only a certain location parameter, such as mean or median, the niche overlap measure takes the full distributions into consideration. This results in a test for equality of distributions which is based on the full set of quantiles.
In Sect. 2 we present the necessary results of Parkinson et al. (2018) and describe the test procedure. Intensive simulations on the performance of the new testing method, also in comparison to other tests, are presented in Sect. 3 followed by a short conclusion and discussion in Sect. 4.

Mathematical background and theoretical results
In the following we propose a test for equality of distributions based on the niche overlap value as defined and estimated by Parkinson et al. (2018). The notation in this paper is identical to theirs. We will first state the relevant results of that paper before introducing the new testing method.
Consider two groups of observations X 1 , . . . , X n and Y 1 , . . . , Y m . We will assume that X 1 , . . . , X n are independent, identically distributed samples drawn from a continuous distribution function F, while Y 1 , . . . , Y m are independent, identically distributed samples drawn from a continuous distribution function G. The empirical estimators of F and G are denoted byF n andĜ m , respectively.
In order to quantify how F is "contained" within G the statistical functional was considered, as well as I 1 , with the roles of F and G switched.
To explain how I 1 and I 2 quantify the inclusion of F in G and vice versa, consider four random variables X (1) , X (2) , Y (1) , and Y (2) which can be constructed from X and Y . Denote the distribution of F below its median as F 1 and above its median as F 2 , such that (1) Then one can construct the random variables X (1) ∼ F 1 and X (2) ∼ F 2 by conditioning on X ≤ F −1 (0.5) and X ≥ F −1 (0.5), respectively, and similar for Y (1) and Y (2) . Now I 2 = P(X (1) ≤ Y ≤ X (2) ) and I 1 = P(Y (1) ≤ X ≤ Y (2) ) holds for proof we refer to Parkinson et al. (2018).
The following important properties hold for I 1 and I 2 . As I 1 and I 2 can be interpreted as probabilities it holds that I 1 + I 2 ∈ [0, 1] for F and G absolutely continuous. If G is continuous and F = G then I 1 = I 2 = 1/2. Additionally, I 1 and I 2 are invariant under strictly monotone, continuous transformations, where the same transformation φ is being applied to both F and G. For further properties and the proofs of the stated properties please consult Lemma 2.2 of Parkinson et al. (2018).
In order to construct an estimator for I 2 , all observations of both groups shall be ranked. Without loss of generality we will assume that X 1 < X 2 < · · · < X n (to simplify notation). All the X -observations below their median are X 1 , . . . , X K where K is the largest integer with K ≤ (n + 1)/2. Their ranks within both groups will be denoted by R X < 1 , . . . , R X < K , the remaining ranks by R X > K +1 , . . . , R X > n . In case of ties we will use midranks. Further define Lemma 2 (Lemma 2.11, Parkinson et al. 2018) For R X > · and R X < · defined as above, a consistent estimator for I 2 is given bŷ with c = −n/m for n even and c ≈ −n/m for n odd, and n and m large. A similar consistent estimator for I 1 can be provided.

Theorem 1 Let F and G be continuous distribution functions with F
Then the estimators of I 1 and I 2 as given in Lemma 2 are biased. More precisely, E[Î 1 +Î 2 ] < 1.

Proof As stated in Lemma
To show that the estimators are biased we will show that the expectation of the sum of the two estimators is systematically below one, thus implying that at least one of the two estimators must be biased. From the fact that I 1 and I 2 are symmetric it should be self-evident that the bias arises from the combination of the two estimators. Without loss of generality, assume X 1 < X 2 < · · · < X n and Y 1 < Y 2 < · · · < Y m with K and L denoting the indices of the respective medians of the X and the Y samples. Due to the fact that the underlying distributions are continuous we do not consider the case of ties between the observations as the probability for this is zero.
We will consider the expression of the estimators through indicator functions, i.e.
Rearranging the expression we obtain Now we will consider the two lines separately, ignoring the factor 2/mn for the time being.
In the second line the indicators are complimentary, respectively, as −1{Y j < This means that the second line can be reduced to As for the first line, we will immediately consider its expectation. Rearranged we obtain As we are showing that the expectation of the sum of the estimators remains systematically below 1, we will simply show that even if (3) is maximized it is smaller than 1. The indicator function can only take the values 0 and 1 such that the expectation of it lies in the interval [0, 1]. Now (3) will be maximized if the first expectation takes the value 1 and the second expectation takes value 0. Then the maximization of (3) is Taking (2) and (4) we obtain To further simplify that term, we will assume that m and n are even such that m = 2L and n = 2K . Then we have Similar, it can be shown that E[Î 1 +Î 2 ] is truly smaller than 1, if n, m, or both are uneven. Thus we can conclude thatÎ 1 +Î 2 is a biased estimator for I 1 + I 2 if

Remark 1
The bias ofÎ 1 +Î 2 depends on the two probabilities P(X K < Y L+1 ) and P(Y L < X K +1 ). The larger those two are the smaller the bias will be. Parkinson et al. (2018), (n + m) 1/2 Î 2 − I 2 converges to a normal distribution with expectation zero.

Lemma 3 Under certain criteria that are specified in detail in
Based on the above Lemmas, I 2 can be consistently estimated, and asymptotic inference for I 2 will be possible based on the rank statistics, provided the variance of (n + m) 1/2 Î 2 − I 2 can also be consistently estimated. For a special case, we can provide a consistent rank-based variance estimator In the following assume G −1 (0.5) = F −1 (0.5) holds, that is, the distributions F and G are assumed to have the same median. In this case, I 2 can be expressed through four random variables that can be constructed from X and Y . This result, along with an approach introduced by Konietschke (2009), provides the following way to construct a variance estimator.
Analogously to the X -sample earlier, without loss of generality, assume that Y 1 < Y 2 < · · · < Y m and divide them into two groups at the index L, with L the largest integer for which L ≤ (m + 1)/2. Additionally, without loss of generality, split the sample of X at the index K such that X 1 , . . . , X K are below and X K +1 , . . . , X n are above the median. We then compare the samples of X and Y which are below their respective medians with each other. We denote their ranks in the combined group by R for those that are above their respective medians. Finally, the observations are (additionally) ranked only within each of these four groups and these within-group ranks are denoted by R Lemma 4 (Lemma 2.20, Parkinson et al. 2018) Let F and G be continuous distribution functions. Assume G −1 (1/2) = F −1 (1/2) holds true. Further, define and s 2 Y 1 and s 2 Y 2 analogously. A consistent variance estimator forÎ 2 is Here, consistency is to be understood as V ar (Î 2 )/s 2 2 → 1 in probability.
This is the same variance estimator as forÎ 1 .
Theorem 2 Let I 1 and I 2 be defined as before, andÎ 1 andÎ 2 denote the rank estimators as introduced above. Further let s 2 2 be the variance estimator as given in Lemma 4. Consider the null hypothesis H 0 : F ≡ G versus H 1 : F = G and the test statistics Under the null hypothesis N O i , i = 1, 2 is asymptotically distributed according to a standard normal distribution.
Remark 2 Instead of the standard normal distribution, one may also approximately use a t-distribution with n + m − 1 degrees of freedom. Simulations showed that, especially for small sample sizes, the t-distribution provided a better approximation to the sampling distribution of N O i .
Using the results of Theorem 2 we can design a test for equal distributions. Remark 3 Due to the way the variance estimator is constructed, it is possible that it can be zero. This can occur if the assumption of equal medians is so strongly violated in the data set that the true proportions of observations of Y 1 below the median of X and of X 2 above the median of Y or vice versa are (close to) zero. In these situations it intuitively appears justified to reject the null hypothesis.
In such cases only an upper p-value can be provided. Following the idea as given in Chapter 3.5.3 of Brunner et al. (2018), the data sets are shifted until the variance estimator is truly greater than zero. The resulting p-value of the test statistic based on the shifted data is an upper limit of the true p-value.
Remark 4 As shown earlier we have a bias forÎ 1 andÎ 2 under the null hypothesis and all alternatives where the underlying distribution functions have equal medians. To minimize this bias we propose a modification to the existing procedure. When calculatingÎ 2 we will add an extra observation to the data set Y , namely the median of the X -observations. And vice versa for I 1 . With this modification it is possible to obtain I 1 + I 2 > 1 which reduces the bias of the estimators.

General settings
In order to investigate the small sample properties of the test statistics concerning Type I and Type II error, we have performed simulation studies in R (R version 3.2.3, R Core Team, 2017). Additionally, simulations to check the correctness and robustness of the new test procedure were run. As to put the error rates into context, the Kolmogorov-Smirnov test and the Wilcoxon-rank-sum test were also calculated using the same data. All simulations for our testing method used the adaptation as proposed in Remark 4. Further simulations without the adaptation, which are omitted here , showed an inflated Type I error for small sample sizes.

Empirical confirmation of Theorem 2
In this part we will confirm the limiting distribution as given in Theorem 2. Consider data distributed with F = G = ex p(0.5). The sample sizes in all 4 settings are equal and given by 20, 35, 50, 100. For each setting we ran 1000 simulations.
The Q-Q plots of the empirical quantiles of the test statistic N O 1 versus the tdistribution percentiles can be seen in Fig. 1. For all sample sizes the middle quantiles are adequate but bigger differences can be noticed for the outer quantiles. For higher sample sizes the empirical distribution agrees very well with the theoretically justified t-distribution. The corresponding Q-Q plots for N O 2 can be found in the supplementary material and show similar results.

Visual power analysis
For the second set of simulations we analyses the power of the three tests, keeping one distribution fixed and varying the parameters of the second distribution. The sample sizes were n = m = 50 in all settings. The results were then visualized for a first

Setting 1
In the first setting the X -observations were drawn from a normal distribution with mean 0 and variance 1. Several Y -observations were drawn also from a normal distribution but with different parameters ranging from −1 to 1 for the mean and 0.25 to 2.5 for the variance. The power for each combination was visualized in Fig. 2 for the new test, in Fig. 3 for the Kolmogorov-Smirnov test, and in Fig. 4 for the Wilcoxon-rank-sum test.
Comparing the three plots one can quite easily see some key-differences between the tests. While the newly proposed test had lower power for detecting small differences in the mean of the two distributions, it had the highest power out of the three for detecting small differences in the variances. The Kolmogorov-Smirnov test showed almost opposite behavior, uncovering small differences in the mean but not in the variance. Due to its reliance on the non-parametric relative effect the Wilcoxon-ranksum test is only capable of detecting location differences in a location-scale model.

Setting 2
In the second setting the X -observations were drawn from a beta distribution with parameters a = 2 and b = 3. Several Y -observations were drawn from beta distribu- tions with parameters ranging from 0.5 to 4 for a and 0.5 to 5.5 for b. The power for each combination was visualized in Fig. 5 for our test, in Fig. 6 for the Kolmogorov-Smirnov test, and in Fig. 7 for the Wilcoxon-rank-sum test. In this settings both parameters take influence on the expectation and the variance. Along the red line all three tests only obtained a low power. The Kolmogorov-Smirnov test quickly increased the power when moving away from the scenario of equal expectations, see Fig. 6. Similar behavior could be observed when looking at the power plot of the Wilcoxon-rank-sum test, Fig. 7. Looking at Fig. 5 one notices bigger differences. Moving along the red line, and away from the true parameters, i.e. the scenario of equal expectations but differing variances, the newly introduced test increased its power faster than the other two tests. However moving away from the red line, the power increased more slowly.

Error analysis based on sample size
In this part we calculated the Type I and Type II error for several different combinations of F and G and for different sample sizes. For all settings we ran 1000 simulations.

Type I error
In the following the observations were drawn from the same distribution F = G. We considered the distributions F 1 = N (0, 1),   F 4 = U (0, 1), F 5 = X 2 1 , and F 6 = t 10 for sample sizes ranging from 10 to 100 in steps of 10. The sample sizes were equal in all settings.
As one can see in Table 1 the new test obtained the nominal level of α = 0.05 in all settings for very small sample sizes (n = m = 10, n = m = 20), whereas for higher sample sizes, it was a little bit conservative. The Kolmogorov-Smirnov test was too conservative for small sample sizes. For moderate to higher sample sizes, the Kolmogorov-Smirnov test was closer to the nominal level but remained conservative. In the settings considered, the Wilcoxon-rank-sum test maintained the nominal level well for all sample sizes.

Type II error
The combination of distributions we analyzed can be found in Table 2. Again we considered sample sizes ranging from 10 to 100 in steps of 10. The sample sizes were equal in all settings.
The simulation settings were chosen in a manner that the differences were hard to detect, using combinations of distributions with equal expectation, variance or both. This led to high Type II errors, especially for small sample sizes, but showed the differences between the tests more accurately. For bigger differences between the two distributions, all tests obtained low Type II errors and thus those simulations are omitted in this paper. The results of the simulations can be found in Table 3.
Looking first at settings A, B, and C, where both groups had the same expectation we notice already great differences between the performances. In all three scenarios, Same expectation and variance  as expected, the Wilcoxon-rank-sum test fails completely, even for sample sizes of n = m = 100. The other two methods both were struggling with setting A where the distributions came from the same family, but improved with increased sample sizes. In the settings B and C both picked up the differences quite well, with low Type II In settings D and E the expectation differed but the variance was the same. Here the Wilcoxon-rank-sum test and the Kolmogorov-Smirnov test outperformed the new test. Even though the (niche) overlap test was able to pick up the differences, the required sample sizes, especially for situation D, were higher. In general, scenario D was difficult for all the tests, however the Wilcoxon-rank-sum test kept the Type II error at a reasonable level for sample sizes of n = m = 50 and higher. In setting E the Wilcoxon-rank-sum test performed slightly better than the Kolmogorov-Smirnov test.
Settings F and G correspond to scenarios where both, expectation and variance, were equal, which makes a detection of differences between the group rather hard. Again the Wilcoxon-rank-sum test failed to detect the differences, even for sample sizes of n = m = 100. In both settings the small sample size performance of the NO-Test was better than the one of the Kolmogorov-Smirnov test. On the other hand, especially in setting F the Kolmogorov-Smirnov test obtained lower Type II error rates for high sample sizes.

Robustness
Finally we investigate how the tests deal with outliers. In this scenario we had sample sizes of n = m = 50 and successively added up to 15 outliers. Both original data sets were drawn from the same distribution, namely F = G = N (0, 1) with the outliers coming from a N (0, 10) distribution and were added to the second data set. For all settings 1000 repetitions were performed.
It is desired that tests are robust, such that single outliers don't effect the test results. However if several data points stray from the data set they might not be outliers any more and a robust test should still pick up on this.
With only a few outliers all three tests stay at the nominal level as it would be desired. With the increase of number of outliers the new test method is the first to start rejecting the null hypothesis when roughly 10% of the sample size are added as outliers. The Wilcoxon-rank-sum test seems rather ignorant against the outliers as even when more than 20% of the sample consists of outliers they stick with the nominal level. For the Kolmogorov-Smirnov test a slight inflation of rejection rate can be observed however it is hardly noticeable.

Additional simulations
Additional supporting information may be found online in the Supporting Information section at the end of the article. This includes Web Figure 1, referenced in Sect. 3, and Web Appendix A, which shows very small and unequal sample size behavior.

Data example
To apply the new test procedure to a real life data set we chose an epilepsy treatment data set. The here considered data comes from clinical records of people prescribed perampanel in routine practice. While Rohracher et al. (2018) pooled observational data across Europe, we will only analyze the subset of data collected from Department of Neurology, Christian Doppler Medical Centre, Paracelsus Medical University, Salzburg, Austria. Most of that data was already used in Rohracher et al. (2016) where the study design and data sources are described.
The data set was split into two groups. One containing the people who were still on perampanel at the 12-month follow up, the other containing those that discontinued due to different reasons (e.g., adverse events). The considered variable was the duration of epilepsy, in years, before perampanel initiation. Of the 98 patients in the perampanel group there were 8 missing observations, leaving 90 observations for the analysis. In the second group only one observation of 65 was missing.
The null hypothesis was that there is no difference in duration of epilepsy between the two groups. All three tests do not reject the null hypothesis. The p-values of the new test procedure was equal to 0.908 after application of the Bonferroni-Holmes procedure. The two individual p-values were given by 0.454 and 0.645. The Kolmogorov-Smirnov test had a p-value of 0.211 and the Wilcoxon-rank-sum test had one of 0.350. This implies that no significant difference between the two groups exist.
Those findings agree with the original findings of Rohracher et al. (2018). The higher p-value of the new test procedure compared to the other p-values goes along with the findings of the simulation results.

Discussion
In this paper, we have introduced a new test statistic for testing equality of distributions based on the concept of overlap. The newly introduced testing method showed overall good behavior in the simulations. Comparing it with standard methods, it showed some advantages.
The presented test procedure is easy to understand and interpret, and fast to calculate. Its wide application area together with its straightforward interpretation could make it a useful alternative to existing tests in medicine and several other disciplines. Even though its performance could be potentially improved via continuity correction it would come at the cost of its comprehensibility. Additional simulations would be nec-essary to determine if a continuity correction would bring a significant improvement of the test procedure.
The possibilities presented through the idea of the new introduced test procedure are plentiful. One option would be to provide a goodness of fit test in one sample problem. As there the ranks could not be calculated due to the lack of a second sample set another option would need to be found to estimate I 1 and I 2 as well as the variance estimator. One possibility would be to draw several sample sets from the fixed distribution and use those for estimation of the ranks, like a bootstrap procedure. However this option as well as other approaches should be considered and compared in simulations to find a well suited procedure for the one-sample case.