A Two-sample Nonparametric Test for Circular Data– its Exact Distribution and Performance

A nonparametric test labelled ‘Rao Spacing-frequencies test’ is explored and developed for testing whether two circular samples come from the same population. Its exact distribution and performance relative to comparable tests such as the Wheeler-Watson test and the Dixon test in small samples, are discussed. Although this test statistic is shown to be asymptotically normal, as one would expect, this large sample distribution does not provide satisfactory approximations for small to moderate samples. Exact critical values for small samples are obtained and tables provided here, using combinatorial techniques, and asymptotic critical regions are assessed against these. For moderate sample sizes in-between i.e. when the samples are too large making combinatorial techniques computationally prohibitive but yet asymptotic regions do not provide a good approximation, we provide a simple Monte Carlo procedure that gives very accurate critical values. As is well-known, the large number of usual rank-based tests are not applicable in the context of circular data since the values of such ranks depend on the arbitrary choice of origin and the sense of rotation used (clockwise or anti-clockwise). Tests that are invariant under the group of rotations, depend on the data through the so-called ‘spacing frequencies’, the frequencies of one sample that fall in between the spacings (or gaps) made by the other. The Wheeler-Watson, Dixon, and the proposed Rao tests are of this form and are explicitly useful for circular data, but they also have the added advantage of being valid and useful for comparing any two samples on the real line. Our study and simulations establish the ‘Rao spacing-frequencies test’ as a desirable, and indeed preferable test in a wide variety of contexts for comparing two circular samples, and as a viable competitor even for data on the real line. Computational help for implementing any of these tests, is made available online “TwoCircles” R package and is part of this paper.

On the circle, one may start with any one of the observations as the "origin" and consider either sense of rotation (clockwise or anticlockwise) and define the frequency of Y j 's in the arc-lengths formed by the sample X (i) 's, i = 1, . . . , m. These S k 's are referred to as spacing frequencies, and their distribution theory remains the same.
Unlike the ranks which are not well-defined for circular data, such spacing frequencies have the rotational invariance property and play a prominent role in comparing two samples on the circle. They are equally useful on the line as they represent rank-differences (see Gatto and Rao, 2015 for details). Large sample theory for families of nonparametric tests of the form m k=1 h(S k ) which are symmetric in these spacing-frequencies as well as for the more general statistics of the form k h k (S k ) are studied in Holst and Rao (1980). But as shown in Gatto and Rao (2015), symmetric statistics based on spacing-frequencies have the rotational invariance property and hence are appropriate in the circular context.
In this paper, we consider three such nonparametric tests based on spacing frequencies for comparing any two circular samples, and these are listed in Table 1 below.
Note that, under the null hypothesis, the expected value of S k is given by n/m. While the Dixon statistic (1940) looks at the squared L 2 norm of (S k −E[S k ]) (which can be seen as also equivalent to S 2 k ), the Rao spacingfrequencies statistic given above, looks at the L 1 norm of (S k − E[S k ]). As we show below, it is equal to the simple sum 2 m k=1 max S k − n m , 0 . This is because This statistic may be seen as the two-sample analog of the frequently used "Rao's spacings test" for testing uniformity or isotropy for a single sample of circular data (see Rao, 1976). The exact distribution theory, critical values, and the relative performance of this statistic T 1 are being studied in some detail here for the first time. The Wheeler-Watson test (see Wheeler and Watson, 1964) is based on what are called uniform scores and can be written in the above form in terms of spacing frequencies.
Remark A. One may construct tests of the same functional form based on the "dual" spacing-frequencies, namely the number of X-observations that fall in between the spacings made by the Y-observations, but as discussed in some detail in Gatto and Rao (2015) there is a one-to-one correspondence between these two sets, and one may proceed either way to obtain comparable conclusions. This duality may be seen as somewhat analogous to choosing the ranks of one of the two samples in the combined sample, in the theory of rank tests. Rao and Murthy (1981) consider a statistic based on the squared frequencies of both kinds and show that in large samples, this does not add to further efficiency compared to squaring just one set of frequencies as does the Dixon statistic mentioned below. Given this, the authors suggest taking the smaller of the two samples as the X's to make up the spacings, and the larger sample as the Y's for obtaining the frequencies in order to avoid many empty cells with zero frequencies (in analogy with having more balls than cells).

Asymptotic Distributions
The asymptotic distributions of the Rao statistic, as well as the Dixon statistic, can be obtained using the general results of Holst and Rao (1980), in particular their Theorem 4.1, stated below.
Theorem 1 (Holst and Rao, 1980). Let T be a symmetric statistic of the form m k=1 h(S k ). Let η be a geometric(ρ) random variable where Then, under the null hypothesis H 0 that the two populations are identical, the asymptotic distribution of T is given by We conclude this section with the special case when the sample sizes from the two populations are equal, just in order to explore the divergence between this large sample result and the exact values, which we do in Fig. 1 later. In this case, Eq. 3.2 has a very simple expression for both these tests, as shown in the following corollary.
Corollary 1. When the sample sizes from the two populations are equal, the limiting null distribution of the Rao statistic is given by

S. R. Jammalamadaka et al. S144
while the asymptotic distribution of the Dixon statistic is given by Proof. From Eq. 3.1, when the sample sizes are equal, η denotes a random variable with a Geometric (1/2) distribution, and therefore we have Thus for the Rao statistic with h(x) = |x − 1|, we can check that Similarly for the Dixon statistic = 10, from which the result follows. Mirakhmedov et al. (2014) provide improvements to the limiting normal approximation given in Theorem 1 through Edgeworth expansions. Gatto and Rao (1999) discuss the use of saddle-point approximations which lead to accurate numerical approximations that compare well with Monte Carlo simulations. However these are still attempts to get closer to the exact distributions, which is the primary focus of this paper.

Exact Distributions
The exact distributions of these test statistics can be found by considering the joint distribution of S = (S 1 , S 2 , . . . , S m ). Since under the null hypothesis the X i 's and Y i 's come from the same distribution, all possible permutations of the X i 's and Y i 's are equally likely. Hence the distribution of S is found by looking at the number of ways Y i 's can be distributed among the spaces between the order statistics of X. Since there are m spaces generated by the X (k) 's and n objects are to be inserted into these m cells, which is the classical combinatorial problem, and can be done in n+m−1 m−1 equally likely ways. Hence each possible configuration of S = (S 1 , S 2 , . . . , S m ) has probability n+m−1 m−1 −1 . In order to derive the probability for any specific value of a given statistic, we need to add the probabilities of all combinations of S that correspond to this value. This involves increasingly complex combinatorial computations, for which we provide an R package.

Computation of the Exact Critical Regions
We now consider the small sample case and provide the exact critical values for these test statistics, corresponding to commonly used significance levels of α = 0.10 and α = 0.05. Since these test statistics are all discrete, for any given signifi-cance level α, we sandwich such an α between upper tail probabilities p 1 and p 2 with the corresponding critical values c 1 and c 2 as follows. For each value of α, we give a pair of bracketing points (c 1 , c 2 ) and corresponding upper tail probabilities (p 1 , p 2 ), where c 1 < c 2 and p 1 > p 2 . Using c i , i = 1, 2 as the critical value and [c i , ∞) as the critical region so that any observed value of T larger than or equal to c i leads to H 0 being rejected, we have a test of significance level p i , i = 1, 2. The points (c 1 , c 2 ) are chosen to be successive values of T so that P (T ∈ (c 1 , c 2 )|H 0 ) = 0 and The upper 10% and 5% critical values for the Rao, Dixon, and the Wheeler-Watson tests are given in Tables 3 to 5 in the Appendix, for small samples (values of m ≤ 12, n ≤ 11). For other choices of m and n, such critical values can be computed using the "TwoCircles" R package 1 that we make available.

Exact Versus Asymptotic Distributions
Such exact null distributions can now be compared to the asymptotic results given by Corollary 1. Figure 1 presents such a comparison for the Rao and Dixon statistics, taking for illustration m = 9 and n = 10,which are nearly equal size samples. This illustrates, as expected, that for small sample sizes the asymptotic distribution provides a very poor approximation to the exact one, especially for the Dixon statistic.
We further illustrate the deficiency of using the asymptotic critical values instead of using the exact values obtained by the combinatorial methods. We do this by comparing the finite sample type I errors for the Dixon and Rao statistics obtained from the exact and the asymptotic rejection regions. These results are obtained by taking equal sample sizes (i.e. n = m − 1) with n = 2, 3, . . . , 12 and at level α = 0.1. The results are presented on the left panel of Fig. 2. As expected, the exact rejection regions lead to type I errors which are systematically closer to α than the ones obtained using asymptotic results. In the right panel of Fig. 2 we investigate by Monte Carlo simulations (under H 0 : X i ∼ N (0, 1) and Y i ∼ N (0, 1), i = 1, . . . , n) the relationship between the sample size n and the empirical type I error obtained from asymptotic rejection regions. The results suggest that a sample size of a several thousands are needed to obtain a correct type I error.
Overall conclusion is that the asymptotic distribution differs considerably from the exact distribution even for moderately large sample sizes. This calls for better approximations for dealing with moderate sample sizes, which is what we explore next. The relationship between the sample size n and the type I error obtained from asymptotic rejection regions. The error bars correspond to the standard errors obtained by resampling techniques For moderately large values of m and n, the calculation of the exact sampling distribution through exhaustive enumeration of all possible combinations may be too computationally intensive and, consequently, infeasible. However, these values may at the same time be too small for the asymptotic distribution to deliver a reliable approximation of the sampling distribution. Such cases may typically arise when min(n, m) > 20 and max(n, m) < 1000. To address this issue, we consider a simple and computationally efficient Monte Carlo approach to compute fairly accurate critical regions and pvalues. This approach tends to reproduce the idea behind random permutations in general non-parametric context, while taking advantage of the known properties of the spacing-frequencies S k under the null hypothesis.
This approach can be summarized in the following algorithm which is based on B Monte Carlo replications:

Draw the following two samples {X
3. Compute the spacing-frequencies based on two samples obtained from the previous step, i.e.
4. Compute the test statistic of interest based on the S * k 's previously obtained and define T * b as follows: 5. If b < B go to step 2 and define b = b + 1, otherwise end the procedure.
As a result of this procedure, the empirical distribution T * 1 , ..., T * B can be used to estimate the bracketing points (c * 1 , c * 2 ) together with the pair of significance levels (p * 1 , p * 2 ). Naturally, this approach can also be employed to calculate empirical p-values, which are simply defined as (see e.g. Davison and Hinkley, 1997) Moreover, the precision of the above approximations can be assessed using different techniques such as e.g. applying the standard non-parametric bootstrap on the T b 's, allowing to select a value of B that leads to the desired level of precision. The algorithm described in this section is implemented in the "TwoCircles" R package 2 that we make available.

Relative Performance of the Three Tests-Simulated Powers
Holst and Rao (1980) provide a comprehensive treatment on the asymptotic distribution theory and asymptotic efficiencies for tests based on spacingfrequencies. However, to calculate the exact distributions of these test statistics under alternatives, one needs the probability distribution of the vector S under the alternatives, which is quite complex and is given by the following with n 1 + n 2 + · · · n m = n.
In view of this complexity, in this section, we investigate the finite sample performance of the Rao test, compared with the Dixon and the Wheeler-Watson tests via simulations, focusing on circular alternatives since these are primarily meant for circular situations. Different scenarios are considered, each corresponding to a different type of departure from the null hypothesis H 0 . Simulation 1 considers the case when the two samples come from two von Mises distributions with the same concentration, but as the difference in their mean directions increases.
Simulation 1. We consider the following setting: An equally spaced grid of 20 values for μ was employed ranging from 0 to π. Two combinations of sample sizes were used for m and n. The empirical power curves (based on 10 4 Monte Carlo replications) of the different tests considered in Table 1 are presented in Fig. 3. It can be observed that the Wheeler-Watson test appears superior in this simulation to both the Rao and Dixon tests.
In the next simulation we consider again the case when the two samples come from two von Mises distributions with the same mean (at 0), but as the difference in their concentration increases.
Simulation 2. In this simulation we consider the following setting: An equally spaced grid of 20 values for δ was employed ranging from 0 to 20. Two combinations of sample sizes were used for m and n. The empirical power curves (based on 10 4 Monte Carlo replications) are presented in Fig. 4. In this simulation, the Rao test appears to provide a better power than Wheeler-Watson and Dixon tests.
Next, we examine the performance of these tests when one of the samples is from a Uniform distribution on [0, 2π) (which is the same as a vM with b a Simulation 3. We consider the following setting: An equally spaced grid of 20 values for δ was employed ranging from 0 to 20. Two combinations of values were used for m and n such that the sum of the sample sizes was kept to 28. The empirical power curves (based on 10 4 Monte Carlo replications) are presented in Fig. 5  In the following simulation, we consider a different case of von Mises mixtures Simulation 4. We consider the following setting: An equally spaced grid of 20 values for δ was employed ranging from 0 to 20. Two combinations of values were used for m and n such that the sum of the sample sizes was kept to 28. The empirical power curves (based on 10 4 Monte Carlo replications) are presented in Fig. 6. Similar to the previous simulation, the Rao test clearly outperforms both the Wheeler-Watson and Dixon tests.
The examples presented here are but a sample of exhaustive small sample power study of these test statistics that we have undertaken. Based on these as well as the cases presented here, we can draw several useful conclusions. Among these 3 tests, the Wheeler-Watson test appears to perform better than the other two in detecting location shifts as indicated in Simulation 1. The Rao test appears preferable when one suspects that one of the samples may be from a mixture-distribution i.e. when one sample is transformed to be uniform on the circle, the other sample comes close to any symmetric bimodal or multimodal alternative such as a mixture of 2 von Mises distributions, as is the case in Simulations 3 and 4. This conclusion coincides with similar results obtained by Gatto and Rao (2015). The Dixon and Rao tests which are omnibus tests, perform well even in such situations. Although it is shown in Theorem 4.2 of Holst and Rao (1980) that the Dixon statistic is asymptotically Locally Most Powerful among symmetric functions of spacing-frequencies, we see that its performance in small samples is not as good as that of Rao test in all our simulations. The inevitable conclusion seems to be that one should use this Rao spacing-frequencies test if one suspects multimodal alternatives, or if the sample sizes are small to moderate -the case for which this paper provides tables and necessary R code.

A Comparison with a Test on the Real Line
As stated earlier, let X (i) 's, i = 1, . . . , m − 1 denote the order statistics of X i on the line with the notation X (0) ≡ −∞ and X (m) ≡ ∞, and S k , k = 1, . . . , m denote the number of Y j 's in the interval [X (k−1) , X (k) ).
In order to make a simple comparison with two-sample nonparametric tests on the line, we consider one of the most commonly used tests namely the Wilcoxon test (see Wilcoxon (1945)) as a proxy. As shown in the lemma below, the Wilcoxon statistic is also a simple linear function of the spacing frequencies and belongs to this general class of tests. However it is not symmetric in these spacing-frequencies and hence takes different values depending on the choice of the origin. Because of this lack of rotational invariance, it cannot be used as such, for comparing two circular samples.
Lemma 1. The Wilcoxon rank sum test statistic W can be written as: Proof. If R i denotes the rank of X (i) in the combined sample, the Since m k=1 S k = n, we have m k=1 (m − k)S k = mn − m k=1 kS k and the assertion follows.
Remark C. A centered version of W is sometimes used, namely the Mann-Whitney statistic U = W − m(m−1) 2 , which can be written as To briefly assess how the three tests considered in this paper fare with respect to competitors on the line, using the Wilcoxon test as a proxy, we consider a simple simulation, viz. Simulation 5.
An equally spaced grid of 20 values for p was employed ranging from 0 to 1. Two combinations of values were used for m and n. The empirical power curves (based on 10 4 Monte Carlo replications) of the tests considered in Table 1 as well as the Wilcoxon rank test, is presented in Fig. 7. Similar to the results presented in Section 6, Rao test clearly outperforms Wheeler-Watson, Dixon and even the Wilcoxon rank test. It can also be observed that the Wilcoxon rank test, and to a lesser extent the Wheeler-Watson test, provide a particularly poor performance in detecting departure from H 0 . This may be be partly explained by the fact that the distribution remains symmetric under the alternative.

An Illustrative Example and Use of Tables
For small sample sizes (values of m ≤ 12, n ≤ 11), Appendix provides tables of critical values for these three test statistics, by giving bracketing values (closest critical value on either side) for α = 0.05, and α = 0.1. More extensive tables as well p-values can be obtained using R code available in the R package "TwoCircles" 3 . We illustrate the use of this code through the following example: > library(TwoCircles) > get_critical_values(n = 6,m = 8, test = "rao", alpha = 0.05) Bracketing values (c1, c2) corresponding to significance levels (p1, p2) for Rao test based on the significance level 0.05 c1 = 9 (p1 = 0.0862) c2 = 10.5 (p2 = 0.0047) We now consider how these circular tests perform and their exact and asymptotic p-values, by applying them on a classic data set on homing pigeons. Homing pigeons are selectively bred pigeons that possess the ability to find their way home over extremely long distances, and are used in experiments on animal navigation. There is considerable literature on this subject (see e.g. Walcott, 1996 and the references therein). Among the early studies on homing pigeons, Schmidt-Koenig (1958) evaluated the difference in orientation abilities of a group of control and experimental birds. The experimental birds had their internal clock reset by six hours clockwise and were expected to deviate by about 90 • counterclockwise with respect to the control birds upon release (see Schmidt-Koenig, 1958 for details). The release flight direction of the pigeons from the two groups were collected in different exper-b a Figure 7: a Power curve for an equally spaced grid of 20 values for p (see Simulation 5) ranging from 0 to 1 for the tests considered in Table 1 as well as the Wilcoxon rank test. The values m = 5 and n = 18 were considered in this case. b Similar to point (a) but for the values m = 5 and n = 24 iments and one of them is presented in Fig. 8. It can be observed that mean directions of the two groups appear to significantly differ. The experimental and control birds have a mean direction of respectively 18.9 • and 104.0 • , which is close to 90 • (counterclockwise) expected difference.
A hypothesis of interest here is whether the observed directions of the two groups are from the same circular distribution. The three tests discussed in Section 2 can be used to perform such analysis. Their asymptotic and exact p-values are reported in Table 2. These results can be replicated using our package as illustrated through the following example:  Figure 22 of Schmidt-Koenig (1958) The differences between the asymptotic and exact p-values are noticeable, in particular for the Dixon test where the asymptotic and exact p-values are 12.16% and 0.19%, respectively. In such case, one would actually be likely not to reject the null hypothesis simply due to the poor approximation of the asymptotic distribution for such a small sample size.
For the sample sizes of m = 9 and n = 10 as in this example, Fig. 1 can be used to read off the asymptotic and exact p-values for the Dixon and Rao tests. The poor asymptotic approximation in the case of Dixon test is particularly apparent from this figure, which confirms the large difference reported in Table 2.

Concluding Remarks
A straightforward combinatorial approach is used to derive the exact distributions of several tests based on spacing frequencies. This class of tests for comparing two circular samples includes the newly introduced Rao test,  the Dixon test, and the Wheeler-Watson test. For all these statistics our method provides the exact critical values and tables. Although the asymptotic theory of these statistics has been well-studied, such results often lead to a poor approximation in small to moderate samples, as we demonstrate. The comparative power performance of these various tests in small samples has been studied through extensive simulations, some of which are presented here. Based on these simulations, we find that omnibus tests like the Rao test and the Dixon test are preferable to the Wheeler-Watson test if one suspects symmetric multimodal alternatives. Again in small to moderate samples, Rao spacing frequencies test often outperforms the Dixon test. An illustrative comparison with the Wilcoxon test which is commonly used on the real line for comparing two samples, is presented to demonstrate that these tests may also be effectively used in two-sample comparisons on the real line. hájek, j., sidak, z. and sen, p.k (1999). Theory of Rank Tests. Academic Press, Cambridge.