Permutation testing for thick data when the number of variables is much greater than the sample size: recent developments and some recommendations

In many scientific disciplines datasets contain many more variables than observational units (so-called thick data). A common hypothesis of interest in this setting is the global null hypothesis of no difference in multivariate distribution between different experimental or observational groups. Several permutation-based nonparametric tests have been proposed for this hypothesis. In this paper we investigate the potential differences in performance between different methods used to test thick data. In particular we focus on an extension of the Nonparametric combination procedure (NPC) proposed by Pesarin and Salmaso, a rank-based approach by Ellis, Burchett, Harrar and Bathke, and a distance-based approach by Mielke. The effect of different combining procedures on the NPC is also explored. Finally, we illustrate the use of these methods on a real-life dataset.


Introduction
Thick data refers to datasets in which the number of variables exceeds the number of observations. In many modern research scenarios (e.g. engineering, manufacturing, genetics, neuroscience and other medical fields) the number of variables can be much larger than the number of observations. Classic parametric methods (e.g. generalized linear models) are underdetermined when the number of parameters exceeds the number of observations and are therefore unable to provide valid inference. Nonparametric permutation tests for a suitable global null hypothesis in one-factorial designs have been proposed, for example a distance-based approach by Mielke and Berry (2007), a rank-based approach Harrar and Bathke 2008a, b;Liu et al. 2011) implemented in the R package npmv (Ellis et al. 2017), and the Nonparametric combination procedure (NPC) by Pesarin and Salmaso (2010). Such methods differ fundamentally in their approach: the distance and rank methods use a top-down approach, calculating firstly a global test statistic and relying on post-hoc tests to test marginal univariate hypotheses. The NPC starts with test statistics for the marginal hypotheses and combines them (hence the name nonparametric combination procedure) to obtain p-values for global tests. Combining functions for p-values have a long history, with some of the best known being Fisher's (Fisher 1932), Stouffer's (Stouffer et al. 1949) and Tippett's (Tippett et al. 1931).
One has to differentiate between p-value combination in a meta-analytic sense and p-value combination in a multivariate sense. The first concerns itself with testing the same hypothesis on different datasets, the second with testing multiple hypotheses on the same dataset. In the first scenario, the one for which p-value combination was originally developed, there are theoretical works on the optimality of combining functions for different alternative hypotheses, for example Birnbaum (1954) and Heard and Rubin-Delanchy (2018). In the second case, there is much less guidance regarding the optimal choice of p-values, especially when used in the context of the NPC. Pesarin and Salmaso basically advise (see Pesarin and Salmaso 2010, p. 133) using Tippett's when we expect only one or a few, but not all, sub-alternatives to occur, using Stouffer's when we expect all sub-alternatives to be jointly true, and using Fisher's when no specific knowledge is available.
After a quick review of each method, we compare their performance in different simulation settings. We focus on settings in which the number of variables is larger than the sample size and the ratio of informative to uninformative variables is small. We focus on the NPC and the influence of different combining functions on its performance, using the other two methods as competitors when suitable. Lastly, we demonstrate the use and usefulness of multivariate testing versus simple univariate testing with multiplicity control using a real-life dataset. Table 1 Unit-by-unit representation of the data X 1 (1) · · · X 1 (n 1 ) · · · X 1 (n − n a + 1) · · · X 1 (n X p (1) · · · X p (n 1 ) · · · X p (n − n a + 1) · · · X p (n) Assume we observe p-dimensional data X (k) i j where i = 1, . . . , a is the group index, j = 1, . . . , n i is the subject index, and k = 1, . . . , p is the variable index. We then use X = {X(i)|i = 1, . . . , n} to denote the dataset, where n = a i=1 n i and X(i) = (X (1) i , . . . , X ( p) i ) such that the first n 1 observations belong to the first group, the next n 2 to the second group and so on as illustrated in Table 1. This is the so-called unit-by-unit representation. Because we want to exchange columns of this table, we need multivariate exchangeability under the null hypothesis. A suitable choice for H 0 is therefore where X i denotes a random vector with the same distribution as the one from which the observations of the i-th group were drawn and d = denotes equality in distribution. The core principle underlying the NPC is to firstly test a number of sub-hypotheses. A combined test is then conducted to test for the intersection hypothesis.
A natural choice in our case is the hypotheses of equality of the marginal distributions H (k) the reverse is not necessarily true. Thus, some alternatives in H C 0 might not be in the consistency region of the test.
The multivariate distributions are stochastically comparable if ∀i, i = 1, . . . , a where X i d ≤ X i means that for all bounded and increasing functions f : It is worth pointing out that under the assumption that the multivariate distributions are stochastically comparable, the equality between H 0 and H mar 0 does hold (see Baccelli and Makowski 1989). This makes the NPC especially suitable for testing hypotheses of stochastic dominance as we shall see later.
. . Table 3 Test statistic permutation distribution 0 . Let φ be a p-value combining function as described below. We thereafter compute the original test statistics over X (see Table 2).
Next, the centered version of the empirical cumulative function for each of the p test statistics among the B permutations is calculated aŝ and a p-value for each of the p tests is defined aŝ In other words, p partial p-values are achieved together with their simulated permutation distributions {L kb =L k (T kb ), b = 1, . . . , B, k = 1, . . . , B} as illustrated in Table 4. For each permutation and the original data we then combine the p-value statistics into T o := φ(λ 1 , . . . ,λ p ) and We again calculate the empirical cumulative function and finally calculate the global p-valuê The choice of combining function φ(·) affects the definition and potentially the power of the combined test used to assess the global problem of interest. This is therefore a critical step.
According to Pesarin and Salmaso (2010), a combining function φ(·) should satisfy a few fundamental properties: where u is a random permutation) -φ(·) should reach its supremum value φ sup even when only a single partial p-value attains 0 -for each significance level α, the related critical value T α should be finite and lower than φ sup -the rejection region of the combined test achieved using φ(·) should be convex.
In this paper we consider some of the most popular combining functions which satisfy these properties and investigate, by means of a simulation study, how the choice of combining function affects the power of the combined tests under several different scenarios.
Let us now introduce these combining functions, a special case of alternative hypothesis, and the aforementioned competing nonparametric methods.

P-value combining functions
Five different combining functions are considered in this paper and they are briefly described in this section.

Fisher and truncated Fisher
Fisher's combining function is defined as or equivalently as the product of the p-values. This approach was extended by Zaykin et al. (2002) to the so-called truncated product method (TPM). Here only those p-values below a certain prespecified threshold τ are considered, that is Truncation generally helps gain power, especially with highly dependent data (Arboretti Giancristofaro et al. 2016).

Stouffer
Stouffer's method, also sometimes attributed to Liptak, is based on the quantile function of a standard normal distribution ( −1 ) : Here we choose the version with 1 − p i instead of p i as the argument for −1 because for the NPC, large values should be significant.

Tippett
Tippett's method looks at the smallest p-value (or in the case of the NPC, at one minus the smallest p-value): If the global null is true, then the values of the Tippett combining function obtained during the combination step of the NPC will be distributed as the maximum of p different values. The use of this combining function is recommended when a low number of informative variables is available (see Pesarin and Salmaso 2010, p. 133).

Logistic combining function
The logistic combining function (LCF) is similar to Fisher's function and defined as:

Iterative combination
Different combining functions will of course lead to different p-values. This provides the researcher with a significant degree of freedom and raises the already mentioned question of which combining function is 'the best'. One way to solve this problem is to formulate specific advice based on results of the performance of different combining functions under different scenarios. Another is to try to sidestep the question entirely by using the so-called iterative combination. Assume one has conducted the NPC as described above on a given dataset using f different combining functions φ 1 , . . . , φ f . Instead of one vector consisting of a p-value and the B values of the empirical cumulative function as above, we get f different vectors, one for each combining function (see Table 5). This is the exact same situation as with the traditional NPC before doing the combination. But instead of h rows resulting from the h different test statistics, we now have f different rows as a result of the f different combining functions. We can simply apply the combining functions again to get second iteration p-values and values of the empirical functions Our simulation study suggests that there is a two-fold convergence going on if one repeats this iterative procedure. Firstly, the p-values resulting from the f combining functions converge towards each other, and they all converge together to a certain fixed point. One could define this point as the ultimate global p-value. Of course, the researchers retain some degree of freedom-they do not have to settle on a single combining function. However, the number of possible sets of combining functions is even larger than the number of individual combining functions.
In the simulation study we used an iterative procedure based on Fisher's, Stouffer's and the logistic combining function. Convergence was considered to have occurred when the difference between the largest and smallest of the three p-values was less than 0.01. In this case the result from Fisher's combining function at the iteration at which convergence was achieved was returned as the iterated p-value. If convergence did not occur after 10 iterations, a missing value was returned.

Stochastic ordering
A special example of an alternative hypothesis for the comparison of more than two groups is the so-called stochastic ordering alternative. It tests for a certain trend in the population that puts an order relation on the groups. Let us assume we observe univariate data X i j , i = 1, . . . , a being the group index and j = 1, . . . , n i being the subject index. Again let us use unit-by-unit representation and write X = {X (i)|i = 1, . . . , n} where n = a i=1 n i and assume the first n 1 observations belong to group 1, the next n 2 to group 2 and so on. We want to test the hypothesis with at least one equality being strict. For this purpose one can partition the null hypothesis into a − 1 sub-hypotheses corresponding to pairwise comparisons. To do so, use Y 1c = X 1 ∪· · ·∪X c to define the artificial group of observations that are within the first c original groups, and use Y 2c to define the observations that do not belong to the first c groups. Then the original null hypothesis becomes (for H ≤ 0 ; the case for H ≥ 0 works equivalently of course) This approach can easily be extended to a multivariate setting. Under the assumption that multivariate distributions are stochastically comparable to one another, equality of the marginal distributions implies equality of the multivariate distributions. The reverse is of course always true. In this case, therefore, the global null can simply be written as an intersection of local null hypotheses of equality of marginal distribution and the NPC can be applied again.

Competing methods
The distance-based approach by Mielke and Berry (2007) and the rank-based approach by Ellis et al. (2017) are also considered in this study and are briefly described in this section.

Rank-based approach
Assume we observe realizations of random variables X (k) i j , where i = 1, . . . , a is the group index, j = 1, . . . , n i is the subject index, and k = 1, . . . , p is the variable index. Set N := a i=1 n i as the total number of observational units and i j ) as the observation vector for each observational unit. We assume that all X i j are independent and within each group are identically distributed as X i j ∼ F i . We want to test the global null hypothesis i j ) be the vector of these ranks associated with each observational unit. Then define the within group meansR i. := 1 n i n i j=1 R i j and the total unweighted meanR .. := 1 a a i=1R i. . Finally The ANOVA-type test statistic used is then defined as T := tr(H) tr(G) and we have approximately Note that  define more than just this test statistic in their paper, however this is the only one described that can be used for data with a higher number of variables than observations. The abovementioned result of the approximation of the sampling distribution is shown in , which also contains simulations showing excellent small-sample performance. For the simulations presented here, we do not use the approximation by the F-distribution but rather a permutation version of T , whose p-values are therefore calculated as where T is the test statistic from the original sample and T * b ; b = 1, . . . , B is the test statistic for the b-th permuted sample.

Distance-based approach
The basis of this approach (see Mielke and Berry 2007) is the standard analysis of variance (ANOVA). Basically we observe realizations of real variables X i j where i = 1, . . . , a is the group index and j = 1, . . . , n i is the subject index. Define The F statistic is then Note that δ = 2M SE W ithin (see Mielke and Berry 2007, page 52). Since all the expressions in (1) are fixed properties given a dataset, one only needs to compute δ in a permutation approach: permute the data B times, obtaining δ * 1 , . . . , δ * B and estimating the p-value as . In this paper we estimated the p-value as for a fair comparison with the NPC and the rank-based method.
This approach can be turned into a multivariate approach with a clever twistwhen observing multivariate data for a particular metric d. Thus in the multivariate setting the statistic δ is computed using If the variables are measured in different units, they will firstly need to be standardized (see Mielke and Berry 2007, p. 53). Assume we observe X (k) j j = 1, . . . , N in the k-th variable and want to use the Euclidean metric for the test statistic δ. We can then standardize this variable by defining

Simulation study
Before entering simulation scenarios it is worth noting that while the NPC can easily accommodate tests for one-sided hypotheses, the competitor procedures as described above only test for two-sided hypotheses. We decided to focus our investigation on behavior in relation to one-sided tests. In a stochastic dominance situation this is the only possible hypothesis. However, to provide a fair comparison, we also conducted a two-sided version of the tests. As such we explored the simulation scenarios discussed below.

Scenario descriptions
In this section we present the scenarios considered in the simulation study. For all scenarios we generated 1000 Monte Carlo samples. B was set to 1000 and we used the same permutations for each method. For the truncated product method, we set τ = 0.2. Since there were always 30 informative variables under the alternative, we did not consider Tippet's combining function because its use is recommended for cases with only few such variables as mentioned in the corresponding section. For all simulations we used the statistical software package R (R Core Team 2021). t-distributed variables were created using the package mvtnorm (Genz et al. 2009(Genz et al. , 2021. Ordinal variables were created using the package GenOrd (Barbiero and Ferrari 2015), which allows for the creation of ordinal variables with a prespecified correlation structure and marginal distributions.

Two groups-continuous-uncorrelated-one-sided alternative
In this scenario there are two groups, each with a sample size of 10. All variables in this scenario follow a t-distribution with two degrees of freedom and are independent of each other. We created blockwise response variables, one block consisting of 30 variables. In the alternative scenario the variables in the first block have mean 0.25 in the first group and mean zero in the second group. In the null hypothesis scenario all variables in the first block have mean zero. We then added between 0 and 3 blocks of uninformative variables, i.e. blocks where all variables have mean zero, thereby increasing the number of variables from 30 to 60, 90 and 120 respectively. We did this for both the alternative and null hypothesis scenario. For example, adding two blocks of uninformative variables results in three blocks (90 variables) for each scenario: one informative and two uninformative blocks in the alternative scenario and three uninformative blocks in the null hypothesis scenario.
For the NPC we tested the hypothesis H 0 : where X i is a random vector distributed as the observations from the i-th group. As a test statistic for each variable we used the sum of the first 10 observations.

Two groups-continuous-uncorrelated-two-sided alternative
Here everything is exactly the same as in the previous scenario, except we tested Here we also tested the rank and distance-based methods. For the distance-based approach we used the Euclidean and Manhattan metric. Since variances are equal for all variables, we do not standardize.

Two groups-mixed-uncorrelated-one-sided alternative
Again there are two groups of 10 observations each and blockwise variables are created in the same way as in the two groups-continuous-uncorrelated scenarios. Each block consists of 15 multivariate t-distributed variables with two degrees of freedom and 15 discrete variables with uniform distribution on {1, 2, 3, 4, 5} in the null hypothesis. In the alternative the first block of the first group was shifted by 0.25 for the normally distributed variables and the discrete variables were distributed as The test statistics used for the NPC in this scenario was the sum of the first 10 observations for the continuous variables and the one-sided Anderson-Darling test statistic for the discrete variables, which is defined as: 2 (t)), and N = n 1 + n 2 .

Two groups-mixed-correlated-one-sided alternative
This scenario is the same as the previous one, except we introduced a dependency structure. The dependency holds only among variables of the same type (continuous or discrete) within the same block. The correlation/scale matrix used was

Three groups-mixed-correlated-two-sided alternative
Here there are 3 groups with 10 observations each. The setup was the same as in the previous scenario, i.e. the alternative influenced only group 1, the other two groups being unaffected. For the NPC we tested for equality of distribution versus any kind of stochastic ordering. The test statistic used by the NPC for the continuous variables was the standard ANOVA F statistic. For the discrete variables we used the a-sample version of the Anderson-Darling test statistic.
Here we also tested the rank and distance-based methods. For the distance-based approach we again used the Euclidean and Manhattan metric. Since we have a mixture of different variable types, we also standardized them using the Euclidean metric.

Three groups-mixed-correlated-stochastic ordering alternative
The setup is the same as in the previous scenario but the alternative is different. The test statistics used by the NPC are the same as in the two groups-mixeduncorrelated-one-sided alternative and two groups-mixed-correlated-one-sided alternative scenarios. Here we tested for a prespecified stochastic order, that is ≥ X 3 } with at least one inequality being strict.

Results
In this section we present the results of the simulation study for each scenario. Figure 1a shows the rejection rate at α = 0.05 under the null and alternative hypotheses for all tests. Figure 1b, c illustrate the actual rejection rate under different significance levels for the null and alternative hypotheses respectively.

Two groups-continuous-uncorrelated-one-sided alternative
There are some differences between the different combining functions; the Fisher and TPM functions perform worse than the others when there are no uninformative variables. However, they perform similarly when the number of uninformative variables is high. All NPC variants suffer greatly in terms of performance when the number of uninformative variables increases. All methods kept the nominal level. As can be seen, the power of the NPC variants is much smaller than in the two groups-continuous-uncorrelated-one-sided alternative scenario, which is to be expected. Within the NPC methods, the Stouffer and Logistic methods seem to be mildly inferior to the other three. Again, a reduction in power is seen when adding uninformative blocks. However, after one block, power is already very close to the null scenario such that no substantial differences can be seen between one, two or three blocks of uninformative variables. The distance-based approaches are very similar in power to the NPC variants, whereas the rank-based method seems to be slightly more powerful. All methods kept the nominal level. Figure 5a shows the rejection rate at α = 0.05 under the null and alternative hypotheses for all tests. Figure 5b, c illustrate the actual rejection rate for different α-levels under the null and alternative hypotheses respectively.

Two groups-mixed-uncorrelated-one-sided alternative
The power to detect the alternative is much larger here, probably because the effect of the discrete variables is picked up more easily than the continuous variables. There does not seem to be any great difference between different combining functions. Fisher's combining function and the Iterative method perform slightly better than the others when a large number of uninformative variables is present. All tests kept the nominal level.    Figure 6a shows the rejection rate at α = 0.05 under the null and alternative hypotheses for all tests. Figure 6b, c show the actual rejection rate for different significance levels under the null and alternative hypotheses respectively. It is clear from these results that introducing a strong correlation between variables greatly reduces the power of the tests. No large difference is present between combining functions. Fisher's combining function and the iterative method, this time jointly with the truncated product method, outperform the others by a small amount when a large number of uninformative variables is present. All methods kept the nominal level. Figure 7 shows the rejection rate at α = 0.05 under the null and alternative hypotheses for all tests. Figure 8 illustrates the actual rejection rate for different α-levels under the null hypothesis and Fig. 9 does the same for the alternative hypothesis.

Three groups-mixed-correlated-two-sided alternative
The overall power of the tests is increased somewhat in comparison to the previous scenario, likely due to the higher overall sample size. Again the Stouffer and Logistic methods are slightly inferior to the others. The rank-based method is essentially identical in power to the better performing NPC methods. Under this scenario, the distance-based methods show superior performance to the rank-based method.    Indeed, the power of the rank-based method appears to be negatively affected by the presence of mixed variable types. All methods kept the nominal level. Figure 10a shows the rejection rate at α = 0.05 under the null and alternative hypotheses for all tests. Figure 10b, c illustrate the actual rejection rate for different significance levels under the null and alternative hypotheses respectively. The overall power is again increased compared to the previous scenario due to the larger effect size. The inferiority of the Stouffer and Logistic methods is even more evident now, while the others are comparable in power. All methods kept the nominal level.

Application to a real case study
Finally, we applied the NPC procedure (using all the aforementioned combining functions) and the competing methods to a real case study in the medical field to further compare their performances.

Description of the dataset
The dataset we used to illustrate the different procedures is a small random sample taken from a much larger dataset gathered by the Departments of Internal Medicine I, Geriatrics, Neurology, and Pneumology and the Institute of Medical and Chemical Laboratory Diagnostics of the Paracelsus Medical University, Salzburg, Austria in the context of the Paracelsus 10,000 project (see https://salk.at/16707.html). The goal of the project is to collect epidemiological data and establish a biobank containing samples of 10,000 inhabitants of the city of Salzburg and neighboring areas. For the purposes of this study, we extracted the laboratory data of ten male and female participants.

Statistical analysis
We transformed the non-numeric variables by ordering the observed values and coding them from 1 to l where l is the number of unique observed values. The null hypothesis was equality between the multivariate distributions of the observed variables between men and women. For the NPC we used the absolute value of the difference between the sum of the first 10 observations and the sum of the second 10 observations as the test statistic for the continuous variables. For the ordinal variables we used the two-sided

Results
The global null was rejected by all methods. The p-values (rounded to four digits) can be found in Table 6. All of them allow for a rejection of the null hypothesis at α = 0.05. The different combining functions of the NPC create a range of p-values from 0.0171 for the iterative combination to 0.0365 for Stouffer's combining function. The rank-based approach and the Euclidean-Distance-based approach generate p-values at the lower end of the NPC spectrum and the Manhattan-Distance-based approach created the smallest p-value of all. One should be careful not to read too much into the differences in p-values in this individual case, however we would like to point out that the order of p-values corresponds roughly to the power of the different combining functions in the two groups-mixed-correlated-one-sided alternative scenario, which is also the scenario that is probably closest to the structure of these data. We would also like to point out that when using only the univariate p-values produced by the NPC and correcting them with the Bonferroni-Holm procedure, no significant difference can be detected in any variable (see Table 7).

Conclusions
Testing hypotheses of equality in distribution in a multivariate context when the number of variables is much greater than the sample size is a challenging task. Several nonparametric tests have been proposed in the literature to address this type of problem and among them are the distance-based approach by Mielke and Berry (2007), the rank-based approach by Ellis et al. (2017), and the Nonparametric Combination procedure by Pesarin and Salmaso (2010). In this paper we conducted a simulation study to evaluate the performances of these methods under several different scenarios and then applied them to a real case study, placing particular emphasis on the NPC.
The choice of combining function is a critical step in the NPC procedure which can affect its performances. For this reason, in our study, we also tried to evaluate the impact of this choice on the power of this testing procedure with the aim of providing practitioners with a few guidelines to help them make this decision.
In those simulation scenarios in which we could compare different methods, and not simply different variants of the NPC, we observed the following: 1. The rank-based approach is similar in performance to the best performing NPC variants. 2. The distance-based approach was slightly inferior with only two groups present, but outperformed all others with three groups present.
When comparing the NPC methods with one another, the following statements can be made: 1. Differences between combining functions are not very large but do exist. 2. The truncated product method performs worse than others when no correlation between variables is present, but is among the best performing methods when such a correlation is introduced. 3. Overall, Fisher's combining function and the iterative combining function have a favourable profile, regardless of dependency structure.
The advice we can give, therefore, is that in general the NPC with Fisher's combining function or an iteration of several combining functions and the rank-based approach performed robustly across all scenarios, never being among the worst ones. While the performance of the distance-based methods was not as good as the performance of the other procedure in the two groups-continuous-uncorrelated-two-sided alternative scenario, it outperformed all others in the three groups-mixed-correlated-two-sided alternative scenario. One might suspect that either the nature of the variables (continuous vs mixed), the number of groups (two vs three) or the correlation structure (uncorrelated vs correlated) is responsible for this. Our conjecture is that it is due to the difference in correlation structure.  We can thus state, with some reservation, that the rank-based approach or the NPC with the Fisher or Iterative combining function is a solid and reliable choice. When it is suspected that a strong correlation exists between variables or a very large number of uninformative variables are believed to be present, the truncated product method also performs well.
Unfortunately, all methods suffer from a large loss in power when the number of uninformative variables is increased. We should also point out that, due to computational costs, the maximum ratio of uninformative to informative variables in our simulations is 3:1. It is of course possible that one of the investigated methods would show better performance were this ratio increased.
In the Application section we can see that all the presented methods are valuable tools for multivariate data analysis. While a univariate approach, with the necessary adjustment for multiple testing, failed to show any difference between groups, the difference was clearly significant with all multivariate methods.
Funding Open access funding provided by Universitá degli Studi di Padova within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.