Abstract
The notion of testing for equivalence of two treatments is widely used in clinical trials, pharmaceutical experiments, bioequivalence and quality control. It is traditionally operated within the intersection–union principle (IU). According to this principle the null hypothesis is stated as the set of effects the differences \(\delta\) of which lie outside a suitable equivalence interval and the alternative as the set of \(\delta\) that lie inside it. In the literature related solutions are essentially based on likelihood techniques, which in turn are rather difficult to deal with. A recently published paper goes beyond most of likelihood limitations by using the IU approach within the permutation theory. One more paper, based on Roy’s union–intersection principle (UI) within the permutation theory, goes beyond some limitations of traditional twosided tests. Such UI approach, effectively a mirror image of IU, assumes a null hypothesis where \(\delta\) lies inside the equivalence interval and an alternative where it lies outside. Since testing for equivalence can rationally be analyzed by both principles but, as the two differ in terms of the mirrorlike roles assigned to the hypotheses under study, they are not strictly comparable. The present paper’s main goal is to look into these problems and provide a sort of comparative analysis of both by highlighting the related requirements, properties, limitations, difficulties, and pitfalls so as to get practitioners properly acquainted with their correct use in practical contexts.
Introduction and motivation
Testing for equivalence (Eq) of two treatments is widely used in clinical trials, pharmaceutical experiments, bioequivalence, quality control, etc. If we take, for example, bioequivalence, a potential risk can arise if the bioequivalence of products is not well regulated and guaranteed. This paper addresses the crucial methodological step in testing for Eq and provides a unified framework to nonparametric testing within the permutation approach.
In the current literature there are two different, albeit dual or mirrorlike, approaches for testing for Eq. The first commonly adopted approach, especially in bioequivalence and pharmacostatistics (AndersonCook and Borror 2016; Berger 1982; Berger and Hsu 1996; D’Agostino et al. 2003; Hirotsu 2007; Hung and Wang 2009; Lakens 2017; Mehta et al. 1984; Patterson and Jones 2017; Richter and Richter 2002; Wellek 2010), is derived from the intersection–union principle (IU) and its analysis is based mainly on likelihood techniques, which in turn are rather difficult to deal with, or even unavailable outside the regular exponential family (Lehmann 1986). As far as we know, the only paper on IU based on permutation methods is Arboretti et al. (2018). The other approach (Arboretti et al. 2017; Pesarin et al. 2014, 2016) is based on Roy’s (1953) union–intersection principle (UI), which is also difficult to deal with using likelihood techniques (Sen 2007; Sen and Tsai 1999). The two approaches essentially differ in terms of the roles assigned to the null and alternative hypotheses. In this paper we start with a simple description of both, before introducing the related permutation solutions. We then provide a few sampling inspection plans and an application to a bioequivalence case study (two further case studies are provided in the Supplementary Material). In the final paragraphs, after exploring the limiting behavior of permutation solutions, we discuss the most important requirements and pitfalls of both parametric and nonparametric permutationbased approaches, before drawing our conclusions. The main aim of this paper is to provide the reader with some methodological insights and suggestions in order to make the most suitable choices in relation to Eq testing to deal with any underlying population distribution, any sample sizes and any margins.
On intersection–union and union–intersection approaches
With reference to one endpoint variable X and a twosample design, to draw inferences on the substantial Eq of a comparative treatment A to a new treatment B, the IU approach consists in checking if the effect \(\delta _{A}\) of A lies in a clinically or biologically or technically unimportant interval around \(\delta _{B}\) of B, i.e. testing the nonequivalence (NEq) null \(\ H:[(\delta _{A}\le \delta _{B}\varepsilon _{I}) {\textstyle \bigcup } (\delta _{A}\ge \delta _{B}+\varepsilon _{S})]\) versus (V.s) the Eq alternative \(K:(\delta _{B}\varepsilon _{I}<\delta _{A}<\delta _{B} +\varepsilon _{S}),\) where \(\varepsilon _{I}>0\) and \(\varepsilon _{S}>0\) are the inferior (lower) and superior (upper) margins for the difference \(\delta =\delta _{A}\delta _{B}\), respectively—margins that are established by biological, clinical, pharmacological, technical or regulatory arguments and not by purely statistical considerations. Focusing on the multiaspect nature of the problem, (Berger 1982; Berger and Hsu 1996; Schuirmann 1981, 1987), these hypotheses can be equivalently stated as \(\ H\equiv H_{I}\bigcup H_{S}\) and \(\ K\equiv K_{I}\bigcap K_{S}\), where \(\ H_{I}:\delta \le \varepsilon _{I},\) \(\ K_{I}:\delta >\varepsilon _{I},\) \(\ H_{S}:\delta \ge \varepsilon _{S},\) and\(\ \ K_{S}:\delta <\varepsilon _{S}\) are the partial onesided subhypotheses into which H and K are equivalently broken down. In actual fact, H is true if one and only one of \(H_{I}\) and \(H_{S}\) is true; K is true when both subalternatives \(K_{I} \) and \(K_{S}\) are jointly true. Accordingly, H is retained if one and only one of two suitable partial test statistics, \(T_{I}\) for \(H_{I}\) V.s \(K_{I}\) and \(T_{S}\) for \(H_{S}\) V.s \(K_{S},\) retains the respective subnull. The alternative K is retained if and only if two subalternatives \(K_{I}\) and \(K_{S}\) are jointly retained. So, the overall (global) solution, \(T_{G}\) say, has to be based (Berger 1982; Schuirmann 1981) on a suitable combination of two onesided tests (TOST).
The UI approach considers the Eq null \(\ {\tilde{H}}:(\varepsilon _{I} \le \delta \le \varepsilon _{S})\) that \(\delta\) lies inside the Eq interval and the alternative NEq hypothesis \(\ {\tilde{K}}:[(\delta <\varepsilon _{I}) {\textstyle \bigcup } (\delta >\varepsilon _{S})]\) that \(\delta\) lies outside it. By using \({\tilde{H}}_{I}:\delta \ge \varepsilon _{I}\) V.s \({\tilde{K}}_{I}:\delta <\varepsilon _{I}\) and \({\tilde{H}}_{S}:\delta \le \varepsilon _{S}\) V.s \({\tilde{K}}_{S}:\delta >\varepsilon _{S}\) to denote two onesided subhypotheses into which the problem can be broken down, according to Roy (1953) we may equivalently state \({\tilde{H}}\equiv {\tilde{H}}_{I}\bigcap \tilde{H}_{S}\) and \({\tilde{K}}\equiv {\tilde{K}}_{I}\bigcup {\tilde{K}}_{S}\). That is, the null \({\tilde{H}}\) is true if both onesided subnull hypotheses \({\tilde{H}}_{I}\) and \({\tilde{H}}_{S}\) are jointly true and \({\tilde{K}}\) is true if one and only one of two subalternatives \({\tilde{K}}_{I}\) and \({\tilde{K}} _{S}\) is true. It is worth noting that UI, having inverted the roles of null and alternative, is effectively a mirrored formulation of IU. Of course, the global UI \({\tilde{T}}_{G}\) solution implies a suitable combination of two partial test statistics \({\tilde{T}}_{I}\) and \({\tilde{T}}_{S}\). In Arboretti et al. (2017, 2018), Pesarin et al. (2016), Sen (2007) and Wellek (2010) it is seen that both combinations of \(T_{I}\) and \(T_{S}\) for IU and \({\tilde{T}}_{I}\) and \({\tilde{T}}_{S}\) for UI are the crucial methodological points at issue for obtaining proper solutions ( Pesarin 2001, 2015, 2016; Pesarin and Salmaso 2010; Sen 2007; Sen and Tsai 1999, see also the Supplementary Material).
It is important to highlight that in order to obtain a valid global solution \(T_{G}\), the IU approach requires the researcher to set its maximum type I error rate no larger than \(\alpha\) and the maximum type II error rate \(\beta\) no larger than \(1\alpha\); i.e.
where \(\phi _{G}\) is the indicator function for the rejection region of the \(T_{G}\) global test and \({\mathbf {E}}_{F}(\cdot )\) the mean value of \((\cdot )\) with respect to the underlying data distribution F. Correspondingly, with clear meanings of the symbols, to obtain a valid UI global test \({\tilde{T}}_{G}\) the researcher must set its maximum type I error rate and maximum type II error rate as:
These conditions, that deal with inferential unbiasedness, are necessarily required by any test statistics (Lehmann 1986).
In the literature on the subject matter almost all authors apparently assume that regulatory agencies (e.g. FDA, EMEA, etc.) for testing Eq consider only the IU approach. For instance, the ICHE9 glossary (Food and Drug Administration 1998) defines Eq of clinical trials as: “A trial with the primary objective of showing that the response to two or more treatments differs by an amount which is clinically unimportant. That is usually demonstrated by showing that the true treatment difference is likely to lie between a lower and an upper equivalence margin of clinically acceptable differences.” This definition, however, does not contain sufficiently precise methodological indications as to which of two formulations, the IU (H, K) or the UI \(({\tilde{H}},{\tilde{K}})\), is to be chosen, since there are circumstances where one or the other is rationally suitable for the testing problem at hand. We will see that the two share the same asymptotic behavior. This is not the case for finite sample sizes, where quite important differences are ascertained as will be seen in this paper.
Consequently, in any practical situation the researcher must choose which of (H, K) and \(({\tilde{H}},{\tilde{K}})\) is most suitable for the proper analysis of his/her problem. We think that such an option, although not well emphasized in the literature on hypotheses testing, is common to almost all testing situations. A simple example clarifies this point: let us consider the classic problem of two simple hypotheses: \(\theta _{A}\ne \theta _{B}\) and \(\theta _{A}<\theta _{B}\). According to the NeymanPearson lemma, the best test for \(H:\theta =\theta _{A}\) V.s \(K:\theta =\theta _{B}\) rejects H when \(T\ge T_{\alpha },\) where the critical value \(T_{\alpha }\) is determined by the distribution of the likelihood ratio under H. On the other hand the best test for \({\tilde{H}}:\theta =\theta _{B}\) V.s \({\tilde{K}}:\theta =\theta _{A}\) rejects \({\tilde{H}}\) when \({\tilde{T}}\le {\tilde{T}}_{\alpha },\) where \({\tilde{T}}_{\alpha }\) is determined by the likelihood ratio distribution under \({\tilde{H}}.\) So, the duality between two alternative formulations is evident. The researcher is therefore required to explicitly decide between (H, K) and \(({\tilde{H}},{\tilde{K}});\) i.e. he/she has to justify which is given the role of null hypothesis, and the maximum rejection rate \(\alpha\) when true, so as to strictly control both type I and type II inferential errors with \(\beta \le 1\alpha .\) We believe that no researcher can escape this central necessity. Since both ways are rationally appropriate for Eq testing, such a notion supports our purpose to provide a sort of weak comparative (parallel) analysis of both by highlighting their respective requirements, properties, limitations, difficulties, pitfalls and inferential costs. It has to be stated, however, that two dual formulations reverse the roles of respective inferential risks: what works as type I error for (H, K) has the role, not just the related numerical value, of type II error for (\({\tilde{H}},{\tilde{K}}\)), and vice versa.
Some authors emphasize the general problem that any traditional twosided consistent test rejects a point null hypothesis with a probability close to one for sufficiently large sample sizes, even for practically negligible violations of the null. For instance, Nunnally (1960) says: “To minimize type II errors, large samples are recommended. In psychology, practically all null hypotheses are claimed to be false for sufficiently large samples so (...) it is nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis”. According to Pantsulaia and Kintsurashvili (2014) the same concept is expressed by more than 200 authors. Clearly, to go beyond such a limitation of twosided tests, this suggests considering a null hypothesis as made up of an interval of substantially equivalent points, rather than only one point. As a result the hypotheses for any traditional twosided testing is written as \((\tilde{H},{\tilde{K}})\). Such a formulation then has its own specific merits, in spite of the fact that it is not adequately considered in the general literature (Wellek 2010, pp. 355–358, considers some likelihoodbased hints). Up to now we have dedicated two papers: Arboretti et al. (2017) to the onedimensional setting and Pesarin et al. (2016) to the multidimensional setting. It should, however, be emphasized that to find proper workable solutions requires going beyond the limitations of likelihood ratio approaches and so staying within a nonparametric approach, and specifically within the permutation theory and the nonparametric combination (NPC) of dependent permutation tests.
It could be argued that widespread use of the IU approach is due, rather than to rational analysis, to the fact that under a set of very stringent conditions (Lehmann 1986; Romano 2005), one uniformly most powerful unbiased test exists, namely \(T_{G}^{opt}\), and this result is merely extended, by simple analogy, to all Eq problems. It will be seen that such an extension outside those conditions might give rise to several quite severe and intriguing consequences.
Intersection–union and union–intersection permutation tests
Without loss of generality and for the sake of simplicity, we illustrate the proposed methodology with reference to a twosample design and onedimensional endpoint variable X. To stay within the permutation theory and the nonparametric combination of dependent permutation tests (NPC), let us assume that a sample of \(n_{1}\) IID data related to treatment A are drawn from \(X_{1}\) and, independently, \(n_{2}\) IID data related to treatment B are drawn from \(X_{2}\). This setting can generally be obtained when \(n_{1}\) units out of \(n=n_{1}+n_{2}\) are randomly assigned to A and the remaining \(n_{2}\) to B. We define responses as \(X_{1}=X+\delta _{A}\) and \(X_{2}=X+\delta _{B},\) where the underlying variable X, whose distribution is F, is common to both populations. Hence, \({\mathbf {X}}_{1}=(X_{11},\ldots ,X_{1n_{1}})\) are the data of sample A and \({\mathbf {X}}_{2}=(X_{21},\ldots ,X_{2n_{2}})\) those of sample B. So, the pooled data set is \({\mathbf {X}}=({\mathbf {X}}_{1} ,{\mathbf {X}}_{2})=\{X(i),i=1,\ldots ,n;n_{1},n_{2}\},\) where in the last notation it is intended that the first \(n_{1}\) data in the list are from the first sample and the rest from the second. Moreover, we assume that, possibly after suitable data transformations to obtain quasi data symmetry [e.g. as \(\log (\cdot ),\) \(\sqrt{(\cdot )},\)\(Rank(\cdot )\), etc., also point UI.3 in Sect. 7.3], variable X is provided with a finite mean value, i.e. \({\mathbf {E}}_{F}(X)<\infty ,\) so as to use consistent permutation tests based on comparison of sample means (Sen 2007; Pesarin 2015; Pesarin and Salmaso 2013).
It is assumed that two effects \(\delta _{A}\) and \(\delta _{B}\) are fixed and data are homoscedastic. In further research we will extend our permutation theory to random effects, that is to a condition compatible with important forms of heteroscedasticities, as are frequently met in most experimental and observational problems when a treatment, together with the mean, can also modify dispersion or even other aspects of a distribution.
In this context, both IU and UI approaches are in practice worked out by considering two partial tests for each way: one for \(H_{I}\) V.s \(K_{I}\) and one for \(H_{S}\) V.s \(K_{S}\) for IU; one for \({\tilde{H}}_{I}\) V.s \({\tilde{K}}_{I}\) and one for \({\tilde{H}}_{S}\) V.s \({\tilde{K}}_{S}\) for UI.
The two IU partial tests we consider have the (non standardized) form:
and correspondingly, the two UI partial tests are:
where as usual, \({\bar{X}}_{j}=\sum _{1\le i\le n_{j}}X_{ji}/n_{j},\) \(j=1,2,\) are sample averages. It is worth noting that \(T_{I}={\tilde{T}}_{I}\) and \(T_{S}={\tilde{T}}_{S}\) and that large values of each test are evidence for their respective subalternatives. Also worth noting is that IU pair \((T_{I},T_{S})\), as well as the UI pair \(({\tilde{T}}_{I},{\tilde{T}}_{S}),\) are functions of essentially the same data \({\mathbf {X}}\) and so two tests in each pair are negatively dependent (Pesarin 2016; Pesarin et al. 2016).
One major problem related to both IU and UI, that also arises when several test statistics are functions of the same data, is what to do with such a multiplicity of dependent partial tests. In this regard, a meaningful warning by Sen (2007) relating to UI says: “However, computational and distributional complexities may mar the simple appeal of the UI to a certain extent. (...) The crux of the problem is however to find the distribution theory for the maximum of these possibly correlated statistics. Unfortunately, this distribution depends on the unknown F, even under the null hypothesis. (...) An easy way to eliminate this impasse is to take recourse to the permutation distribution theory (...)”. The same warning applies to the IU.
We partially disagree with this warning. The greatest obstacle to achieving suitable working solutions is finding a general method to cope with the overly complex dependence structure of two partial tests \((T_{I},T_{S})\) for IU and \(({\tilde{T}}_{I},{\tilde{T}}_{S})\) for UI. They are negatively dependent and their dependence coefficients depend on underlying F, data \({\mathbf {X}}\) and margins \((\varepsilon _{I},\varepsilon _{S})\). Indeed, such a dependence runs from correlation \(\rho =1\), for margins \(\varepsilon _{I}=\varepsilon _{S}=0\), to almost practical independence for sufficiently large margins. Quite a general solution can be validly obtained when it is possible to deal with that dependence nonparametrically.
Moreover, in multidimensional problems, such a dependence is much more complex than pairwise linear. So it seems impossible to deal with it by proper estimates of all associated dependence coefficients, the number and type of which are typically unknown. Thus, this dependence must be worked out nonparametrically within a wellsuited theory. This requires adopting the conditionality principle of inference by conditioning on data \({\mathbf {X}}\) (which under the null are always sufficient), i.e. by the permutation testing principle (Pesarin 2015) and, more importantly, by the NPC of dependent permutation tests (Pesarin 1990, 1992, 2001, 2015, 2016; Pesarin and Salmaso 2010, see also the Supplementary Material).
It is worth noting that to stay within the permutation theory, i.e. by permuting the ndimensional data \({\mathbf {X}}\), we have to consider permuted data associated with permutations \({\mathbf {u}}^{*}=(u_{1}^{*} ,\ldots ,u_{n}^{*})\) of unit labels \({\mathbf {u}}=(1,\ldots ,n)\). Thus, all test statistics are calculated on corresponding data permutations \({\mathbf {X}}^{*}=\{X(u_{i}^{*}),i=1,\ldots ,n;n_{1},n_{2}),\) where two permuted samples are \({\mathbf {X}}_{1}^{*}=\{X(u_{i}^{*}),i=1,\ldots ,n_{1}\}\) and \({\mathbf {X}}_{2}^{*}=\{X(u_{i}^{*}),i=n_{1}+1,\ldots ,n\}\), respectively.
Our proposal is to separately test, albeit simultaneously, \(H_{I}\) V.s \(K_{I}\) and \(H_{S}\) V.s \(K_{S}\) for IU, and \({\tilde{H}}_{I}\) V.s \({\tilde{K}}_{I}\) and \({\tilde{H}}_{S}\) V.s \({\tilde{K}}_{S}\) for UI.
To test for \(H_{I}\) V.s \(K_{I}\) let us consider the statistic \(T_{I}=\bar{X}_{I2}{\bar{X}}_{I1},\) where the data \({\mathbf {X}}_{2}\) of sample B are modified to \({\mathbf {X}}_{I2}={\mathbf {X}}_{2}+\varepsilon _{I}\) while those of sample A are retained as they are, i.e. \({\mathbf {X}}_{I1}={\mathbf {X}}_{1}.\) Correspondingly, to test for \(H_{S}\) V.s \(K_{S}\) we use the statistic \(T_{S}={\bar{X}}_{S1}{\bar{X}}_{S2},\) where \({\mathbf {X}}_{S1}={\mathbf {X}}_{1}\) and \({\mathbf {X}}_{S2}={\mathbf {X}}_{2}\varepsilon _{S}\). Thus, the global test is by one of their IUNPC solutions, the simplest and effective of which is:
where \(\lambda _{h}\) is the socalled p value statistic for \(T_{h},\) \(h=I,S\).
Correspondingly, to test for \({\tilde{H}}_{I}\) V.s \({\tilde{K}}_{I}\) and \({\tilde{H}}_{S}\) V.s \({\tilde{K}}_{S}\) we use the two statistics \({\tilde{T}} _{I}={\bar{X}}_{I1}{\bar{X}}_{I2}=T_{I}\) and \({\tilde{T}}_{S}={\bar{X}}_{S2}\bar{X}_{S1}=T_{S},\) and so \({\tilde{T}}_{G}\) is given by their UINPC:
According to the general theory (Lehmann 1986; Romano 2005; Wellek 2010), for \(T_{G}\) to be unbiased with the IUNPC, it is required that conditions a) and b) are both satisfied, thus partial critical values must be calibrated so that the global test \(T_{G}\) satisfies \(\alpha\) at both extremes of K, that is
analogously for \({\tilde{T}}_{G}\) to be unbiased with the UINPC (Arboretti et al. 2018), it is required that conditions ã) and \({\tilde{b}}\)) are satisfied, thus partial critical values must be calibrated so that \({\tilde{T}}_{G}\) satisfies \(\alpha\) at both extremes of \({\tilde{H}},\) that is
where \(\phi _{h},\) \({\tilde{\phi }}_{h}\), \(\phi _{G},\) \({\tilde{\phi }} _{G},\) are the indicator functions of rejection regions of concerned tests.
It is worth noting that partial critical values \(C_{I\alpha }\) and \(C_{S\alpha }\) of parametric tests, which depend on distribution F, sample size n and margins \((\varepsilon _{I},\varepsilon _{S}),\) according to Lehmann (1986) and Wellek (2010) are to be numerically determined (also UI.3, Sect. 7.1). Essentially, these values can coincide only asymptotically with the standard critical values (e.g. as \(z_{\alpha }\) or \(t_{\alpha },\) etc.) in use with traditional twosided tests. Thus, in our terminology, they also must be calibrated.
In some literature, the noncalibrated IUTOST (naive) solution \({\ddot{T}}_{G}\) is often considered (e.g. AndersonCook and Borror 2016; Berger and Hsu 1996; Lakens 2017; Pardo 2014; Patterson and Jones 2017; Richter and Richter 2002). This solution satisfies condition a) but not b), thus it is far from being unbiased, unless sample sizes and/or margins are sufficiently large (e.g. Sect. 5 and Supplementary Material).
When optimal likelihood solutions \(T_{G}^{opt}\) and \({\tilde{T}}_{G}^{opt}\) are available then for divergent sample sizes, under their conditions, we have \(T_{G}\rightarrow T_{G}^{opt}\) and \({\tilde{T}}_{G}\rightarrow {\tilde{T}} _{G}^{opt}\) at quite a high rate (Hoeffding 1952).
Computational details and related algorithms are in Arboretti et al. (2018) for the IUNPC and in Pesarin et al. (2016) for the UINPC (see also the Supplementary Material). Of course, by using \(T_{G}^{ob}=T_{G}({\mathbf {X}})\) and \({\tilde{T}}_{G}^{ob}({\mathbf {X}})\) to denote the observed values of test statistics \(T_{G}\) and \({\tilde{T}}_{G},\) respectively, if p value statistics of the IUNPC \(T_{G}\) test \(\lambda _{T_{G}}=\Pr \{T_{G}^{*}\ge T_{G}^{ob}{\mathbf {X}}\}\le \alpha ^{c},\) then the NEq hypothesis H is rejected at significance level \(\alpha\) (the naive IUTOST \({\ddot{T}}_{G}\) rejects H if \(\lambda _{T_{G}} \le \alpha\); so its true type I error remains unknown, depending on F, data \({\mathbf {X}}\) and margins \(\varepsilon _{I}\), \(\varepsilon _{S}\)). Correspondingly, if the UINPC test \({\tilde{T}}_{G}\) gives \(\lambda _{\tilde{T}_{G}}=\Pr \{{\tilde{T}}_{G}^{*}\ge {\tilde{T}}_{G}^{ob}{\mathbf {X}} \}\le {\tilde{\alpha }}^{c},\) then the Eq hypothesis \({\tilde{H}}\) is rejected at significance level \(\alpha\). In practice p value statistics are estimated, at any desired confidence rate, by a conditional Monte Carlo procedure as: \({\hat{\lambda }}_{h}=\#[T_{h}({\mathbf {X}}^{*})\ge T_{h} ^{ob}\mathbf {X]/}R,\) where \(T_{h}\) stands for \(T_{I},T_{S},T_{G},\tilde{T}_{T},{\tilde{T}}_{S},{\tilde{T}}_{G}\) and R is the number of random permutations.
NPC limiting behavior for IU and UI
Let us assume that population mean \({\mathbf {E}}_{F}(X)\) is finite, so that \({\mathbf {E}}({\bar{X}}^{*}{\mathbf {X}})\) is also finite for almost all sample data \({\mathbf {X}},\) where \({\bar{X}}^{*}\) is the sample mean of a without replacement random sample of \(n_{1}\) or \(n_{2}\) elements from \({\mathbf {X}},\) taken as a finite population.
To find the limiting behavior of IUNPC let us firstly consider the partial test \(T_{S}^{*}(\delta )={\bar{X}}_{S1}^{*}{\bar{X}}_{S2}^{*},\) where its dependence on effect \(\delta\) is emphasized. In Sen (2007) and Pesarin and Salmaso (2013), based on the law of large numbers for strictly stationary dependent sequences, such as are those generated by the without replacement random sampling (any random permutation is simply a without replacement random sample from \({\mathbf {X}} _{S})\), it is proved that as \(\min (n_{1},n_{2})\rightarrow \infty \) the permutation distribution of \(T_{S}^{*}(\delta )\) weakly converges to \({\mathbf {E}}_{F}({\bar{X}}_{S1}{\bar{X}}_{S2})=(\varepsilon _{S}\delta )\).
Thus, for any \(\delta <\varepsilon _{S}\) the rejection rate of \(T_{S}(\delta )\) converges to one: \({\mathbf {E}}_{F}(\phi _{T_{S}},\delta )\rightarrow 1\). Moreover, for any \(\delta >\varepsilon _{S}\) that rejection rate converges to zero. At the right extreme of \(H_{S},\) \(\delta =\varepsilon _{S}\) say, since \(T_{S} (\varepsilon _{S})\) rejects with probability \(\alpha\) for any sample sizes (\(n_{1},n_{2}),\) its limit rejection rate is also \(\alpha\).
The behavior of \(T_{I}(\delta )\) mirrors that of \(T_{S}(\delta )\). That is, its limiting rejection rate: i) for \(\delta =\varepsilon _{I}\) is \(\alpha ;\) ii) for \(\delta <\varepsilon _{I}\) is zero; iii) for \(\delta >\varepsilon _{I}\) is one.
In the global alternative \(K:(\varepsilon _{I}<\delta <\varepsilon _{S}),\) since both permutation tests \(T_{I}\) and \(T_{S}\) are jointly consistent, the global test \(T_{G}\) is consistent too (Pesarin 2001, 2016; Pesarin and Salmaso 2010), that is \({\mathbf {E}} _{F}(\phi _{T_{G}},\delta )\rightarrow 1\). Correspondingly, for every \((\delta <\varepsilon _{I}) {\textstyle \bigcup } (\delta >\varepsilon _{S})\) the limiting rejection is \({\mathbf {E}}_{F} (\phi _{T_{G}},\delta )\rightarrow 0.\) Moreover, in the extreme points of H, when \(\delta\) is either \(\varepsilon _{I}\) or \(\varepsilon _{S},\) as one and only one can be true if at least one differs from zero, the limiting rejection rate of \(T_{G}\) is \(\alpha .\) Moreover, if \(\varepsilon _{I}=\) \(\varepsilon _{S}=0,\) this rejection rate is not defined for every sample size.
To find UINPC’s limit behavior, firstly let us analogously consider \({\tilde{T}}_{S}^{*}(\delta )={\bar{X}}_{S2}^{*}{\bar{X}}_{S1}^{*}\). As \(\min (n_{1},n_{2})\rightarrow \infty\) implies that the permutation distribution of \(T_{S}^{*}(\delta )\) weakly converges to \({\mathbf {E}}_{F}({\bar{X}} _{S2}{\bar{X}}_{S1})=(\delta \varepsilon _{S}),\) then for any \(\delta >\varepsilon _{S}\) the rejection rate of \({\tilde{T}}_{S}(\delta )\) converges to one. Moreover, for any \(\delta <\varepsilon _{S}\) its rejection rate converges to zero. At the right extreme \(\delta =\varepsilon _{S}\), since for any sample sizes \({\tilde{T}}_{S}(\varepsilon _{S})\) rejects with probability \(\alpha ,\) its limit rejection rate is also \(\alpha\).
The behavior of \({\tilde{T}}_{I}(\delta )\) mirrors that of \({\tilde{T}}_{S} (\delta )\). That is, the limiting rejection rate: i) for \(\delta =\varepsilon _{I}\) is \(\alpha ;\) ii) for \(\delta >\varepsilon _{I}\) is zero; iii) for \(\delta <\varepsilon _{I}\) is one.
In the global alternative \({\tilde{K}}:(\delta <\varepsilon _{I}) {\textstyle \bigcup } (\delta >\varepsilon _{S})\) since one and only one of \({\tilde{T}}_{I}\) and \({\tilde{T}}_{S}\) is consistent, then \({\tilde{T}}_{G}\) is consistent too (Pesarin 2001; Pesarin and Salmaso 2010; Pesarin et al. 2016).
A simple analysis
Calibrated values \(\alpha ^{c}\) and \({\tilde{\alpha }}^{c}\), so as to get global type I error rate \(\alpha\) for the IUNPC \(T_{G}\) and the UINPC \(\tilde{T}_{G},\) if the underlying data distribution F is completely known, can be determined via Monte Carlo simulations as is done in Arboretti et al. (2018) for \(T_{G}\) and in Pesarin et al. (2016) for \({\tilde{T}}_{G}\).
Algorithms for IUNPC and UINPC used to determine calibrated \(\alpha ^{c}\) and \({\tilde{\alpha }}^{c}\) (see also the Supplementary Material) can even be used to establish the designs \(n_{1}=n_{2}\) and \({\tilde{n}}_{1}={\tilde{n}}_{2}\) such that the maximum power \(W_{T_{G}}(0;n,\varepsilon )=p\) and \(W_{{\tilde{T}}_{G} }(\pm 2\varepsilon ;{\tilde{n}},\varepsilon )=p\) at standardized margins \(\varepsilon _{I}=\varepsilon _{S}=\varepsilon\) on calibrated \(\alpha ^{c}\) and \({\tilde{\alpha }}^{c}\), respectively. The choice to consider designs at \(\delta =0\) for the \(T_{G}\) and at \(\delta =\pm 2\varepsilon\) for the \(\tilde{T}_{G}\) resides in the fact that these values are equally far away from H and \({\tilde{H}},\) respectively, and so their power behaviors are comparable (Wellek 2010, Chapter 11).
Assuming \(X\sim N(0,1)\) [\(\sigma\) unknown], \(\alpha =0.05,\) \(p=(0.80,~0.50),\) Table 1 contains a few designs obtained by \(MC=5000\) Monte Carlo runs, each with \(R=2500\) random permutations, for both IUNPC and UINPC.
Referring to point \(\varepsilon =0.60\) as a pivot, the approximate sample sizes for \(p=0.80\), the IUNPC designs for any intermediate margins \(\varepsilon ^{\prime }\) approximately agree to the empirical rule \(n(\varepsilon ^{\prime })\approx 48.28\cdot (0.6/\varepsilon ^{\prime })^{2}\) as obtained by interpolating simulation results. It is worth noting that these IUNPC designs are strictly close to those obtained within the naive IUTOST \({\ddot{T}}_{G}\) approach as reported in Lakens (2017). Such a practical coincidence is mostly due to the fact that: i) calibrated \(\alpha ^{c}\) coincides with noncalibrated \(\alpha\) for interval length, adjusted with sample sizes, of about \((\varepsilon _{I}+\varepsilon _{S})\sqrt{n_{1}n_{2}/n\sigma ^{2}}>5.4,\) and ii) permutation tests are convergent at a high rate to the corresponding parametric solutions (Hoeffding 1952). On the other hand, for UINPC the related empirical rule for intermediate margins \(\varepsilon ^{\prime }\) is \(\tilde{n}(\varepsilon ^{\prime })\approx 35.33\cdot (0.6/\varepsilon ^{\prime })^{2}.\) Similar approximate rules for \(p=0.50\) are \(n(\varepsilon ^{\prime } )\approx 30.25\cdot (0.6/\varepsilon ^{\prime })^{2}\) and \(\tilde{n}(\varepsilon ^{\prime })\approx 16.03\cdot (0.6/\varepsilon ^{\prime })^{2}\), for IUNPC and UINPC respectively. It is worth observing that to reach reasonable power the Eq testing process requires quite large sample sizes especially when margins are small [also point IU.2 in Sect. 7.1].
From these results we may derive a sort of relative efficiency rate of UINPC with respect to IUNPC. For instance, at \(\varepsilon =0.60\) and \(p=0.80\) the rate of sample sizes is \(n/{\tilde{n}}\) \(\approx 1.36\), for \(p=0.50\) it is \(n/{\tilde{n}}\) \(\approx 1.88,\) and for \(p=0.30\) (details not reported) it is \(n/{\tilde{n}}\) \(\approx 2.57\). In practice, relative efficiency rates are mostly dependent on power p and are almost \(\varepsilon\)invariant.
Table 2 reports, for standard normal data (\(\sigma\) unknown) with \(n_{1} =n_{2}=12\) and \(\varepsilon =(4/5,3/5,2/5,1/3,1/5,1/10)\), calibrated \(\alpha ^{c},\) \({\tilde{\alpha }}^{c},\) rejection rates of H at \(\delta =0\) and \(\delta =\pm 2\varepsilon\) for the IUNPC \(T_{G}\) and the naive IUTOST \({\ddot{T}}_{G}\), and of \({\tilde{H}}\) for the UINPC \({\tilde{T}}_{G},\) all obtained with \(MC=5000\) and \(R=2500.\)
In order to clarify how to read Table 2, let us consider line \(\varepsilon =0.40\): calibrated \(\alpha ^{c}=0.185,\) \(W_{T_{G}}(0)=0.076,\)\(W_{T_{G}}(\pm 0.8)=0.987;\) \(W_{{\ddot{T}}_{G}} (0)=0.001,\) \(W_{{\ddot{T}}_{G}}(\pm 0.8)=1.000;\)\({\tilde{\alpha }}^{c}=0.049;\) \(W_{{\tilde{T}}_{G}}(0)=0.991,\) and \(W_{{\tilde{T}}_{G}}(\pm 0.8)=0.249\), and so on. In particular, the naive IUTOST \({\ddot{T}}_{G}\) appears to be dramatically conservative since its maximum power of \(W_{{\ddot{T}}_{G}}(0)=0.001\) is much smaller than \(\alpha =0.05.\) Thus, since its power is much smaller than \(\alpha ,\) naive \({\ddot{T}}_{G}\) cannot be seriously considered as a practical way to test for Eq. \(W_{T_{G}}(0)=0.076\) in contrast with \(W_{{\tilde{T}}_{G}}(\pm 0.8)=0.249,\) when a comparison can be stated, manifests that the UINPC is considerably more efficient than the IUNPC in detecting the respective comparable alternative.
From these results we can see that IUNPC appears to be mostly focused on NEq as the main assertion under testing, i.e. the one to be falsified if not true, so exhibiting an intrinsic propensity to retain H even when it is not true. Thus, its applications are mostly with problems where rejection of true Eq has relatively smaller costs than its acceptance when NEq is true while taking under strict control related global errors. This is typically the case in the areas of bioequivalence and pharmacostatistics, where it is considered ethical to retain A (the “old drug”) unless there is empirical evidence that B (the “competitor”) is Eq to it. On the other hand, UINPC appears to be mostly focused on Eq, so exhibiting a relatively larger propensity to retain \({\tilde{H}}\) when it is true. Thus, its applications are mostly with problems where rejection of true Eq has relatively greater costs than acceptance of a false NEq while taking under strict control related global errors. This generally occurs with testing aims to go beyond traditional twosided procedures, as for instance with quality control, etc. It is also important to emphasize that for \(\varepsilon \le 0.333\) the maximum probability for the naive IUTOST \({\ddot{T}}_{G}\) to retain Eq, when it is true, is zero [see also points ÏÜ.3, ÏÜ.5, ÏÜ.6 in Sect. 7.2], so resulting in pure costs without any inferential benefits.
Table 3 reports, for data from N(0, 1) (\(\sigma\) unknown), the minimal sample sizes \(\ddot{n}_{1}=\ddot{n}_{2},\) in terms of \(\varepsilon =\varepsilon _{I}=\varepsilon _{S},\) for naive IUTOST \({\ddot{T}}_{G}\) when conditions a) and b) are satisfied, i.e. to be unbiased at \(\ddot{\alpha } _{G}=.05,\) and the maximum probability (i.e. the power) to accept Eq [\({\ddot{W}}(Eq)\)] at \(\delta =0\).
If \(\sigma\) were known we have essentially the same results. It is proved that when \((n_{1},n_{2})\) and/or \(\varepsilon _{I}+\varepsilon _{S}\) are not sufficiently large (Wellek, 2010, p. 5), the naive IUTOST \({\ddot{T}}_{G},\) as is frequently used in the literature, can be unacceptably biased (Sect. 7.2).
A bioequivalence application
Let us consider data from (Hirotsu (2017), p. 108) on the endpoint variable Log\(\ C_{\max }\) (Log of maximum blood concentration of a drug), related to \(n_{1}=20\) Japanese subjects and \(n_{2}=13\) Caucasians, after prescribing a standard dose of a drug. Data concern a bridging study conducted to investigate for bioequivalence between two populations. So, the test is to see if two populations can be considered bioequivalent with respect to that variable. Data are reported in Table 4.
The basic statistics are: \({\bar{X}}_{Jap}=1.518;\) \({\hat{\sigma }}_{Jap}=0.0813;\) \({\bar{X}}_{Cau}=1.457;\) \({\hat{\sigma }}_{Cau}=0.0951;\) pooled \(\hat{\sigma }=0,0869.\) By firstly using the permutation test \(T^{\prime }={\bar{X}}_{J} {\bar{X}}_{C}\) for the point null hypothesis \(H^{\prime }:X_{J}\overset{d}{=}X_{C}\) versus the twosided alternative \(K^{\prime }:X_{J}\overset{d}{\ne }X_{C}\), with \(R=100\,000\) we obtain the p value statistic \(\hat{\lambda }^{\prime }=0.0535\). There is no evidence for nonequality between two data sets at \(\alpha =0.05\), although \({\bar{X}}_{Jap}\) appears to be slightly larger than \({\bar{X}}_{Cau}\) (Student’s \(t=1.991,\) 31 df, \(\lambda _{t}^{\prime }>0.05\)).
Let us consider the IUNPC \(T_{G}\) and the UINPC \({\tilde{T}}_{G}\) with the list of margins \(\varepsilon _{I} =\varepsilon _{S}=(0.058,\) 0.071, 0.109, 0.125), corresponding to studentized values (in terms of \({\hat{\sigma }})\) of (2/3, \(0.82,\ 1.25,\) 1.44 ) , respectively.
The results, with \(R=100\,000\) on data X, are in Table 5.
The results in brackets are related to midrank data transformations. In our opinion, due to some apparent irregular data concentrations and some ties, midrank results can be slightly more reliable than those on plain data X (IU.3 Sect. 7.1, and UI.1 Sect. 7.3). At \(\varepsilon\) such that \(\ddot{\alpha }(\varepsilon )=\alpha _{G}=0.05\), i.e. \(\varepsilon \approx 0.071\) (corresponding to \(\approx 0.82~{\hat{\sigma }}),\) type I error rates of naive IUTOST \({\ddot{T}}_{G}(\varepsilon )\) and of IUNPC \(T_{G}\) approximately coincide. Of course, this coincidence also remains for larger sample sizes and margins. With the data of the example, the Eq of two data sets is retained by the IUNPC \(T_{G}\) for margins \(\varepsilon _{I}=\varepsilon _{S} \gtrsim 0.12\approx 1.38~{\hat{\sigma }}\). For margins \(\varepsilon _{I}=\varepsilon _{S}<0.12\), we could state that there is not enough information to conclude that the two populations are equivalent. Anyway, it also well known in the literature that the IUNPC approach suffers from a lack of power (e.g., Berger and Hsu 1996; Wellek 2010).
On the other hand, the UINPC \({\tilde{T}}_{G}\) retains Eq for all margins, including \(\varepsilon _{I}=\varepsilon _{S}=0\) (in such a case \({\tilde{\alpha }}^{c}=\alpha /2\) and the related p value statistic is \({\hat{\lambda }}_{{\tilde{G}}}=\hat{\lambda }^{\prime }=0.0535>0.025)\).
Warnings and good practices for NPC equivalence
The IUNPC solution
The IUNPC approach, as well as the likelihoodbased IU, presents some pitfalls, as is evident from previous results (see also the Supplementary Material). The most important requirements and pitfalls are:

IU.1) It does not admit any solution when \(\ \varepsilon _{I}=\varepsilon _{S}=0,\) i.e. when the null hypotheses is \(H:[(\delta \le 0) {\textstyle \bigcup } (\delta \ge 0)]\) in which case the alternative K becomes logically impossible as it is empty, \(K=\varnothing\) say.

IU.2) When the \(T_{G}\) measure of \(\varepsilon _{I}+\varepsilon _{S}\) is small, there still remain difficulties in retaining Eq when it is true. This difficulty is well recognized in the literature. For instance, (Wellek (2010), p. 5) says ”...the sample sizes required in an equivalence test in order to achieve a reasonable power typically tend to be considerably larger than in an ordinary one or twosided testing procedure ...unless the range of tolerance deviations ...is chosen so wide that even distributions exhibiting pronounced dissimilarities would be declared ‘equivalent’ ...”. The same difficulties are confirmed by simulation results (Arboretti et al. 2018).

IU.3) Using Monte Carlo to achieve the IUNPC calibrated \(\alpha ^{c}\) requires complete knowledge of underlying distribution F of variable X, including all its nuisance parameters. When a central limit theorem is working for partial test distributions, calibrated \(\alpha ^{c}\) can be approximately assessed by a simulation algorithm as in (Arboretti et al. (2018), see also the Supplementary Material) where the unknown standard deviation \(\sigma _{X}\) is replaced by its sampling estimate \({\hat{\sigma }}_{X}\). Thus, assessing \(\alpha ^{c}\) endures two sources of approximations. This difficulty is particularly intriguing when sample sizes \((n_{1},n_{2})\) and/or Eq interval length \(\varepsilon _{I}+\varepsilon _{S}\) are small. However, if data X are provided with finite second moment, IUNPC is little influenced by misspecification of data distribution F [also UI.3, Sect. 7.3)]. When no assumption on underlying F is undertaken, then midrank transformation of the numeric data \({\mathbf {X}}\) and margins \((\varepsilon _{I},\varepsilon _{S})\) may provide for reliable evaluation of calibrated \(\alpha ^{c}\), provided that normal approximation for WilcoxonMannWhitney statistics takes place, i.e. for sample sizes of about 10 or larger.

IU.4) According to results in Arboretti et al. (2018) and based on limiting behavior of the permutation test as stated in (Hoeffding (1952), IUNPC test \(T_{G} =\min (T_{I},T_{S})\) quickly converges to \(T_{G}^{opt}\) in the conditions for the latter.

IU.5) Unless \(\min (n_{1},n_{2})\) or\(\ \varepsilon _{I}+\varepsilon _{S}\) are very large, once Eq is rejected, the application of a Bonferronilike rule for establishing which \(H_{h},\) \(h=I,S,\) is active, if not always impossible, is generally difficult since calibrated \(\alpha ^{c}\) lie in the halfopen interval \([\alpha ,~(1+\alpha )/2).\)

IU.6) In practice, to analyze a given data set \(({\mathbf {X}}_{1},{\mathbf {X}}_{2})\), with \((n_{1},n_{2})\) sample sizes, margins \((\varepsilon _{I},\varepsilon _{S}),\) at significance level \(\alpha ,\) one has to firstly establish or to estimate \(\alpha ^{c}\) via Monte Carlo as in point IU.3; then, one can proceed with the IUNPC analysis. This implies using two computing algorithms.

IU.7) While using any kind of ranks, only within the IUNPC permutation approach is it possible to express margins in terms of the same physical measurement units of variable X. Rank solutions as discussed in Wellek (2010) and Janssen and Wellek (2010) express margins in terms of rank transformations so as to mimic solutions based on normal settings. However, this implies considering something similar to random margins, the meaning of which become doubtful or at least questionable and too difficult to justify (Arboretti et al. 2015; Hirotsu 2007). This same difficulty is also met when monotonic data transformations, such as \(X=\varphi (Y),\) are necessary and margins are expressed in terms of transformed values X. In any case, provided that margins are clearly justified, IUNPC can be correctly applied if these are expressed either in terms of original data Y or in terms of transformed data X.

IU.8) The multidimensional extension of the IU approach by likelihood methods is far from satisfactory, especially outside normal distributions. We think this extension can easily be done under the NPC and we intend to do it in future research.

IU.9) Calibrated reference values under the parametric likelihood ratio approach are obtained by numerical calculations (Wellek 2010; Lehmann 1986) only for population distributions lying within the regular exponential family if the invariance property for nuisance parameters (if any) works. Outside, only approximated solutions can be obtained [IU.3]. So the IU parametric approach is extremely demanding. Moreover, whenever the minimal sufficient statistics in the null hypothesis is the whole ndimensional data set \({\mathbf {X}}\), only nonparametric permutation solutions can be set up correctly (Pesarin 2015, 2016; Pesarin and Salmaso 2010).
The naive IUTOST solution
The naive IUTOST solution, \({\ddot{T}}_{G}=\min (T_{I},T_{S})\) say, as is frequently considered in the literature (AndersonCook and Borror 2016; Berger 1982; Berger and Hsu 1996; Pardo 2014; Patterson and Jones 2017; Richter and Richter 2002; Schuirmann 1987; Wellek 2010), corresponds to the noncalibrated version that rejects the global H at type I error rate \(\alpha\) when both partial tests reject each other at the same rate \(\alpha\) in place of calibrated \(\alpha ^{c},\) i.e. when \(\ddot{\alpha }_{I}=\ddot{\alpha }_{S}=\alpha .\) This naive \({\ddot{T}}_{G}\) solution has several further specific pitfalls:

ÏÜ.1) It satisfies condition a) but not b) in Sect. 2; however, it trivially satisfies Theorem 1 in Berger (1982) and Berger and Hsu (1996).

ÏÜ.2) When the \(T_{G}\) measure of \(\varepsilon _{I}+\varepsilon _{S}\) is very large, the noncalibrated naive \({\ddot{T}}_{G},\) whose partial type I error rates are \(\ddot{\alpha }_{I}=\ddot{\alpha } _{S}=\alpha ^{c}=\alpha ,\) and the calibrated IUNPC \(T_{G}\) coincide, and so they are both consistent (Sect. 4).

ÏÜ.3) The naive IUTOST\(\ {\ddot{T}}_{G}\) can be dramatically conservative and its maximum rejection probability can be much smaller than \(\alpha ,\) even exactly zero (Arboretti et al. 2018), see ÏÜ.5 and results in Sect. 5 (see also the Supplementary Material).

ÏÜ.4) In Theorem 2 in Berger (1982) and Berger and Hsu (1996), essentially states that margins \((\varepsilon _{I},\varepsilon _{S})\) exist such that the power under K of naive test \({\ddot{T}}_{G}\) is not smaller than \(\alpha .\) Since \({\ddot{T}}_{G}\) is consistent, as the standardized length of the Eq interval diverges at the rate \([n_{1}n_{2}/(n_{1}+n_{2})]^{1/2},\) if \(\min (n_{1} ,n_{2})\) diverges, such an existence corresponds to consistency. However, it is important to underline that such a condition is not constructive and so is not beneficial to finding practical solutions. Indeed, in any real problem, based on technical or biological or regulatory considerations, margins are established before the experiment for data collection is conducted. So, since it is unknown if the \({\ddot{T}}_{G}\) measure of \((\varepsilon _{I} +\varepsilon _{S})\) with actual sample data is sufficiently large so that \(\ddot{\alpha }=\alpha ^{c}=\alpha ,\) naive \({\ddot{T}}_{G}\) solutions do not guarantee minimal requirements in order to be considered valid test statistics.

ÏÜ.5) Paradoxically, when the Eq interval length \(\varepsilon _{I}+\varepsilon _{S}\) is small in terms of the \({\ddot{T}}_{G}\) distribution, the maximum probability for the naive IUTOST \({\ddot{T}}_{G}\) of finding a drug equivalent to itself can be exactly zero. This generally occurs when two partial rejection regions have no common points, i.e. when \(\phi _{S}\bigcap \phi _{I}=\emptyset\) so leading to impossible events, where \(\phi _{I}\) and \(\phi _{S}\) are the \(\alpha\)rejection regions of \(T_{I}\) and \(T_{S},\) respectively. For instance, with: \(n_{1}=n_{2}=12\), \(\varepsilon _{I}=\varepsilon _{S}=0.25\), \(X\sim N(0,1)\) and \(\ddot{\alpha }_{I} =\ddot{\alpha }_{S}=0.05,\) by a simulation with \(MC=5000\) and \(R=2500\) the type I error for \({\ddot{T}}_{G}\) is \(\ddot{\alpha }_{G}\approx 0.000\) and, much worse, the maximum estimated power \({\hat{W}}{\ddot{T}}_{G}(0)\approx 0.000\). Interestingly, the calibrated IUNPC \(\alpha ^{c}\) is about 0.293 (that of UINPC \({\tilde{T}}_{G}\) is \({\tilde{\alpha }}^{c}\approx 0.047;\) see also the Supplementary Material). In this respect it is easy to see that for normal data, with known \(\sigma\) and \(\varepsilon _{I}=\varepsilon _{S}=\varepsilon\), the maximum probability to retain Eq is exactly zero up to \(n_{1} =n_{2}=\lfloor 2(z_{\alpha }\sigma /\varepsilon )^{2}\rfloor ,\) with \(\lfloor (\cdot )\rfloor\) the integer part of (\(\cdot\)) and \(z_{\alpha }\) the \(\alpha\)quantile of N(0, 1).

ÏÜ.6) As a consequence, naive IUTOST \({\ddot{T}} _{G}\) tests are not members of the set of test statistics that satisfy conditions a) and b) in Sect. 2. Moreover, as the global type I error rate and power can both be considerably smaller than \(\alpha\) for small sample sizes and/or small Eq interval length, we may state that the naive \({\ddot{T}}_{G}\) testing procedure is based on an incorrect methodology, meaning that it can happen that true type I error results \(\alpha (\pm \varepsilon ,n)\ll \alpha\) and maximal power, at \(\delta =0,\) \(W{\ddot{T}}_{G}(0,n,\varepsilon )\ll \alpha\), conditions that do not agree with the minimal requirements for any test (Nunnally 1960, and Sect. 2). Thus, in our opinion, unless sample sizes and/or Eq interval length are sufficiently large, there is no reason for taking naive \(\ddot{T}_{G}\) into consideration in Eq testing. Essentially, this is our basic criticism regarding the widespread use of the naive IUTOST method (e.g. AndersonCook and Borror (2016), Pesarin (1990, (1992), among the many). We think that this intrinsic defect remains hidden to most practitioners because naive IUTOST apparently sounds noncounterintuitive.

ÏÜ.7) Direct consequence of former two points is that for naive IUTOST \({\ddot{T}}_{G}\) the cumulation of inferences from independent studies could be unsuitable. For instance, if there are \(m\ge 2\) analyses, each based on insufficiently large sample sizes, as is common in some metaanalyses and multicenter studies, their combination might always reject that a drug is Eq to itself. Indeed, for valid combination it is required that all m partial tests are unbiased (i.e. minimal power \(\ge \alpha\), Sect. 2). In fact, if for study h, \(h=1,\ldots ,m,\)\(\phi _{Sh}\bigcap \phi _{Ih}=\emptyset ,\) i.e. the joint rejection region of any two partial tests \(T_{Sh}\) and \(T_{Ih}\) is empty, the p value related to \({\ddot{T}}_{Gh}\) is 1 and so p value of any of its combinations is also 1, hence always providing for NEq, true or not.
The UINPC solution
The most important requirements and pitfalls of the UINPC are:

UI.1) Using Monte Carlo to establish the calibrated \({\tilde{\alpha }}^{c}\) requires complete knowledge of underlying distribution F of endpoint variable X, including all its nuisance parameters [same as IU.3, Sect. 7.1)]. When, for partial test distributions, a central limit theorem is working, calibrated \({\tilde{\alpha }}^{c}\) can be approximately determined according to Arboretti et al. (2018), since the Eq interval length \(\varepsilon _{I}+\varepsilon _{S}\) can be measured in terms of underlying standard error \(\sigma _{X}[n_{1}n_{2}/(n_{1}+n_{2})]^{1/2}.\) As in practice \(\sigma _{X}\) is unknown, substitution by its sampling estimate \({\hat{\sigma }}_{X}\) implies that \({\tilde{\alpha }}^{c}\) can be assessed only approximately. It is worth noting, however, that the related degree of approximation is generally negligible in practice because: i) its true value lies in the closed interval \([\frac{1}{2}\alpha ,~\alpha ]\) and so the maximum approximation error is bounded by \(\alpha /2\); ii) for any given Eq interval, calibrated \({\tilde{\alpha }}^{c}\) quickly converges to \(\alpha\) for increasing sample sizes, provided that the population mean \({\mathbf {E}}_{F}(X)\) is finite. When population mean is assumed not to be finite, then midrank transformation of the numeric data \({\mathbf {X}}\) and margins \((\varepsilon _{I},\varepsilon _{S})\), can provide for well approximated evaluations of calibrated \({\tilde{\alpha }}^{c} ,\) provided that normal approximation for WilcoxonMannWhitney statistics takes place.

UI.2) Similarly to point IU.6 (Sect. 7.1) to analyze a given data set \(({\mathbf {X}}_{1},{\mathbf {X}}_{2})\), with \((n_{1} ,n_{2})\) sample sizes, margins \((\varepsilon _{I},\varepsilon _{S}),\) at significance level \(\alpha ,\) one has to firstly establish or to estimate \({\tilde{\alpha }}^{c}\) via Monte Carlo as at point UI.1; then, one can proceed with the UINPC analysis. This too implies using two computing algorithms, but with much less impact than with IUNPC, because \({\tilde{\alpha }}^{c}\in [\frac{1}{2}\alpha ,~\alpha ],\) which is a much smaller range than \([\alpha ,~(1+\alpha )/2).\) Indeed, a similar fiveentry table would require much smaller numbers of sample sizes and margins.

UI.3) Once NEq is retained at significance level \(\alpha ,\) identifying which of two arms is mostly responsible for that result using a Bonferronilike rule implies that the related type I error is in \([\frac{1}{2}\alpha ,~\alpha ]\), and so the related type I error rate is not less than \(\alpha /2.\) Indeed, it is close to \(\alpha\) even for moderate sample sizes and small Eq interval length since, in practice, UINPC is intrinsically robust against misspecification of underlying F, possibly after data transformations achieving near symmetry. In this regard, with data from a Student’s t distribution with 2 df (zero mean and infinite variance), \(n_{1}=n_{2}=12\), \(\varepsilon _{I}=\varepsilon _{S}=0.321,\) corresponding to margins of about 0.25 for standard normally distributed data since \(\Pr \{0.25\le N(0,1)\le 0.25\}=\Pr \{0.321\le t_{2} \le 0.321\},\) we have \(\alpha ^{c}\approx 0.375\) and \({\tilde{\alpha }}^{c} \approx 0.047\). Compared to those that are active under standard normal data, as in ÏÜ.5 (Sect. 7.2), \(\alpha ^{c}\) proves to be much larger than 0.293, and so the IUNPC appears not to be robust against F; instead \({\tilde{\alpha }}^{c}\) coincides at the third figure with 0.047, confirming that the UINPC is at least approximately invariant on F, provided that near symmetry for the data is achieved. The robustness properties of IUNPC and UINPC will be considered in further research.

UI.4) When \(\varepsilon _{I}=\varepsilon _{S}=0,\) i.e. for sharp null and twosided alternatives, unless the underlying data distribution is symmetric, it is well known that it is difficult to find unbiased tests based on comparison of sample averages (Cox and Hinkley 1974; Lehmann 1986). Within the UINPC, however, the test \({\tilde{T}}_{G}=\max [({\bar{X}}_{1}{\bar{X}}_{2}),\) \((\bar{X}_{2}{\bar{X}}_{1})]\) is always at least unbiased at \(\alpha /2.\)

UI.5) Similarly to IU.9 (Sect. 7.1), calibrated reference values under the parametric likelihood ratio approach are obtained by numerical calculations only for population distributions lying within the regular exponential family if the invariance property for nuisance parameters (if any) works (Ferguson 1967). So, like the IU, the UI parametric approach is also quite demanding. On the contrary, when no parametric UI is available, approximations within UINPC generally suffice for most practical applications [UI.1].
Concluding remarks
The present paper provides a sort of comparative analysis of two nonparametric permutation approaches for Eq testing problems. In accordance with the majority of the literature on the subject matter, one is based on the IU principle. The other is based on the UI principle. Although they entail different evaluations of inferential errors, both are rationally suitable for such testing and so they are not strictly comparable. As such, rather than a proper comparison, we have proposed a sort of weak comparative (parallel) analysis. However, we believe that neither can be considered uniformly the best to be used for all possible problems. Thus, our analysis is mostly concerned with highlighting their respective requirements, properties, difficulties, inferential costs, limitations and pitfalls.
One important point we took into consideration was that in some of the literature the IU solution is used referring to the socalled noncalibrated reference critical values. We called it the naive IUTOST solution. In this regard, we showed (see ÏÜ.5 and ÏÜ.6, Sect. 7.2) that since its type I error rate and power for relatively small margins and/or sample sizes can be zero, thus implying rejection of Eq, true or not, with a probability close to one, its related testing process can become absolutely useless, resulting in pure costs without any inferential benefits. This rather erroneous feature may lead, for instance, to the unacceptable conclusion that “the probability to find that a drug is Eq to itself by the naive IUTOST can be zero”.
A further aspect we would like to consider is a sort of comparison between the IU and the UI with respect to the socalled point null hypotheses. A point null is equivalent to considering \(\varepsilon _{I}=\varepsilon _{S}=0,\) with length of equivalence interval of zero. On the one hand, the UI way coincides with the traditional twosided solution plus one more: once the null has been rejected, its p value \(\lambda =\min (\lambda _{I},\lambda _{S})\) satisfies Bonferroni’s rule (UI.3, Sect. 7.3) and allows us to make inference on which is the active arm: e.g. if \(\lambda =\lambda _{I}\) then \(\delta <0\), at type I rate \({\tilde{\alpha }}^{c}=\alpha /2\) (similarly for \(\delta _{S}\)). On the other hand, the IU way cannot have any solution, so in this formulation a point null cannot be considered as a null interval. This too shows that two formulations are essentially different.
A problem faced by any researcher is finding guidance to choose between two approaches. Our point of view is that if he/she considers that rejection of Eq when it is true has relatively smaller costs than its acceptance when NEq is true, as can typically be considered the case with bioequivalence and pharmacostatistics, then the IUNPC is the correct choice. Correspondingly, if he/she considers that rejection of Eq when it is true has relatively greater costs than its acceptance when NEq is true, as can typically be the case with traditional twosided testing (quality control, etc.), then the UINPC is the correct choice.
In the usual literature on the subject matter, both IU and UI parametric approaches are essentially worked out within likelihood techniques. These approaches, which in any case imply approximate solutions, are rather difficult to deal with since they require quite severe conditions of validity, such as population distributions lying within the regular exponential family and enjoying the invariant property for nuisance parameters, if any—conditions that are generally quite difficult to meet and/or justify. Our IUNPC and UINPC permutation solutions are also approximated. However, when a parametric optimal solution exists, its NPC counterpart is asymptotically convergent to it at a high rate. When a solution within likelihood ratio is not invariant with respect to one or more nuisance parameters, it cannot be worked out unless these nuisance parameters are completely known. Our IUNPC and UINPC solutions, given they are working conditionally on a set of sufficient statistics in one point of the null hypothesis, do not require any knowledge of nuisance parameters, so are sufficiently flexible to cope with most practical problems.
References
AndersonCook C, Borror C (2016) The difference between “equivalent” and “not different”. Qual Eng 28:249–262
Arboretti R, Carrozzo E, Caughey D (2015) A rankbased permutation test for equivalence and noninferiority. Ital J Appl Stat 25:81–92
Arboretti R, Carrozzo E, Pesarin F, Salmaso L (2017) A multivariate extension of union–intersection permutation solution for twosample testing. J Stat Theory Pract 11:436–448
Arboretti R, Carrozzo E, Pesarin F, Salmaso L (2018) Testing for equivalence: an intersection–union permutation solution. Stat. Biopharm Res 10:130–138
Berger R (1982) Multiparameter hypothesis testing and acceptance sampling. Technometrics 24:295–300
Berger RL, Hsu JC (1996) Bioequivalence trials, intersection–union tests and equivalence confidence sets. Stat Sci 11:283–319
Cox D, Hinkley D (1974) Theoretical statistics. Chapman and Hall, London
D’Agostino RB, Massaro JM, Sullivan LM (2003) Noninferiority trials: design concepts and issues—the encounters of academic consultants in statistics. Stat Med 22:169–186
Ferguson TS (1967) Mathematical statistics, a decision theoretic approach. Academic Press, New York
Food and Drug Administration (1998) Guidance for industry: E9 statistical principles for clinical trials. Food and Drug Administration, Rockville
Hirotsu C (2007) A unifying approach to noninferiority, equivalence and superiority tests via multiple decision processes. Pharm Stat 6:193–203
Hirotsu C (2017) Advanced analysis of variance. Wiley, Hoboken
Hoeffding W (1952) The largesample power of tests based on permutations of observations. Ann Math Stat 23:169–192
Hung H, Wang S (2009) Some controversial multiple testing problems in regulatory applications. J Biopharm Stat 19:1–11
Janssen A, Wellek S (2010) Exact linear rank tests for twosample equivalence problems with continuous data. Stat Neerl 64:482–504
Lakens D (2017) Equivalence trials: a practical primer for t test, correlations and metaanalyses. Soc Psychol Pers Sci 8:355–362
Lehmann E (1986) Testing statistical hypotheses. Wiley, New York
Mehta CR, Patel NR, Tsiatis AA (1984) Exact significance testing to establish treatment equivalence with ordered categorical data. Biometrics 40:819–825
Nunnally J (1960) The place of statistics in psychology. Educ Psychol Meas 20:641–650
Pantsulaia G, Kintsurashvili M (2014) Why is the null hypothesis rejected for ‘almost every’ infinite sample by some hypothesis testing of maximal reliability. J Stat Adv Theory Appl 11:45–70
Pardo S (2014) Equivalence and noninferiority tests for quality. Manufactoring and test engineers. Chapman & Hall/CRC, Boca Raton
Patterson S, Jones B (2017) Bioequivalence and statistics in clinical pharmacology, 2nd edn. Chapman & Hall/CRC, Boca Raton
Pesarin F (1990) On a nonparametric combination method for dependent permutation tests with applications. Psychometrics Psychosomatics 54:172–179
Pesarin F (1992) A resampling procedure for nonparametric combination of several dependent tests. J Ital Stat Soc 1:87–101
Pesarin F (2001) Multivariate permutation tests, with applications in biostatistics. Wiley, Chichester
Pesarin F (2015) Some elementary theory of permutation tests. Commun Stat Theory Methods 44:4880–4892
Pesarin F (2016) Encyclopedia of statistical sciences, chap permutation test: multivariate. WileyStatRef, Hoboken
Pesarin F, Salmaso L (2010) Permutation tests for complex data, theory, applications and software. Wiley, Chichester
Pesarin F, Salmaso L (2013) On the weak consistency of permutation tests. Commun Stat Simul Comput 42:1368–1397
Pesarin F, Salmaso L, Carrozzo E, Arboretti R (2014) Testing for equivalence and noninferiority: Iu and ui tests within a permutation approach. JSM 2014—section on nonparametric statistics
Pesarin F, Salmaso L, Carrozzo E, Arboretti R (2016) Unionintersection permutation solution for twosample equivalence testing. Stat Comput 26:693–701
Richter S, Richter C (2002) A method for determining equivalence in industrial applications. Qual Eng 14:375–380
Romano J (2005) Optimal testing of equivalence hypotheses. Ann Stat 33:1036–1047
Roy S (1953) On a heuristic method of test construction and its use in multivariate analysis. Ann Math Stat 24:220–238
Schuirmann D (1987) A comparison of the two onesided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm 15:657–680
Schuirmann DL (1981) On hypothesistesting to determine if the mean of a normaldistribution is contained in a known interval. Biometrics 37:617
Sen P (2007) Union–intersection principle and constrained statistical inference. J Stat Plan Inference 137:3741–3752
Sen P, Tsai M (1999) Twostage likelihood ratio and union intersection tests for onesided alternatives multivariate mean with nuisance dispersion matrix. J Multivar Anal 68:264–282
Wellek S (2010) Testing statistical hypotheses of equivalence and noninferiority. Chapman & Hall/CRC, Boca Raton
Acknowledgements
Authors wish to thank the Editor, Associate Editor and Referees for helping to improve the manuscript.
Funding
Open access funding provided by Università degli Studi di Padova within the CRUICARE Agreement.. Open access funding provided by Università degli Studi di Padova within the CRUICARE Agreement.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Arboretti, R., Pesarin, F. & Salmaso, L. A unified approach to permutation testing for equivalence. Stat Methods Appl (2020). https://doi.org/10.1007/s10260020005480
Accepted:
Published:
Keywords
 Intersection–union principle
 Multiaspect testing
 Nonparametric combination
 Permutation tests
 Testing equivalence
 Twoonesided tests (TOST)
 Union–intersection principle