1 Introduction and motivation

Testing for equivalence (Eq) of two treatments is widely used in clinical trials, pharmaceutical experiments, bioequivalence, quality control, etc. If we take, for example, bioequivalence, a potential risk can arise if the bioequivalence of products is not well regulated and guaranteed. This paper addresses the crucial methodological step in testing for Eq and provides a unified framework to nonparametric testing within the permutation approach.

In the current literature there are two different, albeit dual or mirror-like, approaches for testing for Eq. The first commonly adopted approach, especially in bioequivalence and pharmacostatistics (Anderson-Cook and Borror 2016; Berger 1982; Berger and Hsu 1996; D’Agostino et al. 2003; Hirotsu 2007; Hung and Wang 2009; Lakens 2017; Mehta et al. 1984; Patterson and Jones 2017; Richter and Richter 2002; Wellek 2010), is derived from the intersection–union principle (IU) and its analysis is based mainly on likelihood techniques, which in turn are rather difficult to deal with, or even unavailable outside the regular exponential family (Lehmann 1986). As far as we know, the only paper on IU based on permutation methods is Arboretti et al. (2018). The other approach (Arboretti et al. 2017; Pesarin et al. 2014, 2016) is based on Roy’s (1953) union–intersection principle (UI), which is also difficult to deal with using likelihood techniques (Sen 2007; Sen and Tsai 1999). The two approaches essentially differ in terms of the roles assigned to the null and alternative hypotheses. In this paper we start with a simple description of both, before introducing the related permutation solutions. We then provide a few sampling inspection plans and an application to a bioequivalence case study (two further case studies are provided in the Supplementary Material). In the final paragraphs, after exploring the limiting behavior of permutation solutions, we discuss the most important requirements and pitfalls of both parametric and nonparametric permutation-based approaches, before drawing our conclusions. The main aim of this paper is to provide the reader with some methodological insights and suggestions in order to make the most suitable choices in relation to Eq testing to deal with any underlying population distribution, any sample sizes and any margins.

2 On intersection–union and union–intersection approaches

With reference to one endpoint variable X and a two-sample design, to draw inferences on the substantial Eq of a comparative treatment A to a new treatment B, the IU approach consists in checking if the effect \(\delta _{A}\) of A lies in a clinically or biologically or technically unimportant interval around \(\delta _{B}\) of B, i.e. testing the non-equivalence (NEq) null \(\ H:[(\delta _{A}\le \delta _{B}-\varepsilon _{I}) {\textstyle \bigcup } (\delta _{A}\ge \delta _{B}+\varepsilon _{S})]\) versus (V.s) the Eq alternative \(K:(\delta _{B}-\varepsilon _{I}<\delta _{A}<\delta _{B} +\varepsilon _{S}),\) where \(\varepsilon _{I}>0\) and \(\varepsilon _{S}>0\) are the inferior (lower) and superior (upper) margins for the difference \(\delta =\delta _{A}-\delta _{B}\), respectively—margins that are established by biological, clinical, pharmacological, technical or regulatory arguments and not by purely statistical considerations. Focusing on the multi-aspect nature of the problem, (Berger 1982; Berger and Hsu 1996; Schuirmann 1981, 1987), these hypotheses can be equivalently stated as \(\ H\equiv H_{I}\bigcup H_{S}\) and \(\ K\equiv K_{I}\bigcap K_{S}\), where \(\ H_{I}:\delta \le -\varepsilon _{I},\) \(\ K_{I}:\delta >-\varepsilon _{I},\) \(\ H_{S}:\delta \ge \varepsilon _{S},\) and\(\ \ K_{S}:\delta <\varepsilon _{S}\) are the partial one-sided sub-hypotheses into which H and K are equivalently broken down. In actual fact, H is true if one and only one of \(H_{I}\) and \(H_{S}\) is true; K is true when both sub-alternatives \(K_{I} \) and \(K_{S}\) are jointly true. Accordingly, H is retained if one and only one of two suitable partial test statistics, \(T_{I}\) for \(H_{I}\) V.s \(K_{I}\) and \(T_{S}\) for \(H_{S}\) V.s \(K_{S},\) retains the respective sub-null. The alternative K is retained if and only if two sub-alternatives \(K_{I}\) and \(K_{S}\) are jointly retained. So, the overall (global) solution, \(T_{G}\) say, has to be based (Berger 1982; Schuirmann 1981) on a suitable combination of two one-sided tests (TOST).

The UI approach considers the Eq null \(\ {\tilde{H}}:(-\varepsilon _{I} \le \delta \le \varepsilon _{S})\) that \(\delta\) lies inside the Eq interval and the alternative NEq hypothesis \(\ {\tilde{K}}:[(\delta <-\varepsilon _{I}) {\textstyle \bigcup } (\delta >\varepsilon _{S})]\) that \(\delta\) lies outside it. By using \({\tilde{H}}_{I}:\delta \ge -\varepsilon _{I}\) V.s \({\tilde{K}}_{I}:\delta <-\varepsilon _{I}\) and \({\tilde{H}}_{S}:\delta \le \varepsilon _{S}\) V.s \({\tilde{K}}_{S}:\delta >\varepsilon _{S}\) to denote two one-sided sub-hypotheses into which the problem can be broken down, according to Roy (1953) we may equivalently state \({\tilde{H}}\equiv {\tilde{H}}_{I}\bigcap \tilde{H}_{S}\) and \({\tilde{K}}\equiv {\tilde{K}}_{I}\bigcup {\tilde{K}}_{S}\). That is, the null \({\tilde{H}}\) is true if both one-sided sub-null hypotheses \({\tilde{H}}_{I}\) and \({\tilde{H}}_{S}\) are jointly true and \({\tilde{K}}\) is true if one and only one of two sub-alternatives \({\tilde{K}}_{I}\) and \({\tilde{K}} _{S}\) is true. It is worth noting that UI, having inverted the roles of null and alternative, is effectively a mirrored formulation of IU. Of course, the global UI \({\tilde{T}}_{G}\) solution implies a suitable combination of two partial test statistics \({\tilde{T}}_{I}\) and \({\tilde{T}}_{S}\). In Arboretti et al. (2017, 2018), Pesarin et al. (2016), Sen (2007) and Wellek (2010) it is seen that both combinations of \(T_{I}\) and \(T_{S}\) for IU and \({\tilde{T}}_{I}\) and \({\tilde{T}}_{S}\) for UI are the crucial methodological points at issue for obtaining proper solutions ( Pesarin 2001, 2015, 2016; Pesarin and Salmaso 2010; Sen 2007; Sen and Tsai 1999, see also the Supplementary Material).

It is important to highlight that in order to obtain a valid global solution \(T_{G}\), the IU approach requires the researcher to set its maximum type I error rate no larger than \(\alpha\) and the maximum type II error rate \(\beta\) no larger than \(1-\alpha\); i.e.

figure a

where \(\phi _{G}\) is the indicator function for the rejection region of the \(T_{G}\) global test and \({\mathbf {E}}_{F}(\cdot )\) the mean value of \((\cdot )\) with respect to the underlying data distribution F. Correspondingly, with clear meanings of the symbols, to obtain a valid UI global test \({\tilde{T}}_{G}\) the researcher must set its maximum type I error rate and maximum type II error rate as:

figure b

These conditions, that deal with inferential unbiasedness, are necessarily required by any test statistics (Lehmann 1986).

In the literature on the subject matter almost all authors apparently assume that regulatory agencies (e.g. FDA, EMEA, etc.) for testing Eq consider only the IU approach. For instance, the ICH-E9 glossary (Food and Drug Administration 1998) defines Eq of clinical trials as: “A trial with the primary objective of showing that the response to two or more treatments differs by an amount which is clinically unimportant. That is usually demonstrated by showing that the true treatment difference is likely to lie between a lower and an upper equivalence margin of clinically acceptable differences.” This definition, however, does not contain sufficiently precise methodological indications as to which of two formulations, the IU (HK) or the UI \(({\tilde{H}},{\tilde{K}})\), is to be chosen, since there are circumstances where one or the other is rationally suitable for the testing problem at hand. We will see that the two share the same asymptotic behavior. This is not the case for finite sample sizes, where quite important differences are ascertained as will be seen in this paper.

Consequently, in any practical situation the researcher must choose which of (HK) and \(({\tilde{H}},{\tilde{K}})\) is most suitable for the proper analysis of his/her problem. We think that such an option, although not well emphasized in the literature on hypotheses testing, is common to almost all testing situations. A simple example clarifies this point: let us consider the classic problem of two simple hypotheses: \(\theta _{A}\ne \theta _{B}\) and \(\theta _{A}<\theta _{B}\). According to the Neyman-Pearson lemma, the best test for \(H:\theta =\theta _{A}\) V.s \(K:\theta =\theta _{B}\) rejects H when \(T\ge T_{\alpha },\) where the critical value \(T_{\alpha }\) is determined by the distribution of the likelihood ratio under H. On the other hand the best test for \({\tilde{H}}:\theta =\theta _{B}\) V.s \({\tilde{K}}:\theta =\theta _{A}\) rejects \({\tilde{H}}\) when \({\tilde{T}}\le -{\tilde{T}}_{\alpha },\) where \({\tilde{T}}_{\alpha }\) is determined by the likelihood ratio distribution under \({\tilde{H}}.\) So, the duality between two alternative formulations is evident. The researcher is therefore required to explicitly decide between (HK) and \(({\tilde{H}},{\tilde{K}});\) i.e. he/she has to justify which is given the role of null hypothesis, and the maximum rejection rate \(\alpha\) when true, so as to strictly control both type I and type II inferential errors with \(\beta \le 1-\alpha .\) We believe that no researcher can escape this central necessity. Since both ways are rationally appropriate for Eq testing, such a notion supports our purpose to provide a sort of weak comparative (parallel) analysis of both by highlighting their respective requirements, properties, limitations, difficulties, pitfalls and inferential costs. It has to be stated, however, that two dual formulations reverse the roles of respective inferential risks: what works as type I error for (HK) has the role, not just the related numerical value, of type II error for (\({\tilde{H}},{\tilde{K}}\)), and vice versa.

Some authors emphasize the general problem that any traditional two-sided consistent test rejects a point null hypothesis with a probability close to one for sufficiently large sample sizes, even for practically negligible violations of the null. For instance, Nunnally (1960) says: “To minimize type II errors, large samples are recommended. In psychology, practically all null hypotheses are claimed to be false for sufficiently large samples so (...) it is nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis”. According to Pantsulaia and Kintsurashvili (2014) the same concept is expressed by more than 200 authors. Clearly, to go beyond such a limitation of two-sided tests, this suggests considering a null hypothesis as made up of an interval of substantially equivalent points, rather than only one point. As a result the hypotheses for any traditional two-sided testing is written as \((\tilde{H},{\tilde{K}})\). Such a formulation then has its own specific merits, in spite of the fact that it is not adequately considered in the general literature (Wellek 2010, pp. 355–358, considers some likelihood-based hints). Up to now we have dedicated two papers: Arboretti et al. (2017) to the one-dimensional setting and Pesarin et al. (2016) to the multidimensional setting. It should, however, be emphasized that to find proper workable solutions requires going beyond the limitations of likelihood ratio approaches and so staying within a nonparametric approach, and specifically within the permutation theory and the nonparametric combination (NPC) of dependent permutation tests.

It could be argued that widespread use of the IU approach is due, rather than to rational analysis, to the fact that under a set of very stringent conditions (Lehmann 1986; Romano 2005), one uniformly most powerful unbiased test exists, namely \(T_{G}^{opt}\), and this result is merely extended, by simple analogy, to all Eq problems. It will be seen that such an extension outside those conditions might give rise to several quite severe and intriguing consequences.

3 Intersection–union and union–intersection permutation tests

Without loss of generality and for the sake of simplicity, we illustrate the proposed methodology with reference to a two-sample design and one-dimensional endpoint variable X. To stay within the permutation theory and the nonparametric combination of dependent permutation tests (NPC), let us assume that a sample of \(n_{1}\) IID data related to treatment A are drawn from \(X_{1}\) and, independently, \(n_{2}\) IID data related to treatment B are drawn from \(X_{2}\). This setting can generally be obtained when \(n_{1}\) units out of \(n=n_{1}+n_{2}\) are randomly assigned to A and the remaining \(n_{2}\) to B. We define responses as \(X_{1}=X+\delta _{A}\) and \(X_{2}=X+\delta _{B},\) where the underlying variable X,  whose distribution is F,  is common to both populations. Hence, \({\mathbf {X}}_{1}=(X_{11},\ldots ,X_{1n_{1}})\) are the data of sample A and \({\mathbf {X}}_{2}=(X_{21},\ldots ,X_{2n_{2}})\) those of sample B. So, the pooled data set is \({\mathbf {X}}=({\mathbf {X}}_{1} ,{\mathbf {X}}_{2})=\{X(i),i=1,\ldots ,n;n_{1},n_{2}\},\) where in the last notation it is intended that the first \(n_{1}\) data in the list are from the first sample and the rest from the second. Moreover, we assume that, possibly after suitable data transformations to obtain quasi data symmetry [e.g. as \(\log (\cdot ),\) \(\sqrt{(\cdot )},\)\(Rank(\cdot )\), etc., also point UI.3 in Sect. 7.3], variable X is provided with a finite mean value, i.e. \({\mathbf {E}}_{F}(|X|)<\infty ,\) so as to use consistent permutation tests based on comparison of sample means (Sen 2007; Pesarin 2015; Pesarin and Salmaso 2013).

It is assumed that two effects \(\delta _{A}\) and \(\delta _{B}\) are fixed and data are homoscedastic. In further research we will extend our permutation theory to random effects, that is to a condition compatible with important forms of heteroscedasticities, as are frequently met in most experimental and observational problems when a treatment, together with the mean, can also modify dispersion or even other aspects of a distribution.

In this context, both IU and UI approaches are in practice worked out by considering two partial tests for each way: one for \(H_{I}\) V.s \(K_{I}\) and one for \(H_{S}\) V.s \(K_{S}\) for IU; one for \({\tilde{H}}_{I}\) V.s \({\tilde{K}}_{I}\) and one for \({\tilde{H}}_{S}\) V.s \({\tilde{K}}_{S}\) for UI.

The two IU partial tests we consider have the (non standardized) form:

$$\begin{aligned} T_{I}=({\bar{X}}_{2}+\varepsilon _{I})-{\bar{X}}_{1} \;\,{\mathrm{and}}\;\;T_{S} ={\bar{X}}_{1}-({\bar{X}}_{2}-\varepsilon _{S}), \end{aligned}$$

and correspondingly, the two UI partial tests are:

$$\begin{aligned} {\tilde{T}}_{I}={\bar{X}}_{1}-({\bar{X}}_{2}+\varepsilon _{I})\;\,{\mathrm{and}}\;\;{\tilde{T}}_{S}=({\bar{X}}_{2}-\varepsilon _{S})-{\bar{X}}_{1}, \end{aligned}$$

where as usual, \({\bar{X}}_{j}=\sum _{1\le i\le n_{j}}X_{ji}/n_{j},\) \(j=1,2,\) are sample averages. It is worth noting that \(T_{I}=-{\tilde{T}}_{I}\) and \(T_{S}=-{\tilde{T}}_{S}\) and that large values of each test are evidence for their respective sub-alternatives. Also worth noting is that IU pair \((T_{I},T_{S})\), as well as the UI pair \(({\tilde{T}}_{I},{\tilde{T}}_{S}),\) are functions of essentially the same data \({\mathbf {X}}\) and so two tests in each pair are negatively dependent (Pesarin 2016; Pesarin et al. 2016).

One major problem related to both IU and UI, that also arises when several test statistics are functions of the same data, is what to do with such a multiplicity of dependent partial tests. In this regard, a meaningful warning by Sen (2007) relating to UI says: “However, computational and distributional complexities may mar the simple appeal of the UI to a certain extent. (...) The crux of the problem is however to find the distribution theory for the maximum of these possibly correlated statistics. Unfortunately, this distribution depends on the unknown F, even under the null hypothesis. (...) An easy way to eliminate this impasse is to take recourse to the permutation distribution theory (...)”. The same warning applies to the IU.

We partially disagree with this warning. The greatest obstacle to achieving suitable working solutions is finding a general method to cope with the overly complex dependence structure of two partial tests \((T_{I},T_{S})\) for IU and \(({\tilde{T}}_{I},{\tilde{T}}_{S})\) for UI. They are negatively dependent and their dependence coefficients depend on underlying F,  data \({\mathbf {X}}\) and margins \((\varepsilon _{I},\varepsilon _{S})\). Indeed, such a dependence runs from correlation \(\rho =-1\), for margins \(\varepsilon _{I}=\varepsilon _{S}=0\), to almost practical independence for sufficiently large margins. Quite a general solution can be validly obtained when it is possible to deal with that dependence nonparametrically.

Moreover, in multidimensional problems, such a dependence is much more complex than pair-wise linear. So it seems impossible to deal with it by proper estimates of all associated dependence coefficients, the number and type of which are typically unknown. Thus, this dependence must be worked out nonparametrically within a well-suited theory. This requires adopting the conditionality principle of inference by conditioning on data \({\mathbf {X}}\) (which under the null are always sufficient), i.e. by the permutation testing principle (Pesarin 2015) and, more importantly, by the NPC of dependent permutation tests (Pesarin 1990, 1992, 2001, 2015, 2016; Pesarin and Salmaso 2010, see also the Supplementary Material).

It is worth noting that to stay within the permutation theory, i.e. by permuting the n-dimensional data \({\mathbf {X}}\), we have to consider permuted data associated with permutations \({\mathbf {u}}^{*}=(u_{1}^{*} ,\ldots ,u_{n}^{*})\) of unit labels \({\mathbf {u}}=(1,\ldots ,n)\). Thus, all test statistics are calculated on corresponding data permutations \({\mathbf {X}}^{*}=\{X(u_{i}^{*}),i=1,\ldots ,n;n_{1},n_{2}),\) where two permuted samples are \({\mathbf {X}}_{1}^{*}=\{X(u_{i}^{*}),i=1,\ldots ,n_{1}\}\) and \({\mathbf {X}}_{2}^{*}=\{X(u_{i}^{*}),i=n_{1}+1,\ldots ,n\}\), respectively.

Our proposal is to separately test, albeit simultaneously, \(H_{I}\) V.s \(K_{I}\) and \(H_{S}\) V.s \(K_{S}\) for IU, and \({\tilde{H}}_{I}\) V.s \({\tilde{K}}_{I}\) and \({\tilde{H}}_{S}\) V.s \({\tilde{K}}_{S}\) for UI.

To test for \(H_{I}\) V.s \(K_{I}\) let us consider the statistic \(T_{I}=\bar{X}_{I2}-{\bar{X}}_{I1},\) where the data \({\mathbf {X}}_{2}\) of sample B are modified to \({\mathbf {X}}_{I2}={\mathbf {X}}_{2}+\varepsilon _{I}\) while those of sample A are retained as they are, i.e. \({\mathbf {X}}_{I1}={\mathbf {X}}_{1}.\) Correspondingly, to test for \(H_{S}\) V.s \(K_{S}\) we use the statistic \(T_{S}={\bar{X}}_{S1}-{\bar{X}}_{S2},\) where \({\mathbf {X}}_{S1}={\mathbf {X}}_{1}\) and \({\mathbf {X}}_{S2}={\mathbf {X}}_{2}-\varepsilon _{S}\). Thus, the global test is by one of their IU-NPC solutions, the simplest and effective of which is:

$$\begin{aligned} T_{G}=\min (T_{I},T_{S})\equiv \max (\lambda _{I},\lambda _{S}), \end{aligned}$$

where \(\lambda _{h}\) is the so-called p value statistic for \(T_{h},\) \(h=I,S\).

Correspondingly, to test for \({\tilde{H}}_{I}\) V.s \({\tilde{K}}_{I}\) and \({\tilde{H}}_{S}\) V.s \({\tilde{K}}_{S}\) we use the two statistics \({\tilde{T}} _{I}={\bar{X}}_{I1}-{\bar{X}}_{I2}=-T_{I}\) and \({\tilde{T}}_{S}={\bar{X}}_{S2}-\bar{X}_{S1}=-T_{S},\) and so \({\tilde{T}}_{G}\) is given by their UI-NPC:

$$\begin{aligned} {\tilde{T}}_{G}=\max ({\tilde{T}}_{I},{\tilde{T}}_{S})\equiv \min ({\tilde{\lambda }}_{I},{\tilde{\lambda }}_{S}). \end{aligned}$$

According to the general theory (Lehmann 1986; Romano 2005; Wellek 2010), for \(T_{G}\) to be unbiased with the IU-NPC, it is required that conditions a) and b) are both satisfied, thus partial critical values must be calibrated so that the global test \(T_{G}\) satisfies \(\alpha\) at both extremes of K,  that is

$$\begin{aligned} \alpha ^{c}={\mathbf {E}}_{F}(\phi _{h},\delta =\varepsilon _{h}) \;\; {\mathrm{s.t.}}\;\;{\ }{\mathbf {E}}_{F}(\phi _{G},\delta =\varepsilon _{h} )=\alpha ,\ \;\;{\mathrm{at}}\;\; \varepsilon _{h}=-\varepsilon _{I},\varepsilon _{S}, h=I,S; \end{aligned}$$

analogously for \({\tilde{T}}_{G}\) to be unbiased with the UI-NPC (Arboretti et al. 2018), it is required that conditions ã) and \({\tilde{b}}\)) are satisfied, thus partial critical values must be calibrated so that \({\tilde{T}}_{G}\) satisfies \(\alpha\) at both extremes of \({\tilde{H}},\) that is

$$\begin{aligned} {\tilde{\alpha }}^{c}={\mathbf {E}}_{F}({\tilde{\phi }}_{h},\delta =\varepsilon _{h}) \quad {\mathrm{s.t.}}\quad \ {\mathbf {E}}_{F}({\tilde{\phi }}_{G},\delta =\varepsilon _{h})=\alpha ,\,\, {\mathrm{at}}\,\, \varepsilon _{h}=-\varepsilon _{I},\varepsilon _{S},\,\, h=I,S, \end{aligned}$$

where \(\phi _{h},\) \({\tilde{\phi }}_{h}\), \(\phi _{G},\) \({\tilde{\phi }} _{G},\) are the indicator functions of rejection regions of concerned tests.

It is worth noting that partial critical values \(C_{I\alpha }\) and \(C_{S\alpha }\) of parametric tests, which depend on distribution F,  sample size n and margins \((\varepsilon _{I},\varepsilon _{S}),\) according to Lehmann (1986) and Wellek (2010) are to be numerically determined (also UI.3, Sect. 7.1). Essentially, these values can coincide only asymptotically with the standard critical values (e.g. as \(z_{\alpha }\) or \(t_{\alpha },\) etc.) in use with traditional two-sided tests. Thus, in our terminology, they also must be calibrated.

In some literature, the non-calibrated IU-TOST (naive) solution \({\ddot{T}}_{G}\) is often considered (e.g. Anderson-Cook and Borror 2016; Berger and Hsu 1996; Lakens 2017; Pardo 2014; Patterson and Jones 2017; Richter and Richter 2002). This solution satisfies condition a) but not b), thus it is far from being unbiased, unless sample sizes and/or margins are sufficiently large (e.g. Sect. 5 and Supplementary Material).

When optimal likelihood solutions \(T_{G}^{opt}\) and \({\tilde{T}}_{G}^{opt}\) are available then for divergent sample sizes, under their conditions, we have \(T_{G}\rightarrow T_{G}^{opt}\) and \({\tilde{T}}_{G}\rightarrow {\tilde{T}} _{G}^{opt}\) at quite a high rate (Hoeffding 1952).

Computational details and related algorithms are in Arboretti et al. (2018) for the IU-NPC and in Pesarin et al. (2016) for the UI-NPC (see also the Supplementary Material). Of course, by using \(T_{G}^{ob}=T_{G}({\mathbf {X}})\) and \({\tilde{T}}_{G}^{ob}({\mathbf {X}})\) to denote the observed values of test statistics \(T_{G}\) and \({\tilde{T}}_{G},\) respectively, if p value statistics of the IU-NPC \(T_{G}\) test \(\lambda _{T_{G}}=\Pr \{T_{G}^{*}\ge T_{G}^{ob}|{\mathbf {X}}\}\le \alpha ^{c},\) then the NEq hypothesis H is rejected at significance level \(\alpha\) (the naive IU-TOST \({\ddot{T}}_{G}\) rejects H if \(\lambda _{T_{G}} \le \alpha\); so its true type I error remains unknown, depending on F,  data \({\mathbf {X}}\) and margins \(\varepsilon _{I}\), \(\varepsilon _{S}\)). Correspondingly, if the UI-NPC test \({\tilde{T}}_{G}\) gives \(\lambda _{\tilde{T}_{G}}=\Pr \{{\tilde{T}}_{G}^{*}\ge {\tilde{T}}_{G}^{ob}|{\mathbf {X}} \}\le {\tilde{\alpha }}^{c},\) then the Eq hypothesis \({\tilde{H}}\) is rejected at significance level \(\alpha\). In practice p value statistics are estimated, at any desired confidence rate, by a conditional Monte Carlo procedure as: \({\hat{\lambda }}_{h}=\#[T_{h}({\mathbf {X}}^{*})\ge T_{h} ^{ob}|\mathbf {X]/}R,\) where \(T_{h}\) stands for \(T_{I},T_{S},T_{G},\tilde{T}_{T},{\tilde{T}}_{S},{\tilde{T}}_{G}\) and R is the number of random permutations.

4 NPC limiting behavior for IU and UI

Let us assume that population mean \({\mathbf {E}}_{F}(X)\) is finite, so that \({\mathbf {E}}({\bar{X}}^{*}|{\mathbf {X}})\) is also finite for almost all sample data \({\mathbf {X}},\) where \({\bar{X}}^{*}\) is the sample mean of a without replacement random sample of \(n_{1}\) or \(n_{2}\) elements from \({\mathbf {X}},\) taken as a finite population.

To find the limiting behavior of IU-NPC let us firstly consider the partial test \(T_{S}^{*}(\delta )={\bar{X}}_{S1}^{*}-{\bar{X}}_{S2}^{*},\) where its dependence on effect \(\delta\) is emphasized. In Sen (2007) and Pesarin and Salmaso (2013), based on the law of large numbers for strictly stationary dependent sequences, such as are those generated by the without replacement random sampling (any random permutation is simply a without replacement random sample from \({\mathbf {X}} _{S})\), it is proved that as \(\min (n_{1},n_{2})\rightarrow \infty \) the permutation distribution of \(T_{S}^{*}(\delta )\) weakly converges to \({\mathbf {E}}_{F}({\bar{X}}_{S1}-{\bar{X}}_{S2})=(\varepsilon _{S}-\delta )\).

Thus, for any \(\delta <\varepsilon _{S}\) the rejection rate of \(T_{S}(\delta )\) converges to one: \({\mathbf {E}}_{F}(\phi _{T_{S}},\delta )\rightarrow 1\). Moreover, for any \(\delta >\varepsilon _{S}\) that rejection rate converges to zero. At the right extreme of \(H_{S},\) \(\delta =\varepsilon _{S}\) say, since \(T_{S} (\varepsilon _{S})\) rejects with probability \(\alpha\) for any sample sizes (\(n_{1},n_{2}),\) its limit rejection rate is also \(\alpha\).

The behavior of \(T_{I}(\delta )\) mirrors that of \(T_{S}(\delta )\). That is, its limiting rejection rate: i) for \(\delta =-\varepsilon _{I}\) is \(\alpha ;\) ii) for \(\delta <-\varepsilon _{I}\) is zero; iii) for \(\delta >-\varepsilon _{I}\) is one.

In the global alternative \(K:(-\varepsilon _{I}<\delta <\varepsilon _{S}),\) since both permutation tests \(T_{I}\) and \(T_{S}\) are jointly consistent, the global test \(T_{G}\) is consistent too (Pesarin 2001, 2016; Pesarin and Salmaso 2010), that is \({\mathbf {E}} _{F}(\phi _{T_{G}},\delta )\rightarrow 1\). Correspondingly, for every \((\delta <-\varepsilon _{I}) {\textstyle \bigcup } (\delta >\varepsilon _{S})\) the limiting rejection is \({\mathbf {E}}_{F} (\phi _{T_{G}},\delta )\rightarrow 0.\) Moreover, in the extreme points of H,  when \(\delta\) is either \(-\varepsilon _{I}\) or \(\varepsilon _{S},\) as one and only one can be true if at least one differs from zero, the limiting rejection rate of \(T_{G}\) is \(\alpha .\) Moreover, if \(\varepsilon _{I}=\) \(\varepsilon _{S}=0,\) this rejection rate is not defined for every sample size.

To find UI-NPC’s limit behavior, firstly let us analogously consider \({\tilde{T}}_{S}^{*}(\delta )={\bar{X}}_{S2}^{*}-{\bar{X}}_{S1}^{*}\). As \(\min (n_{1},n_{2})\rightarrow \infty\) implies that the permutation distribution of \(T_{S}^{*}(\delta )\) weakly converges to \({\mathbf {E}}_{F}({\bar{X}} _{S2}-{\bar{X}}_{S1})=(\delta -\varepsilon _{S}),\) then for any \(\delta >\varepsilon _{S}\) the rejection rate of \({\tilde{T}}_{S}(\delta )\) converges to one. Moreover, for any \(\delta <\varepsilon _{S}\) its rejection rate converges to zero. At the right extreme \(\delta =\varepsilon _{S}\), since for any sample sizes \({\tilde{T}}_{S}(\varepsilon _{S})\) rejects with probability \(\alpha ,\) its limit rejection rate is also \(\alpha\).

The behavior of \({\tilde{T}}_{I}(\delta )\) mirrors that of \({\tilde{T}}_{S} (\delta )\). That is, the limiting rejection rate: i) for \(\delta =-\varepsilon _{I}\) is \(\alpha ;\) ii) for \(\delta >\varepsilon _{I}\) is zero; iii) for \(\delta <-\varepsilon _{I}\) is one.

In the global alternative \({\tilde{K}}:(\delta <-\varepsilon _{I}) {\textstyle \bigcup } (\delta >\varepsilon _{S})\) since one and only one of \({\tilde{T}}_{I}\) and \({\tilde{T}}_{S}\) is consistent, then \({\tilde{T}}_{G}\) is consistent too (Pesarin 2001; Pesarin and Salmaso 2010; Pesarin et al. 2016).

5 A simple analysis

Calibrated values \(\alpha ^{c}\) and \({\tilde{\alpha }}^{c}\), so as to get global type I error rate \(\alpha\) for the IU-NPC \(T_{G}\) and the UI-NPC \(\tilde{T}_{G},\) if the underlying data distribution F is completely known, can be determined via Monte Carlo simulations as is done in Arboretti et al. (2018) for \(T_{G}\) and in Pesarin et al. (2016) for \({\tilde{T}}_{G}\).

Algorithms for IU-NPC and UI-NPC used to determine calibrated \(\alpha ^{c}\) and \({\tilde{\alpha }}^{c}\) (see also the Supplementary Material) can even be used to establish the designs \(n_{1}=n_{2}\) and \({\tilde{n}}_{1}={\tilde{n}}_{2}\) such that the maximum power \(W_{T_{G}}(0;n,\varepsilon )=p\) and \(W_{{\tilde{T}}_{G} }(\pm 2\varepsilon ;{\tilde{n}},\varepsilon )=p\) at standardized margins \(\varepsilon _{I}=\varepsilon _{S}=\varepsilon\) on calibrated \(\alpha ^{c}\) and \({\tilde{\alpha }}^{c}\), respectively. The choice to consider designs at \(\delta =0\) for the \(T_{G}\) and at \(\delta =\pm 2\varepsilon\) for the \(\tilde{T}_{G}\) resides in the fact that these values are equally far away from H and \({\tilde{H}},\) respectively, and so their power behaviors are comparable (Wellek 2010, Chapter 11).

Assuming \(X\sim N(0,1)\) [\(\sigma\) unknown], \(\alpha =0.05,\) \(p=(0.80,~0.50),\) Table 1 contains a few designs obtained by \(MC=5000\) Monte Carlo runs, each with \(R=2500\) random permutations, for both IU-NPC and UI-NPC.

Table 1 Calculations of sample sizes for IU and UI

Referring to point \(\varepsilon =0.60\) as a pivot, the approximate sample sizes for \(p=0.80\), the IU-NPC designs for any intermediate margins \(\varepsilon ^{\prime }\) approximately agree to the empirical rule \(n(\varepsilon ^{\prime })\approx 48.28\cdot (0.6/\varepsilon ^{\prime })^{2}\) as obtained by interpolating simulation results. It is worth noting that these IU-NPC designs are strictly close to those obtained within the naive IU-TOST \({\ddot{T}}_{G}\) approach as reported in Lakens (2017). Such a practical coincidence is mostly due to the fact that: i) calibrated \(\alpha ^{c}\) coincides with non-calibrated \(\alpha\) for interval length, adjusted with sample sizes, of about \((\varepsilon _{I}+\varepsilon _{S})\sqrt{n_{1}n_{2}/n\sigma ^{2}}>5.4,\) and ii) permutation tests are convergent at a high rate to the corresponding parametric solutions (Hoeffding 1952). On the other hand, for UI-NPC the related empirical rule for intermediate margins \(\varepsilon ^{\prime }\) is \(\tilde{n}(\varepsilon ^{\prime })\approx 35.33\cdot (0.6/\varepsilon ^{\prime })^{2}.\) Similar approximate rules for \(p=0.50\) are \(n(\varepsilon ^{\prime } )\approx 30.25\cdot (0.6/\varepsilon ^{\prime })^{2}\) and \(\tilde{n}(\varepsilon ^{\prime })\approx 16.03\cdot (0.6/\varepsilon ^{\prime })^{2}\), for IU-NPC and UI-NPC respectively. It is worth observing that to reach reasonable power the Eq testing process requires quite large sample sizes especially when margins are small [also point IU.2 in Sect. 7.1].

From these results we may derive a sort of relative efficiency rate of UI-NPC with respect to IU-NPC. For instance, at \(\varepsilon =0.60\) and \(p=0.80\) the rate of sample sizes is \(n/{\tilde{n}}\) \(\approx 1.36\), for \(p=0.50\) it is \(n/{\tilde{n}}\) \(\approx 1.88,\) and for \(p=0.30\) (details not reported) it is \(n/{\tilde{n}}\) \(\approx 2.57\). In practice, relative efficiency rates are mostly dependent on power p and are almost \(\varepsilon\)-invariant.

Table 2 reports, for standard normal data (\(\sigma\) unknown) with \(n_{1} =n_{2}=12\) and \(\varepsilon =(4/5,3/5,2/5,1/3,1/5,1/10)\), calibrated \(\alpha ^{c},\) \({\tilde{\alpha }}^{c},\) rejection rates of H at \(\delta =0\) and \(\delta =\pm 2\varepsilon\) for the IU-NPC \(T_{G}\) and the naive IU-TOST \({\ddot{T}}_{G}\), and of \({\tilde{H}}\) for the UI-NPC \({\tilde{T}}_{G},\) all obtained with \(MC=5000\) and \(R=2500.\)

Table 2 Power behavior of IU and UI with NPC versus TOST

In order to clarify how to read Table 2, let us consider line \(\varepsilon =0.40\): calibrated \(\alpha ^{c}=0.185,\) \(W_{T_{G}}(0)=0.076,\)\(W_{T_{G}}(\pm 0.8)=0.987;\) \(W_{{\ddot{T}}_{G}} (0)=0.001,\) \(W_{{\ddot{T}}_{G}}(\pm 0.8)=1.000;\)\({\tilde{\alpha }}^{c}=0.049;\) \(W_{{\tilde{T}}_{G}}(0)=0.991,\) and \(W_{{\tilde{T}}_{G}}(\pm 0.8)=0.249\), and so on. In particular, the naive IU-TOST \({\ddot{T}}_{G}\) appears to be dramatically conservative since its maximum power of \(W_{{\ddot{T}}_{G}}(0)=0.001\) is much smaller than \(\alpha =0.05.\) Thus, since its power is much smaller than \(\alpha ,\) naive \({\ddot{T}}_{G}\) cannot be seriously considered as a practical way to test for Eq. \(W_{T_{G}}(0)=0.076\) in contrast with \(W_{{\tilde{T}}_{G}}(\pm 0.8)=0.249,\) when a comparison can be stated, manifests that the UI-NPC is considerably more efficient than the IU-NPC in detecting the respective comparable alternative.

From these results we can see that IU-NPC appears to be mostly focused on NEq as the main assertion under testing, i.e. the one to be falsified if not true, so exhibiting an intrinsic propensity to retain H even when it is not true. Thus, its applications are mostly with problems where rejection of true Eq has relatively smaller costs than its acceptance when NEq is true while taking under strict control related global errors. This is typically the case in the areas of bioequivalence and pharmacostatistics, where it is considered ethical to retain A (the “old drug”) unless there is empirical evidence that B (the “competitor”) is Eq to it. On the other hand, UI-NPC appears to be mostly focused on Eq, so exhibiting a relatively larger propensity to retain \({\tilde{H}}\) when it is true. Thus, its applications are mostly with problems where rejection of true Eq has relatively greater costs than acceptance of a false NEq while taking under strict control related global errors. This generally occurs with testing aims to go beyond traditional two-sided procedures, as for instance with quality control, etc. It is also important to emphasize that for \(\varepsilon \le 0.333\) the maximum probability for the naive IU-TOST \({\ddot{T}}_{G}\) to retain Eq, when it is true, is zero [see also points ÏÜ.3, ÏÜ.5, ÏÜ.6 in Sect. 7.2], so resulting in pure costs without any inferential benefits.

Table 3 reports, for data from N(0, 1) (\(\sigma\) unknown), the minimal sample sizes \(\ddot{n}_{1}=\ddot{n}_{2},\) in terms of \(\varepsilon =\varepsilon _{I}=\varepsilon _{S},\) for naive IU-TOST \({\ddot{T}}_{G}\) when conditions a) and b) are satisfied, i.e. to be unbiased at \(\ddot{\alpha } _{G}=.05,\) and the maximum probability (i.e. the power) to accept Eq [\({\ddot{W}}(Eq)\)] at \(\delta =0\).

Table 3 Minimal sample sizes for unbiasedness of naive IU-TOST

If \(\sigma\) were known we have essentially the same results. It is proved that when \((n_{1},n_{2})\) and/or \(\varepsilon _{I}+\varepsilon _{S}\) are not sufficiently large (Wellek, 2010, p. 5), the naive IU-TOST \({\ddot{T}}_{G},\) as is frequently used in the literature, can be unacceptably biased (Sect. 7.2).

6 A bioequivalence application

Let us consider data from (Hirotsu (2017), p. 108) on the end-point variable Log\(\ C_{\max }\) (Log of maximum blood concentration of a drug), related to \(n_{1}=20\) Japanese subjects and \(n_{2}=13\) Caucasians, after prescribing a standard dose of a drug. Data concern a bridging study conducted to investigate for bioequivalence between two populations. So, the test is to see if two populations can be considered bioequivalent with respect to that variable. Data are reported in Table 4.

Table 4 Data from Hirotsu (2007)

The basic statistics are: \({\bar{X}}_{Jap}=1.518;\) \({\hat{\sigma }}_{Jap}=0.0813;\) \({\bar{X}}_{Cau}=1.457;\) \({\hat{\sigma }}_{Cau}=0.0951;\) pooled \(\hat{\sigma }=0,0869.\) By firstly using the permutation test \(T^{\prime }=|{\bar{X}}_{J} -{\bar{X}}_{C}|\) for the point null hypothesis \(H^{\prime }:X_{J}\overset{d}{=}X_{C}\) versus the two-sided alternative \(K^{\prime }:X_{J}\overset{d}{\ne }X_{C}\), with \(R=100\,000\) we obtain the p value statistic \(\hat{\lambda }^{\prime }=0.0535\). There is no evidence for non-equality between two data sets at \(\alpha =0.05\), although \({\bar{X}}_{Jap}\) appears to be slightly larger than \({\bar{X}}_{Cau}\) (Student’s \(t=1.991,\) 31 df, \(\lambda _{t}^{\prime }>0.05\)).

Let us consider the IU-NPC \(T_{G}\) and the UI-NPC \({\tilde{T}}_{G}\) with the list of margins \(\varepsilon _{I} =\varepsilon _{S}=(0.058,\) 0.071,  0.109,  0.125),  corresponding to studentized values (in terms of \({\hat{\sigma }})\) of (2/3,  \(0.82,\ 1.25,\) 1.44 ) , respectively.

The results, with \(R=100\,000\) on data X, are in Table 5.

Table 5 Analysis with IU and UI NPC

The results in brackets are related to mid-rank data transformations. In our opinion, due to some apparent irregular data concentrations and some ties, mid-rank results can be slightly more reliable than those on plain data X (IU.3 Sect. 7.1, and UI.1 Sect. 7.3). At \(\varepsilon\) such that \(\ddot{\alpha }(\varepsilon )=\alpha _{G}=0.05\), i.e. \(\varepsilon \approx 0.071\) (corresponding to \(\approx 0.82~{\hat{\sigma }}),\) type I error rates of naive IU-TOST \({\ddot{T}}_{G}(\varepsilon )\) and of IU-NPC \(T_{G}\) approximately coincide. Of course, this coincidence also remains for larger sample sizes and margins. With the data of the example, the Eq of two data sets is retained by the IU-NPC \(T_{G}\) for margins \(\varepsilon _{I}=\varepsilon _{S} \gtrsim 0.12\approx 1.38~{\hat{\sigma }}\). For margins \(\varepsilon _{I}=\varepsilon _{S}<0.12\), we could state that there is not enough information to conclude that the two populations are equivalent. Anyway, it also well known in the literature that the IU-NPC approach suffers from a lack of power (e.g., Berger and Hsu 1996; Wellek 2010).

On the other hand, the UI-NPC \({\tilde{T}}_{G}\) retains Eq for all margins, including \(\varepsilon _{I}=\varepsilon _{S}=0\) (in such a case \({\tilde{\alpha }}^{c}=\alpha /2\) and the related p value statistic is \({\hat{\lambda }}_{{\tilde{G}}}=\hat{\lambda }^{\prime }=0.0535>0.025)\).

7 Warnings and good practices for NPC equivalence

7.1 The IU-NPC solution

The IU-NPC approach, as well as the likelihood-based IU, presents some pitfalls, as is evident from previous results (see also the Supplementary Material). The most important requirements and pitfalls are:

  • IU.1) It does not admit any solution when \(\ \varepsilon _{I}=\varepsilon _{S}=0,\) i.e. when the null hypotheses is \(H:[(\delta \le 0) {\textstyle \bigcup } (\delta \ge 0)]\) in which case the alternative K becomes logically impossible as it is empty, \(K=\varnothing\) say.

  • IU.2) When the \(T_{G}\) measure of \(\varepsilon _{I}+\varepsilon _{S}\) is small, there still remain difficulties in retaining Eq when it is true. This difficulty is well recognized in the literature. For instance, (Wellek (2010), p. 5) says ”...the sample sizes required in an equivalence test in order to achieve a reasonable power typically tend to be considerably larger than in an ordinary one- or two-sided testing procedure ...unless the range of tolerance deviations ...is chosen so wide that even distributions exhibiting pronounced dissimilarities would be declared ‘equivalent’ ...”. The same difficulties are confirmed by simulation results (Arboretti et al. 2018).

  • IU.3) Using Monte Carlo to achieve the IU-NPC calibrated \(\alpha ^{c}\) requires complete knowledge of underlying distribution F of variable X, including all its nuisance parameters. When a central limit theorem is working for partial test distributions, calibrated \(\alpha ^{c}\) can be approximately assessed by a simulation algorithm as in (Arboretti et al. (2018), see also the Supplementary Material) where the unknown standard deviation \(\sigma _{X}\) is replaced by its sampling estimate \({\hat{\sigma }}_{X}\). Thus, assessing \(\alpha ^{c}\) endures two sources of approximations. This difficulty is particularly intriguing when sample sizes \((n_{1},n_{2})\) and/or Eq interval length \(\varepsilon _{I}+\varepsilon _{S}\) are small. However, if data X are provided with finite second moment, IU-NPC is little influenced by mis-specification of data distribution F [also UI.3, Sect. 7.3)]. When no assumption on underlying F is undertaken, then mid-rank transformation of the numeric data \({\mathbf {X}}\) and margins \((\varepsilon _{I},\varepsilon _{S})\) may provide for reliable evaluation of calibrated \(\alpha ^{c}\), provided that normal approximation for Wilcoxon-Mann-Whitney statistics takes place, i.e. for sample sizes of about 10 or larger.

  • IU.4) According to results in Arboretti et al. (2018) and based on limiting behavior of the permutation test as stated in (Hoeffding (1952), IU-NPC test \(T_{G} =\min (T_{I},T_{S})\) quickly converges to \(T_{G}^{opt}\) in the conditions for the latter.

  • IU.5) Unless \(\min (n_{1},n_{2})\) or\(\ \varepsilon _{I}+\varepsilon _{S}\) are very large, once Eq is rejected, the application of a Bonferroni-like rule for establishing which \(H_{h},\) \(h=I,S,\) is active, if not always impossible, is generally difficult since calibrated \(\alpha ^{c}\) lie in the half-open interval \([\alpha ,~(1+\alpha )/2).\)

  • IU.6) In practice, to analyze a given data set \(({\mathbf {X}}_{1},{\mathbf {X}}_{2})\), with \((n_{1},n_{2})\) sample sizes, margins \((\varepsilon _{I},\varepsilon _{S}),\) at significance level \(\alpha ,\) one has to firstly establish or to estimate \(\alpha ^{c}\) via Monte Carlo as in point IU.3; then, one can proceed with the IU-NPC analysis. This implies using two computing algorithms.

  • IU.7) While using any kind of ranks, only within the IU-NPC permutation approach is it possible to express margins in terms of the same physical measurement units of variable X. Rank solutions as discussed in Wellek (2010) and Janssen and Wellek (2010) express margins in terms of rank transformations so as to mimic solutions based on normal settings. However, this implies considering something similar to random margins, the meaning of which become doubtful or at least questionable and too difficult to justify (Arboretti et al. 2015; Hirotsu 2007). This same difficulty is also met when monotonic data transformations, such as \(X=\varphi (Y),\) are necessary and margins are expressed in terms of transformed values X. In any case, provided that margins are clearly justified, IU-NPC can be correctly applied if these are expressed either in terms of original data Y or in terms of transformed data X.

  • IU.8) The multidimensional extension of the IU approach by likelihood methods is far from satisfactory, especially outside normal distributions. We think this extension can easily be done under the NPC and we intend to do it in future research.

  • IU.9) Calibrated reference values under the parametric likelihood ratio approach are obtained by numerical calculations (Wellek 2010; Lehmann 1986) only for population distributions lying within the regular exponential family if the invariance property for nuisance parameters (if any) works. Outside, only approximated solutions can be obtained [IU.3]. So the IU parametric approach is extremely demanding. Moreover, whenever the minimal sufficient statistics in the null hypothesis is the whole n-dimensional data set \({\mathbf {X}}\), only nonparametric permutation solutions can be set up correctly (Pesarin 2015, 2016; Pesarin and Salmaso 2010).

7.2 The naive IU-TOST solution

The naive IU-TOST solution, \({\ddot{T}}_{G}=\min (T_{I},T_{S})\) say, as is frequently considered in the literature (Anderson-Cook and Borror 2016; Berger 1982; Berger and Hsu 1996; Pardo 2014; Patterson and Jones 2017; Richter and Richter 2002; Schuirmann 1987; Wellek 2010), corresponds to the non-calibrated version that rejects the global H at type I error rate \(\alpha\) when both partial tests reject each other at the same rate \(\alpha\) in place of calibrated \(\alpha ^{c},\) i.e. when \(\ddot{\alpha }_{I}=\ddot{\alpha }_{S}=\alpha .\) This naive \({\ddot{T}}_{G}\) solution has several further specific pitfalls:

  • ÏÜ.1) It satisfies condition a) but not b) in Sect. 2; however, it trivially satisfies Theorem 1 in Berger (1982) and Berger and Hsu (1996).

  • ÏÜ.2) When the \(T_{G}\) measure of \(\varepsilon _{I}+\varepsilon _{S}\) is very large, the non-calibrated naive \({\ddot{T}}_{G},\) whose partial type I error rates are \(\ddot{\alpha }_{I}=\ddot{\alpha } _{S}=\alpha ^{c}=\alpha ,\) and the calibrated IU-NPC \(T_{G}\) coincide, and so they are both consistent (Sect. 4).

  • ÏÜ.3) The naive IU-TOST\(\ {\ddot{T}}_{G}\) can be dramatically conservative and its maximum rejection probability can be much smaller than \(\alpha ,\) even exactly zero (Arboretti et al. 2018), see ÏÜ.5 and results in Sect. 5 (see also the Supplementary Material).

  • ÏÜ.4) In Theorem 2 in Berger (1982) and Berger and Hsu (1996), essentially states that margins \((\varepsilon _{I},\varepsilon _{S})\) exist such that the power under K of naive test \({\ddot{T}}_{G}\) is not smaller than \(\alpha .\) Since \({\ddot{T}}_{G}\) is consistent, as the standardized length of the Eq interval diverges at the rate \([n_{1}n_{2}/(n_{1}+n_{2})]^{1/2},\) if \(\min (n_{1} ,n_{2})\) diverges, such an existence corresponds to consistency. However, it is important to underline that such a condition is not constructive and so is not beneficial to finding practical solutions. Indeed, in any real problem, based on technical or biological or regulatory considerations, margins are established before the experiment for data collection is conducted. So, since it is unknown if the \({\ddot{T}}_{G}\) measure of \((\varepsilon _{I} +\varepsilon _{S})\) with actual sample data is sufficiently large so that \(\ddot{\alpha }=\alpha ^{c}=\alpha ,\) naive \({\ddot{T}}_{G}\) solutions do not guarantee minimal requirements in order to be considered valid test statistics.

  • ÏÜ.5) Paradoxically, when the Eq interval length \(\varepsilon _{I}+\varepsilon _{S}\) is small in terms of the \({\ddot{T}}_{G}\) distribution, the maximum probability for the naive IU-TOST \({\ddot{T}}_{G}\) of finding a drug equivalent to itself can be exactly zero. This generally occurs when two partial rejection regions have no common points, i.e. when \(\phi _{S}\bigcap \phi _{I}=\emptyset\) so leading to impossible events, where \(\phi _{I}\) and \(\phi _{S}\) are the \(\alpha\)-rejection regions of \(T_{I}\) and \(T_{S},\) respectively. For instance, with: \(n_{1}=n_{2}=12\), \(\varepsilon _{I}=\varepsilon _{S}=0.25\), \(X\sim N(0,1)\) and \(\ddot{\alpha }_{I} =\ddot{\alpha }_{S}=0.05,\) by a simulation with \(MC=5000\) and \(R=2500\) the type I error for \({\ddot{T}}_{G}\) is \(\ddot{\alpha }_{G}\approx 0.000\) and, much worse, the maximum estimated power \({\hat{W}}{\ddot{T}}_{G}(0)\approx 0.000\). Interestingly, the calibrated IU-NPC \(\alpha ^{c}\) is about 0.293 (that of UI-NPC \({\tilde{T}}_{G}\) is \({\tilde{\alpha }}^{c}\approx 0.047;\) see also the Supplementary Material). In this respect it is easy to see that for normal data, with known \(\sigma\) and \(\varepsilon _{I}=\varepsilon _{S}=\varepsilon\), the maximum probability to retain Eq is exactly zero up to \(n_{1} =n_{2}=\lfloor 2(z_{\alpha }\sigma /\varepsilon )^{2}\rfloor ,\) with \(\lfloor (\cdot )\rfloor\) the integer part of (\(\cdot\)) and \(z_{\alpha }\) the \(\alpha\)-quantile of N(0, 1).

  • ÏÜ.6) As a consequence, naive IU-TOST \({\ddot{T}} _{G}\) tests are not members of the set of test statistics that satisfy conditions a) and b) in Sect. 2. Moreover, as the global type I error rate and power can both be considerably smaller than \(\alpha\) for small sample sizes and/or small Eq interval length, we may state that the naive \({\ddot{T}}_{G}\) testing procedure is based on an incorrect methodology, meaning that it can happen that true type I error results \(\alpha (\pm \varepsilon ,n)\ll \alpha\) and maximal power, at \(\delta =0,\) \(W{\ddot{T}}_{G}(0,n,\varepsilon )\ll \alpha\), conditions that do not agree with the minimal requirements for any test (Nunnally 1960, and Sect. 2). Thus, in our opinion, unless sample sizes and/or Eq interval length are sufficiently large, there is no reason for taking naive \(\ddot{T}_{G}\) into consideration in Eq testing. Essentially, this is our basic criticism regarding the widespread use of the naive IU-TOST method (e.g. Anderson-Cook and Borror (2016), Pesarin (1990, (1992), among the many). We think that this intrinsic defect remains hidden to most practitioners because naive IU-TOST apparently sounds non-counter-intuitive.

  • ÏÜ.7) Direct consequence of former two points is that for naive IU-TOST \({\ddot{T}}_{G}\) the cumulation of inferences from independent studies could be unsuitable. For instance, if there are \(m\ge 2\) analyses, each based on insufficiently large sample sizes, as is common in some meta-analyses and multicenter studies, their combination might always reject that a drug is Eq to itself. Indeed, for valid combination it is required that all m partial tests are unbiased (i.e. minimal power \(\ge \alpha\), Sect. 2). In fact, if for study h\(h=1,\ldots ,m,\)\(\phi _{Sh}\bigcap \phi _{Ih}=\emptyset ,\) i.e. the joint rejection region of any two partial tests \(T_{Sh}\) and \(T_{Ih}\) is empty, the p value related to \({\ddot{T}}_{Gh}\) is 1 and so p value of any of its combinations is also 1,  hence always providing for NEq, true or not.

7.3 The UI-NPC solution

The most important requirements and pitfalls of the UI-NPC are:

  • UI.1) Using Monte Carlo to establish the calibrated \({\tilde{\alpha }}^{c}\) requires complete knowledge of underlying distribution F of endpoint variable X, including all its nuisance parameters [same as IU.3, Sect. 7.1)]. When, for partial test distributions, a central limit theorem is working, calibrated \({\tilde{\alpha }}^{c}\) can be approximately determined according to Arboretti et al. (2018), since the Eq interval length \(\varepsilon _{I}+\varepsilon _{S}\) can be measured in terms of underlying standard error \(\sigma _{X}[n_{1}n_{2}/(n_{1}+n_{2})]^{-1/2}.\) As in practice \(\sigma _{X}\) is unknown, substitution by its sampling estimate \({\hat{\sigma }}_{X}\) implies that \({\tilde{\alpha }}^{c}\) can be assessed only approximately. It is worth noting, however, that the related degree of approximation is generally negligible in practice because: i) its true value lies in the closed interval \([\frac{1}{2}\alpha ,~\alpha ]\) and so the maximum approximation error is bounded by \(\alpha /2\); ii) for any given Eq interval, calibrated \({\tilde{\alpha }}^{c}\) quickly converges to \(\alpha\) for increasing sample sizes, provided that the population mean \({\mathbf {E}}_{F}(X)\) is finite. When population mean is assumed not to be finite, then mid-rank transformation of the numeric data \({\mathbf {X}}\) and margins \((\varepsilon _{I},\varepsilon _{S})\), can provide for well approximated evaluations of calibrated \({\tilde{\alpha }}^{c} ,\) provided that normal approximation for Wilcoxon-Mann-Whitney statistics takes place.

  • UI.2) Similarly to point IU.6 (Sect. 7.1) to analyze a given data set \(({\mathbf {X}}_{1},{\mathbf {X}}_{2})\), with \((n_{1} ,n_{2})\) sample sizes, margins \((\varepsilon _{I},\varepsilon _{S}),\) at significance level \(\alpha ,\) one has to firstly establish or to estimate \({\tilde{\alpha }}^{c}\) via Monte Carlo as at point UI.1; then, one can proceed with the UI-NPC analysis. This too implies using two computing algorithms, but with much less impact than with IU-NPC, because \({\tilde{\alpha }}^{c}\in [\frac{1}{2}\alpha ,~\alpha ],\) which is a much smaller range than \([\alpha ,~(1+\alpha )/2).\) Indeed, a similar five-entry table would require much smaller numbers of sample sizes and margins.

  • UI.3) Once NEq is retained at significance level \(\alpha ,\) identifying which of two arms is mostly responsible for that result using a Bonferroni-like rule implies that the related type I error is in \([\frac{1}{2}\alpha ,~\alpha ]\), and so the related type I error rate is not less than \(\alpha /2.\) Indeed, it is close to \(\alpha\) even for moderate sample sizes and small Eq interval length since, in practice, UI-NPC is intrinsically robust against mis-specification of underlying F,  possibly after data transformations achieving near symmetry. In this regard, with data from a Student’s t distribution with 2 df (zero mean and infinite variance), \(n_{1}=n_{2}=12\), \(\varepsilon _{I}=\varepsilon _{S}=0.321,\) corresponding to margins of about 0.25 for standard normally distributed data since \(\Pr \{-0.25\le N(0,1)\le 0.25\}=\Pr \{-0.321\le t_{2} \le 0.321\},\) we have \(\alpha ^{c}\approx 0.375\) and \({\tilde{\alpha }}^{c} \approx 0.047\). Compared to those that are active under standard normal data, as in ÏÜ.5 (Sect. 7.2), \(\alpha ^{c}\) proves to be much larger than 0.293, and so the IU-NPC appears not to be robust against F; instead \({\tilde{\alpha }}^{c}\) coincides at the third figure with 0.047,  confirming that the UI-NPC is at least approximately invariant on F,  provided that near symmetry for the data is achieved. The robustness properties of IU-NPC and UI-NPC will be considered in further research.

  • UI.4) When \(\varepsilon _{I}=\varepsilon _{S}=0,\) i.e. for sharp null and two-sided alternatives, unless the underlying data distribution is symmetric, it is well known that it is difficult to find unbiased tests based on comparison of sample averages (Cox and Hinkley 1974; Lehmann 1986). Within the UI-NPC, however, the test \({\tilde{T}}_{G}=\max [({\bar{X}}_{1}-{\bar{X}}_{2}),\) \((\bar{X}_{2}-{\bar{X}}_{1})]\) is always at least unbiased at \(\alpha /2.\)

  • UI.5) Similarly to IU.9 (Sect. 7.1), calibrated reference values under the parametric likelihood ratio approach are obtained by numerical calculations only for population distributions lying within the regular exponential family if the invariance property for nuisance parameters (if any) works (Ferguson 1967). So, like the IU, the UI parametric approach is also quite demanding. On the contrary, when no parametric UI is available, approximations within UI-NPC generally suffice for most practical applications [UI.1].

8 Concluding remarks

The present paper provides a sort of comparative analysis of two nonparametric permutation approaches for Eq testing problems. In accordance with the majority of the literature on the subject matter, one is based on the IU principle. The other is based on the UI principle. Although they entail different evaluations of inferential errors, both are rationally suitable for such testing and so they are not strictly comparable. As such, rather than a proper comparison, we have proposed a sort of weak comparative (parallel) analysis. However, we believe that neither can be considered uniformly the best to be used for all possible problems. Thus, our analysis is mostly concerned with highlighting their respective requirements, properties, difficulties, inferential costs, limitations and pitfalls.

One important point we took into consideration was that in some of the literature the IU solution is used referring to the so-called non-calibrated reference critical values. We called it the naive IU-TOST solution. In this regard, we showed (see ÏÜ.5 and ÏÜ.6, Sect. 7.2) that since its type I error rate and power for relatively small margins and/or sample sizes can be zero, thus implying rejection of Eq, true or not, with a probability close to one, its related testing process can become absolutely useless, resulting in pure costs without any inferential benefits. This rather erroneous feature may lead, for instance, to the unacceptable conclusion that “the probability to find that a drug is Eq to itself by the naive IU-TOST can be zero”.

A further aspect we would like to consider is a sort of comparison between the IU and the UI with respect to the so-called point null hypotheses. A point null is equivalent to considering \(\varepsilon _{I}=\varepsilon _{S}=0,\) with length of equivalence interval of zero. On the one hand, the UI way coincides with the traditional two-sided solution plus one more: once the null has been rejected, its p value \(\lambda =\min (\lambda _{I},\lambda _{S})\) satisfies Bonferroni’s rule (UI.3, Sect. 7.3) and allows us to make inference on which is the active arm: e.g. if \(\lambda =\lambda _{I}\) then \(\delta <0\), at type I rate \({\tilde{\alpha }}^{c}=\alpha /2\) (similarly for \(\delta _{S}\)). On the other hand, the IU way cannot have any solution, so in this formulation a point null cannot be considered as a null interval. This too shows that two formulations are essentially different.

A problem faced by any researcher is finding guidance to choose between two approaches. Our point of view is that if he/she considers that rejection of Eq when it is true has relatively smaller costs than its acceptance when NEq is true, as can typically be considered the case with bioequivalence and pharmacostatistics, then the IU-NPC is the correct choice. Correspondingly, if he/she considers that rejection of Eq when it is true has relatively greater costs than its acceptance when NEq is true, as can typically be the case with traditional two-sided testing (quality control, etc.), then the UI-NPC is the correct choice.

In the usual literature on the subject matter, both IU and UI parametric approaches are essentially worked out within likelihood techniques. These approaches, which in any case imply approximate solutions, are rather difficult to deal with since they require quite severe conditions of validity, such as population distributions lying within the regular exponential family and enjoying the invariant property for nuisance parameters, if any—conditions that are generally quite difficult to meet and/or justify. Our IU-NPC and UI-NPC permutation solutions are also approximated. However, when a parametric optimal solution exists, its NPC counterpart is asymptotically convergent to it at a high rate. When a solution within likelihood ratio is not invariant with respect to one or more nuisance parameters, it cannot be worked out unless these nuisance parameters are completely known. Our IU-NPC and UI-NPC solutions, given they are working conditionally on a set of sufficient statistics in one point of the null hypothesis, do not require any knowledge of nuisance parameters, so are sufficiently flexible to cope with most practical problems.