1 Introduction

A tolerance region is a multidimensional set within which a pre-specified proportion, \(\beta \) of sampled responses regarding “normal" subjects will be covered with a prefixed probability, \(\gamma \). The choice of \(\beta \) and \(\gamma \) will depend on the practical application of the tolerance region. It is important to point out that the term normal here is intended to define a control part of the population (i.e. reference subjects), (see Tukey 1947; Solberg 1993; Gräsbeck 2004) and represent typical responses found in those subjects. We will use the term “Gaussian" to imply that values are distributed in accordance with a the conventional bell-shaped curve.

Multivariate tolerance regions (TRs) developed under Gaussianity assumptions are well advanced (e.g. John 1963; Siotani 1964). They appear to be particularly appropriate in dealing with statistical populations in which a preponderant majority of subjects are concentrated within a narrow central range. It is claimed that, especially in a large sample situation, Gaussian TRs exhibit robustness properties that extend their applicable uses.

Nonetheless, Helsel and Hirsch (2002) correctly observe that distributional assumptions are often employed because there are insufficient data to compute tolerance bounds with the precision required. Adding information from within the distribution will increase the precision of the estimate as long as the distributional assumption is a reasonable one, but when the distribution assumed does not fit the data well, the resulting estimates are less accurate, and more misleading than if nothing were assumed.

Regrettably, the situations, like those where sample sizes are small, in which the assumption of a known distribution, such as the Gaussian model, is most needed are the same as those where it is difficult to determine whether or not data follow the hypothesized distribution. For these reasons, we will not discuss the construction of tolerance regions in the form of Gaussian hyper-ellipsoids. In the following, we will restrict ourselves to distribution-free tolerance regions.

A possible procedure for constructing distribution-free TRs might be one based on the properties of statistically equivalent blocks, which has received attention from a number of researchers (e.g. Fraser 1953; Gessaman 1970 Ackermann (1983)). Di Bucchinico et al. (2001) say that this method depends on an arbitrary ordering function and that is essentially a one-dimensional procedure. The TRs so obtained are exact, but they are in generally not asymptotically minimal. In addition, the shape of the regions has a multi-planar profile, which can give rise to a multiplicity of subintervals for one or more variables that are not easy to handle. Another procedure for TRs is based on density estimation and data depth (see Chatterjee and Patra 1980; Li and Liu 2008). However, according to Di Bucchinico et al. (2001), this method is very conservative and requires the satisfaction of some regularity conditions. For a recent review on tolerance intervals and regions, see Krishnamoorthy and Mathew (2009); Wellek (2011); Young and Mathew (2020).

This article advocates, to some extent, a return to the roots of TRs, that is a return to the concept of an accept-reject region derived from order statistics which has been well recognized for long time, but has not attracted much attention. Indeed, in spite of their good properties, an extensive examination of the literature found a small number of bivariate applications of TRs, but none with three or more variables. This is rather unexpected because of the benefits that distribution-free TRs can provide in multivariate analysis. We explain the reticence of researchers and practitioners to use distribution-free TRs by underlining the fact that the shape of the regions obtained by the most established methods can be quite complicated and, hence, not only expensive in higher dimensions, but also not particularly easy to use for someone who is not familiar or comfortable with devices which minimize human involvement. Cho and Ng [2021] offer a comprehensive overview of this topic.

The aim of this paper is to develop an innovative approach which will allow the user to calculate efficiently a reference hyper-rectangle for a multidimensional character so that the expected probability of a given vector of responses falling within the region established for normal subjects can be readily established. In addition, it will be simple to identify the variable or variables that have caused a given vector to be considered aberrant. The implicit idea is that, if a subject lies outside the given bounds in one or more variables, some action should be taken. These goals will be achieved by using a single sample of observation drawn from a common, but unknown joint probability distribution, about which we assume to be continuous.

The organization of the paper is driven by these questions. In the next section we will discuss the one-dimensional case. The third section is concerned with multidimensional distribution-free TRs. Since they are centered around order statistics, it is troublesome to extend their concept to higher dimensions. In this respect, an innovative technique is proposed and illustrated. Moreover, we suggest a solution to the important practical problem concerning the order in which cuts are made to segment the data set, which has remained unsolved since Wald (1943). Numerical examples and simulations are used to illustrate the characteristics of the proposed technique in Section 4. The final section contains some concluding remarks and perspectives for further research.

2 One-dimensional tolerance intervals

Univariate tolerance intervals (TIs) are well-known and their properties have been investigated in detail. Nonetheless, in contrast to other intervals commonly applied in statistical inference, TIs are rarely employed. One reason is that the theoretical concept and their computational complexity are much more complicated than those of conventional confidence and prediction intervals. In addition, they are not always available in statistical software packages. In this section we will review briefly the relevant details of TIs.

Let’s consider a random variable Y with a continuous density function f(y) and distribution function F(y). Let \({\mathbf {y}}\!=\!(y_1, y_2,\ldots , y_n)\) be a random sample obtained from F. A distribution-free \(\beta \)-content TI with confidence level \(\gamma \) will solve the problem of finding two functions of the sample, say \(L({\mathbf {y}})<U({\mathbf {y}})\), such that \([L({\mathbf {y}}), U({\mathbf {y}})]\) has a probability \(\gamma \) of including at least a fraction \(\beta \) of the items or subjects of the sampled population, that is,

$$\begin{aligned} Pr\left\{ \int _{L({\mathbf {y}})}^{U({\mathbf {y}})}f\left( y\right) dy\geqslant \beta \right\} =\gamma \qquad \text{ with }\quad 0\leqslant \beta , \gamma \leqslant 1 \end{aligned}$$
(1)

We call \(\gamma \) confidence level and \(\beta \) coverage or tolerance proportion (in the sense that it is the minimum proportion of normal subjects that can be considered tolerable to cover). The confidence level \(\gamma \) represents the probability of accepting a valid region, where \((1\!-\!\gamma )\) is the risk of one’s being rejected. The quantities LU are the upper and lower bounds of the interval. Incidentally, it may well be that L is not finite so a TI can be proposed in the form \((-\infty ,U)\) or (aU) where a is the smallest value which Y can assume. If \(U=\infty \) the interval become \((L,\infty )\) of (Lb) where b is the largest value which Y can assume. These cases are called one-sided TIs.

Let \(y_{(1)}, y_{(2)}, \ldots , y_{(n)}\) denote the order statistics of the sample \({\mathbf {y}}\) and set \(L\!=\!y_{(r)},U\!=\!y_{(s)}\) where r and s are two integers with \(1\leqslant r\!<\!s\leqslant n\) (possibly including \(y_{(0)}=-\inf \) and \(y_{(n+1}=\inf \)). We assume that the recording of data is performed in such a way that no two observations are equal. In this regard, any effort made would improve the accuracy of each single entry into the sample. If not otherwise possible, small uniform random numbers can be added to the data.

It turns out that the value of the integral in (1) does not depend on f(y) if and only if \([L({\mathbf {y}})\) and \( U({\mathbf {y}})]\) are order statistics of the sample: \(L(y)=y_{(r)}, U(y)=y_{(s)}\). See Wilks (1941), Robbins (1944). Relationship (1) can thus be written as

$$\begin{aligned} Pr\left\{ F[y_{(s)}]-F[y_{(r)}] \geqslant \beta \right\} =Pr\left\{ W_{r,s} \geqslant \beta \right\} \end{aligned}$$
(2)

where \(W_{r,s}\!=\!F[y_{(s)}]\!-\!F[y_{(r)}]\). Noting that \(F[y_{(s)}]\) and \(F[y_{(r)}]\) are order statistics from a uniform distribution in (0, 1), probability (2) is

$$\begin{aligned} Pr\left\{ W_{r,s} \geqslant \beta \right\} =1-I_{\beta }\left( k, n-k+1\right) , \qquad \text{ with }\quad k=s-r \end{aligned}$$
(3)

where \(I_{\beta }\) is the incomplete beta function. Generally, expression (3) cannot be satisfied with precision for any confidence level \(\gamma \), but the indices r and s can be chosen to give \(Pr[W_{r,s} \geqslant \beta ]\geqslant {\widehat{\gamma }}\) where \({\widehat{\gamma }}\) is the lowest confidence level greater than or equal to \(\gamma \). Let \(k^*\) be the index such that

$$\begin{aligned} \min _{1\leqslant k^*\leqslant n}\left\{ \left[ 1-I_{\beta }\left( k, n-k+1\right) \right] ={\widehat{\gamma }} \right\} \end{aligned}$$
(4)

Then, any interval \([y_{(r)},y_{(s)}]\) such that \(s\!-\!r\!=\!k^*\) will be a valid choice. It surprised us to find that most of the published work assumes that the order statistics defining the bounds should be decided beforehand. On the other hand, as will be explained in the next section, our approach searches for an optimal combination of pairs \((r_1,s_1), (r_2,s_2),\) \(\ldots ,(r_m,s_m)\).

The availability of two R packages: tolerance (Young 2010) and EnvStats (Millard 2013), along with detailed user guides, visualization devices and worked examples, has helped the diffusion of tolerance intervals in statistical research. As an example, we compute distribution-free TIs for a set of \(n\!=\!85\) systolic blood pressure reading data reported in Bland and Altman (1999). In that study, simultaneous measurements were made by each of two experienced observers using a sphygmomanometer and by a semi-automatic blood pressure monitor.

Three sets of readings were made in quick succession. We have aggregated the three measurements into a single vector of \(n\!=\!255\) observations and set \(\beta \!=\!0.90\) and \(\gamma \!=\!0.95\). The solution obtained by using \(\nu \!=\!125\) is [96, 228] for the human observer N.1, with actual confidence of 95.6% and actual coverage of 90.2%. The results for human observer N. 2 are [98, 226], 95.6%, 90.2%, which are very similar to those of human observer N.1. The solutions for the semi-automatic blood pressure monitor is [111, 228] with actual confidence 95.6% and actual coverage 90.21%. Bland and Altman (1999) emphasize that the decision about what is an acceptable agreement is a clinical one; statistics alone cannot answer the question. It seems, however, that the machine operator gives values which tend to be higher than those presented by human operators, at least in the lower part of the interval.

3 Distribution-free tolerance regions

Let \({\mathbf {y}}_i\!=\!\left\{ y_{i,1}, y_{i,2},\ldots , y_{i,m}\right\} , i\!=\!1,2,\ldots ,n\) be a random sample of size n from a multivariate distribution \(F(Y_1, Y_2,\ldots , Y_m)\) with joint density function \(f(Y_1, Y_2,\) \(\ldots , Y_m)\). A tolerance region \(H_m({\mathbf {y}}_1,{\mathbf {y}}_2,\ldots ,{\mathbf {y}}_n)\) at level \(\gamma \) is such that

$$\begin{aligned} Pr\left\{ Pr\left[ H_m({\mathbf {y}}_1,{\mathbf {y}}_2,\ldots ,{\mathbf {y}}_n)\right] \geqslant \beta \right\} \geqslant \gamma \end{aligned}$$
(5)

where

$$\begin{aligned} Pr\left[ H_m({\mathbf {y}}_1,{\mathbf {y}}_2,\!\cdots ,\!{\mathbf {y}}_n)\right] \!=\!\int _{L_m}^{U_m}\int _{L_{m-1}}^{U_{m-1}}\!\cdots \!\int _{L_1}^{U_1}f\left( y_1, y_2,\!\cdots ,\! y_m\right) dy_1dy_2\cdots dy_m \end{aligned}$$
(6)

The first multivariate extension is the construction of one-dimensional intervals for each variable separately. The intersection of the various intervals will then delimit the TR within normal items will fall. This method, however, is unsatisfactory from both a formal and an operational point of view. First, it is valid only in the case of independence of the variables. Second, by using several different univariate TIs, one may actually be inflating the probability of the corresponding TR because a subject could be declared “not normal" when at least one of the values falls outside the corresponding normal decision limits.

Problems described above may be alleviated by constructing TRs with respect to all the variables of simultaneously. The construction of TRs is, nevertheless, still troublesome and can be even more problematic when the number of variables is large (cf. Wald 1943; Chatterjee and Patra 1980; Di Bucchinico et al. 2001; Li and Liu 2008; Frey 2010).

3.1 Hyper-rectangle based segmentation (HRBS)

To retain the simplicity of one-dimensional intervals, TRs should be constructed as cuboids that can be decomposed into easily interpretable univariate intervals for the individual components. The idea is to find m pairs of random variables \([L_j(\mathbf {{\mathbf {y}}_1,{\mathbf {y}}_2,\ldots ,{\mathbf {y}}_n)}),\) \(U_j(\mathbf {{\mathbf {y}}_1,{\mathbf {y}}_2,\ldots ,{\mathbf {y}}_n)})], j=1,2,\ldots ,m\) such that

$$\begin{aligned} Pr\left\{ H_m\geqslant \beta \right\} \!=\!\gamma \qquad \text{ with } \quad H_m\!=\![L_1, U_1]\!\times \![L_2,U_2]\!\times \!\cdots \!\times \![L_m, U_m] \end{aligned}$$
(7)

Let \({\widetilde{\beta }}\) be the proportion of sample points actually included in the hyper-rectangle \(H_m\). One desirable property for a tolerance region is that the content is as close to \(\beta \) as possible. Since a region for which \({\widetilde{\beta }}\geqslant \beta \) may have a significantly greater content than \(\beta \), it is advisable to keep \({\widetilde{\gamma }}\), the probability of accepting the region which has produced \({\widetilde{\beta }}\), as near as is practicable to the threshold \(\gamma \) because of the trade off between them. These considerations are clearly important in deciding the direction of the search for the best tolerance region.

The HRBS algorithm is as follows.

  1. 1.

    Set \(\beta _{opt}\!=\!1\) and \(\gamma _{opt}\!=\!1\) and rearrange the observed vectors in a \((n\times m)\) data matrix \({\mathbf {Y}}\). Furthermore, let \(\nu _j, j=1,2,\ldots ,m\) be the maximum length of the starting and ending sequences of indices that are examined to find the order statistics which act as tolerance bounds. Unless otherwise stated, we set \(\nu _j\!=\!\nu \!-\!j\!+\!1\) with \(\nu =\lceil \pi \sqrt{n}/m)\rceil \).

  2. 2.

    Set \(h=1\), \(n_h\!=\!n\), \({\mathbf {Y}}^{(h)}\!=\!{\mathbf {Y}}\).

  3. 3.

    Sort the rows of matrix \({\mathbf {Y}}^{(h)}\) into the increasing order of the h-th column. It is useful to underline that distribution-free TRs are only a function of the order statistics and, hence, are heavily conditioned by the sample size and are generally not very satisfactory for small samples.

  4. 4.

    Select a pair of indices \((r_h, s_h), r_h<s_h\) where \((r_h, s_h), s_h\!-\!r_h\!>\!2\), is a pair of positive integers with values in \(r_h\!\in \!(1,2,\ldots ,\nu _h)\) and \(s_h\!\in \!(n_h\!-\!\theta _h\!+\!1,\ldots ,n_h)\), respectively. Here, \(\theta _h\!=\!\min \!\left\{ \nu _h,n_h^*\right\} \) where \(n_h^*\!=\!(n_h\!+\!1)/2\) if \(n_h\) is odd and \(n_h^*\!=\!n_h/2\) if \(n_h\) is even. If \(n_h\) is sufficiently small relative to the number of variables, then it might be convenient to consider \(s_h\!\in \![(\theta _h+1),\ldots ,n_h]\). However, when n is large the search is constrained to pairs to satisfy the symmetrical constraint

    $$\begin{aligned} s=n-r+1,\quad r< \Big \lfloor \dfrac{n}{2}\Big \rfloor \end{aligned}$$
    (8)

    which should prevent us from being thrown into the black hole of the combinatorial complexity.

  5. 5.

    Compute the bounds: \(L_h\!=\!y_{r_h,h}\) and \(U_h\!=\!y_{s_h,h},\)

  6. 6.

    If \(h\!=\!m\) goto step 10.

  7. 7.

    Delete the rows above \(s_h\) and below \(r_h\) (extremes included). This deletion is necessary so as to ensure that \(y_{r_{h+1},h+1}, y_{r_{h+2},h+1} \ldots ,y_{s_{h-1},h+1}\) can be considered \((s_h\!-\!r_h\!-\!1)\) independent observations on the random variable \(Y_h\). See Wald (1943).

  8. 8.

    Set \(h\!=\!h\!+\!1\) and set \({\mathbf {Y}}^{(h)}\) equal to the \(n_h\times m\) matrix obtained in the preceding step.

  9. 9.

    If \(h\!<\!m\), then restart from point 3 and continue the process of defining sub-matrices and computing bounds until the hyper-rectangle \(H_m\) has been completed.

  10. 10.

    Wald (1943) showed that \(H_m\) has a distribution given by the incomplete beta function with parameters \((k_m, n_m\!-\!k_m\!+\!1)\), with \(k_m\!=\!s_m\!-\!r_m\). As in the univariate case, let \(k^*\) be the integer that solves

    $$\begin{aligned} \min _{1\leqslant k\leqslant n_m}\left\{ 1-I_{\beta }\left( k, n_m-k+1\right) \geqslant \gamma \right\} ,\qquad \ n_m>1 \end{aligned}$$

    then \({\widetilde{\gamma }}\!=\!1\!-\!I_{\beta }\left( k_1, n_m\!-\!k_1\!+\!1\right) \) is the confidence level of \(H_m\).

  11. 11.

    If \({\widetilde{\gamma }}<\gamma \), then go to step 2.

  12. 12.

    If \((\beta \leqslant {\widetilde{\beta }}<\beta _{opt})\cap (\gamma \leqslant {\widetilde{\gamma }}\leqslant \gamma _{opt})\) then \(\beta _{opt}\) is replaced by \({\widetilde{\beta }}\) and \(\gamma _{opt}\) by \({\widetilde{\gamma }}\). If, instead, \((\beta \leqslant {\widetilde{\beta }}=\beta _{opt})\cap (\gamma \leqslant {\widetilde{\gamma }}<\gamma _{opt})\), then \(\gamma _{opt}\) is replaced by \({\widetilde{\gamma }}\).

  13. 13.

    Definition (5) implies something which has so far passed unnoticed, namely that, there may be equally valid alternative solutions. If \((\beta \leqslant {\widetilde{\beta }}\!=\!\beta _{opt})\cap (\gamma \!\leqslant \!{\widetilde{\gamma }}\!=\!\gamma _{opt})\), then it is clearly preferable a cuboid whose perimeter, in relative terms,

    $$\begin{aligned} \delta =m^{-1}\left[ \sum _{j=1}^{m}\left( \dfrac{U_j-L_j}{R_j}\right) \right] ,\qquad R_j=y_{\left( n\right) ,j}-y_{\left( 1\right) ,j} \end{aligned}$$

    is as “short" as possible is clearly preferable.

  14. 14.

    Restart from step 2 until all prefixed possibilities have been examined.

The algorithm described above converges to the hyper-rectangle with the lowest confidence level that is greater than \(\gamma \) and the lowest actual tolerance proportion that is greater than \(\beta \) (only with regard to the combinations of indices that have been explored). Naturally, it is not possible to ensure that the final region of HRSB is the best solution that can be obtained by examining the entire data set. If, for example, \(n\!=\!100\) and \(m\!=\!5\), then the number of different hyper-rectangles to be recognized is of the order of thousands of trillions, which would not only be a challenge for computers, but would also waste resources by exploring combinations of little or null interest from a practical point of view. On the other hand, setting \(\nu _j\!=\!6\) reduces the order of magnitude to a more approachable level of hundreds of thousands. An analogous simplification is possible, if we adopt the symmetrical condition (8).

Although the number of points examined in practice may appear relatively small, it should first be noted that the required coverage is usually high, leading to selection of values near the endpoints of the intervals. Secondly, the search is conducted within the most extreme values of the data set, where the chance of finding a feasible solution is high. The prominent role of the extreme values of the sample in defining TRs may facilitate the influence of outliers, thereby generating larger bounds than necessary. Before applying HRBS, it is useful to detect (and, if necessary, discard) any outlying observations.

3.2 Reordering of the variables

The decision limits \((L_j, U_j), i=j,2,\ldots ,m\) obtained through the method proposed in the previous section, depend on the sequence with which the variables are involved in the construction of hyper-rectangles. If the value of lengths \(\nu _j\) is small, then the dependency is expected to be negligible and any of the m! reorderings of the variables may be evaluated. However, experiments with many data sets suggest that this is not so, or at least not entirely so. Consequently, it would be desirable to force a priority ordering on the variables.

To this end, we predetermine the succession of the inclusions on the basis of the “principal variables", which is a subset of the original variables that preserves, as well as possible, the structure and information carried by the original variables. The idea behind the principal variables is to compute the principal components of the data matrix while simultaneously reducing the number of variables to be examined. However, the purpose of our study is not to constrain the essential information contained in the original data matrix, within a low-dimensional space, but to establish the order of precedence of the variables, while keeping information content unchanged.

Let’s consider the principal components of the correlation for matrix of the variables currently under consideration, ranked according to the percentage of explained variance from most to least. We identify the principal variables using the technique B4 described in Jolliffe (2002)[Sec. 6.3]. The B4 method first considers the variable that has the highest correlation coefficient (in absolute terms) with the first principal component. This variable constitutes the first dimension of the HBS method with the understanding that greater variability is associated with a better discriminatory capability. This is followed by another variable that has the highest absolute correlation coefficient with the second principal component. Selections subsequent to the first are limited only to the principal variables that were not retained in the preceding selections. The procedure is continued until we are left with only one candidate.

4 Numerical examples

In this section, we simulate the behavior of our distribution-free tolerance regions in a controlled situation which should reveal any possible incongruences. Moreover, to demonstrate the applicability of the proposed tolerance regions, we present three examples that illustrate the main features of the new approach.

The R code implementing the techniques described in section 3 is available from the author on request.

4.1 Simulation

To assess performance and accuracy of the HRBS method, we compare the estimated coverage \({\widetilde{\beta }}\) with the fraction \({\widehat{\beta }}\) of volume occupied by the cuboid \({\widetilde{H}}_m\!=\![{\widetilde{L}}_1, {\widetilde{U}}_1]\!\times \![{\widetilde{L}}_2,{\widetilde{U}}_2]\!\times \!\cdots \!\times \![{\widetilde{L}}_m, {\widetilde{U}}_m]\) in a multivariate Gaussian density distribution, namely,

$$\begin{aligned} {\widehat{\beta }}\!= & {} \! Pr\left[ \left( y_1,y_2,\ldots ,y_m\right) \in {\widetilde{H}}_m\right] =\nonumber \\&\quad \dfrac{1}{(2\pi )^{n/2}\sqrt{|\varvec{\Sigma }|}}\int _{{\widetilde{L}}_m}^{{\widetilde{U}}_m}\int _{{\widetilde{L}}_{m-1}}^{{\widetilde{U}}_{m\!-\!1}}\!\cdots \!\int _{{\widetilde{L}}_1}^{{\widetilde{U}}_1}e^{-\frac{1}{2}\left( {\mathbf {y}}-\varvec{\mu } \right) ^t\varvec{\Sigma }^{-1}\left( {\mathbf {y}}-\varvec{\mu }\right) }dY_mdY_{m\!-\!1}\!\cdots \!dY_1 \end{aligned}$$
(9)

where \(\varvec{\mu }\) and \(\varvec{\Sigma }\) are known and \([{\widetilde{L}}_j, {\widetilde{U}}_j]\) are the bounds determined by the HRBS method. In more detail, N random samples are generated by the Gaussian distribution \(N_m(\varvec{\mu }, \varvec{\Sigma })\) for various sample sizes, HRBS bounds and \({\widehat{\beta }}\) computed for each sample. The quantity \(\psi _1=N^{-1}\sum _{i=1}^{N}|{\widetilde{\beta }}_i\!-\!{\widehat{\beta }}_i |\) is used to measure bias due to the distribution-free approximation of the pertinent portion of the Gaussian ellipsoid. There are many approaches to computing multivariate Gaussian probabilities. For a thorough study of the problem, see Genz and Bretz (2009). In our simulations, (9) is computed by using Dunnett (1989)’s algorithm, which is fast and specific to the correlation structure assumed in our experiments.

It would also be of great interest to compare the confidence level \({\widetilde{\gamma }}\), determined by HRBS, with the probability

$$\begin{aligned} {\widehat{\gamma }}=\min _{{\widehat{\gamma }} \in \left( \gamma ,1\right) }\left\{ Pr\left[ {\widehat{\beta }}\geqslant \beta \right] \geqslant \gamma \right\} \end{aligned}$$
(10)

achieved in a Gaussian context. Since the estimated tolerance proportion \({\widehat{\beta }}\) varies across the samples simply because of the sampling variability of the bounds, it should be possible to calculate \({\widehat{\gamma }}\) directly from the sampling distribution of \({\widehat{\beta }}\), provided that this distribution is known and manageable. However, this is not the case and an approximation is needed.

To build the empirical distribution of \({\widehat{\beta }}\), we employ a Monte Carlo technique. Krishnamoorthy and Mathew (1999) remind us that it is not possible to estimate the tolerance proportion and its confidence level in a single simulation. Therefore, to simulate the sampling fluctuations of the i-th estimate \({\widehat{\beta }}_i\), we generate \(N^*\) additional samples from the multivariate Gaussian \(N_m(\bar{{\mathbf {x}}}_i, {\mathbf {S}}_i)\), where \(\bar{{\mathbf {x}}}_i\) estimates the mean vector and \({\mathbf {S}}_i\) estimates the covariance matrix of sample i. Finally, we compute (9) for all the new samples, with \([{\widetilde{L}}_i, {\widetilde{U}}_i], i=1,2,\ldots ,m]\) as limits of integration. Let \({\widehat{\beta }}_{i,j},j=1,2,\ldots ,N^*\) be the estimated coverages of the cuboids that circumscribe the m-variate Gaussian ellipsoids. The Monte Carlo estimate of the confidence level at the i-experiment is

$$\begin{aligned} {\widehat{\gamma }}_i=1-\dfrac{1}{N^*}\sum _{j=1}^{N^*} I\!\left[ {\widehat{\beta }}_{i,j}\geqslant {\widehat{\beta }}_i\right] \end{aligned}$$
(11)

where I() is the indicator function that takes 1 if the argument is true and 0 otherwise. Similarly to what has been proposed, \(\beta \) the quantity \(\psi _2=N^{-1}\sum _{i=1}^{N}|{\widetilde{\gamma }}_i\!-\!{\widehat{\gamma }}_i |\) can be used to quantify the bias in estimates of \(\gamma \).

In Table 1, we show simulation results for the tolerance proportion and confidence level obtained by averaging the values obtained in \(N\!=\!1000\) samples (and \(N^*\!=\!1200\) additional samples for each on the former samples). Thanks to the adoption of the symmetrical condition (8), we are able to consider the starting and the ending sequences (column headed \(\nu \)) of satisfactory length. Two correlation structures are considered:

  1. 1.

    Uncorrelated random variables distributed as \(N_3({\mathbf {0}}_3,{\mathbf {I}}_3)\) where \({\mathbf {0}}\) is a \(3 \times 1\) vector of zeros and \({\mathbf {I}}_3\) is the identity matrix of order 3.

  2. 2.

    Equally correlated random variables distributed as \(N_3({\mathbf {0}}_3,\varvec{\Sigma })\) with \(\sigma _{i,j}=0.9\) if \(i\ne j\) and \(\sigma _{i,i}=1\).

Table 1 Monte Carlo experiment with \(\beta \!=\!90\%\) and \(\gamma \!=\!95\%\)

The findings provide evidence that the HRBS method respects, at least formally, both the tolerance proportion \(\beta \) and the confidence level \(\gamma \) for any sample size (columns 3 and 6 of Table 1). However, when the sample size is less than or equal to 160, the estimated tolerance proportion \({\widetilde{\beta }}\) tends to be larger than the level of \({\widehat{\beta }}\) expected under a multi-gaussian density. This indicates a problem of under coverage: in the case of Gaussian data, the lower and upper bounds determined by HRBS delimit a content that is lower than the true volume. However, as the sample size increases, the coverages \({\widetilde{\beta }}\) and \({\widehat{\beta }}\) coverages that HRBS estimates for Gaussian distributions tend to become similar.

Another aspect that should be noted is that the tolerance proportion \({\widetilde{\beta }}\), estimated in a distribution-free manner, is more liberal than the nominal level of 90%. This fact can be easily explained by the symmetrical constraint (8) that was necessary to impose in order to keep computer time within reasonable limits, but which has had the effect of drastically reducing the number of possible combinations for searching for the optimum solution.

A final consideration regards the estimated confidence level \({\widetilde{\gamma }}\), which is lower than the level \({\widehat{\gamma }}\) achievable when the sampled population is a multivariate Gaussian. This raises some concerns about the possible liberality of the HRBS procedure when applied to Gaussian data. Additionally, the degree of liberality is noticeably greater for larger samples.

All the caveats described above apply even more strongly when variables are correlated.

The results presented in Table 1 are consistent with what can be expected from a hyper-rectangle with sides parallel to the coordinate axes, that mainly operates mainly in the peripheral zones of the Gaussian spheroid or ellipsoid, which, instead, are very dense in the central zone. We argue that the increase in the volume of rejection of normal subjects is compensated for by the ease of interpretation and of application of distribution-free bounds.

4.2 Application to real data sets

Three kinds of examples were used to check the implementation of the HRBS method outlined in Sect. 3. In particular, it was shown that the proposed method works well in separating subjects into normal subjects that have a quantity included within the tolerance region and abnormal subjects that exceed the tolerance bounds of one or more of the variables used in the study.

4.2.1 Gender differences in elderly patients

Boyd et al. (1998) study gender differences in elderly patients with regard to calcium, phosphorus, and alkaline phosphatase levels. Patient data is for 178 subjects 91 males and 87 females over the age of 65. The covariates involved are Age in years, Gender, Alkaline phosphatase international units/iiter (Alkp), Laboratory, Calcium mmol/L (Calm), Inorganic phosphorus mmol/L (Iphm).

One of the objectives of this research was to propose a reference range of values that may be considered “statistically normal" for calcium, inorganic phosphorus, and alkaline phosphatase. To exclude the effect of laboratories, we only consider data from laboratory Metpath, that is 88 observations with \(n\!=\!59\) males and \(n\!=\!29\) females. Furthermore, to bypass problems relating to duplicate rows, we added a random number in the range of [0.0001, 0.0002] to each entry in the data matrix. In both data sets we set \(\nu _j\!=\!(n+1)/2\!-\!j\!+\!1, j\!=\!1,2,3\).

Table 2 shows the results obtained with: Alkaline phosphatase, Calcium and Inorganic phosphorus. The columns headed \({\widehat{\gamma }}\) and \({\widehat{\beta }}\) present the actual confidence levels and coverage achieved by the various regions. The row titled “Ind.” includes the tolerance limits obtained through individual one-dimensional procedures.

Table 2 Trivariate tolerance intervals for \(\beta =80\%\) and \(\gamma =95\%\)

We first observe that both the lower and the upper bounds for female patients are systematically higher than those for male patients. This is in line with the results of Boyd et al. (1998), who found a statistically significant difference between the mean levels of men and women of 65 years of age and older. The differences are mainly concentrated at the upper bounds of each interval. It is useful to note that the order of inclusion of the covariates is different for elderly females and males, but the Alkaline phosphatase is always the first choice.

It is difficult to determine the relevance and credibility of these results in terms of medical statistics, but it can be seen that HRBS regions work well, at least in comparison with the intersection of the hree univariate tolerance intervals, which ignore the multi-dimensionality of gender differences. Moreover, while it is true that univariate TIs have a narrower width, this advantage is only apparent because it is compromised by a much lower confidence level.

4.2.2 Data on renal function

We re-analyze a data set listed in Appendix 4.2 to Harris and Boyd (1995). The data frame has 596 rows and \(m\!=\!5\) columns: alanine transaminase (ALT), aspartate transaminase (AST), urea nitrogen (BUN), serum creatinine (SCR) and uric acid (UA). The first two variables refer to liver enzymes while the remaining three are related to renal function. We have identified two records with outliers rows 62 and 418, which are excluded from the data set. Although the number of aberrant values is small (\(2/592\approx 0.34\%\)), their deletion substantially decreases the range of the variables involved (ALT and SCR). The number of cases is now \(n\!=\!594\). Note that \(ALT\!=\!176\) is a clear outlier in the AST measurements. However, inspection of the data, reveals no reason to reject this observation as an experimental error. Indeed, there is some suggestion from the histogram that the distribution of aspartate transaminase measurements is long-tailed, in which case, it is correct to include the suspect value.

We used the data set in two ways. First, two TRs are computed, one for each group of analytes: liver enzymes and renal variables. Second, the five analytes are used together. Table 3 shows the results with \(\beta =0.90\) and \(\gamma =0.95\). The leading sequences are \(\nu =x-j+1, x=297, 73, 29\) respectively, for liver enzymes, renal variables, and all the variables together. In all of the cases, we use pairs of indices that are equidistant from the extreme positions

Table 3 Joint and separate intervals for \(\beta =90\%\) and \(\gamma =95\%\)

As can be seen, confidence levels and tolerance proportions do not change very much across data set specifications. Even the bounds for renal variables BUN, SCR and UA ,do not seem to be affected by the different formulations; indeed, here we can observe only modest changes, which can be easily attributed to sampling fluctuations. Things are different in the case of liver enzymes because the addition of the renal variables causes a noticeable change in the bounds of ALT and AST: the former an increase, and the latter a decrease in magnitude. We do not have sufficient information to provide an explanation for this diversity of behavior, although, we can exclude the idea that correlation effects are responsible for this result because of the low values of the Pearson coefficients (a maximum of 0.32 is observed between the two groups). This result is not unexpected because of the low values of Pearson correlation coefficients (a maximum of 0.32 is observed between variables of the two groups).

4.2.3 Calcium oxalate crystals data

Multivariate TRs can be used for screening patients in routine clinical practice. We describe this possibility by using a data set reported in Andrews and Herzberg (1985)[p. 249-252]. This is a data frame with \(m\!=\!6\) characteristics of 79 urine specimens, which were analyzed in an effort to determine whether certain physical characteristics of the urine might be related to the formation of calcium oxalate crystals. The variables are Cx (a binary indicator 0/1 of the presence of calcium oxalate crystals), Gravity (specific gravity of the urine relative to water), Ph (pH reading of the urine), Osmo (osmolarity of the urine), Conduct (conductivity of the urine), Urea (urea concentration in millimoles per litre) and Calc (calcium concentration in millimoles per litre). Cases 1 and 55 have missing values which we replaced with the median of the non-missing responses of the corresponding variable. The data set is divided into absence (Cx=0, \(n\!=\!45\)) and presence (Cx=1, \(n\!=\!34\)) of calcium oxalates. Table 4 reports the bounds obtained using \(\nu _j=23-j+1\) for the first data set and \(\nu \!=\!17-j+1\) for the second. The estimated coverage, the confidence level and the average relative range are given in the last rows of the table.

Table 4 Simultaneous tolerance intervals for \(\gamma \!=\!90\%\) and \(\beta \!=\!75\%\)

It is self-evident that Gravity and Ph have a limited discriminatory power for the absence/presence of oxalates. The warning limits are better indicators, but not strikingly so, because of the overlap between the two groups of subjects.

5 Concluding remarks

Tolerance regions are enclosure sets which cover at least a given proportion \(\beta \) of subjects, supposedly normal within the population, with a degree of certainty of no less than \(\gamma \). In this paper, we introduce an innovative method for constructing a hyper-cuboidal tolerance region in m-dimensional space when everything is unknown about the distribution of the random vectors except that it is continuous. The method, indicated as HRBS (hyper-rectangle based segmentation), consists of finding an interval of normal subjects one by one with respect to all the variables. The first variable is the one which has the largest correlation with the first principal component and the last variable is that which correlates most strongly with the m-th principal component. The dividing lines between normal and abnormal subjects in each domain are not predetermined, but rather arise from a supervised combinatorial algorithm that searches with a specific aim, to find the hyper-rectangle with the lowest confidence level greater or equal to \(\gamma \) and the lowest actual tolerance proportion greater or equal to \(\beta \), supported by a careful selection of separate order statistics for each variable. We have experimented with three data sets and with simulated samples of different sizes from a trivariate Gaussian distribution. The results indicate that the HRBS method provides tolerance regions that are valid even when the distributions of the variables are skewed and outliers are present. Nonetheless, our empirical findings discourage application of the proposed procedure to Gaussian data because, in this case, the confidence levels estimated by the HRBS, as it was entirely predictable, might be affected by an excess of liberality. The real limitation of the HRBS method, however, is its the execution time, which is generally high. Our belief is that a far more effective algorithm could be designed in future studies, perhaps utilizing the potential of parallel computing.