According to EN ISO IEC 17025 [1] and EA-4/18 [2], accredited laboratories should assure the quality of test results by participating in proficiency testing programs. In the case of a lack of proficiency tests because of, for example, the technical characteristics of the measurement or the low number of existing laboratories in the sector, other methods of assuring quality are accepted. However, interlaboratory comparisons (ILCs) are preferred by accreditation bodies. This is the reason why interlaboratory comparisons are organized often even if there are no reasonable methods of assessing the results.

Typical methods of assessment of ILC results are described in standards EN 17043 [3] and ISO 13528 [4]. Most are based on a known assigned value (value attributed to a particular quantity and accepted [4]) and its uncertainty. This knowledge comes from preparing special samples for the purpose of ILC, using certified reference materials (CRMs) or testing the samples at expert laboratories before the ILC. For some statistics used in the assessment of laboratory proficiency, reference laboratories are involved. When it is not possible to apply the above methods, consensus values calculated from participant results using robust analysis are recommended for the estimation of an assigned value. But for a limited number of participating laboratories when statistical methods become increasingly unreliable, schemes based on CRMs are preferred in the available literature [5].

However, it is sometimes not possible to apply a recommended method of assessment of ILC results. The assigned value is unknown. Neither are there reference materials nor is there the possibility of establishing consensus values owing to a lack of experts. It is impossible to repeat a test on the same test item because it is destroyed during tests. The homogeneity of tested items is unknown. Moreover, tests and/or test items are expensive or unique, and thus, a small number of tests results are available.

Such situations are frequent in the mechanical testing of construction product conducted to find a characteristic (type) of an unknown product [6]. An example is the mechanical testing of doors, windows, walls, panels, lintels, and small wastewater treatment systems, where both the tests and test items are often expensive. Additionally, in these situations, it is important to assure the quality of the test result because the result can directly affect safety or health. The above problem can also be encountered in laboratories that conduct chemical tests of substance/elements that are rarely presented or expensive and in the medical testing of human tissue.

Performing an ILC test on simplified samples is one of many solutions (e.g., a laboratory that tests the load bearing capacity of small wastewater treatment systems having tanks of about 3 m3 may take part in an ILC of the compressive strength of concrete blocks of the size order of dm3), but it does not provide the laboratories and its customers with a sense of security.

Technical requirements specified in EN ISO/IEC 17043 [3] and ISO 13528 [4] or IUPAC Technical Report [5] concerning proficiency testing and ILC schemes are generally not applicable in the situation of interest; i.e., the situation of comparison a small number of laboratories and a small number of samples with no knowledge of the assigned value, when statistical criteria for ILC can only be based on an assigned value and/or standard deviation (SD) taken from the participant. There are commonly used statistical tests of consistency, such as F and t tests, but such statistics seem to be useless in this case because of the high critical values for small samples, which entail a risk of false acceptance. Other statistics (e.g., χ 2) are unsuitable because of the need to know a predetermined value of variance.

The present paper addresses the question: Is it possible to show the reliability of test results and competence of laboratories in an interlaboratory comparison for a small number of possible tests, limited number of participants, no determined assigned value, and no determined permissible uncertainty? Moreover, are statistical assessments of ILC results reliable and rational? This paper considers ILC for two laboratories, each having three samples. This issue has been not considered previously.

Common methods of assessing the consistency of test results

There are three general methods of assessing test results in an ILC:

  • assessing the difference between each result and a “true value,”

  • comparing laboratory variance (or uncertainty) with predicted, required, or known variance, and

  • assessing of comparability of laboratory results.

The last method is the most promising for our purposes because it does not require knowledge of a “true value” or predicted variance.

Typical simple methods of ILC result assessment are described in ISO 13528 [4]. In our case, there is no possibility of establishing reference laboratory, and thus, the E n number is useless and the z score (z) and zeta score (ζ X  in this paper) should be employed instead. These are defined as

$$z = \frac{x - X}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } }},$$
$$\zeta_{X} = \frac{x - X}{{\sqrt {u_{\text{lab}}^{2} + u_{\text{av}}^{2} } }},$$

where x is the participant result, X is the assigned value, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\) is the SD for proficiency assessment, u lab is the combined standard uncertainty of a participant’s result, and u av is the standard uncertainty of the assigned value.

According to Eqs. (1) and (2), both z and ζ X scores are based on an assigned value (X) and the SD for proficiency assessment (\(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\)) or standard uncertainty of the assigned value (u av ). However, Eq. (2) can be used only if x and X are independent, and therefore, X should not be calculated from the results of participants. Thus, among the statistics listed, only the z score is adopted in this work.

If we assume that the values of X and/or \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\) cannot be determined by any method that is not related to the current comparison, then according to ISO 13528, they should be determined from participant results through robust analysis. It is recommended that Algorithm A [4, 7] be used to obtain robust values of the assigned value and SD. However, the question arises whether this algorithm might be used for the estimation of X and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\) in the case under consideration, because the intention is not to use the algorithm for a small population of test results.

Robust estimators for small samples were studied by Rousseeuw et al. [8]. Obviously, robustness is not possible for n equal to 1 or 2 (where n is number of results). When n = 3 and the location and scale are unknown, it is recommended that the location is estimated as the sample median, but there is no robust scale estimator. For n ≥ 4, the authors propose the location be estimated using the M-estimator with a smooth ψ function and the median absolute deviation MAD n using as the auxiliary scale, and analogously, the estimation scale be estimated by the M-estimator with a smooth ρ function using med n (median) as the auxiliary location. In contrast to Algorithm A, functions used for location and scale estimation are monotonic. The question is does the employment of these analyses for the estimation of X and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\) solve the problem of assessment of ILC for a small number of tests and laboratories?

The estimation of X and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\) could be avoided using methods of assessment that do not consider an assigned value.

Kacker et al. [911] and Kessel et al. [12] considered a discrepancy measure that can be used to check the agreement of test results. They discussed the Birge test, which is a classical test that was developed for checking the consistency of interlaboratory test results, specifically whether measured values might be considered as realizations of a normal probability density function with unknown expected values but known variance [9, 10]. Kacker et al. [11] showed that the Birge test is not consistent with the philosophy of the Guide to the Expression of Uncertainty in Measurement (GUM) [13]. The concept of the metrological compatibility of results consistent with VIM3 [14] and GUM has been discussed [11, 12]. According to the VIM3 definition restated in [12], two metrologically comparable results [x 1, u(x 1)] and [x 2, u(x 2)] for the same measurand are said to be metrologically compatible if

$$\zeta (x_{1} - x_{2} ) = \frac{{\left| {x_{1} - x_{2} } \right|}}{{u(x_{1} - x_{2} )}} \le \kappa ,$$

where [x i , u(x i )] denotes the measured quantity value and its standard uncertainty, κ is the chosen threshold (conventionally having a value of two). ζ is a function that may be used as a measure of the significance of the difference between two results, [x 1 , u(x 1 )] and [x 2 , u(x 2 )]. Such a concept of metrological compatibility is consistent with the GUM.

If we assume that measurements of [x i , u(x i )] are uncorrelated and their weights are the same, then

$$u^{2} (x_{1} - x_{2} ) = u^{2} (x_{1} ) + u^{2} (x_{2} ),$$

and thus,

$$\zeta (x_{1} - x_{2} ) = \frac{{\left| {x_{1} - x_{2} } \right|}}{{\sqrt {u^{2} (x_{1} ) + u^{2} (x_{2} )} }}.$$

On the above basis, two functions are employed in this paper for the analysis of results of ILC for a small number of laboratories and small number of samples: the ζ function given by Eq. (5) and the z function given by Eq. (1). For the calculation of z, X and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }\) values are determined from participant results through robust analysis. Algorithm A according to ISO 13528 and ISO 5725-5 is used for the calculation of robust X = X A and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{A}}\), and z A is calculated as

$$z_{\text{A}} = \frac{{x - X_{\text{A}} }}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{A}} }}.$$

Another algorithm, referred to as Algorithm B in this paper and based on robust analysis for small samples following Rousseeuw et al. [8], is employed for the calculation of X = X B and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{B}}\), and z B is calculated as

$$z_{\text{B}} = \frac{{x - X_{\text{B}} }}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } {}_{\text{B}}}}$$

Parameters ζ, z A, and z B are then compared in terms of detecting the inconsistency of test results for two laboratories, each testing three samples.

Simulation of interlaboratory comparison

To compare the effectiveness of parameters ζ, z A, and z B for small samples, it is considered that two laboratories participate in ILC, and each laboratory performs three tests. During testing, test items are destroyed, and it is thus not possible to repeat a test for the same sample. Three samples of the same product are tested at each laboratory.

This paper takes a single repetition x ij (for i = laboratory 1 or 2 and j = repetition 1, 2, or 3 for each laboratory) as the test result. A relatively wide dispersion of results is assumed. Sources of this dispersion are discussed in the next section.

Simulation of interlaboratory tests is carried out using Excel Data Analysis Tool: Random Number Generation. The tool is used to generate 12 sets, with each set containing six random numbers drawn from a normal distribution with mean µ = 5 and SD σ = 1. Such a ratio between the mean and SD is typical for the example of mechanical tests of large items. Each set of six values is then divided into two parts. Each part represents simulated test results (x ij ) of one of the two laboratories LAB i , where i = 1, 2.

A discrepancy between results is introduced by introducing d = 1 or 2 outliers in the LAB2 results. The value of an outlier is given by

$$o = x_{ 2,j} + b,$$

where x 2,j is the jth result of laboratory 2 and b = 2, 3, 4, 5, 10 is the bias value added to x 2,j .

In case of three “outliers,” which means that all LAB2 results differ from the results of LAB1, three random numbers (LAB2 test results) are drawn from a normal distribution with µ 2  = 5 + b and σ = 1. The results of LAB1 are unchanged. Additionally, to conduct a simulation of two tests performed in two laboratories, the same sets of data are used but with the exclusion of the third result of each laboratory.

The following three sections present the methods used to assess the simulated results of laboratories.

Method I of assessing the ILC results using the ζ function of the compatibility of test results

Function ζ defined in Eq. (5) requires only knowledge of probability density functions represented by the results of the laboratories [x 1, u(x 1)], [x 2, u(x 2)] and not knowledge of an assigned value. The result for a laboratory conducting n tests (repetitions) is

$$x_{i} = \frac{{\sum\nolimits_{j} {x_{i,j} } }}{n},$$

for j = 1, 2…n.

To simplify the problem, we assume that there are the three following main sources of uncertainty u(x i ).

  • The characteristic (accuracy) of measuring instruments. Uncertainty is evaluated using data provided by calibration certificates.

  • Variability due to repeatability and reproducibility of the test method. Factors affecting this variability depend on the method. In most cases, it is not possible to assess the effect of an individual factor on uncertainty and it is common to use the Type A [13] evaluation of standard uncertainty from the statistical distribution of the values obtained from a series of measurements.

  • Variability due to the tested product and its inhomogeneity. The repeatability of the test item is not dependent on the laboratory but on the type of product and its production process.

If it is possible to perform tests on items of known homogeneity or on reference materials, then it is possible to separate variability due to the laboratory from variability due to the tested product. However, in the cases considered here, there is no reasonable way of separating the effects of the tested product and test method on the variability of test results. All historical data concern a small number of tests of different products (tests are expensive, and sample is destroyed during the test). The SD values taken from results obtained in the same laboratory differ appreciably for different types of product, and knowledge of the SD that could be assigned to laboratory uncertainty is thus unavailable. Uncertainty u(x i ) can be estimated only on the basis of the current sample. It seems to be justified, as the only available option in such case, to use the sample SD of current results as an approximation of uncertainty u(x i ) in this article. Hence, in the ζ function (Eq. 5) used as a measure of the difference between the results of two laboratories, we used the mean of the results for laboratory i as x i and the sample SD of results for laboratory i as u(x i ).

Method II of assessing ILC results using the z score and a robust estimator of the assigned value obtained in Algorithm A according to ISO 13528

To use the z function (Eq. 1), information on the assigned value X and its standard uncertainty is needed. Because there is no reference value and there are no expert laboratories, the calculation of the assigned value has to be based on robust estimation from participant results.

According to Algorithm A, recommended by ISO 13528, the first evaluation of the location \(X^{*}\) and scale \(s^{*}\) estimator is:

$$X^{*} = {\text{med}}\left( {x_{i} } \right)$$
$$s^{*} = 1.483\;{\text{med}} \left| {x_{i} - X^{*} } \right|$$

where i = 1, 2,…p, with p being the number of test results.

Next, estimators are derived through an iterative calculation of \(X^{*}\) and \(s^{*}\):

$$\begin{aligned} X^{*} = \sum {\frac{{x_{i}^{*} }}{p}} , \hfill \\ {\text{where}} \hfill \\ x_{i}^{*} = \left\{ {\begin{array}{l} {X^{ *} - 1.5s^{*} \quad{\text{if}}\quad x_{i} <\, X^{*} - 1.5s^{*} } \\ {X^{*} + 1.5s^{*} \quad{\text{if}}\quad x_{i} > X^{*} + 1.5s^{*} } \\ {x_{i} \,\quad\,\quad\,\,\,\,\,\quad{\text{otherwise}}}\\ \end{array} } \right\}, \hfill \\ \end{aligned}$$
$$s^{*} = 1.134\sqrt {\frac{{\sum {(x_{i}^{*} - X^{*})^{2} } }}{p - 1}} .$$

An iterative calculation according to ISO 13528 is performed until there is no change from one iteration to the next in the third significant figure of \(s^{*}\) and the equivalent in \(X^{*}\). Equation (6) is then used for the calculation of z A, where X A = \(X^{*}\) and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{A}}\) = \(s^{*}\).

The ISO 13528 standard takes the average of all participant measurements of the test material as “result” x i . In our case, we have only two results x 1 and x 2, referring to LAB1 and LAB2, for the calculation of \(X^{*}\) and \(s^{*}\). Using Algorithm A for p = 2 items of data, we always obtain the same z A (ca. 0.62), regardless of the values of x 1 and x 2, which is of course useless for the assessment of laboratory performance. For this reason, in our calculation results for all tests performed by the two laboratories, x ij (i = 1, 2, j = 1, 2, 3) replaces x i in Eqs. (10)–(13) used to estimate \(X^{*}\) and \(s^{*}\) (we then have p = 6 values of test results).

Method III of assessing ILC results using the z score and a robust estimator of the assigned value obtained in Algorithm B

According to Rousseeuw et al. [8] for the estimation of X B and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{B}}\), we use the M-estimator of location T n that is described by

$$\frac{1}{n}\sum\limits_{i = 1}^{n} {\psi \left( {\frac{{x_{i} - T_{n} }}{{S_{n} }}} \right)} = 0$$
$$\psi (x) = \frac{{{\text{e}}^{x} - 1}}{{{\text{e}}^{x} + 1}} = \tanh \left( {\frac{x}{2}} \right),$$

where T n is the location estimator and S n is the scale estimator.

By analogy with Method II, all tests results obtained by the two laboratories x ij (i = 1, 2, j = 1, 2, 3) are used as x i in Eq. (14). The first evaluation X 1 of the location estimator is

$$X_{1} = {\text{med}}\;(x{}_{i,j}).$$

Next T n is iteratively calculated using Eq. (14). As recommended by Rousseeuw, T n is computed using a Newton–Raphson algorithm, the code of which was developed by the author of this work (shown in “Appendix 1”).

The scale estimator (median absolute deviation) is calculated as

$$S_{n} = c_{n} \cdot 1.483 \cdot {\text{med}}\left. {\left| {x_{i} - {\text{med}}(x_{i} )} \right.} \right|,$$

where c n is a small sample correction factor, dependent on n, which ensures that the median absolute deviation is unbiased [15].

z B is then calculated according to Eq. (7), where x = x i is the participant result according to Eq. (9), X B = T n, and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{B}}\) = S n .

Results and discussion

Appropriate interpretation of ζ, z A, and z B is necessary to confirm agreement or to alert laboratories of discrepancy after ILC. It is assumed [3, 4] that z scores above 2.0 (or below −2.0) indicate discrepancy. The same critical value is commonly used for ζ [12]. If ζ has a value above 2.0, the difference between test results is deemed significant in view of their standard uncertainties.

Critical values for ζ and z scores should in practice depend on the type of test, tested product, the aim of the test, and other risk factors. They could be derived, for example, from z-based and, t-based uncertainty estimators or an unbiased uncertainty estimator (z/c 4), as has been recommended by Huang even for small samples [16]. However, choice of threshold is not the subject of this article. The main question is are the parameters ζ, z A, and z B effective enough in detecting discrepancy between laboratories.

Figure 1 shows the values of ζ, z A, and z B obtained for biases b = 0,…10 added to the results of LAB2 (according to the described method of simulation). For one outlier introduced in LAB2, only a few values of ζ are greater than 1 and none is greater than 2, even for bias of 10 (i.e., 10 multiples of the SD σ). In other words, in this case, ζ has no effectiveness in detecting discrepancies. For each bias b = 0, 2, and 3, one ζ value exceeds 1, but for b = 0 this should be interpreted as a false signal.

Fig. 1
figure 1

Dependence of ζ, z A, and z B for the second laboratory on the value of bias for a one outlier, b two outliers, and c three outliers introduced for the second laboratory (data within a given range of the b value are arranged in ascending order by z B for figures a and b and by ζ for figure c, simply for easier visualization.)

Better results concerning detection of discrepancies are obtained for z A and z B parameters.

A similar situation occurs for two outliers introduced in LAB2 results, but ζ becomes more effective and z A and z B a little less effective.

There is a notable change in the case of three outliers (Fig. 1c; Table 1). In this case, only ζ detect discrepancy of the tests results, while z A and z B do not. In fact, three outliers in LAB2 correspond to the situation that all the results of LAB2 are incompatible with the results of LAB1 and this means that the laboratories obtain completely inconsistent results.

Table 1 Effectiveness of the detection of incorrect results, expressed in numbers of ζ, z A, and z B values calculated for LAB2 that are greater than or equal to 1

The numbers of ζ, z A, and z B values that are greater than 1 are given in Table 1.

For a smaller number of results (i = 2 laboratories, j = 2 results), the effects are similar.

It appears that there is very high positive correlation between z A and z B, particularly for one and two outliers in the case that each laboratory performs tests on three samples and for one outlier in the case of two samples. Pearson product–moment correlation coefficients for z A versus z B are given in Table 1.

This good correlation is not profitable. z B is based on methods of robust location and scale estimation dedicated specifically to small samples [8] and Algorithm A does not concern small samples. The location estimator T n and scale estimator S n show monotonicity, in contrast to estimators \(X^{*}\) and \(s^{*}\) obtained using Algorithm A. It turns out that this does not matter for the evaluation of ILC results using parameters such as the z score.

In the present experiment, very large discrepancies between results are introduced. Bias values are 2,…10 times the SD σ and 40,…200 % of mean µ. However, the effectiveness of proposed ζ, z A, and z B parameters in detecting incorrect results is very low. The experiment clearly shows the difference between the types of detected discrepancies of test results, which of course results from the nature of the parameter. The ζ parameter is more effective in detecting differences between laboratories, whereas z A and z B are better for detecting a laboratory with outliers.

The findings of this experiment are not optimistic, because no statistically reliable parameter for the assessment compliance of results, obtained by two laboratories and for a small number of test results, has been found. Does this mean that such a comparison should not be carried out? In our opinion, such a comparison definitely should be performed. The test method should provide the laboratory customer with confidence that the laboratory has a useful tool for the assessment of the conformity of tested item with specified requirements. Decision making using sample-based location and scale estimators for very small samples is uncertain and may be different for two different laboratories. However, even if no reliable methods of interlaboratory comparison exist, such comparisons give both the laboratory and its client a slightly higher sense of security. Sometimes in such cases, the “researcher’s eye” is more useful than statistics. If we take two sets of results, an experienced laboratory worker would immediately find doubtful results.

It is sometimes possible to establish simple criteria for ILC, which are harmonized with criteria for the tested product. There are many possibilities for such criteria. For example, to establish criteria that refer to the suitability of the test method for conformity assessment, one may rely on the permissible product tolerance for the tested product:

$$\frac{{U_{\text{SL}} - L_{\text{SL}} }}{\sigma } \le \kappa ,$$

where U SL and L SL are the upper and lower specification limits for the tested item, respectively, and σ is the sample SD for all LAB1 and LAB2 results. κ should of course be dependent, as mentioned earlier, on a number of factors and should help to minimize the risk of a different assessment of the tested product at two different laboratories.

Sometimes conformity assessment of a product is based on a value declared by the producer. In such a case, the best solution is to use arbitrarily established criteria based on experience of the test method and its suitability for conformity assessment. An example of such a criterion is that the SD σ (defined as above) should not be greater than, e.g., 10 % of the test result. As a test result we can use, for example, the robust value X B calculated in Algorithm B. This idea is based on the maximum permissible variance of the test results, which will allow for a meaningful assessment of the product conformity.

It should be noted that this type of test method most commonly misses data related to precision. Unfortunately, in the process of method development, even by standard committees, exhaustive validation is often lacking, which would be a source of knowledge about the properties of the test method. If it were not so, the data regarding precision (e.g., the SDs of repeatability and reproducibility) could simply be used to establish criteria for the ILC. Even if the assigned value is unknown, knowledge about the precision of the test method presents the possibility of developing a simple criterion based, for example, on values of the repeatability and reproducibility limits published in standards; e.g., the difference between laboratory results should not be greater than the reproducibility limit.


Requirements and rules concerning the organization of proficiency testing or ILC and the analysis of data obtained are not applicable for some kind of tests, when the numbers of laboratories and tests are small and no reference values are available.

It seems to be justified in such a situation to resign actions aimed at ensuring the quality of tests by conducting interlaboratory comparisons and to focus on other aspects, such as the high competence of personnel and the suitability of equipment. However, laboratories, particularly those responsible for carrying out tests of products that affect health and safety, tend to be concerned about the correctness of their test results. An interlaboratory comparison could help them assess whether differences between laboratories are significant and to have more confidence in their results.

The use of statistical methods turns out to have high risk, particularly a high risk of falsely accepting results. The z score parameters z A and z B, based on an assigned value, are more effective in detecting a laboratory having outlier results. The ζ parameter, which is based on the difference in results of laboratories and its SDs as described in this article, is better for detecting differences between laboratories. The combination of the two methods (using ζ and z A or ζ and z B) can reduce the risk that one of the types of discrepancy is overlooked. However, never do either of these methods or their combination guarantee proper assessment and they should not be used for the main assessment of laboratory performance in such interlaboratory comparisons. It was also shown that methods dedicated to the robust estimation of scale and location in small samples do not improve the efficiency of the “z score”-type parameter in detecting discrepancies of tests results.

In our opinion, the best option is to use arbitrarily defined criteria based on the experience of laboratories, suitable for the requirements of the tested product, the aim of the tests, and other known risk factors.

Simultaneously to this work (unexpectedly for authors of the paper), new version of ISO 13528 [17] has been published. In informative Annex D1 some conclusions on procedures for small numbers of participants has been shown. The external criteria independent of the participants’ results are preferred in ISO for small number of participants. Also unreliability of some procedures used for the performance evaluation for too small number of participants has been underlined in the standard. Thus, our conclusions are consistent with information given in the new standard.

Assessment of the reliability of small populations of test results is a difficult but necessary problem to solve in terms of not only ILC but also the conformity assessment of tested product and will be the subject of further work of the authors of this article.