Is the assessment of interlaboratory comparison results for a small number of tests and limited number of participants reliable and rational?

Szewczak, Ewa; Bondarzewski, Adam

doi:10.1007/s00769-016-1195-y

Is the assessment of interlaboratory comparison results for a small number of tests and limited number of participants reliable and rational?

General Paper
Open access
Published: 23 February 2016

Volume 21, pages 91–100, (2016)
Cite this article

Download PDF

You have full access to this open access article

Accreditation and Quality Assurance Aims and scope Submit manuscript

Is the assessment of interlaboratory comparison results for a small number of tests and limited number of participants reliable and rational?

Download PDF

Ewa Szewczak¹ &
Adam Bondarzewski¹

12k Accesses
23 Citations
1 Altmetric
Explore all metrics

Abstract

Tests and/or test items can sometimes be expensive, unique, or only performed in a few laboratories. There can be cases where assigned values are unknown, there is no information, or only poor information on the probability density function attributed to the test result. Sometimes there are neither reference materials nor the ability to establish consensus values due to a lack of experts. It can be impossible to repeat a test on the same item because it is destroyed during the test itself, or the homogeneity of tested items is unknown and no criteria can be established. Specified technical requirements concerning proficiency testing and interlaboratory comparison schemes are generally not applicable in this situation. However, interlaboratory comparison could allow laboratories to have more confidence in their results. The present paper discusses three statistical methods of assessing interlaboratory comparison results obtained in such conditions. Two methods are based on an assigned value determined from participant results through robust analysis. The third is based on the compatibility of results assessed using the ζ parameter. This paper focuses on an interlaboratory comparison for two laboratories, each testing three samples. The use of statistical methods turns out to be high risk, particularly in terms of falsely accepting results. Additionally, is shown that methods dedicated to small samples are also not efficient in detecting discrepancies of test results.

The evaluation of the scoring systems: the fixed effects model under known variances

Article 14 July 2016

Assessment of the measurement method precision in interlaboratory test by using the robust “Algorithm S”

Review of the new edition of ISO 13528

Article 02 June 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Environmental Chemistry

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

According to EN ISO IEC 17025 [1] and EA-4/18 [2], accredited laboratories should assure the quality of test results by participating in proficiency testing programs. In the case of a lack of proficiency tests because of, for example, the technical characteristics of the measurement or the low number of existing laboratories in the sector, other methods of assuring quality are accepted. However, interlaboratory comparisons (ILCs) are preferred by accreditation bodies. This is the reason why interlaboratory comparisons are organized often even if there are no reasonable methods of assessing the results.

Typical methods of assessment of ILC results are described in standards EN 17043 [3] and ISO 13528 [4]. Most are based on a known assigned value (value attributed to a particular quantity and accepted [4]) and its uncertainty. This knowledge comes from preparing special samples for the purpose of ILC, using certified reference materials (CRMs) or testing the samples at expert laboratories before the ILC. For some statistics used in the assessment of laboratory proficiency, reference laboratories are involved. When it is not possible to apply the above methods, consensus values calculated from participant results using robust analysis are recommended for the estimation of an assigned value. But for a limited number of participating laboratories when statistical methods become increasingly unreliable, schemes based on CRMs are preferred in the available literature [5].

However, it is sometimes not possible to apply a recommended method of assessment of ILC results. The assigned value is unknown. Neither are there reference materials nor is there the possibility of establishing consensus values owing to a lack of experts. It is impossible to repeat a test on the same test item because it is destroyed during tests. The homogeneity of tested items is unknown. Moreover, tests and/or test items are expensive or unique, and thus, a small number of tests results are available.

Such situations are frequent in the mechanical testing of construction product conducted to find a characteristic (type) of an unknown product [6]. An example is the mechanical testing of doors, windows, walls, panels, lintels, and small wastewater treatment systems, where both the tests and test items are often expensive. Additionally, in these situations, it is important to assure the quality of the test result because the result can directly affect safety or health. The above problem can also be encountered in laboratories that conduct chemical tests of substance/elements that are rarely presented or expensive and in the medical testing of human tissue.

Performing an ILC test on simplified samples is one of many solutions (e.g., a laboratory that tests the load bearing capacity of small wastewater treatment systems having tanks of about 3 m³ may take part in an ILC of the compressive strength of concrete blocks of the size order of dm³), but it does not provide the laboratories and its customers with a sense of security.

Technical requirements specified in EN ISO/IEC 17043 [3] and ISO 13528 [4] or IUPAC Technical Report [5] concerning proficiency testing and ILC schemes are generally not applicable in the situation of interest; i.e., the situation of comparison a small number of laboratories and a small number of samples with no knowledge of the assigned value, when statistical criteria for ILC can only be based on an assigned value and/or standard deviation (SD) taken from the participant. There are commonly used statistical tests of consistency, such as F and t tests, but such statistics seem to be useless in this case because of the high critical values for small samples, which entail a risk of false acceptance. Other statistics (e.g., χ ²) are unsuitable because of the need to know a predetermined value of variance.

The present paper addresses the question: Is it possible to show the reliability of test results and competence of laboratories in an interlaboratory comparison for a small number of possible tests, limited number of participants, no determined assigned value, and no determined permissible uncertainty? Moreover, are statistical assessments of ILC results reliable and rational? This paper considers ILC for two laboratories, each having three samples. This issue has been not considered previously.

Common methods of assessing the consistency of test results

There are three general methods of assessing test results in an ILC:

assessing the difference between each result and a “true value,”
comparing laboratory variance (or uncertainty) with predicted, required, or known variance, and
assessing of comparability of laboratory results.

The last method is the most promising for our purposes because it does not require knowledge of a “true value” or predicted variance.

Typical simple methods of ILC result assessment are described in ISO 13528 [4]. In our case, there is no possibility of establishing reference laboratory, and thus, the E _n number is useless and the z score (z) and zeta score (ζ _X in this paper) should be employed instead. These are defined as

$$z = \frac{x - X}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } }},$$

(1)

$$\zeta_{X} = \frac{x - X}{{\sqrt {u_{\text{lab}}^{2} + u_{\text{av}}^{2} } }},$$

(2)

where x is the participant result, X is the assigned value, $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$ is the SD for proficiency assessment, u _lab is the combined standard uncertainty of a participant’s result, and u _av is the standard uncertainty of the assigned value.

According to Eqs. (1) and (2), both z and ζ _X scores are based on an assigned value (X) and the SD for proficiency assessment ($\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$) or standard uncertainty of the assigned value (u _av ). However, Eq. (2) can be used only if x and X are independent, and therefore, X should not be calculated from the results of participants. Thus, among the statistics listed, only the z score is adopted in this work.

If we assume that the values of X and/or $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$ cannot be determined by any method that is not related to the current comparison, then according to ISO 13528, they should be determined from participant results through robust analysis. It is recommended that Algorithm A [4, 7] be used to obtain robust values of the assigned value and SD. However, the question arises whether this algorithm might be used for the estimation of X and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$ in the case under consideration, because the intention is not to use the algorithm for a small population of test results.

Robust estimators for small samples were studied by Rousseeuw et al. [8]. Obviously, robustness is not possible for n equal to 1 or 2 (where n is number of results). When n = 3 and the location and scale are unknown, it is recommended that the location is estimated as the sample median, but there is no robust scale estimator. For n ≥ 4, the authors propose the location be estimated using the M-estimator with a smooth ψ function and the median absolute deviation MAD_n using as the auxiliary scale, and analogously, the estimation scale be estimated by the M-estimator with a smooth ρ function using med_n (median) as the auxiliary location. In contrast to Algorithm A, functions used for location and scale estimation are monotonic. The question is does the employment of these analyses for the estimation of X and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$ solve the problem of assessment of ILC for a small number of tests and laboratories?

The estimation of X and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$ could be avoided using methods of assessment that do not consider an assigned value.

Kacker et al. [9–11] and Kessel et al. [12] considered a discrepancy measure that can be used to check the agreement of test results. They discussed the Birge test, which is a classical test that was developed for checking the consistency of interlaboratory test results, specifically whether measured values might be considered as realizations of a normal probability density function with unknown expected values but known variance [9, 10]. Kacker et al. [11] showed that the Birge test is not consistent with the philosophy of the Guide to the Expression of Uncertainty in Measurement (GUM) [13]. The concept of the metrological compatibility of results consistent with VIM3 [14] and GUM has been discussed [11, 12]. According to the VIM3 definition restated in [12], two metrologically comparable results [x ₁, u(x ₁)] and [x ₂, u(x ₂)] for the same measurand are said to be metrologically compatible if

$$\zeta (x_{1} - x_{2} ) = \frac{{\left| {x_{1} - x_{2} } \right|}}{{u(x_{1} - x_{2} )}} \le \kappa ,$$

(3)

where [x _i, u(x _i)] denotes the measured quantity value and its standard uncertainty, κ is the chosen threshold (conventionally having a value of two). ζ is a function that may be used as a measure of the significance of the difference between two results, [x ₁ , u(x ₁)] and [x ₂ , u(x ₂)]. Such a concept of metrological compatibility is consistent with the GUM.

If we assume that measurements of [x _i , u(x _i )] are uncorrelated and their weights are the same, then

$$u^{2} (x_{1} - x_{2} ) = u^{2} (x_{1} ) + u^{2} (x_{2} ),$$

(4)

and thus,

$$\zeta (x_{1} - x_{2} ) = \frac{{\left| {x_{1} - x_{2} } \right|}}{{\sqrt {u^{2} (x_{1} ) + u^{2} (x_{2} )} }}.$$

(5)

On the above basis, two functions are employed in this paper for the analysis of results of ILC for a small number of laboratories and small number of samples: the ζ function given by Eq. (5) and the z function given by Eq. (1). For the calculation of z, X and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }$ values are determined from participant results through robust analysis. Algorithm A according to ISO 13528 and ISO 5725-5 is used for the calculation of robust X = X _A and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{A}}$, and z _A is calculated as

$$z_{\text{A}} = \frac{{x - X_{\text{A}} }}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{A}} }}.$$

(6)

Another algorithm, referred to as Algorithm B in this paper and based on robust analysis for small samples following Rousseeuw et al. [8], is employed for the calculation of X = X _B and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{B}}$, and z _B is calculated as

$$z_{\text{B}} = \frac{{x - X_{\text{B}} }}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } {}_{\text{B}}}}$$

(7)

Parameters ζ, z _A, and z _B are then compared in terms of detecting the inconsistency of test results for two laboratories, each testing three samples.

Simulation of interlaboratory comparison

To compare the effectiveness of parameters ζ, z _A, and z _B for small samples, it is considered that two laboratories participate in ILC, and each laboratory performs three tests. During testing, test items are destroyed, and it is thus not possible to repeat a test for the same sample. Three samples of the same product are tested at each laboratory.

This paper takes a single repetition x _ij (for i = laboratory 1 or 2 and j = repetition 1, 2, or 3 for each laboratory) as the test result. A relatively wide dispersion of results is assumed. Sources of this dispersion are discussed in the next section.

Simulation of interlaboratory tests is carried out using Excel Data Analysis Tool: Random Number Generation. The tool is used to generate 12 sets, with each set containing six random numbers drawn from a normal distribution with mean µ = 5 and SD σ = 1. Such a ratio between the mean and SD is typical for the example of mechanical tests of large items. Each set of six values is then divided into two parts. Each part represents simulated test results (x _ij) of one of the two laboratories LAB_i, where i = 1, 2.

A discrepancy between results is introduced by introducing d = 1 or 2 outliers in the LAB₂ results. The value of an outlier is given by

$$o = x_{ 2,j} + b,$$

(8)

where x _2,j is the jth result of laboratory 2 and b = 2, 3, 4, 5, 10 is the bias value added to x _2,j.

In case of three “outliers,” which means that all LAB₂ results differ from the results of LAB₁, three random numbers (LAB₂ test results) are drawn from a normal distribution with µ ₂ = 5 + b and σ = 1. The results of LAB₁ are unchanged. Additionally, to conduct a simulation of two tests performed in two laboratories, the same sets of data are used but with the exclusion of the third result of each laboratory.

The following three sections present the methods used to assess the simulated results of laboratories.

Method I of assessing the ILC results using the ζ function of the compatibility of test results

Function ζ defined in Eq. (5) requires only knowledge of probability density functions represented by the results of the laboratories [x ₁, u(x ₁)], [x ₂, u(x ₂)] and not knowledge of an assigned value. The result for a laboratory conducting n tests (repetitions) is

$$x_{i} = \frac{{\sum\nolimits_{j} {x_{i,j} } }}{n},$$

(9)

for j = 1, 2…n.

To simplify the problem, we assume that there are the three following main sources of uncertainty u(x _i).

The characteristic (accuracy) of measuring instruments. Uncertainty is evaluated using data provided by calibration certificates.
Variability due to repeatability and reproducibility of the test method. Factors affecting this variability depend on the method. In most cases, it is not possible to assess the effect of an individual factor on uncertainty and it is common to use the Type A [13] evaluation of standard uncertainty from the statistical distribution of the values obtained from a series of measurements.
Variability due to the tested product and its inhomogeneity. The repeatability of the test item is not dependent on the laboratory but on the type of product and its production process.

If it is possible to perform tests on items of known homogeneity or on reference materials, then it is possible to separate variability due to the laboratory from variability due to the tested product. However, in the cases considered here, there is no reasonable way of separating the effects of the tested product and test method on the variability of test results. All historical data concern a small number of tests of different products (tests are expensive, and sample is destroyed during the test). The SD values taken from results obtained in the same laboratory differ appreciably for different types of product, and knowledge of the SD that could be assigned to laboratory uncertainty is thus unavailable. Uncertainty u(x _i) can be estimated only on the basis of the current sample. It seems to be justified, as the only available option in such case, to use the sample SD of current results as an approximation of uncertainty u(x _i) in this article. Hence, in the ζ function (Eq. 5) used as a measure of the difference between the results of two laboratories, we used the mean of the results for laboratory i as x _i and the sample SD of results for laboratory i as u(x _i).

Method II of assessing ILC results using the z score and a robust estimator of the assigned value obtained in Algorithm A according to ISO 13528

To use the z function (Eq. 1), information on the assigned value X and its standard uncertainty is needed. Because there is no reference value and there are no expert laboratories, the calculation of the assigned value has to be based on robust estimation from participant results.

According to Algorithm A, recommended by ISO 13528, the first evaluation of the location $X^{*}$ and scale $s^{*}$ estimator is:

$$X^{*} = {\text{med}}\left( {x_{i} } \right)$$

(10)

$$s^{*} = 1.483\;{\text{med}} \left| {x_{i} - X^{*} } \right|$$

(11)

where i = 1, 2,…p, with p being the number of test results.

Next, estimators are derived through an iterative calculation of $X^{*}$ and $s^{*}$:

$$\begin{aligned} X^{*} = \sum {\frac{{x_{i}^{*} }}{p}} , \hfill \\ {\text{where}} \hfill \\ x_{i}^{*} = \left\{ {\begin{array}{l} {X^{ *} - 1.5s^{*} \quad{\text{if}}\quad x_{i} <\, X^{*} - 1.5s^{*} } \\ {X^{*} + 1.5s^{*} \quad{\text{if}}\quad x_{i} > X^{*} + 1.5s^{*} } \\ {x_{i} \,\quad\,\quad\,\,\,\,\,\quad{\text{otherwise}}}\\ \end{array} } \right\}, \hfill \\ \end{aligned}$$

(12)

$$s^{*} = 1.134\sqrt {\frac{{\sum {(x_{i}^{*} - X^{*})^{2} } }}{p - 1}} .$$

(13)

An iterative calculation according to ISO 13528 is performed until there is no change from one iteration to the next in the third significant figure of $s^{*}$ and the equivalent in $X^{*}$. Equation (6) is then used for the calculation of z _A, where X _A = $X^{*}$ and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{A}}$ = $s^{*}$.

The ISO 13528 standard takes the average of all participant measurements of the test material as “result” x _i. In our case, we have only two results x ₁ and x ₂, referring to LAB₁ and LAB₂, for the calculation of $X^{*}$ and $s^{*}$. Using Algorithm A for p = 2 items of data, we always obtain the same z _A (ca. 0.62), regardless of the values of x ₁ and x ₂, which is of course useless for the assessment of laboratory performance. For this reason, in our calculation results for all tests performed by the two laboratories, x _ij (i = 1, 2, j = 1, 2, 3) replaces x _i in Eqs. (10)–(13) used to estimate $X^{*}$ and $s^{*}$ (we then have p = 6 values of test results).

Method III of assessing ILC results using the z score and a robust estimator of the assigned value obtained in Algorithm B

According to Rousseeuw et al. [8] for the estimation of X _B and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{B}}$, we use the M-estimator of location T _n that is described by

$$\frac{1}{n}\sum\limits_{i = 1}^{n} {\psi \left( {\frac{{x_{i} - T_{n} }}{{S_{n} }}} \right)} = 0$$

(14)

$$\psi (x) = \frac{{{\text{e}}^{x} - 1}}{{{\text{e}}^{x} + 1}} = \tanh \left( {\frac{x}{2}} \right),$$

(15)

where T _n is the location estimator and S _n is the scale estimator.

By analogy with Method II, all tests results obtained by the two laboratories x _ij (i = 1, 2, j = 1, 2, 3) are used as x _i in Eq. (14). The first evaluation X ₁ of the location estimator is

$$X_{1} = {\text{med}}\;(x{}_{i,j}).$$

(16)

Next T _n is iteratively calculated using Eq. (14). As recommended by Rousseeuw, T _n is computed using a Newton–Raphson algorithm, the code of which was developed by the author of this work (shown in “Appendix 1”).

The scale estimator (median absolute deviation) is calculated as

$$S_{n} = c_{n} \cdot 1.483 \cdot {\text{med}}\left. {\left| {x_{i} - {\text{med}}(x_{i} )} \right.} \right|,$$

(17)

where c _n is a small sample correction factor, dependent on n, which ensures that the median absolute deviation is unbiased [15].

z _B is then calculated according to Eq. (7), where x = x _i is the participant result according to Eq. (9), X _B = T _n, and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma }_{\text{B}}$ = S _n.

Results and discussion

Appropriate interpretation of ζ, z _A, and z _B is necessary to confirm agreement or to alert laboratories of discrepancy after ILC. It is assumed [3, 4] that z scores above 2.0 (or below −2.0) indicate discrepancy. The same critical value is commonly used for ζ [12]. If ζ has a value above 2.0, the difference between test results is deemed significant in view of their standard uncertainties.

Critical values for ζ and z scores should in practice depend on the type of test, tested product, the aim of the test, and other risk factors. They could be derived, for example, from z-based and, t-based uncertainty estimators or an unbiased uncertainty estimator (z/c ₄), as has been recommended by Huang even for small samples [16]. However, choice of threshold is not the subject of this article. The main question is are the parameters ζ, z _A, and z _B effective enough in detecting discrepancy between laboratories.

Figure 1 shows the values of ζ, z _A, and z _B obtained for biases b = 0,…10 added to the results of LAB₂ (according to the described method of simulation). For one outlier introduced in LAB₂, only a few values of ζ are greater than 1 and none is greater than 2, even for bias of 10 (i.e., 10 multiples of the SD σ). In other words, in this case, ζ has no effectiveness in detecting discrepancies. For each bias b = 0, 2, and 3, one ζ value exceeds 1, but for b = 0 this should be interpreted as a false signal.

Better results concerning detection of discrepancies are obtained for z _A and z _B parameters.

A similar situation occurs for two outliers introduced in LAB₂ results, but ζ becomes more effective and z _A and z _B a little less effective.

There is a notable change in the case of three outliers (Fig. 1c; Table 1). In this case, only ζ detect discrepancy of the tests results, while z _A and z _B do not. In fact, three outliers in LAB₂ correspond to the situation that all the results of LAB₂ are incompatible with the results of LAB₁ and this means that the laboratories obtain completely inconsistent results.

Table 1 Effectiveness of the detection of incorrect results, expressed in numbers of ζ, z _A, and z _B values calculated for LAB₂ that are greater than or equal to 1

Full size table

The numbers of ζ, z _A, and z _B values that are greater than 1 are given in Table 1.

For a smaller number of results (i = 2 laboratories, j = 2 results), the effects are similar.

It appears that there is very high positive correlation between z _A and z _B, particularly for one and two outliers in the case that each laboratory performs tests on three samples and for one outlier in the case of two samples. Pearson product–moment correlation coefficients for z _A versus z _B are given in Table 1.

This good correlation is not profitable. z _B is based on methods of robust location and scale estimation dedicated specifically to small samples [8] and Algorithm A does not concern small samples. The location estimator T _n and scale estimator S _n show monotonicity, in contrast to estimators $X^{*}$ and $s^{*}$ obtained using Algorithm A. It turns out that this does not matter for the evaluation of ILC results using parameters such as the z score.

In the present experiment, very large discrepancies between results are introduced. Bias values are 2,…10 times the SD σ and 40,…200 % of mean µ. However, the effectiveness of proposed ζ, z _A, and z _B parameters in detecting incorrect results is very low. The experiment clearly shows the difference between the types of detected discrepancies of test results, which of course results from the nature of the parameter. The ζ parameter is more effective in detecting differences between laboratories, whereas z _A and z _B are better for detecting a laboratory with outliers.

The findings of this experiment are not optimistic, because no statistically reliable parameter for the assessment compliance of results, obtained by two laboratories and for a small number of test results, has been found. Does this mean that such a comparison should not be carried out? In our opinion, such a comparison definitely should be performed. The test method should provide the laboratory customer with confidence that the laboratory has a useful tool for the assessment of the conformity of tested item with specified requirements. Decision making using sample-based location and scale estimators for very small samples is uncertain and may be different for two different laboratories. However, even if no reliable methods of interlaboratory comparison exist, such comparisons give both the laboratory and its client a slightly higher sense of security. Sometimes in such cases, the “researcher’s eye” is more useful than statistics. If we take two sets of results, an experienced laboratory worker would immediately find doubtful results.

It is sometimes possible to establish simple criteria for ILC, which are harmonized with criteria for the tested product. There are many possibilities for such criteria. For example, to establish criteria that refer to the suitability of the test method for conformity assessment, one may rely on the permissible product tolerance for the tested product:

$$\frac{{U_{\text{SL}} - L_{\text{SL}} }}{\sigma } \le \kappa ,$$

(18)

where U _SL and L _SL are the upper and lower specification limits for the tested item, respectively, and σ is the sample SD for all LAB₁ and LAB₂ results. κ should of course be dependent, as mentioned earlier, on a number of factors and should help to minimize the risk of a different assessment of the tested product at two different laboratories.

Sometimes conformity assessment of a product is based on a value declared by the producer. In such a case, the best solution is to use arbitrarily established criteria based on experience of the test method and its suitability for conformity assessment. An example of such a criterion is that the SD σ (defined as above) should not be greater than, e.g., 10 % of the test result. As a test result we can use, for example, the robust value X _B calculated in Algorithm B. This idea is based on the maximum permissible variance of the test results, which will allow for a meaningful assessment of the product conformity.

It should be noted that this type of test method most commonly misses data related to precision. Unfortunately, in the process of method development, even by standard committees, exhaustive validation is often lacking, which would be a source of knowledge about the properties of the test method. If it were not so, the data regarding precision (e.g., the SDs of repeatability and reproducibility) could simply be used to establish criteria for the ILC. Even if the assigned value is unknown, knowledge about the precision of the test method presents the possibility of developing a simple criterion based, for example, on values of the repeatability and reproducibility limits published in standards; e.g., the difference between laboratory results should not be greater than the reproducibility limit.

Conclusions

Requirements and rules concerning the organization of proficiency testing or ILC and the analysis of data obtained are not applicable for some kind of tests, when the numbers of laboratories and tests are small and no reference values are available.

It seems to be justified in such a situation to resign actions aimed at ensuring the quality of tests by conducting interlaboratory comparisons and to focus on other aspects, such as the high competence of personnel and the suitability of equipment. However, laboratories, particularly those responsible for carrying out tests of products that affect health and safety, tend to be concerned about the correctness of their test results. An interlaboratory comparison could help them assess whether differences between laboratories are significant and to have more confidence in their results.

The use of statistical methods turns out to have high risk, particularly a high risk of falsely accepting results. The z score parameters z _A and z _B, based on an assigned value, are more effective in detecting a laboratory having outlier results. The ζ parameter, which is based on the difference in results of laboratories and its SDs as described in this article, is better for detecting differences between laboratories. The combination of the two methods (using ζ and z _A or ζ and z _B) can reduce the risk that one of the types of discrepancy is overlooked. However, never do either of these methods or their combination guarantee proper assessment and they should not be used for the main assessment of laboratory performance in such interlaboratory comparisons. It was also shown that methods dedicated to the robust estimation of scale and location in small samples do not improve the efficiency of the “z score”-type parameter in detecting discrepancies of tests results.

In our opinion, the best option is to use arbitrarily defined criteria based on the experience of laboratories, suitable for the requirements of the tested product, the aim of the tests, and other known risk factors.

Simultaneously to this work (unexpectedly for authors of the paper), new version of ISO 13528 [17] has been published. In informative Annex D1 some conclusions on procedures for small numbers of participants has been shown. The external criteria independent of the participants’ results are preferred in ISO for small number of participants. Also unreliability of some procedures used for the performance evaluation for too small number of participants has been underlined in the standard. Thus, our conclusions are consistent with information given in the new standard.

Assessment of the reliability of small populations of test results is a difficult but necessary problem to solve in terms of not only ILC but also the conformity assessment of tested product and will be the subject of further work of the authors of this article.

References

ISO, IEC 17025 (2005) General requirements for the competence of testing and calibration laboratories. International Organization for Standardization/International Electrotechnical Commission, Geneva
Google Scholar
EA-4/18 (2010) Guidance on the level and frequency of proficiency testing participation. http://www.european-accreditation.org/publication/ea-4-18-inf-rev00-june-2010
ISO, IEC 17043 (2010) Conformity assessment—general requirements for proficiency testing. International Organization for Standardization/International Electrotechnical Commission, Geneva
Google Scholar
ISO 13528 (2005) Statistical methods for use In proficiency testing by interlaboratory comparisons. International Organization for Standardization, Geneva
Google Scholar
Kuselman I, Fajgelj A (2010) IUPAC/CITAC Guide: Selection and use of proficiency testing schemes for limited number of participants- chemical analytical laboratories (IUPAC Technical Report). Pure Appl Chem 82(5):1099–1135
Article CAS Google Scholar
Regulation (EU) No 305/2011 Of The European Parliament And Of The Council of 9 March 2011 laying down harmonised conditions for the marketing of construction products and repealing Council Directive 89/106/EEC
ISO 5725–5 (1998) Accuracy (trueness and precision) of measurement methods and results—Part 5: alternative methods for the determination of the precision of a standard measurement method. International Organization for Standardization, Geneva
Google Scholar
Rousseeuw PJ, Verboven S (2002) Robust estimation in very small samples. Comput Stat Data Anal 40:741–758
Article Google Scholar
Kacker RN, Forbes AB, Kessel R, Sommer K-D (2008) Bayesian posterior predictive p-value of statistical consistency in interlaboratory evaluations. Metrologia 45:512–523
Article Google Scholar
Kacker RN, Forbes A, Kessel R, Sommer K-D (2008) Classical and Bayesian interpretation of the Birge test of consistency and its generalized version for correlated results from interlaboratory evaluation. Metrologia 45:257–264
Article Google Scholar
Kacker RN, Forbes A, Kessel R, Sommer K-D (2010) Assessing differences between results determined according to the guide to the expression of uncertainty in measurement. J Res Natl Inst Stand Technol 115:453–459
Article Google Scholar
Kessel R, Kacker RN (2011) Combining results from multiple evaluations of the same measurand. J Res Natl Inst Stand Technol 116:809–820
Article Google Scholar
JCGM 100 (2008) Evaluation of measurement data: guide to the expression of uncertainty in measurement (GUM). http://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf
JCGM 200 (2012) International vocabulary of metrology: basic and general concepts and associated terms (VIM), 3rd edn http://www.bipm.org/vim
Richard M, Brugger A (1969) Note on unbiased estimation of the standard deviation. Am Stat 23(4):32
Google Scholar
Huang H (2015) Optimal estimator for uncertainty-based measurement quality control. Accred Qual Assur 20:97–106
Article Google Scholar
ISO 13528 (2015) Statistical methods for use In proficiency testing by interlaboratory comparisons. International Organization for Standardization, Geneva
Google Scholar

Download references

Author information

Authors and Affiliations

Instytut Techniki Budowlanej, Filtrowa 1, 00-611, Warszawa, Poland
Ewa Szewczak & Adam Bondarzewski

Authors

Ewa Szewczak
View author publications
You can also search for this author in PubMed Google Scholar
Adam Bondarzewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ewa Szewczak.

Appendix 1: code for calculation of location and scale estimators according to Algorithm B.

(MATLAB language, MATLAB R2014a (8.3.0.532) by MathWorks, Inc.)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Szewczak, E., Bondarzewski, A. Is the assessment of interlaboratory comparison results for a small number of tests and limited number of participants reliable and rational?. Accred Qual Assur 21, 91–100 (2016). https://doi.org/10.1007/s00769-016-1195-y

Download citation

Received: 10 November 2015
Accepted: 22 January 2016
Published: 23 February 2016
Issue Date: April 2016
DOI: https://doi.org/10.1007/s00769-016-1195-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Is the assessment of interlaboratory comparison results for a small number of tests and limited number of participants reliable and rational?

Abstract

Similar content being viewed by others

The evaluation of the scoring systems: the fixed effects model under known variances