Inter-laboratory proficiency testing of the measurement of gypsum parameters with small numbers of participants

Conducting proficiency tests with large numbers of participants are not a problem because the ISO standard 13528:2015 describes many indicators that allow for proper assessment. However, there are proficiency testing (PT) schemes that involve only few participants. In such situations, the same difficulties exist for the selection of proper indicators and the criteria for assessing the participants' achievements. The Institute for Chemical Processing of Coal in Zabrze, Poland organised a round of PT of the determination of gypsum parameters. The results from six participants from Polish research entities and independent laboratories and methodology for the organisation this PT are presented. The performance evaluation criteria were determined using participants results due to the inability to use the metrologically valid procedure. The performance of the participants was evaluated using z′ and zeta scores.


Introduction
The use of fossil fuels as a raw material is associated with the emission of pollutants into the environment, including sulphur dioxide, nitrogen oxides, and dust. Most of the power units of Polish power plants were built between 1970 and 1980. Poland's admission into the European Union (EU) forced Polish law to be updated to accord with EU guidelines, which resulted in tougher emission limits. To fulfil these obligations in the years from 1994 to 2013, 12 exhaust power plants were equipped with flue gas cleaning installations (IOS). Most commonly, these installations use wet flue gas desulphurisation, and a by-product of this treatment is synthetic gypsum. In 2017, the best available techniques (BAT) conclusions in the field of energy combustion were published. The emission limits set out in this document will become effective in 2021. To meet the requirements, a number of investments have been made in relation to the launches of new power units to replace depleted units. Along with the construction of blocks, new IOS installations are being built, and existing installations are being modernised.
It is estimated that, in 2020, the gypsum production capacity will increase by 1.25 tonne of Mg. The analytical laboratories of power plants that had previously focused on testing solid fuels and circulating waters were forced to establish the equipment base and research methodology for the field of limestone and gypsum testing. Most commonly, upon the purchase of IOS technology, the provider implements this research methodology in the form of procedures that are used to supervise the proper execution of the desulphurisation process. For this reason, the research methodology has been optimised in terms of the time required to obtain results, and this optimisation is associated with a decrease in the accuracy of the method. The final stage of the implementation and validation of these methods is interlaboratory proficiency testing (PT), which later becomes an invaluable tool for the early detection of anomalies and the identification of potential problems. The present PT research is directed towards institutions and scientists who seek to confirm their competence in the field of gypsum parameter investigations. The obtained results will aid assessments of test qualities and possessed proficiencies. The current work can also be used to verify the qualifications of a laboratory to perform gypsum tests and thus allow that laboratory to gain the trust of the customers who use their services [1,2]. The PT involved determining the mass fractions of the following oxides: Al 2 O 3 , CaO, Fe 2 O 3 , MgO, and SiO 2 .

3
The preparation of sufficiently homogeneous PT items and homogeneity and stability investigations was conducted by a subcontracting laboratory. The Proficiency Testing Centre implemented a management system in accordance with the requirements of PN-EN ISO/IEC 17043:2010 [1] as confirmed by a certificate issued by the Polish Centre for Accreditation [3]. The first PTs were conducted at the Institute for Chemical Processing of Coal in Zabrze, Poland (IChPW) in 1999. Since 2008, a proficiency testing provider (PTP) has been included in the structure of the IChPW, and in 2017, the IChPW organised the first PT in the area of gypsum research. This PT represented a first in the area of gypsum in Poland. Other gypsum PTs have been conducted, for example, in the USA in 2006. This latter PT concerned the fire resistance of gypsum [4]. The German BAM Federal Institute for Materials Research and Testing (Bundesanstalt für Materialforschung und -prüfung) in collaboration with the German Metrology Institute (PTB) also organised an Africa-wide PT scheme for the testing of cement in 2014 [5]. This proficiency test addressed construction material testing laboratories in Africa. In recent years, PTs related to agricultural gypsum analysis have been conducted in South Africa. Moreover, PTs that particularly focused on examining the abilities of structural elements or systems to withstand fire have been conducted in the USA. These types of PTs are conducted by NIST (National Institute of Standards and Technology) [6].
As stated in the ISO 13528:2015 standard, for PTs with small numbers of participants, an overall value should be calculated and assigned according to a metrological procedure that is independent of the number of obtained results [7]. The most recommended procedure is a bilateral comparison in which expert judgements or criteria based on fitness for the desired purpose are applied. Another recommended procedure involves certified reference materials (CRMs) with known and stabile parameters that can be investigated by the participants [8]. For PT in relation to gypsum, the use of CRMs is onerous due to cost issues. For each participant, the required sample weight is approximately 0.5 kg, thus, the fees for PT participation are too high. Additionally, the present work involved the first round of testing, so the use of parameters from a previous round would have been impossible.
When ideal conditions cannot be met, calculations of assigned values and their standard uncertainties must be performed based on the participants' results. While this approach is feasible, when the number of participants is too small, the results can be unreliable, and this fact must be accounted for. Therefore, the procedure for identifying outliers is crucial because the quantity of analysed data depends on this procedure. As previously noted, standard robust statistics are strongly recommended for outlier-contaminated populations; however, such statistics are not advisable for very small data sets. In such situations, outlier analyses based on the mean or standard deviation may be preferable (point D.1.2 [7]). When the sample number N is small, extreme results cannot typically be identified as outliers with known statistical tests. The metrological approach for small samples makes outlier handling less important because the assigned values are not calculated by consensus, and scores are not expected to be based on the observed standard deviation. It should be noted that statistical tests for outliers are informative, but should not be used exclusively to eliminate data from a data set. Rather, their source should be examined with the intent to learn their cause. If a cause cannot be learned, the data might be analysed with and without the suspected outlier to learn of its presence in the data set is critical to the outcome. Moreover, the criterion presented in the aforementioned ISO standard regarding the uncertainty of the assigned value should also be met for small sets of participants whenever possible; i.e.: where u x pt = the standard uncertainty of the assigned value x pt ; and pt = the standard deviation of the proficiency assessment.
When the number of participants is below 12, this presented criterion is not met. Moreover, Algorithm A (ISO 13528:2015, Annex C, point C3) should be used for assessments involving at least 18 participants [7,9]. In such cases, the participants' results should be evaluated with the z′-score estimator.
As presented in the previously cited standard [7], estimators such as the median can be applicable down to p = 2. When p = 2, the mean should be applied, and in the range 3 ≤ p ≤ 5, the median has more advantages than the mean. For small groups of participants, the problem of estimating dispersion must be treated very carefully. Moreover, for small data sets, the highest available efficient dispersion estimate should be considered. The recommendations for robust estimates of dispersion in cases of small datasets are presented in Table 1.
In cases involving small groups of results, one of the most important features is the ability of the estimator to resist any bias caused by a minority group of discrepant results, i.e., Table 1 Recommendations for robust estimates of dispersion in cases of small datasets [7] Number of participants Standard deviation estimator M-estimate of the standard deviation the resistance to a minor mode. Based on ISO 13528:2015, high resistance is characteristic of some estimators, including the nIQR, MADe, Algorithm A, and Q/Q n [7]. Therefore, the authors decided to apply the estimators summarised in Table 2 to the small set of PT data. However, as stated in point 8.2 of the aforementioned standard [7], the obtained standard deviation can be set as a value that is regulated by a technical expert in PT. For the presented gypsum PT data, technical experts compared the calculated pt values with values that resulted from subjective standards and the perceptions of the experts, and they then established consensus values.

Materials and preparation of the gypsum samples
The materials were prepared in an accredited subcontracting laboratory. The packaging and labelling of the materials for testing were performed in accordance with the instruction of the Proficiency Testing Centre (I/OBB/4.6/02/D "Sample packaging").

Homogeneity and stability assessments
The method described in the ISO 13528:2015 standard was applied for the assessment of the homogeneity of the gypsum samples. According to this standard, the following activities were performed: (a) Three steps were used to ensure the correctness of the data for analysis: • The results for each test sample were examined in the order of measurement collection to identify the presence of any trends over time; • The PT item averages were examined according to production order; • The Cochran test was applied to reject outliers.
(b) The estimates of the within-sample standard deviation s w and the between-sample standard deviation s s (based on 13 measurement results) were estimated according to point B.3 of the standard ISO 13528:2015 Annex B [7]. (c) The following criterion was examined: where: F 1 and F 2 are constants from statistical tables that are derived from the 2 and Snedecor's F distributions ( is the value exceeded with probability 0.05 by a chi-squared random variable with g − 1 degrees of freedom, and is the value exceeded with probability 0.05 by a random variable with an F-distribution with g − 1 and g degrees of freedom), c is the critical value for the homogeneity test, and s w is the within-sample standard deviation. The samples are sufficiently homogeneous if the criterion is satisfied (significant at the 95 % confidence level). Because the PTs based on the gypsum samples were organised for the first time herein, the homogeneity assessment was performed based on a pt value that was calculated from the repeatability and reproducibility standard deviations from a previous collaborative study in a subcontracting laboratory.
The second-most important PT sample parameter is the stability. The stability provides information about the effects that may influence sample parameters including, for example, external conditions. The assessment criterion for the stability of the gypsum samples was taken from the ISO standard 13528:2015 [7] as follows: where y 1 is the average value of the parameter obtained from the homogeneity content, y 2 is the average value of the parameter investigated during the stability test, and pt is the standard deviation used in the PT.
The parameters used to investigate the sample homogeneity and stability were the mass fractions of Al 2 O 3 , CaO, Fe 2 O 3 , MgO, and SiO 2 . First, the material was collected and adequately conditioned. Following this preparation, 13 analytical samples, each weighing 500 g, were separated and placed in boxes. The homogeneity assessment was performed based on the results of the determinations of the Median Q n Al 2 O 3 , CaO, Fe 2 O 3 , MgO, and SiO 2 contents of the 13 samples with each item tested in duplicate. Table 3 contains  raw data used for the homogeneity calculation and Table 4 contains the calculated values that were used for the homogeneity determination. As mentioned above, a very important step in a homogeneity assessment is the Cochran test for duplicate results among all tested samples, which is used to reject outliers. No outliers were detected in this investigation. Consequently, all PT test result items were included in the assessment of homogeneity. The homogeneity requirement expressed in Eq. (2) was satisfied. The stability of the gypsum samples was investigated 7 weeks after the preparation of the materials for testing. In agreement with the technical expert, the least favourable method of storing the samples was assumed. The samples were stored on the worktop of the laboratory conducting the research. The conditions in the studio were constantly registered. The air temperature in the studio ranged from 22 ºC to 26 ºC, and humidity ranged from 33 to 42 %. Sample containers were opaque. Four samples were investigated. The criterion for the stability of the investigated samples is expressed in Eq. (5). During the stability testing, the aforementioned oxide contents were also examined. The standard deviations of the proficiency assessments pt for the oxide contents were calculated as presented above in the Introduction. Six participants investigated the Al 2 O 3 , CaO, Fe 2 O 3 , MgO, and SiO 2 contents; thus, the preferred standard deviation estimator was Q n . The calculated pt values for each of the oxide content are presented in Table 4. Due to the small number of participants, the calculated pt values were judged by technical experts in this round of testing. The calculated values were compared with values that were determined based on the experience of the authors' laboratory and the experience of the experts (point 8.2 in [7]). Larger values were selected for further calculations. Raw data used to calculate the stability are presented in Table 5. The values of the parameters required to verify the stability of the gypsum samples are presented in Table 6. The stability criterion was satisfied.

Proficiency testing
The organisation of the PT of the gypsum analysis proceeded according to ISO/IEC 17043:2010 [1]. All instructions and relevant information for the participants were available at the IChPW web site. After registration in accordance with the rules of confidentiality, each PT participant received a All collected results were calculated and are presented on a dry-weight basis-due to the instructions given to the participants. Additional information regarding the applied techniques and instruments was also obtained. The participants performed the determinations according to specified standards and their own procedures. The participants used the test procedures described by the equipment manufacturers, other standards (i.e., the ICP-OES method according to PN-EN ISO 11885, and the X-ray fluorescence method according to PN-EN 15309), or their own procedures. Example fluorescent X-ray spectrometers used in the determinations include the S8 TIGER Bruker and the Rigaku ZSX Primus spectrometers. Additionally, for the ICP-OES method, a Leeman DRE spectrometer, an iCAP6500 DUO Thermo spectrophotometer, and a PerkinElmer Optima 800 spectrophotometer were used.

Outliers
Before the statistical analyses, a preliminary appraisal of the results provided by the participants was performed to examine the correctness and consistency of the data. The assigned values of the measurands x pt , the corresponding uncertainties u(x pt ), and the standard deviations pt were determined based on the results obtained by the participants. These estimators were calculated according to the equations presented in Table 2. The expanded uncertainty of the assigned value U(x pt ) was calculated as follows: Where k is a coverage factor that was determined with the Student's t distribution and the appropriate corresponding degree of freedom and % confidence for each measurand (k = 2).
As mentioned above, prior to the application of the performance statistics, the outlier analysis involving the calculation of the mean ( x ) and standard deviation ( s ) necessitated by the small sample size was performed. The calculated data are presented in Table 7.
The bold numbers were above the x + s or below the x − s . At the 68 % confidence level for the Fe 2 O 3 content, two of six results were outside of these boundaries. Regarding the Al 2 O 3 content two results of five are outliers. Regarding the MgO and SiO 2 contents, three of the four results for  each compound were within the range of the standard deviation. These findings indicated that the statistical analyses needed to be performed for four, three, and two results. ISO 13528:2015 [7] does not recommend the use of the 99 % confidence level for outlier rejection, thus, the PT organisers and the technical experts decided to use the 95 % confidence level. With the use of the 95 % confidence level x ± 2s , no outliers were identified. Therefore, all of the results qualified for the statistical analyses.

Normality
Because no hypothesis regarding the normalities of the PT result distributions was taken into account, the compatibility of the results (as a group) should have been tested based on nonparametric statistics as stated in the IUPAC/CITAC Guide (Annex A, paragraph 4) [8]. However, due to specifics of the investigated material and the practical obstacles to testing CRMs, the compatibility of the results was not checked in this investigation.

Determining the assigned value and the standard uncertainty of the assigned value
As mentioned above, the assigned values and standard uncertainties of the assigned values were calculated according to equations presented in Table 2. After these calculations, the results were judged by technical experts, and the calculated values were compared with values that were determined according to the experiences of experts (Point 8.2 in [7]). The larger values were selected for the further calculations. The obtained and final results are presented in Table 8.

Calculation of the performance statistic
According to ISO 13528:2015 [7] and PN-EN IOS/IEC 17043:2011 [1] and accounting for the uncertainty condition (Eq. (1)), the achievements of the participants were calculated with the z′-score, which is expressed as follows: where x i = the average value for a given parameter obtained by the participant, x pt = the assigned value, pt = the standard deviation for proficiency assessment, u(x pt ) = the uncertainty of the assigned value.
The following interpretation of the test results obtained by the participants was applied: The calculated results are presented in Table 9. ζ score calculations ζ score can be useful for evaluating a participant's ability to produce results that are close to the assigned values. Knowledge of the uncertainty of the measurement and the results of research are very important for laboratories, their clients, and all institutions that use such results to make key decisions. The use of ζ scores allows for direct assessments of whether laboratories are able to deliver correct results [7]. According to ISO 13528:2015, the ζ score should be used to calculate x pt from the participants' results to assess the correctness of the uncertainty value estimation because the ζ score encompasses the uncertainty of the assigned value and the uncertainty with which the participant performed the analysis. The ζ score is expressed in Eq. (8): where x i = the average value of a given parameter obtained by the participant, x pt = the assigned value, u(x i ) = the uncertainty with which the participant performed the analysis, u(x pt ) = the uncertainty of the assigned value.
The test results obtained by the participants were interpreted as follows: The results for the participants who provided information about their uncertainties are presented in Table 10.  delimit the assigned intervals x pt ± U x pt (k = 2) and the dotted lines delimit the target intervals x pt ± 3 ⋅ pt (k = 2), and the solid blue lines delimit the target interval (x pt ± 3 pt ), which is relevant for z′ score evaluation. Based on Fig. 1, the qualitative measurands and estimated measurement uncertainties were evaluated. If the measurands were determined correctly, their values should remain in the region defined by x pt ± 2 pt . In the present work, the participants achieved satisfactory z′ scores.

Discussion
The qualitative measurands and estimated measurement uncertainties were evaluated based on Fig. 1. If the measurands are determined correctly, their values should remain in the region defined by x pt ± 3 ⋅ pt . In the present work, the participants achieved satisfactory z′-scores. Additionally, in accordance with ISO 13528 [7], the qualitative parameters and estimated measurement uncertainties were properly determined if their values covered the area defined by x pt ± U x pt . As observable in the figure, not all of the participants performed valid calculations of the expanded uncertainties of their results. Moreover, in some cases, the uncertainties were correct, but the values were too high (for example, the code 8 participant in the case of the CaO content). Some of the participants estimated the measurement means correctly, but the uncertainties of the investigated parameters were too high or too low (for example, the code 7 participant in the case of the Fe 2 O 3 content).
The calculated z′ and ζ values are presented in Tables 8  and 9. Regarding the MgO content, one unsatisfactory result was obtained (participant code 1). Regarding the Al 2 O 3 and Fe 2 O 3 contents, two participants (codes 8 and 10, respectively) obtained questionable results. These findings indicate that the participants with codes 1, 8, and 10 should be advised to check their measurement procedures based on these warning signals.
Analysis of the ζ parameter revealed that one participant (code 8) obtained an unsatisfactory result for the Al 2 O 3 determination. Three questionable values were also detected (for the CaO and MgO contents obtained from participant code 8 and the MgO content obtained from participant code 7). The use of ζ scores allows participants to identify problems with measurement uncertainties and provides information about the abilities of laboratories to deliver correct results.

Conclusions
Currently, the ability of a laboratory to conduct investigations with a high level of quality is among the most desirable skills from the customer's perspective. The opinion of the laboratory often plays a decisive role, and therefore, reliable results are the most sought-after product. Thus, it is often necessary to conduct PT among small groups of specialised laboratories, which, for accreditation purposes, are required to participate in such testing. The need to implement PT among small numbers of participants is necessarily associated with the acceptance of certain restrictions. The first such restriction is the need to adjust the scope of the research according to the costs of organising the PT. An excessively wide range of tests may result in a drastic increase in the costs of preparing large samples and conducting homogeneity tests for a wide range of parameters. Another limitation is the need for very careful analyses of the produced data. In situations with small numbers of participants, the rejection of outliers may result in drastic changes in the assigned values. Therefore, the analysis of the results in such situations requires the very close cooperation of statisticians and technical experts in the field. Additionally, for topics of investigation for which prior PT has not been performed, the use of CRMs is not possible, and cooperation with technical experts is the only option for conducting PT. Moreover, when it is not possible to compare results across laboratories due to the highly specialised nature of a laboratory, PT involving the close cooperation of experts is the only option for confirming laboratory skills.
In the present analysis, six participants investigated the oxide contents of gypsum. Because the different oxides were examined by a small number of participants, the standard uncertainties were calculated for the assigned values and judged by technical experts. Based on the analyses of the obtained results, the experts decided to increase the standard uncertainties of the assigned values. The obvious conclusion is that all of the participants achieved satisfactory results. However, the measurement procedures of one participant require examination. A second investigation examined the ζ scores. The ζ score is strictly associated with the participant's uncertainty. Unfortunately, only four participants reported the information required for ζ score analysis. Despite this limitation, it was possible to compare the participants' uncertainties, and it can be concluded that the participants, especially the participant with code 8, should analyse their sources of uncertainty.