Is the z score sufficient to assess participants’ performance in proficiency testing? The hidden corrective action

Proficiency testing providers, accreditation bodies and testing laboratories should be aware that a laboratory participating in a proficiency testing round might have reported a biased result despite a satisfactory performance indicated by an assessment using uniquely the z score. A complementary performance evaluation, based on the ζ score and the assessment of the measurement uncertainty, is therefore highly recommended. This work presents an intuitive graphical tool (the Naji2 plot) that combines z and ζ scores together with the reported measurement uncertainties. This tool allows a comprehensive assessment of the laboratory performance and enables to identify the need for corrective actions. The concerned laboratory should then perform a root cause analysis and investigate their bias and/or their measurement uncertainty evaluation.


Introduction
The international standard ISO/IEC 17025:2017 [1] requires from testing laboratories to use validated measurement procedures, certified reference materials (when available) or appropriate quality control materials and to participate in interlaboratory comparisons such as proficiency testing (PT) rounds, to ensure and/or demonstrate the validity of their results. In addition, every laboratory is expected to estimate the measurement uncertainty (MU) associated with the reported result in order to comply with the basic demand that "a measurement result is generally expressed as a single measured quantity value and a measurement uncertainty" [2]. Similarly, ISO/IEC 17043:2010 [3] is demanding from PT providers, among other requirements, to estimate the associated measurement uncertainties when determining the assigned values.
Most of the European Union official laboratories performing control activities in the Single Market, for customs or environmental surveillance, are accredited according to ISO/ IEC 17025 [1]. During the accreditation audits, the technical assessors check the compliance of such testing against the requirements set in the standard. They are often relying on the laboratory performance demonstrated in the frame of regular participation in PT schemes.
While all PT providers require laboratories to report a value for each investigated measurand, only a few of them request participants to report the associated MU. Hence, the common proof of satisfactory performance relies only on the z score. This score may confirm that the method of analysis applied is fit for the intended purpose, but it will not allow concluding on the presence or absence of a potential bias of the value reported, or on the appropriateness of the estimated MU associated with that value. Hence, laboratories may miss the hidden corrective action needed.
Robouch et al. presented a simple graphical tool for the evaluation of PT results at an international workshop on "Data analysis of key-comparisons" in 2003 [4]. The original Naji plot allowed the simultaneous assessment of various performance criteria of PT participants. Two alternative visualisation tools have been subsequently developed, namely the PomPlot [5] and the Kiri plot [6].
This paper revisits the approach of the original Naji plot concept, combining several assessment criteria recommended by ISO/IEC 17043 and ISO 13528 [3,7]. It introduces several modifications to obtain the "Naji2 plot", a simpler and more intuitive tool. The main goal of this work is to raise the awareness of PT providers on the importance of requesting from the participants the analytical results together with the respective measurement uncertainties, in order to deliver a comprehensive assessment of the laboratory performance.

Performance scoring
ISO/IEC 17043:2010 [3] and ISO 13528:2015 [7] recommend, among others, the following three performance scores to assess the reported results in the frame of a proficiency testing round.
The relative deviation (D %,i ) used to estimate the deviation of the reported result (x i ) from the assigned value (x pt ) is calculated as (expressed as a percentage of x pt ): Similarly, the deviation may be normalised using the standard deviation for proficiency assessment (σ pt ) to calculate the z score: Alternatively, the deviation can be normalised combining the uncertainty reported by the laboratory (u(x i )) and the uncertainty associated with the assigned value (u(x pt )) to compute the zeta (ζ) score: According to ISO 13528 [7], a laboratory performance is considered as satisfactory (SP), questionable (QP) or unsatisfactory (UP), when the absolute value of the z or ζ scores are: |score| ≤ 2; 2 <|score|< 3; and |score| ≥ 3, respectively.
In addition, ISO 13528 [7] suggests to assess the reported measurement uncertainties (MU). The systematic approach implemented by the European Union Reference Laboratories (EURLs) managed by the Joint Research Centre is described in the PT report to participants, see for instance [8,9]. This was done by comparing the reported relative MU (u rel (x i ) = u(x i )/x i ) to the relative uncertainty of the assigned value (u rel (x pt ) = u(x pt )/x pt ) and to the relative standard deviation for proficiency assessment (σ pt,rel = σ pt /x pt ). Hence, MU is considered as realistic (case "a"), probably underestimated (case "b") or likely to be overestimated (case "c") when: "a" u rel (x pt ) ≤ u rel (x i ) ≤ σ pt,rel ; "b" u rel (x i ) < u rel (x pt ); and "c" u rel (x i ) > σ pt,rel , respectively.
Knowing that there are three performance evaluation categories (satisfactory, questionable and unsatisfactory) and three uncertainty evaluation cases (underestimated, realistic and overestimated) for each of the three assessments mentioned above (z score, ζ score and MU), a total of 27 theoretical possibilities could be identified as shown in Table 1. However, a (1) few possibilities marked by an asterisk in Table 1 may be unrealistic. For example, it is not possible to have simultaneously a satisfactory performance according to z, an unsatisfactory performance according to ζ and an overestimated MU (see possibility "I" in Table 1).

The Naji2 Plot: a graphical solution
Assuming that the ζ score is smaller than a certain performance limit "P L " (equal to 2 or 3, as set in [3,7]), and combining Eqs. 2 and 3 one can derive the following relations: Equivalent to:  [4], when plotting (u x i ∕ pt ) 2 (y-axis) as a function of the z score (x-axis). These graphs were used in several publications related to proficiency testing rounds organised in a broad variety of fields [10][11][12][13][14][15][16][17]. Furthermore, this graphical representation is currently implemented in the commercial software "PROLab™" developed by Quodata GmbH (Dresden, Germany) for the interpretation of results reported in the frame of interlaboratory comparisons [18].
An alternative transformation of Eq. 5 is further investigated: Equivalent to: Equation 8 describes two sets of hyperbolas obtained when plotting u(x i ) versus the z score, for the two acceptance criteria ( P L = 2 or 3). This graphical representation constitutes the "Naji2 plot". Each hyperbola presents the following characteristics: (i) it is symmetric around the y-axis; (ii) it is always positive or equal to zero when z = ±P L u(x pt )∕ pt ; and (iii) it increases steadily with increasing |z| to reach the linear asymptote (y = ± pt z∕P L ). However, Eq. 8 is not valid for z values ranging between " −P L u(x pt )∕ pt " and " +P L u(x pt )∕ pt ", because they would lead to a negative value under the square root.

Assessment of measurement uncertainties
In order to include the assessment of measurement uncertainties (cases "a", "b", "c") in the Naji2 plots, the relation between the relative standard deviation (u rel (x i )) and the z score is described hereafter: is a linear relation of u(x i ) as a function of the z score. Two specific lines delimit the range of "realistic" measurement uncertainties (case "a"), when u rel x i is equal to u rel x pt or to pt,rel . While each line crosses the y-axis (z = 0) at u x i = u x pt or at u x i = pt , they both cross the x-axis ( u x i = 0) at z = −x pt ∕ pt .

The bias boundaries
No bias can be identified (null hypothesis H 0 ) when the distribution of a measurement result and the interval of the assigned value and its expanded uncertainty (both assumed to follow a normal or approximately normal distributions) overlap with a probability (α) of rejecting a true H 0 of 5%, and a probability (β) of accepting a false H 0 of 5% [19]. Consequently, a value (x i ) lower than the assigned value (x pt ) would be negatively biased (z < 0) if: where 1.64 is the inverse of the standard normal cumulative distribution for a probability of 95% (using the built-in spreadsheet function "NORM.S.INV(0.95)"). Combining Eqs. 2 and 10, one derives the following linear relation between u(x i ) and z: A similar linear relation is obtained in the case of "positive bias" (z > 0): The two lines defined by Eqs. 11 and 12 delimit the "upper range" of the bias boundaries. The ranges delimited by these lines are similar to those defined by the two green hyperbolas (|ζ|> 2) and comply with the criteria for bias set by ISO Guide 33 [20]. Such large ζ scores may be caused either by a too large numerator, indicating a significant deviation (bias) of the reported value from the assigned value ( caused by an underestimated reported MU, assuming that the uncertainty associated with the assigned value is realistic. The concerned laboratory should then perform a root cause analysis and investigate their bias and/or their measurement uncertainty evaluation, in order to resolve the identified poor performance expressed as a ζ score. Despite the fact that many PT results are indeed corresponding to a satisfactory performance when expressed uniquely by the z score (|z| ≤ 2), some of them are located below the bias boundaries (with |ζ| > 2) and may require corrective actions by the laboratory.

A hypothetical case study
In order to demonstrate the Naji2 plot concept, a hypothetical proficiency test round attended by 40 participants was constructed. A PT provider processed a commercially available food commodity to produce a test material with adequate homogeneity and stability, according to the recommendations of ISO/IEC 17043 [3]. The PT provider also determined the total mass fraction of a specific analyte ( x pt ) of 100 mg kg −1 and the associated standard uncertainty (u(x pt )) of 3 mg kg −1 . In addition, a pt of 10 mg kg −1 was set to assess the measurement capabilities of the laboratories to comply with some legal requirements. Since u(x pt ) ≤ 0.3 pt , the test item was considered fit-for-purpose and the z score could be applied for performance assessment. Figure 1 presents the Naji2 plot generated based on the above-mentioned predefined criteria ( x pt , u(x pt ), and pt ). This plot consists of: (v) two bias boundaries (Eqs. 11 and 12, see double blue lines in Fig. 1).
The four vertical lines (|z| = 2 and 3), the four hyperbolas (|ζ| = 2 and 3) and the two straight lines (u rel (x i ) = u rel (x pt ) or u rel (x i ) = σ pt,rel ) delimit all the possible Naji2 plot areas identified by the letters "A" to "AA" in Table 1 and Fig. 1. The areas denoted by the letters with an asterisk in Table 1 could not be represented in Fig. 1, since they represent unrealistic possibilities.
The spreadsheet function "NORMINV(RAND(),m,u)" was used to generate two sets of 40 values, where m is the mean value and u the standard uncertainty. The first set of data simulated the reported results (x i ) normally distributed around the assigned value (m = 100 mg kg −1 ) with u = 25 mg kg −1 (see Table 2, 2nd column). The second set simulated a broad range of reported standard measurement uncertainties (u(x i )), with m = 6 mg kg −1 and u = 4 mg kg −1 (see Table 2, 4th column). The standard uncertainties were multiplied by a coverage factor (k = 2) to derive the expanded uncertainties shown in Fig. 2. Table 2 summarises the synthetic data set, the relative standard uncertainty u rel (x i ) (to be compared to u rel (x pt ) and to σ pt,rel ) and the outcome of the four assessments (D %,i , z score, ζ score, and MU). Similarly, Fig. 2 presents the data set in a particular order, sorted by the reported result in increasing order first, then by the performance categories (SP, QP and UP), expressed as z scores and ζ scores. This allows the identification of 11 laboratories (from L25 to L38, see x-axis of Fig. 2) having "satisfactory" performances expressed as z scores but "questionable" or even "unsatisfactory" performances when expressed as ζ scores. Figure 3 presents the Naji2 plot applied to the synthetic data set (Table 2). This graphical presentation allows an intuitive and direct assessment of every data point, related to all the assessment criteria under investigation, namely the z score, the ζ score or the bias boundaries, and the appropriateness of the reported MU. However, this graphical presentation is only accessible to PT providers collecting measurand values including measurement uncertainties from their participants.
One can easily notice that: (i) most of the results are in the central part of the plot, representing satisfactory z and ζ scores; (ii) seven "overestimated" (case "c") and (iii) seven seemingly "underestimated" (case "b") MU were reported, while four laboratories did not report their MU (u(x i ) = 0 for L02, L04, L13 and L27). Seventeen results are below the "bias boundaries", of which six are significantly biased with |z|> 2 and |ζ|> 3 (L03, L04, L09, L14, L27, L33). The remaining eleven laboratories have "satisfactory" z scores, and "questionable" or "unsatisfactory" ζ scores (see empty circles in Fig. 3). These laboratories may consider two types of improvement actions: either an increase in their MU  Fig. 1 The Naji2 plot, presenting the reported uncertainty (u(x i )) versus z score. Four vertical lines represent |z| = ± 2 or ± 3. The green hyperbolas delimit |ζ|= 2, while the red ones delimit |ζ| = 3. The dotted and dashed lines delimit the "realistic" uncertainty range (between u rel (x i ) = u rel (x pt ) and u rel (x i ) = σ pt,rel , respectively), while the two blue lines define the bias boundaries. The letters "A" to "AA" denote various Naji2 plot "areas" described in Table 1. This graph was constructed using the following values: x pt = 100 mg kg −1 ; u(x pt ) = 3 mg kg −1 ; and σ pt = 10 mg kg −1 Table 2 Synthetic data set of 40 laboratories having reported measurement results (x i ) and expanded uncertainties (U(x i )). The standard MU (u(x i )) and the relative standard uncertainty (u rel (x i )) are calculated accordingly. The performance scores (D % , i, , z and ζ scores) and the uncertainty assessment are based on the criteria set by the PT provider (x pt = 100 mg kg −1 ; u(x pt ) = 3 mg kg −1 and pt = 10 mg kg −1 ). The grey and black cells represent "questionable" and "unsatisfactory" performance scores, respectively. The last column refers to the MU assessment (cf. Case "a" realistic, "b" underestimated or "c" overestimated)

Measurement uncertainty evaluation
Since 2000, the PT providers of the Joint Research Centre in Geel are assessing the participants' performance using the z and ζ scores for laboratories participating in various IMEP, REIMEP, NUSIMEP PTs or for PT schemes organised by JRC's European Union Reference Laboratories for contaminants in food. The organisers have noticed that extremely large reported MU lead to |ζ|≤ 2 (satisfactory performance), even when |z|≥ 3 (unsatisfactory performance). Hence, an additional evaluation criterion was introduced related to the expected maximum and minimum measurement uncertainties. It was assumed that realistic uncertainty estimations should not be larger than the standard deviation for performance assessment (u(x i ) max = σ pt ) or smaller than the uncertainty of the assigned value (u(x i ) min = u(x pt )). This concept was later adopted in the standard ISO 13528:2015 [7], where the example of the IMEP-111 is specifically mentioned. Let us consider, for example, the results submitted by L14 and L19 (62.2 ± 9.0 mg kg −1 /0.14/; and 127.6 ± 11.5 mg kg −1 /0.09/, expressed as x i ± u(x i )/u re l (x i )/, see Table 2), in the frame of the hypothetical PT round (where x pt = 100 mg kg −1 ; u(x pt ) = 3 mg kg −1 ; σ pt = 10 mg kg −1 ; u rel (x pt ) = 0.03; and σ pt,rel = 0.10). According to the above-mentioned criteria, the absolute MU reported by L14 would be considered as "realistic" (9 < 10, in mg kg −1 ), while the one of L19 would be flagged as potentially "overestimated" (11.5 > 10, in mg kg −1 ).
However, according to Ellison and Williams [21,22], the relative uncertainty of chemical measurements can be  Table 2). The solid line represents the assigned value (x pt = 100 mg kg −1 ); while the blue and red dashed lines represent the "assigned range" (x pt ± 2 u(x pt )) and the "acceptance range" (x pt ± 2 σ pt ), respectively. The green, yellow and red bars in the lower part of the figure indicate the "satisfactory", "questionable" or "unsatisfactory" performance, expressed as z and ζ scores 1 3 taken as constant for a range of measurand values. Based on this, the alternative approach described in this paper had been implemented by several EURLs of the JRC. Instead of comparing u(x i ) to u(x pt ) and to σ pt , the reported relative measurement uncertainty (u rel (x i ) = u(x i )/x i ) is compared to the relative standard uncertainty of the assigned value (u rel (x pt ) = u(x pt )/x pt ) and to the relative standard deviation for proficiency assessment (σ pt,rel = σ pt /x pt ) over the whole range of reported mass fractions (e.g. from 60 to 140 mg kg −1 or |z| ≤ 4 in Figs. 1 and 3).
Knowing that a constant relative uncertainty u rel (x i ) implies a proportional increase of u(x i ) with x i , one may have for z > 0 (x i > x pt ) a realistic u(x i ) larger than u(x pt ), or inversely, a realistic u(x i ) smaller than u(x pt ) for z < 0 (x i < x pt ). Hence, according to the new criteria and unlike to the assessment presented in ISO 13528, the MU statement of L14 is flagged as potentially "overestimated" (0.14 > 0.10), while the one of L19 is considered as "realistic" (0.09 < 0.10).
These new criteria for the evaluation of reported measurement uncertainties may be taken into consideration to replace those prescribed in the current ISO 13528 standard [7].

Naji2 plot applied to real cases
In 2018, the European Union Reference Laboratory for Food Contact Materials (EURL-FCM) organised a PT round (FCM-18-02, [23]) for the determination of the mass fractions of total zinc and other metals in food simulant B (acetic acid, 3% w/v). The test items were prepared gravimetrically, which resulted in an accurate assigned value (x pt = 5.024 mg kg −1 ) with a small associated standard measurement uncertainty (u(x pt ) = 0.033 mg kg −1 ). Based on the expert opinion, σ pt,rel was set to 0.12. A total of 46 National Reference Laboratories (NRLs) and Official Control Laboratories (OCLs) reported results. The outcome of this PT is presented in Fig. 4a, where the Naji2 plot includes the "bias boundaries". Most of the laboratories reported accurate values and realistic measurement uncertainty estimations (case "a"), while some of them reported overestimated MUs (case "c"). However, according to ISO Guide 33 [20] six laboratories have to investigate their significant bias (with |ζ|> 2), of which three have a |z|≤ 2 (see empty circles in Fig. 4a).
In 2019, the European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF) organised a PT round (GMFF-19/02, [24]) for the determination of GMOs in food and feed materials to support Regulation (EU) 2017/625 on official controls [25]. One of the test items consisted of a pig feed spiked with a genetically modified "40-3-2 soybean". The EURL characterised the mass fraction of the GM event applying the EURL validated method and used the following values for the performance assessment of the participants: x pt = 1.014 m/m %; u(x pt ) = 0.061 m/m %, and σ pt,rel = 0.25. A total of 67 NRLs and OCLs reported results. Figure 4b shows the outcome of this PT round, where the Naji2 plot presents the set of hyperbolas. Most of the laboratories reported accurate values and realistic measurement uncertainty estimations, while some of them reported overestimated MUs (case "c"). However, according to ISO Guide 33 [20] sixteen laboratories have to investigate their significant bias (with |ζ| > 2), of which twelve have a |z| ≤ 2 (see empty circles in Fig. 4b). It is worth noting that, in this particular case, the realistic range of MU (defined by Eq. 9) goes to zero at z = − 4 (= 1/σ pt,rel ).

Conclusions
The Naji2 plot is an intuitive graphical representation that allows the simultaneous assessment of the three performance evaluations (z, ζ scores and the MU assessment) and the identification of potential biases. This comprehensive assessment may indicate to participants the need for an appropriate corrective action that, otherwise, would have been hidden by a satisfactory performance, if expressed uniquely as a z score. Similarly to the original Naji plot, the Naji2 plot can be applied to all analytes, matrices, concentration/content levels and proficiency testing schemes. It can be used to summarise the outcome of a PT round presenting the results of all laboratories for a given measurand, or to present a historical overview of a laboratory participation in different rounds of a specific PT scheme. Bias limit z score Fig. 3 The Naji2 plot including the synthetic data set presented in Table 2 This tool is accessible to the PT providers that request from laboratories who participate in their PT schemes, to report a measurement result as stipulated by ISO/IEC 17025, i.e. including the associated measurement uncertainty. Unfortunately, this is still rarely the case. The situation may improve as soon as the next edition of ISO/IEC 17043 would require PT providers to assess laboratory performances based on the measured value and the corresponding measurement uncertainty.
Furthermore, the authors consider that the evaluation of the reported measurement uncertainties described in Sect. 9.8 of ISO 13528:2015 may be reviewed to take into account the relative uncertainties as an assessment criterion.