Introduction

The Comité International des Poids et Mesures (CIPM) is responsible for the conduct of international key comparison (KC) studies that enable national metrology institutes (NMIs) and related organizations to document measurement capabilities relevant to international trade and environmental-, health-, and safety-related decision making. The technical supplement to the 1999 Mutual Recognition Arrangement (CIPM MRA) [1] establishes the process by which NMIs demonstrate the “degree of equivalence” (DEq) of national measurement standards. The CIPM MRA states that (1) KCs lead to reference values, (2) a key comparison reference value (KCRV) is expected to be a good indicator of an international system of units (SI) value, (3) DEqs refer to the degree to which a national measurement standard is consistent with the KCRV, and (4) DEqs for measurement standards are expressed quantitatively by the deviation from the KCRV and the uncertainty of this deviation at a 95 % level of confidence.

The Working Groups of the Consultative Committee for Amount of Substance—Metrology in Chemistry (CCQM) are responsible for selecting and overseeing the operation of KCs that address chemical (and biochemical) measurements. Few such measurements directly realize an SI unit: a mole of one chemical analyte may have no physiochemical properties in common with a mole of another beyond containing the same number of entities. Further, with a few exceptions such as atmospheric ozone [2], the higher order chemically related measurements made by an NMI do not reflect “national measurement standards” but rather the organization’s measurement capabilities at a given time. However, until recently most CCQM-sponsored KCs have attempted to keep as closely as possible to the philosophy of the CIPM MRA as described above by estimating a separate DEq for each reported result in each KC.

Recognizing the impossibility of conducting separate KCs for all important chemically related analytes in all important sample matrices (and the ever-increasing resource burdens placed on the world’s NMIs by attempting to address even a tiny subset of these measurands), several of the Working Groups within the CCQM are now using KCs to evaluate a series of critical or “core” measurement competencies. While continuing to provide DEqs for the results reported in individual KCs, the overall assessment of an NMI’s measurement capabilities may require combining DEqs for several different measurands that may be estimated in different KCs and at separate times.

The KCs conducted by the CCQM Electrochemical Analysis Working Group (EAWG) and two regional metrology organizations (RMOs) on primary pH-related measurements are an excellent, and prescient, model for such studies. Initiated in 1999, to date results are publicly available for 11 KCs involving five buffer systems, with all but one of these systems characterized at 15 °C, 25 °C, and 37 °C (see Table 1). While individual NMIs routinely if informally assess their primary pH measurement capabilities by qualitative comparison of the various DEqs for different temperatures and buffers, no formal mechanism currently exists for quantitatively summarizing such results.

Table 1 pH-related key Comparisons

We here propose quantitative data analysis methods for combining individual DEqs from multiple KCs to estimate an NMI’s measurement capabilities for particular measurement areas. We will show that the various primary pH measurements can be combined to document the expected measurement performance for primary pH measurements from pH 1 to pH 11 and from 15 °C to 37 °C. These data analysis methods represent a first step in the development of tools for assessing NMI measurement capabilities from less coherent evidence.

Data

Sources

The data used in this study are the results of primary method pH measurements as provided in the published Final Reports [313] of the KCs listed in Table 1. All of the primary pH measurement data given in these reports are listed in Tables S1.a to S5.a of the electronic supplementary material (ESM), with the exception of values that (1) were identified in the KC’s final report as technically flawed and as such were excluded from the reference value (RV) estimation process for that KC and (2) are not the most recent primary pH measurement in that buffer system for the NMI that submitted the excluded result. Table 2 lists the number of DEq estimates available for each NMI for each buffer system. As the focus of this report is the process of combining results rather than particular outcomes for these data, each NMI is designated as a single-letter alphabetical code.

Table 2 Participation history

The 11 KCs considered include five “root” comparisons of pH measurements made in different buffer systems: CCQM-K9 (phosphate), CCQM-K17 (phthalate), CCQM-K18 (carbonate), CCQM-K19 (borate), and CCQM-K20 (tetroxalate). These root KCs were activities of the EAWG. The remaining studies, formally differentiated as “Subsequent KCs” and “Regional KCs” but here referred to as “successor” KCs, are each linked to one or another of the roots through the use of in-common measurement protocols and qualitatively similar buffer solutions. The four successor studies CCQM-K9.1, -K9.2, -K18.1, and -K19.1 were activities of the EAWG; the integer part of the label designates the root KC and the decimal designates the temporal order of the successor KC relative to its root. The APMP.QM-K9 and EUROMET.QM-K17 (also termed EUROMET Project 696) KCs were activities of the Asia Pacific Metrology Programme and the European Collaboration in Measurement Standards RMOs, respectively, both in collaboration with the EAWG. All of the successor studies were designed to enable additional NMIs to demonstrate newly acquired pH measurement capabilities and/or to allow participants in earlier studies to document improved capabilities.

The KCs examined in this study, all with completion dates ranging from 1999 to 2010, constitute the initial cycle of primary pH KCs. The recently completed CCQM-K91 (phthalate) [14] is the first KC of the second cycle and is not included in this study. CCQM-K91 and the other pH studies currently in progress or planned are designed as fresh root comparisons rather than maintaining linkages to the earlier studies.

Primary method pH measurements

All of the data considered here are the primary pH measurements reported by KC participants for a buffer solution prepared and distributed by the coordinator of each KC. The direct result of the primary measurement itself is pa 0, the acidity function at zero added chloride. Depending on the KC design, pa 0 determinations were made at one or more specified temperatures. The metrological basis for the primary measurement of pH is discussed in detail elsewhere [1517]. In essence, the pa 0 is a function of the potential of a specified type of electrochemical cell, commonly referred to as the Harned cell.

The pH is obtained from pa 0 by adding a constant term, defined by the Bates–Guggenheim convention, specific for a given buffer and temperature [15, 18]. Since the value of this term is invariant among the participants of each KC, all measurement-specific factors that affect the pa 0 affect the corresponding pH values (as well as any KCRV calculated from them) to the same extent. The uncertainty [15] of the Bates–Guggenheim convention is excluded from the reported uncertainties for the pH KCs. This exclusion avoids inflating the reported uncertainties for the pH KCs and ensures that the reported uncertainties relate to the measurement capabilities per se of the participants.

Measurements for the carbonate, borate, and tetroxalate buffer KCs are recorded in the Final Reports as the reported pa 0 values. Measurement results for some of the phosphate and phthalate buffer system KCs were recorded as pH values. We consider the recorded values for all of these KCs as being of the same kind: “primary pH”.

Note that primary pH is a procedurally defined kind-of-quantity [19]. Since primary pH cannot be determined except through the measurement process itself, the KCRV for a primary pH KC must be estimated from the measurement results even though the study materials are prepared quantitatively from materials of established composition. This is in contrast to some chemical systems (such as synthetic gas mixtures and organic and inorganic calibration solutions) where materials can be prepared to have well-defined compositions that, with suitable verification, provide KCRVs that are independent of results reported by the study’s participants.

Computation

All calculations used in this study were performed in a spreadsheet environment using a modern desktop computer. Purpose-built programs in languages native to this environment were used to automate repetitive computations. Versions of these tools are available on request from the corresponding author.

Results and discussion

“National standard” degrees of equivalence as currently estimated

As defined by the CIPM MRA, the DEq, d, for a particular KC result is estimated as

$$d = x{-}V_{\text{KC}}$$
(1)

where x is the reported value and V KC is the KCRV and is a close realization of an SI value as assigned by the sponsoring Working Group and approved by the Consultative Committee.

Using formal variance propagation, the uncertainty associated with d should be estimated as [20],

$$u\left( d \right) = \sqrt {u^{2} \left( x \right) + u^{2} \left( {V_{\text{KC}} } \right) - 2\rho \left( {x,V_{\text{KC}} } \right)u\left( x \right)u\left( {V_{\text{KC}} } \right)}$$
(2)

where u(x) is the standard uncertainty associated with x, u(V KC) is the standard uncertainty of the V KC, and ρ(x,V KC) is the correlation between the reported value and the KCRV. Within at least the CCQM, except when the KCRV has been assigned using the Graybill-Deal estimator [21, 22], the ρ(x,V KC) term has generally been ignored—effectively asserting that ρ(x,V KC) = 0.

Since the MRA requires that uncertainties are to be specified at the 95 % level of confidence, standard uncertainties must usually be estimated from reported expanded uncertainties

$$u(x) = \frac{{U_{95} (x)}}{{k_{95} }};\quad u(V_{\text{KC}} ) = \frac{{U_{95} (V_{\text{KC}} )}}{{k_{95} }}$$
(3)

where k 95 is the coverage factor expected to yield an expanded uncertainty such that the interval x ± k 95 u(x) includes the true value with a 95 % level of confidence. The desired 95 % level of confidence expanded uncertainty on d, U 95(d), is likewise typically estimated as

$$U_{ 9 5} \left( d \right) \, = k_{ 9 5} \cdot u\left( d \right).$$
(4)

Again, within at least the CCQM, k 95 has generally been asserted to be 2 regardless of how the various quantities are actually estimated.

“Measurement capability” degrees of equivalence for a given buffer

Given N individual d ± U 95(d) estimates for a particular NMI and assuming that they are independently drawn from a relatively normal distribution, a combined “measurement capability” DEq, D ± U 95(D), for that NMI can be estimated from the mean of the d, the standard deviation of the d, and the pooled U 95(d)

$$\begin{gathered} D = {{\sum\limits_{i = 1}^{N} {d_{i} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{N} {d_{i} } } N}} \right. \kern-0pt} N};\quad u\left( D \right) = \sqrt {\bar{u}^{2} \left( d \right) + s^{2} \left( d \right)}\\U_{95} \left( D \right) = 2\,u\left( D \right) \\ \bar{u}^{2} \left( d \right) = {{\sum\limits_{j = 1}^{N} {\left( {\frac{{U_{95} \left( {d_{i} } \right)}}{2}} \right)^{2} } } \mathord{\left/ {\vphantom {{\sum\limits_{j = 1}^{N} {\left( {\frac{{U_{95} \left( {d_{i} } \right)}}{2}} \right)^{2} } } N}} \right. \kern-0pt} N}\\s^{2} \left( d \right) = \left\{ \begin{gathered} 0\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad N = 1 \\ {{\sum\limits_{i = 1}^{N} {\left( {d_{i} - D} \right)^{2} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{N} {\left( {d_{i} - D} \right)^{2} } } {\left( {N - 1} \right)}}} \right. \kern-0pt} {\left( {N - 1} \right)}}\quad\quad N < 1 \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(5)

where i indexes over the individual estimates. This U 95(D) estimated in this manner can be considered as conservatively large since the among-temperature variability, estimated from the standard deviation of the d i , includes contributions from the within-temperature variability, estimated as the pooled U 95(d i )/2. However, these U 95(D) will always be at least as large as the expected within-temperature U 95(d i ) and will closely approach 2·s(d) as between-temperature differences become dominant. Note that u(D) is not scaled by √N since D ± U 95(D) is intended to be characteristic of individual measurement processes rather than any estimate of the central tendency of N processes. The variance propagation results for all five buffer systems are listed in Tables S1.b to S5.b of the ESM, along with the d and u(d) recalculated from the reported results as listed in Tables S1.a to S5.a.

Of course, that the d ± U 95(d) can be mathematically combined does not address the question as to whether combining them is chemically reasonable. Figure 1 displays the d ± U 95(d) for all NMIs that reported primary pH results in the CCQM-K9, -K9.1, K9.2, and APMP.QM-K9 studies of the phosphate buffer system along with the combined D ± U 95(D). These d ± U 95(d) estimates are taken directly from the Final Reports or calculated using the data and formulae provided in those reports. The coherence of the d ± U 95(d) over the three temperatures for nearly all of the NMIs suggests that combining the individual estimates is reasonable. If the validity of the combination is accepted, then the D ± U 95(D) provides a snapshot of the NMI’s phosphate buffer primary pH measurement capabilities from 15 °C to 37 °C.

Fig. 1
figure 1

Dot-and-bar plot of degrees of equivalence estimated by variance propagation for all participants in CCQM-K9, -K9.1, -K9.2, or APMP.QM-K9 who reported primary method pH results. The vertical axis displays degrees of equivalence, D ± U 95(D) and d ± U 95(d). The horizontal axis is used to separate the NMIs. The filled circles and thick vertical lines represent the combined D ± U 95(D) for each NMI as estimated from Eq. 5. The NMIs are sorted in order of increasing D within each KC; the KC is identified above the results for the participant with the lowest-valued D within that KC. The open symbols and thin vertical lines represent d ± U 95(d) for measurements made at 15 °C (diamond), 25 °C (triangle), and 37 °C (square) as specified in the KC Final Reports. The thick horizontal line represents zero bias; the thin horizontal lines are visual guides

Revisiting the estimation of degrees of equivalence

Since estimating D ± U 95(D) is outside the scope of the CIPM MRA’s “measurement standard” paradigm, the question arises whether even more informative estimates could be achieved using data analysis approaches that do more than just propagate reported summary estimates.

Key comparison reference value, V KC

While many location estimators have been proposed for evaluating a KCRV and recent guidance provided for choosing and calculating ones appropriate to particular circumstances [23], all of the KCs considered here have used either the median when there was significant between-result variance, \(s_{\text{b}}^{2}\), or the Graybill–Deal weighted mean [21], x GD, when \(s_{\text{b}}^{2}\) was considered insignificant. The x GD is defined as

$$x_{\text{GD}} = {{\sum\limits_{i = 1}^{n} {\frac{{x_{i} }}{{u^{2} (x_{i} )}}} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {\frac{{x_{i} }}{{u^{2} (x_{i} )}}} } {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}$$
(6)

where i indexes over all the accepted results in a KC and n is the number of such results. Three of the root KCs (CCQM-K9, -K17, and -K20) used x GD as their KCRV estimate for all temperatures studied.

It is now better appreciated that use of x GD is justified only in the unusual case where \(s_{\text{b}}^{2}\) is both truly zero and all of the u(x) are credible. For situations where \(s_{\text{b}}^{2}\) is appreciable but the x follow an approximately unimodal symmetric distribution and the u(x) are at least plausible, the DerSimonian–Laird (DL) [24] weighted mean, x DL, is more appropriate [25]. Commonly used in clinical meta-analysis, x DL, is identical to x GD when \(s_{\text{b}}^{2}\) is zero but approaches the arithmetic mean, x mean, as \(s_{\text{b}}^{2}\) becomes large relative to the u(x i ). The x DL is defined as

$$\begin{gathered} x_{\text{DL}} = {{\sum\limits_{i = 1}^{n} {\frac{{x_{i} }}{{s_{\text{b}}^{2} + u^{2} \left( {x_{i} } \right)}}} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {\frac{{x_{i} }}{{s_{\text{b}}^{2} + u^{2} \left( {x_{i} } \right)}}} } {\sum\limits_{i = 1}^{n} {\frac{1}{{s_{\text{b}}^{2} + u^{2} \left( {x_{i} } \right)}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{s_{\text{b}}^{2} + u^{2} \left( {x_{i} } \right)}}} }} \\ \\ s_{\text{b}}^{2} = {\text{MAX}}\left[ {0,{{\left( {\sum\limits_{i = 1}^{n} {\frac{{\left( {x_{i} - x_{\text{GD}} } \right)^{2} }}{{u^{2} (x_{i} )}}} - n + 1} \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_{i = 1}^{n} {\frac{{\left( {x_{i} - x_{\text{GD}} } \right)^{2} }}{{u^{2} (x_{i} )}}} - n + 1} \right)} {\left( {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} - {{\sum\limits_{i = 1}^{n} {\frac{1}{{u^{4} (x_{i} )}}} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {\frac{1}{{u^{4} (x_{i} )}}} } {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}} \right)}}} \right. \kern-0pt} {\left( {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} - {{\sum\limits_{i = 1}^{n} {\frac{1}{{u^{4} (x_{i} )}}} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {\frac{1}{{u^{4} (x_{i} )}}} } {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}} \right)}}} \right] \\ \end{gathered}$$
(7)

where “MAX” is the function “return the largest value of the arguments.” Since x DL asymptotically approaches x mean, it is as sensitive as x mean itself to the presence of discordant results and is only appropriately used after any and all such results have been identified, reviewed by the submitting NMI, and excluded if a cause for the discordance is identified.

Due to what was considered appreciable \(s_{\text{b}}^{2}\), the CCQM-K18 and -K19 studies used the median of the accepted x, x median, to estimate the KCRV at each temperature studied. While appropriate for any distribution and robust to minority populations of discordant values, x median is not a very efficient estimate of location (that is, it is more variable than x mean when applied to normally distributed data) and does not make use of any information provided by the u(x) even when they are quite informative [26].

Standard uncertainty of the key comparison reference value, u(V KC )

The three root KCs that used x GD as their KCRV estimates regarded that estimator’s usual standard uncertainty, u(x GD),

$$u(x_{\text{GD}} ) = \sqrt {{1 \mathord{\left/ {\vphantom {1 {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} \left( {x_{i} } \right)}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} \left( {x_{i} } \right)}}} }}} ,$$
(8)

as too small for use as the u(V KC). Instead, a weighted standard deviation estimated using the same inverse-variance weighting used to define x GD was used to provide estimates that take non-zero s b into account. While sometimes referred to as the “external consistency” uncertainty [3, 7, 27], this estimate is more simply termed the “Graybill–Deal weighted standard deviation” and is defined as

$$u_{\text{GD}} (x_{\text{GD}} ) = \sqrt {{{\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}\frac{{\left( {x_{i} - x_{\text{GD}} } \right)^{2} }}{n - 1}} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}\frac{{\left( {x_{i} - x_{\text{GD}} } \right)^{2} }}{n - 1}} } {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{u^{2} (x_{i} )}}} }}}$$
(9)

While providing more chemically reasonable u(V KC) for these studies than does u(x GD), this approach does not address the x GD’s bias towards x that have very small u(x).

The two studies that estimated the KCRV values as the x median used a scaled version of the robust median absolute deviation from the median (MAD) dispersion estimate to estimate u(V KC):

$$u_{\text{MAD}} (V_{\text{KC}} ) = {\text{MAD}}(x)\frac{1.858}{{\sqrt {n - 1} }} \equiv {\text{MEDIAN}}\left( {\left| {x - x_{\text{median}} } \right|} \right)\frac{1.858}{{\sqrt {n - 1} }}$$
(10)

where MEDIAN is the function “find the median value of the specified list of values” and the scaling factor of 1.858/√(N − 1) adjusts the estimate to (1) have the approximately the same coverage as a standard deviation for normally distributed data, (2) compensate for the lower efficiency of x median relative to x mean, and (3) compensate for the relatively small N. While robust to the inclusion of discordant values, the MAD is inefficient compared to the standard deviation when applied to normally distributed data.

While various approaches for estimating uncertainties for weighted means have been proposed that provide more efficient coverage intervals [28, 29], the original estimate associated with x DL is [24]

$$u(x_{\text{DL}} ) = \sqrt {{1 \mathord{\left/ {\vphantom {1 {\sum\limits_{i = 1}^{n} {\frac{1}{{s_{\text{b}}^{2} + u^{2} \left( {x_{i} } \right)}}} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{n} {\frac{1}{{s_{\text{b}}^{2} + u^{2} \left( {x_{i} } \right)}}} }}} .$$
(11)

Linkages between studies

The CIPM MRA does not specify how results from successor KCs are to be linked to those of a root KC; however, it does mandate [1] that “The results of the RMO key comparisons are linked to key comparison reference values established by CIPM key comparisons by the common participation of some institutes in both CIPM and RMO comparisons. The uncertainty with which comparison data are propagated depends on the number of institutes taking part in both comparisons and on the quality of the results reported by these institutes.” The CCQM has chosen to link successor and RMO KCs using the same general methods.

When a successor or RMO KC uses materials and methods that are sufficiently similar to those used in a root—as is the case for the primary pH studies considered here, the studies can be directly linked through results provided by one or more “anchor” NMIs who successfully participated in a prior KC. For example, results in the successor CCQM-K9.1 are linked to the KCRV of the root CCQM-K9 through results provided by one anchor who made full sets of measurements in both studies, CCQM-K9.2 is linked to CCQM-K9 through the results of two such anchors, and APMP.QM-K9 is linked through results of one anchor from CCQM-K9, one from CCQM-K9.1, and one from CCQM-K9.2. The linkages for all of the pH studies considered here are detailed in Tables S1.a to S5.a of the ESM.

To date, degrees of equivalence for participants in a successor pH KC have been estimated using a “National standard” paradigm assuming that DEq are unchanging over time and samples:

$$\begin{gathered} d = x - V_{\text{KC}} + V_{\text{R}} - V_{\text{S}} \\ \\ u\left( d \right) = \sqrt {u^{2} \left( x \right) + u^{2} \left( {V_{\text{KC}} } \right) + u^{2} \left( {V_{\text{R}} } \right) + u^{2} \left( {V_{\text{S}} } \right) - 2\rho \left( {V_{\text{R}} ,V_{\text{KC}} } \right)u\left( {V_{\text{R}} } \right)u\left( {V_{\text{KC}} } \right)} \\ \\ U_{95} \left( d \right) = k_{95} u\left( d \right) \\ \end{gathered}$$
(12)

where V R is a reference value estimated from the results of the anchor participants in previous studies, u(V R) is its estimated standard uncertainty, V S is a reference value estimated from the results of the anchor participants in the successor KC, u(V S) is its estimated standard uncertainty, and ρ(V R, V KC) is the correlation between prior studies’ reference values and the KCRV. Although V R has (nearly) always been estimated from a subset of the participants in the root KC, none of the other quantities are estimated from the same data sets and so are not expected to be strongly correlated. As with the d ± U 95(d) estimated for the participants in the root KC, ρ(V R, V KC) has typically been ignored and k 95 asserted to be 2.

In the successor studies involving two or more anchor participants, V R and V S have been estimated from x mean; the standard deviation, s(x); and pooled uncertainty of the anchor participants’ results, \(\bar{u}\)(x). The V S and its standard uncertainty, u(V S), are readily estimated:

$$\begin{gathered} V_{\text{S}} = {{\sum\limits_{j = 1}^{n} {x_{{{\text{S}}j}} } } \mathord{\left/ {\vphantom {{\sum\limits_{j = 1}^{n} {x_{{{\text{S}}j}} } } n}} \right. \kern-0pt} n};\quad \quad u\left( {V_{\text{S}} } \right) = \sqrt {\frac{{\bar{u}^{2} \left( {x_{\text{S}} } \right) + s^{2} \left( {x_{\text{S}} } \right)}}{n}} \\ \\ \bar{u}^{2} \left( {x_{\text{S}} } \right) = {{\sum\limits_{j = 1}^{N} {u^{2} \left( {x_{{{\text{S}}j}} } \right)} } \mathord{\left/ {\vphantom {{\sum\limits_{j = 1}^{N} {u^{2} \left( {x_{{{\text{S}}j}} } \right)} } n}} \right. \kern-0pt} n} \\ \\ s^{2} \left( {x_{\text{S}} } \right) = \left\{ \begin{gathered} 0\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad n = 1 \hfill \\ {{\sum\limits_{i = 1}^{N} {\left( {x_{{{\text{S}}i}} - V_{\text{S}} } \right)^{2} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{N} {\left( {x_{{{\text{S}}i}} - V_{\text{S}} } \right)^{2} } } {\left( {n - 1} \right)}}} \right. \kern-0pt} {\left( {n - 1} \right)}}\quad \,\;\quad n > 1 \hfill \\ \end{gathered} \right. \\ \end{gathered}$$
(13)

where j indexes over the anchors, n is the number of anchors, x S are the results for the anchors in the successor KC, and u(x S) are the standard uncertainties for the anchor values.

When all anchors successfully participated in the same prior KC, the estimation process for the prior reference value, V R, is analogous to the above with the x S replaced by x R. However, when some of the anchors participated in different studies (as in APMP.QM-K9), the “national standard” paradigm re-centers all of the anchor values to have the value they “should have had”:

$$x_{{\text{adj}}} = x_{\text{R}} + d_{\text{R}} ;\quad u\left( {x_{{\text{adj}} }} \right) = u\left( {d_{\text{R}} } \right)$$
(14)

where x adj designates a re-centered value, x R is the value in the most recent KC that the anchor successfully participated in, d R is the DEq in that KC, and u(d R) is its standard uncertainty. The uncertainty associated with x R, u(x R), is not included the calculation of \(u\left( {x_{\text{adj}} } \right)\) since it is already included in u(d R).

The measurement capability paradigm suggests a much simpler calculation. If a participant’s result does not reflect the fixed bias of a national standard, successful participation in a prior KC implies only that all anchor participants are expected to routinely realize true values within their assessed uncertainties. The DEq for the non-anchor participants in the successor KC is thus independent of results in the root KC:

$$\begin{gathered} d = x - V_{\text{S}} \\ \\ u\left( d \right) = \sqrt {u^{2} \left( x \right) + u^{2} \left( {V_{\text{S}} } \right)} ;\quad U_{95} \left( d \right) = k_{95} u\left( d \right) \\ \end{gathered}$$
(15)

When there is only one anchor participant, the k 95 expansion factor in Eqs. 12 and 15 must be assigned by expert judgment.

Reference value estimators

When there is more than one anchor participant in a successor KC, using Eq. 13, i.e., estimating V S as x mean, does not make efficient use of the information provided in the reported u(x). As in the estimation of the KCRV, estimating V S as x DL (Eq. 7) and u(V S) as u(x DL) (Eq. 11) makes more complete use of the available information. Further, use of the same estimators for the V KC and Vs provides a philosophically consistent approach to the analysis of the successor KCs.

Leave-one-out reference values

Estimating a KCRV using all accepted results can be considered to provide the closest realization of an SI unit that can be estimated using a consensus process. However, using that KCRV to estimate the d ± U 95(d) for a x ± U 95(x) used in the determination of the KCRV may result in non-negligible values for the often-ignored ρ(x,V KC) term in Eq. 2. This can be avoided by estimating each d ± U 95(d) relative to a reference value that is independent of the associated x ± U 95(x). At the cost of additional calculations and an \(\sqrt {{n \mathord{\left/ {\vphantom {n {\left( {n - 1} \right)}}} \right. \kern-0pt} {\left( {n - 1} \right)}}}\) increase in the estimated uncertainty, the same estimator used for V KC can provide individual reference values for the d ± U 95(d) for each x ± U 95(x) using all of the accepted results except itself. This leave-one-out (LOO) approach is a routine tool for assessing the predictive utility of regression models [30]. LOO is a particularly useful tool for identifying the influence of particular values on consensus summaries and the consequences of such inclusion on the other values [31].

When the measurement capability linkage of Eq. 15 is used, the LOO-estimated DEq for participants in a root KC does not impact the DEq estimated for participants in successor studies since these are linked only to the KCRV of the root and the measurements made by the anchor participants in the successor KC itself. In any case, eliminating the potential distortion from ignoring non-zero ρ(x,V KC) places the U 95(d) estimates for root and successor KC participants on more equal footing.

Use of corrected and imperfect results

It can happen that an NMI recognizes computational oversights only after the results of a KC have been revealed. While the DEq for such an NMI must be estimated from the originally reported results, when the error results from miscalculation then the WG may choose to use a transparently corrected result in determining the KCRV. In CCQM-K9, the NMI who reported the errant result had to demonstrate its capability in a successor KC. In these circumstances, an issue arises when the NMI is an anchor in a later successor: which result should be used as the link? The approach used by the EAWG has been to link to the result from the successor KC. However, since the KCRV of the root KC is based in part on that NMI’s corrected result, linkage through the corrected result shortens the linkage chain for later participants without further compromise. As this shortening does not benefit the anchor participant but impacts only those NMIs that are linked through that anchor, “measurement capability” DEq should be based on the most direct valid linkage.

Occasionally, too, results are reported that are valid in their own right but that are excluded from formal inclusion in the KC and so cannot be used to estimate a national standard DEq. Such exclusions include but are not limited to measurements made at not quite the KC’s design conditions and values submitted without an accompanying uncertainty budget. Given that the proposed process for combining results is already well outside the scope of the CIPM MRA’s paradigm, it seems reasonable to try to make use of such data after conservative adjustment. For example, (1) measurements made at an off-target temperature could be interpolated to the target if the approximate temperature dependence of the measurements can be estimated or (2) missing uncertainties could be estimated as the “worst case” of previously supplied complete data, assuming that sufficient such data were available. While it would be inappropriate to base critical decisions primarily on resurrected data, ignoring available information is inefficient.

Parametric Bootstrap Monte Carlo analysis

The DEq uncertainty estimates detailed above generally follow the conventional propagation rules, with the exception that degrees of freedom and known correlation issues are routinely ignored. Given the relatively small number of data available for estimating a V KC or V S, the assumption that k 95 = 2 provides about a 95 % level of confidence coverage interval about the true value is difficult to justify. And, while the correlation between a given location estimate and a datum used in its estimation can be determined, the functional relationship can be fairly complex.

Parametric Bootstrap Monte Carlo (PBMC) analysis is one approach that provides a relatively simple and convenient method for estimating coverage intervals directly from just the reported data. Assuming that all of the reported x ± U 95(x) credibly specify N(x,(U 95(x)/2)2) normal kernel distributions, then empirical posterior distributions for all d values estimated from Eqs. 1 or 15 can be estimated by (1) repetitively sampling all of the input values within their distributions, providing one PBMC sample per reported result for each set, (2) estimating V KC and V S for each of the PBMC sets, and (3) estimating and storing the d (call them d MC) for all of the resampled results in each set. This methodology is closely related to the methods described in [32] and to empirical Bayesian analysis [33].

While not particularly efficient in terms of computer resources, PBMC can be readily implemented in any computational environment that supports user definition of programs for the evaluation of specialized functions (e.g., x DL) and for the storage of intermediate results. Since spreadsheets can provide a familiar working environment that simplifies the definition and maintenance of the linkages between root and successor KCs, PBMC analysis within a spreadsheet environment can be quite efficient in terms of analysts’ resources when appropriate care is taken in their design.

Assuming that a suitably large number of PBMC samplings, N MC, are available, d can be estimated from the empirical 50 percentile of the stored PBMC results:

$$d = {\text{ PTILE}}\left( { 50,d_{\text{MC}} } \right) \, \equiv {\text{ MEDIAN}}\left( {d_{\text{MC}} } \right)$$
(16)

where “PTILE” is the function “return the p percentile of the specified values” and for p = 50 is identical to the median. Credible uncertainty intervals about d can be estimated in the same manner, with the 95 % level of confidence interval estimated from the 2.5 and 97.5 percentiles: PTILE(2.5,d MC) and PTILE(97.5,d MC). If the ratio (d − PTILE(2.5,d MC))/(PTILE(97.5,d MC) − d) is about 1, then the usual symmetric 95 % confidence interval on d can be estimated as

$$U_{ 9 5} \left( d \right) \, = \, ({\text{PTILE}}( 9 7. 5,d_{\text{MC}} ) -{\text{PTILE}}( 2. 5,d_{\text{MC}} ))/ 2.$$
(17)

However, if the ratio is far from 1 then the interval can either be reported as asymmetric,

$$ ^{-} U_{ 9 5} \left( d \right) \, = d - {\text{PTILE}}( 2. 5,d_{\text{MC}} ),{^+} U_{ 9 5} \left( d \right) \, = {\text{PTILE}}( 9 7. 5,d_{\text{MC}} )- d, $$
(18)

or as the larger of the two half-intervals,

$$U_{ 9 5} \left( d \right) = {\text{MAX}}\left( {^{-} U_{ 9 5} \left( d \right),{^+}U_{9 5} \left( d \right)} \right)$$
(19)

Asymmetric intervals are the narrowest intervals that provide the stated coverage; however, the familiar symmetric form may be more convenient for use in further calculations. While the symmetric estimates of Eq. 19 are conservative, they increasingly over-estimate the length of the interval as asymmetry increases.

Using the same sets of PBMC d MC used to estimate the d in Eq. 16, the “measurement capability” DEq that combines results for all temperatures in a given buffer, the D for a given NMI of Eq. 5 can be estimated as

$$D = {\text{PTILE}}\left( {50,\bigcup\limits_{t = 1}^{T} {d_{{{\text{MC}}t}} } } \right)$$
(20)

where t indexes over the temperatures, T is the number of temperatures, d MCt are the PBMC values for the given temperature, \(\bigcup{d_{{{\text{MC}}t}} }\) is the union of all the PBMC d MC for a given NMI, and the number of d MC is the same for all temperatures. The U 95(D) can be estimated using the same approaches and decision criteria detailed in Eqs. 1719:

$$\begin{gathered} U_{95} \left( D \right) = \left( {{\text{PTILE}}\left( {97.5,\bigcup\limits_{t = 1}^{T} {d_{{{\text{MC}}t}} } } \right) - {\text{PTILE}}\left( {2.5,\bigcup\limits_{t = 1}^{T} {d_{{{\text{MC}}t}} } } \right)} \right)/2 \\ {\text{or}} \\ {}^{ - }U_{95} \left( D \right) = D - {\text{PTILE}}\left( {2.5,\bigcup\limits_{t = 1}^{T} {d_{{{\text{MC}}t}} } } \right);\quad {}^{ + }U_{95} \left( D \right) = {\text{PTILE}}\left( {97.5,\bigcup\limits_{t = 1}^{T} {d_{{{\text{MC}}t}} } } \right) - D \\ {\text{or}} \\ U_{95} \left( D \right) = {\text{MAX}}\left( {{}^{ - }U_{95} \left( D \right),{}^{ + }U_{95} \left( D \right)} \right) \\ \end{gathered}$$
(21)

Figure 2 displays the PBMC-estimated d ± U 95(d) and D ± U 95(D) for all NMIs that provided results for primary pH measurements in phosphate buffer. All of the expanded uncertainties are estimated conservatively as the maximum of the two half-intervals. At graphical resolution, the differences between the national standard estimates of Fig. 1 and the measurement capability estimates of Fig. 2 are quite small.

Fig. 2
figure 2

Dot-and-bar plot of PBMC estimated degrees of equivalence for the CCQM-K9, -K9.1, -K9.2, and APMP.QM-K9 participants. The graphical format is identical to that of Fig. 1

Figure 3 provides a high-resolution comparison between the DEq as reported in the Final Reports and those estimated using PBMC and the several estimation and linkage modifications proposed above. All of the pH differences are small with none larger than 0.003 and most less than 0.001, but the pattern of changes attributable to specific modifications may be of interest. Figure 3a visualizes the differences in d, U 95(d), D, and U 95(D) attributable to the PBMC estimation method itself. The d are essentially unaffected; the D are mostly unaffected except for those NMIs where the distribution of the combined d MC is not well described as symmetric unimodal. For these NMIs, the PBMC-estimated median d MC is somewhat closer to the ideal zero D than the arithmetic average. The PBMC-estimated U 95(d) for the CCQM-K9 participants are somewhat smaller than the reported values. The PBMC-estimated U 95(d) for the participants in successor studies are either unchanged or somewhat larger, depending on which KC is considered. The U 95(D) are essentially unaffected, again except for the NMIs where the combined d MC distribution is significantly asymmetric.

Fig. 3
figure 3

Differences between the degrees of equivalence and their expanded uncertainties for the CCQM-K9, -K9.1, -K9.2, or APMP.QM-K9 participants as reported and as estimated using the proposed modified approaches. The panels display differences due to the use of a the PBMC estimation process, b DerSimonian–Laird weighted mean to estimate all reference values, c linking to a corrected value in CCQM-K9 rather than to its replacement in CCQM-K9.1, d leave-one-out evaluation of CCQM-K9 results, e measurement capability paradigm linkage, and f the combination of all the proposed modifications. For all panels, the horizontal axis displays differences in absolute d and D; the vertical axis displays differences in U 95(d) and U 95(D). Negative values along either axis indicate that the reported values are further from the ideal zero than those estimated using the proposed modification. Small open symbols represent temperature-specific differences in d and U 95(d); large solid symbols differences in the estimated D and U 95(D); circles estimates from CCQM-K9, triangles CCQM-K9.1, diamonds CCQM-K9.2, and squares APMP.QM-K9. The bars on all symbols represent 95 % level of confidence intervals on the PBMC estimates, based on 9 sets of 1000 random draws

Figure 3b depicts the changes attributable to the use of x DL for the reference values in CCQM-K9, -K9.2, and APMP.QM-K9. None of the d and D are changed by more than about ±0.0005. The U 95(d) and U 95(D) are on average very slightly smaller than the values provided in the reports or estimated from them. Figure 3c depicts the change resulting from linking CCQM-K9.2 to the KCRV using the corrected value reported in CCQM-K9 by one of the anchor NMIs rather than that NMI’s official DEq estimated in CCQM-K9.1. The change only affects the APMP.QM-K9 participants. Figure 3d depicts the change resulting from using LOO evaluation for the CCQM-K9 participants, where the d and D become on average about 0.0002 farther from zero and the U 95(d) and U 95(D) become uniformly about 0.0003 larger. These small changes have virtually no effect on the DEq estimated for the participants in the successor KCs.

Figure 3e depicts the change in linkage from the “national standard” paradigm of Eq. 12 to the “measurement capability” paradigm of Eq. 15. The d and D for the participants in the successor KCs are changed by up to ±0.002, reflecting the elimination of the V R bias-correction resulting in a small majority of the DEq becoming closer to the ideal zero. The U 95(d) and U 95(D) for these NMIs rather uniformly become about 0.0005 shorter, reflecting the elimination of the u(V R) uncertainty component.

Figure 3f depicts the whole of the proposed modifications. The great majority of the observed changes are attributable to use of (1) the measurement capability paradigm, (2) PBMC analysis, (3) x DL as the estimator for the reference values in both the root and successor KCs, and (4) LOO analysis of the DEq for participants in the root KC. Note that each of these modifications can have very different effects on the participants in the root and in the successor KCs, and the magnitude of the changes observed with the CCQM-K9, -K9.1, K-9.2, and APMP.QM-K9 studies may not predict their relative impact on other measurement systems.

The PBMC results for all five buffer systems are listed in Tables S1.c to S5.c of the ESM.

“Measurement capability” degrees of equivalence for all buffers

While each buffer system has its unique attributes, the d ± U 95(d) estimates for most NMIs in other buffers where measurements were reported at 15 °C, 25 °C, and 37 °C are about as self-consistent as they are in the phosphate buffer discussed above. Given that all D ± U 95(D) within-buffer estimates appear to “make chemical sense”, it remains to explore how results can be combined across the buffers—and whether such combinations are chemically informative.

To meaningfully combine across the buffer systems, the magnitude and distributions of the quantities combined must be similar. Figure 4 displays the standard deviation, s(x), the DerSimonian-Laird between-NMI component of variation, s b, and the pooled (see Eq. 7) measurement uncertainties, \(\bar{u}\)(x), estimated from the accepted results in the five root KCs. The \(\bar{u}\)(x) are strikingly similar for all five buffers, indicating that the participating NMIs regarded the measurement processes as being of similar complexity. However, the reported measurement uncertainties do not fully account for the observed between-NMI variability in any of the buffer systems. The magnitude of the unexplained between-NMI variability is about the same and rather small in four of the buffers. Only in the carbonate system investigated in CCQM-K18 and -K18.1 the unexplained variability is significant—and can be entirely attributed to a reproducible offset in the measurement results reported by two NMIs. While not yet completely understood, this offset is believed to be related to the procedures used to account for slow loss of CO2 from the buffer into the hydrogen flow in the Harned cell.

Fig. 4
figure 4

Uncertainty components for the pH-related measurement results reported in the CCQM-K9, -K17, K-18, -K19, and -K20 key comparisons. The horizontal axis displays the KCRVs as estimated from the 15, 25, and 37 °C results accepted for use in estimating the KCRV. The vertical axis displays estimates of variability for these results. The open triangles represent the standard deviations, s, for the reported x in each of these root KCs; the dashed horizontal line the pooled value of the s. The asterisk represents the pooled uncertainty, \(\bar{u}\), of the reported u(x); the thick horizontal line their pooled value. The solid circles represent the DerSimonian–Laird estimate of between-NMI variability, s b; the thin horizontal line their pooled value. The horizontal and vertical lines represent PBMC-estimated 95 % coverage intervals, based on 9 sets of 1000 random draws

The carbonate buffer KCs are also unique in that, owing to the time required for measurement at each temperature, the KC protocol only involved measurements at 25 °C. It is plausible that primary pH measurements in this system may not be comparable to those in the other four buffers. However, the variability of the DEq in the carbonate system is not so much greater than that in the others to preclude attempting to combine them with those for the other buffers and evaluating the resulting combined values for chemical plausibility.

The number of temperatures evaluated in the EAWG’s pH KCs does differ; further, KC participants do not always report results for all of the temperatures included in the KC design. To provide an “all buffer” DEq summary, Р± U 95(Ð), for each NMI, this potential imbalance in the number of temperature-specific d ± U 95(d) available in different buffers requires modification of the single-buffer approaches for combining DEq. This is trivial for the propagation approach, requiring only that the d ± U 95(d) in Eq. 5 be replaced by the summary D ± U 95(D):

(22)

where i now indexes over the buffers and N is number of buffer systems for which the NMI provided results.

Estimating Р± U 95(Ð) is only a bit more complicated for the PBMC approach of Eq. 16. Using the same sets of PBMC d MC used to estimate d, U 95(d), D and U 95(D):

$$-\!\!\!\!{\boldsymbol D}= {\text{PTILE}}\left({50,\mathop \cup \limits_{t = 1}^{\sum T} d_{{{\text{MC}}t}}} \right)$$
(23)

where t now indexes over all temperatures in all buffers.. The U 95(Ð) can be estimated using the analogous modifications to Eq. 21, again using the decision criteria discussed for Eqs. 1719.

To ensure that each of the five buffer systems has equal influence on the all-buffer Р± U 95(Ð) estimates, the total number of d MC should be the same for all buffers, e.g., for each 1000 PBMC d MC values generated for each of the three results reported in the phosphate buffer system there should be 3000 d MC for the carbonate buffer’s single result. While just a bookkeeping detail, having balanced numbers of d MC is necessary for the PBMC process to yield equal-weighted estimates.

Figure 5 displays the variance propagation and PBMC-generated Р± U 95(Ð) estimates for all NMIs reporting any primary pH result in any of the pH KCs listed in Table 1, with the U 95(D) and U 95(Ð) conservatively estimated as the maximum half-interval. Figure 5 uses the same dot-and-bar format used in Fig. 1, but with the thin lines representing the buffer-specific D ± U 95(D) rather than the within-buffer temperature-specific d ± U 95(d). At graphical resolution, the two methods provide very similar estimates; numeric values of the estimates are listed in Table S6 of the ESM. Figure S6 displays the PBMC results using symmetric and asymmetric U 95(D) and U 95(Ð) intervals.

Fig. 5
figure 5

Dot-and-bar plots of degrees of equivalence for all NMIs that reported primary method pH results in any of the 11 KCs listed in Table 1. a Variance propagation estimates, b PBMC estimates. The graphical format is similar to that of Fig. 1 with the exception that the large solid circles and thick bars represent the all-temperature all-buffer Р± U 95(Ð) summaries, and the smaller symbols and thin lines represent the all-temperature D ± U 95(D) buffer-specific summaries. The smaller solid circles represent results for phosphate buffer (CCQM-K9, -K9.1, -K9.2, APMP.QM-K9), times phthalate buffer (CCQM-K17, EUROMET-K17), solid triangles carbonate buffer (CCQM-K18, -K18.1), plus borate buffer (CCQM-K19, -K19.1), and solid diamonds tetroxalate buffer (CCQM-K20)

The D ± U 95(D) for the carbonate buffer do not appear to be systematically different from those of the other buffer systems. For the large majority of NMIs, the DEq in different buffers are quite coherent. The reproducible and relatively large offset for the NMI coded as “T” has been previously noted and identified as the result of using a somewhat different electrochemical cell design than that used by most other NMIs.

Conclusion

The very similar values of the temperature-specific d ± U 95(d) for the primary pH measurement results reported by most KC participants in each of the five buffer systems suggest that combining them into buffer-specific D ± U 95(D) summaries provides chemically useful information—at least for the measurements made over the range of temperatures evaluated in that buffer. Likewise, the very similar values for the buffer-specific D ± U 95(D) for most NMIs suggest that combining them into the buffer-independent Р± U 95(Ð) summaries may usefully summarize the primary pH measurement capabilities of the KC participants—at least for the five buffer systems and 15 °C–37 °C temperature range considered in this study.

While not essential to reaching the above conclusions, we propose a number of modifications to the methods usually used for CIPM MRA degrees of equivalence that may contribute to providing more representative estimates. The most significant of these are use of (1) “measurement capability” linkages between root and successor KCs, (2) Monte Carlo (PBMC and others) methods for evaluating the consequences of different distributional assumptions on the estimation of credible coverage intervals, (3) comparison of leave-one-out (LOO) degrees of equivalence estimates with those using the traditional approach to evaluate the influence of correlation, and (4) a modified dot-and-bar graphic for displaying summary estimates such as D ± U 95(D) and Р± U 95(Ð).

The primary pH measurement results provided by the NMI participants in these pH-related KCs were chosen for study for a number of reasons, but chief among them is the remarkable agreement among the participant results over all of the solutions and evaluation temperatures thus far studied by the EAWG. If the degrees of equivalence for these measurements could not have been meaningfully combined, it would be highly unlikely that the results for less well understood and controlled measurement systems could be meaningfully combined. That the primary pH results can be combined using relatively simple analysis and display methods thus does not ensure that similarly meaningful summaries can be devised for other measurement systems, but it provides the incentive to attempt to do so.