Accreditation and Quality Assurance

, 16:483

Dark uncertainty


    • School of ScienceBirkbeck University of London
  • Stephen L. R. Ellison
    • LGC Ltd
General Paper

DOI: 10.1007/s00769-011-0803-0

Cite this article as:
Thompson, M. & Ellison, S.L.R. Accred Qual Assur (2011) 16: 483. doi:10.1007/s00769-011-0803-0


Standard uncertainties obtained by the GUM approach for a range of analytical methods are compared directly and indirectly with estimates of reproducibility standard deviation for the same methods. Results were obtained from both routine analysis and international key comparisons. A general tendency for the uncertainty to be substantially less than the reproducibility standard deviation was found.


UncertaintyDark uncertaintyReproducibility conditionsGUM


Uncertainty is ‘dark’ if it is not visible in the uncertainty budget, the cause-and-effect model of the measurement procedure that is the metrological approach for estimating uncertainty, but appears only as a visible effect in a population of measurement results. All that is needed for the existence of dark uncertainty is one or more missing terms in the budget. We can use the term ‘dark uncertainty’ by analogy with the ‘dark matter’ that seems to exist in galaxies: its gravitational effect is evident but the total mass of visible stars and interstellar dust cannot account for it.

How could we show that dark uncertainty exists? Well, we could look for its effect in studies where the metrological cause-and-effect analysis is compared with a statistical study of replicated results. Any dark uncertainty would appear as an unexpectedly large dispersion of values. In this short paper, we examine evidence for this occurrence and show that dark uncertainty seems to be prevalent in analytical measurement systems. First, we consider some aspects of the metrological and statistical approaches.

Metrological and statistical approaches to the estimation of uncertainty in chemical analysis

In 1995, the ISO document “Guide to the expression of uncertainty in measurement” (GUM) [1] set the scene for an ongoing discussion about different approaches to the estimation of uncertainty. Many scientists considered that only a complete operational model of the measurement process described in GUM, with inputs traceable to international standards, could provide an accurate estimate of uncertainty.1 Right from the launch of GUM, however, numerous authors have called into question the validity of this absolute standpoint when applied to chemical measurement. Notable among these are the following: the Analytical Methods Committee [2]; Horwitz [3, 4]; Maroto et al. [5]; and Hund et al. [6]. The essence of these arguments is twofold, namely: (a) that chemical measurement is usually too complex to be represented adequately by an a priori model and (b) that user-defined fitness for purpose in analysis seldom demands a relative standard uncertainty smaller than 0.02. The implication of the former point is that the GUM approach might tend to underestimate uncertainty in analysis. The second point implies that uncertainties associated with traceability to SI units are, thanks to a sound physical measurement infrastructure, hardly ever an issue in “real-life” chemical analysis—we can transfer the mole, the kilogramme and the cubic centimetre to the analyst’s bench with far better relative uncertainty than 0.02.

Analytical methods need to be validated before use. Validation traditionally comprises inter alia the estimation of precision by replicating the measurement under various conditions. Reproducibility precision is regarded as most useful but costly to address. Analytical chemists have argued that, with due caution, reproducibility standard deviation is a good indication of uncertainty in many instances. The Eurachem Guide [7] recommends reproducibility standard deviation as a basis for uncertainty provided that bias and contributions associated with traceability (usually calibration uncertainties) are taken into account. (In a strict sense, we should speak of validating the analytical measurement procedure in combination with an appropriately defined class of test material, rather than validating the “method” as such. With this stipulation, bias resulting from a change in test material is eliminated as a separate contribution to the uncertainty).

Potential problems with precision studies

Metrologists justifiably reject replication as an unqualified method of estimating uncertainty, and it is easy to see why. Replication alone could be appropriate only under two conditions, namely:
  • the replication of the measurement is able to explore all of the variable space with equal probability, that is, to explore all of the scope for variation in the analytical measurement procedure;

  • the analytical method is known to be effectively unbiased; that is, systematic effects are negligible in relation to the precision.

We can see immediately that repeatability replication does not fulfil the first requirement. It does not explore the effect of the variation between conditions in different laboratories. For example, if part of the procedure specified drying the test material for 1 h at 110 °C, we might reasonably expect variations in timing between 50 and 70 min and in temperature between 105 and 115 °C on different occasions or in different laboratories. Repeatability results would not reflect this potential variation, so the repeatability standard deviation \( \sigma_{\text{r}} \) would be too small an estimate of the uncertainty contribution.

Measurements under inter-laboratory reproducibility conditions, however, are much better able to explore the sample space for variation, because conditions in one laboratory will be effectively independent of those in another. There are obvious “in-principle” objections to this assertion; for example, different analysts in different laboratories, by virtue of similar training, might still tend to do the same thing in a similar way and thus not visit all parts of the variable space. For example, if the written procedure specified a heating time of 60 min ± 10 min, it might happen that analysts (in different laboratories) tended to use a time of 60 min ± 2 min and thus not sample the full range allowed. This would give rise to a reproducibility standard deviation \( \hat{\sigma }_{\text{R}} \) that was smaller than that appropriate for the method as written. More generally, the use of the same measurement procedure by different laboratories implies that any bias inherent in the procedure will be replicated. In other words, the dispersion underestimates the uncertainty by ignoring systematic effects particular to the procedure. Reproducibility standard deviation will not include this contribution to uncertainty. Some metrologists make the further point that precision estimates make no proper reference to traceability; in our view, this point is not distinct from the issue of bias.

These objections, however, imply that \( \sigma_{\text{R}} \) tends to underestimate uncertainty. It follows that if we find that uncertainties reported by laboratories are too small to account for the observed between-laboratory dispersion, we have evidence of the existence of dark uncertainty to at least the extent by which \( \sigma_{\text{R}} \) is greater than expected from the reported uncertainties alone. It, therefore, remains appropriate to compare observed dispersion with that expected on the basis of reported uncertainties.

A meta-analysis of some experimental studies

To summarise, we might expect, on at least two separate grounds, that a standard deviation estimated from reproducibility replication will tend to underestimate the GUM uncertainty u of a result. Few reports of direct comparisons are available, but other reported experiments allow an indirect comparison. A meta-analysis of several studies is summarised below.

For a considered assessment of these experiments, however, two factors have to be borne in mind.
  • If the true reproducibility standard deviation \( \sigma_{\text{R}} \) and some ‘best estimate’ u of uncertainty happened to be equal, we would expect roughly equal numbers of instances where the observed ratio \( {u \mathord{\left/ {\vphantom {u {\hat{\sigma }_{\text{R}} }}} \right. \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \) falls above and below unity.

  • Note: For this paper, u denotes a standard uncertainty reported by a single laboratory and \( \hat{\sigma }_{\text{R}} \) an appropriate observed, interpolated or otherwise estimated reproducibility standard deviation.

  • As both \( \hat{\sigma }_{\text{R}} \) and reported uncertainties will usually be estimated with a (statistically) small number of effective degrees of freedom, the confidence limits for \( {u \mathord{\left/ {\vphantom {u {\hat{\sigma }_{\text{R}} }}} \right. \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \) will be wide. For instance, with 10 degrees of freedom on each estimate and \( u = \sigma_{\text{R}} = 1 \), the estimated ratio will have 95% confidence limits of about 0.5 and 1.9.

For both reasons, individual values of \( {u \mathord{\left/ {\vphantom {u {\hat{\sigma }_{\text{R}} }}} \right. \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \) convey little information about the overall situation—we require at least a medium-sized sample of values for a useful inference from any one type of comparison.

Barwick and Ellison (1998) [8]

This study covered a wide range of analyte types, concentration ranges and physical measurement principles. Given the authors’ involvement with the preparation of authoritative guides relating to the estimation of uncertainty in chemical measurement [7], it is a good presumption that uncertainties estimated in this study by cause-and-effect are appropriate. Unfortunately the scope for valid comparison with other estimates is limited, as only five examples provided independent values of \( \hat{\sigma }_{\text{R}} \) that could be interpolated from collaborative trials. For these examples, the median value of \( {u \mathord{\left/ {\vphantom {u {\hat{\sigma }_{\text{R}} }}} \right. \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \) was 1.08, suggesting overall a good comparison.

A further comparison can be made by using a reproducibility standard deviation \( \sigma_{\text{R}}^{\prime } \) calculated from the modified Horwitz function [9], which describes closely the trend of \( \sigma_{\text{R}} = f\left( c \right) \) for collaborative trials in the food sector, although it does not, of course, predict individual values exactly. In this instance, the comparison of \( u_{i} \) with \( \sigma_{\text{R}}^{\prime } \) is questionable for the following reasons.
  • The majority of the analyses were not conducted on foodstuffs but on simpler matrices. In these instances \( \sigma_{\text{R}}^{\prime } \) would therefore tend to overestimate uncertainty and therefore might give rise to atypically low values of \( {u \mathord{\left/ {\vphantom {u {\sigma_{\text{R}}^{\prime } }}} \right. \kern-\nulldelimiterspace} {\sigma_{\text{R}}^{\prime } }} \).

  • The Horwitz function underestimates \( \sigma_{\text{R}} \) for results obtained near detection limits of individual analytical methods. Analytes present at concentrations near their detection limits would therefore give rise to atypically high values of \( {u \mathord{\left/ {\vphantom {u {\sigma_{\text{R}}^{\prime } }}} \right. \kern-\nulldelimiterspace} {\sigma_{\text{R}}^{\prime } }} \). No information is provided in the paper that would enable such results to be identified.

The comparison is shown in Fig. 1, which distinguishes between values of \( {u \mathord{/ {\vphantom {u {\sigma_{\text{R}}^{\prime } }}} \kern-\nulldelimiterspace} {\sigma_{\text{R}}^{\prime } }} \) and \( {u \mathord{/ {\vphantom {u {\hat{\sigma }_{\text{R}} }}} \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \). The overall median value of the ratio is 0.65.
Fig. 1

Ratios of standard uncertainty to reproducibility standard deviation, either from collaborative trials (solid circles) or estimated from the Horwitz function (open circles). (Barwick and Ellison data [8])

Thompson et al. (2002) [10]

This paper reports an experiment that compared uncertainties \( \hat{u}_{\text{rug}} \) estimated from the results of specially designed ruggedness tests with reproducibility standard deviations from published collaborative trials for the same methods. The 11 analytical methods studied were in the food analysis sector but covered a wide range of analyte types, concentrations and physical measurement principles. The magnitudes of the perturbations to the standard methods were selected by four experts to represent the maximum variations in the method thought likely to occur in different laboratories. Each of the 20 executions of a method was a new random combination of the perturbed conditions, and the outcome was the standard deviation of the set of 20 results. Estimated reproducibility standard deviations \( \hat{\sigma }_{\text{R}} \) at the appropriate concentration were derived by interpolation from the results of collaborative trials of the same method on appropriate test materials.

The outcome of the experiment can be summarised by a dot plot of the ratio \( {{\hat{u}_{\text{rug}} } \mathord{\left/ {\vphantom {{\hat{u}_{\text{rug}} } {\hat{\sigma }_{\text{R}} }}} \right. \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \) for each analyte (Fig. 2). In two instances, there was insufficient collaborative trial information to provide a value for \( \hat{\sigma }_{\text{R}} \), and an estimate derived from the Horwitz equation was used instead. Despite one high value, the tendency is for a ratio of <1.0, with an overall median value of 0.55, clearly showing that even expert assessment of the likely sources of variation was insufficient to identify all of the actual sources. There were, clearly, ‘dark uncertainties’ operating.
Fig. 2

Ratios of standard uncertainty estimated from ruggedness experiments to reproducibility standard deviation, either from collaborative trials (solid circles) or estimated from the Horwitz function (open circles). (Thompson et al. data [10])

Populaire and Giménez (2006) [11]

These authors studied the determination of analytes occurring in food from sub-ppm concentrations to about 10% mass fraction. They estimated uncertainties by the GUM method and compared them (inter alia) to collaborative data obtained among the laboratories participating in an internal company exercise. The ratios \( {u \mathord{\left/ {\vphantom {u {\hat{\sigma }_{\text{R}} }}} \right. \kern-\nulldelimiterspace} {\hat{\sigma }_{\text{R}} }} \) found are shown in Fig. 3. There are two high values but the clear tendency is for a ratio <1.0, with a median value of 0.75.
Fig. 3

Ratios of standard uncertainty estimated from a metrological model to reproducibility standard deviation from collaborative data. (Populaire and Giménez data [11])

Ellison and Mathieson (2008) [12]

This study of performance evaluation strategies was based on results reported with uncertainties to a food analysis proficiency test. Ninety-eight different analytes were involved, of which most appeared only once. The range of concentrations was wide, encompassing mass fractions between 10−11 and 0.1. For assessment, the i-th result xi, together with its uncertainty \( u_{i} \), was converted to a ‘zeta score’ as \( \zeta _{i} = {{(x_{i} - x_{A} )} \mathord{\left/ {\vphantom {{(x_{i} - x_{A} )} {\sqrt {u_{i}^{2} + u_{A}^{2} } }}} \right. \kern-\nulldelimiterspace} {\sqrt {u_{i}^{2} + u_{A}^{2} } }} \). Assuming correct reported uncertainties, and unbiased assigned values \( x_{A} \) with uncertainties \( u_{A} \), the variance of \( \zeta \) should have an expectation of unity (1.0). However, the whole data set provided a robust standard deviation of 1.24, significantly greater than unity, showing ‘strong evidence of uncertainty underestimation among at least some respondents’. This outcome was only to be expected as many of the respondents stated that they were using repeatability precision as the basis for the uncertainty estimate.

A direct examination of \( u_{i}/s_{{\text{R}}} \) was not practicable in this study, but the authors recalculated the zeta scores as \( \zeta_{i}^{\prime } = {{\left( {x_{i} - x_{A} } \right)} \mathord{\left/ {\vphantom {{\left( {x_{i} - x_{A} } \right)} {\sqrt {\sigma_{R}^{\prime 2} + u_{A}^{2} } }}} \right. \kern-\nulldelimiterspace} {\sqrt {\sigma_{R}^{\prime 2} + u_{A}^{2} } }} \) by using a reproducibility standard deviation \( \sigma_{R}^{\prime } \) estimated from the modified Horwitz function. The value of the ratio \( {{{\text{sd}}\left( {\zeta ^{\prime } } \right)} \mathord{\left/ {\vphantom {{{\text{sd}}\left( {\zeta ^{\prime } } \right)} {{\text{sd}}\left( \zeta \right)}}} \right. \kern-\nulldelimiterspace} {{\text{sd}}\left( \zeta \right)}} \) should therefore tend towards \( {u \mathord{\left/ {\vphantom {u {\sigma_{R}^{\prime } }}} \right. \kern-\nulldelimiterspace} {\sigma_{R}^{\prime } }} \) for small \( u_{A} \) (usually the case in proficiency tests), being larger than 1 where the uncertainties are generally larger than the modified Horwitz prediction and <1 where the modified Horwitz prediction is the larger. The individual values of \( {\text{sd}}{{\left( {\zeta^{\prime } } \right)} \mathord{\left/ {\vphantom {{\left( {\zeta^{\prime } } \right)} {{\text{sd}}\left( \zeta \right)}}} \right. \kern-\nulldelimiterspace} {{\text{sd}}\left( \zeta \right)}} \) were not available in the published paper, but the mean value observed was 0.86, consistent with previous findings. The authors’ comment on this is worth quoting in full:

…the number of unacceptable zeta scores is lower using any of the reproducibility models than by relying on reported uncertainties—a quite extraordinary observation if it is assumed that the laboratory’s own data and uncertainty evaluation should provide the most accurate uncertainty estimates.

BIPM international key comparisons [13]

A compelling demonstration of the prevalence of dark uncertainty comes from the results of the BIPM International Key Comparisons, in which a group of national reference laboratories from different countries analyse a variety of materials blind. Each laboratory conducts the measurement with the best possible accuracy and reports the result to BIPM together with an uncertainty estimated by the GUM method. The purpose of these comparisons is to ensure that important measurements are equivalent internationally.

If the individual uncertainty estimates are correct, they should account for all of the variation among the participants’ reported results. However, if there is dark uncertainty present, the dispersion of the participants’ results would be greater, on average, than that estimated from the uncertainties of individual participants. This was tested by the examination of statistics from 28 key comparison data sets from 24 key comparisons, as follows (CCQM-K number and analyte): 1 (O3); 2 (Cd, Pb); 24 (Cd); 27 (ethanol); 28 (tributyltin); 29 (Cl, PO4); 29a (Cl); 30 (Pb); 33 (Cr, Mn, Ni, Mo); 38 (five PAHs); 48 (K); 52 (CO2); 59 (NO3, NO2); 63b (progesterone). The data sets used were selected from those available to cover a range of analytes, concentrations and measurement methods. Data sets with <7 participants were excluded but no other selection criterion was applied.

The observed variation of the reported results was characterised as a simple standard deviation estimate \( s_{\text{obs}} \). The probability density function of the standard deviation (\( s_{\exp } \)) expected on the basis of the individual uncertainties was estimated by 1,000 Monte Carlo simulations, assuming a normal distribution for the individual contributions. In that way \( s_{\text{obs}} \) could be compared with the mean (\( \bar{s}_{\exp } \)) and quantiles of \( s_{\exp } \). Despite the unsupported assumption of normality, the procedure would be expected to give a reasonable account of the situation.

In the event, it was found that in 21/28 instances, \( s_{\text{obs}} \) exceeded the value expected from the individual uncertainties (\( \bar{s}_{\exp } \)), and in 18/28 instances exceeded the 97.5% percentile point of \( \bar{s}_{\exp } \) (Fig. 4). The median of all 28 values of the ratio \( {{\bar{s}_{\exp } } /\mathord{ {\vphantom {{\bar{s}_{\exp }} {s_{\text{obs}} }}} \kern-\nulldelimiterspace}{s_{\text{obs}} }} \) was 0.45. This demonstrates that the GUM method of uncertainty estimation, even in expert hands, is susceptible to underestimation because of dark uncertainty.
Fig. 4

Ratio of standard deviation of results expected from uncertainty estimates to that observed from results. (BIPM International Key Comparisons data)


Despite the logical conclusion that a reproducibility standard deviation \( \sigma_{\text{R}} \) will tend to underestimate the true uncertainty because of its failure to account for bias, there seems to be a general tendency in chemical measurement for the GUM approach to offer an even lower and therefore less accurate estimate, even after 15 years since the implementation of the GUM. On the basis of current data, this seems to be as true of studies involving national measurement laboratories as of routine test laboratories; in every case, at least some of the laboratories must be omitting important contributions from their formal uncertainty calculations, by underestimation of recognised uncertainty sources, by omitting important effects from the model used or for other reasons.

This is evidence for the prevalence of dark uncertainty in much of chemical measurement. Moreover, the reproducibility standard deviation is thus vindicated as an important benchmark for judging the relevance of uncertainty estimates.


The term ‘estimate’ is used in relation to measurement uncertainty in this paper because, although the GUM does not treat measurement uncertainties as estimates, statements of uncertainty are invariably based on observed dispersions or judgements that are themselves estimates of a population parameter. Further, stated uncertainties may be over- or, more commonly, under-stated owing, for example, to incomplete models, and therefore in some sense subject to error.


Copyright information

© Springer-Verlag 2011