3.1 Introduction

As presented in the previous chapter, the characterization of measurement as a process specified by a procedure is given in terms of a set of necessary conditions. This description is so fundamental that it provides a very abstract picture of what measurement actually is, and so, it now needs to be expanded and placed in a more concrete context. This is the purpose of the present chapter, which develops along three parallel lines.

The first line of development starts from the acknowledgment that the empirical nature of the properties to be measured and of the process of measuring them prevents exact (i.e., perfectly certain and specific) evaluations. On this basis, we elaborate the requirement that measurement should produce information about both values of the measurand and the quality of such an evaluation. This examination of measurement quality must begin with the features of the measuring instrument itself: the empirical nature of the instrument implies unavoidably non-ideal behavior that affects the quality of the results it generates: the lack of complete accuracy of the instrument is a key (though not the only) source of the errors and the uncertainties in the results.

The second line of development starts from the acknowledgment that measurement is a relational process—in the usual case for physical quantities it compares the measurand and the chosen unit—and embeds it into the scientific, technical, and organizational system that is the foundation for this relationship. This is made possible through the definition and the dissemination of measurement standards embodying the relevant reference properties. Measuring instruments are calibrated by means of such standards.

The third line of development complements this operational context with a conceptual one, by tracing from a historical perspective the common position that only quantitative properties are measurable. The outcome of our analysis is that the Euclidean basis of this assumption is not sufficient as such for maintaining the constraint: this will pave the way for a further analysis of measurability.

3.2 The Quality of Measurement and Its Results

In Chap. 2, we introduced the distinction between a measurement procedure and a measurement process: the former is the description that specifies how the latter, i.e., the process, must be performed. Even if the specifications are exactly fulfilled, two measurement processes implementing the same procedure on the same object may produce different results. This calls for an explanation.

In principle, two situations might obtain. The property with which the measuring instrument is designed to interact either

  • has changed, so that different measurement results may correctly report the fact that the object under measurement modified its state in the interval between the interactions, or

  • has not changed, but the behavior of the measuring instrument has been affected by changes in the state of the environment or of the measuring instrument itself, so that different measurement results incorrectly report a difference that is not related to the measurand.Footnote 1

From the perspective of measurement as such, the first case is not problematic: the property under measurement may actually change as the result of its dependence on other properties, of the object or the environment. We call any property whose changes produce a change in the measured property an affecting property. For example, if the measured property is the length of an iron rod, then due to thermal expansion, the temperature of the environment is an example of an affecting property. If the measured property is the reading comprehension ability of a student, an example of an affecting property might be the intensity of distracting noises from the environment, insofar as such noise could negatively affect the student’s ability to comprehend a text they are attempting to read. However, there can be properties other than the measurand which alter the behavior of the measuring instrument and therefore generate the second case; these are called influence properties (JCGM, 2012: 2:52). In the example of the measurement of the thickness of a rod by means of a caliper, an example of an influence property is the parallelism of the jaws, whereas in the example of the measurement of the reading comprehension ability of a student by means of a test with multiple choice questions, an example of an influence property might be human error in the scoring of the item responses. The role of affecting properties and influence properties in a measurement is depicted in Fig. 3.1.

Fig. 3.1
A block diagram presents properties, instruments, and indications in series. Affecting and influence properties are marked above them.

Black-box model of the empirical behavior of a measuring instrument

In a typical context, the two discussed situations are not mutually exclusive, and in fact they might co-occur to generate the multiplicity of results mentioned above. Under the principled hypothesis that the influences described in the second situation can be identified, the problem arises of how to deal with the incorrect results that are obtained. Two basic strategies can be envisaged.

An empirical strategy aims at improving the behavior of the measuring instrument by reducing its sensitivity to the influence properties and therefore the misleading variability of its results. This is a positive outcome, generally obtained at the price of additional resources (including money, competencies, time, etc.) devoted to the measurement. Of course, this is not always feasible.

The fact that measurement is both an empirical and an informational process makes possible a complementary, informational strategy: if the undesired variability cannot be completely removed, it can at least be modeled, evaluated, and formally expressed. The fundamental outcome is the acknowledgment that only in the simplest cases is the information acquired on a measurand by means of a measurement entirely conveyed by a single measured value. Generally, a structurally more complex result has to be reported instead. This is why the International Vocabulary of Metrology (VIM) defines <measurement result> as a “set of quantity values being attributed to a measurand together with any other available relevant information” (JCGM, 2012: 2.9). This complexity is justified under the assumption that “when reporting the result of a measurement […], it is obligatory that some quantitative indication of the quality of the result be given so that those who use it can assess its reliability”Footnote 2 (JCGM, 2008a: 0.1; emphasis added). In fact, one may even take as a definitional condition of measurement that its results include some information about their quality, as proposed for example by the U.S. National Institute of Standards and Technology (NIST) (Possolo, 2015: p. 12).

The subject is intermingled with the development of statistics and the theory of probability (see, e.g., Hacking, 1975, 1990, and Rossi, 2014), and the diversity of interpretations of probability is reflected in the diversity of understandings of the role of probability in measurement. In particular, according to the VIM, there are two “philosophies and descriptions of measurement”, identified as the “Error Approach (sometimes called Traditional Approach or True Value Approach)” and the “Uncertainty Approach” (JCGM, 2012: Introduction). While for some purposes this might be too rough a classification, and some cases may be intermediate (see, e.g., Giordani & Mari, 2014), or even of a different kind (see, e.g., Ferrero & Salicone, 2006), we adopt this distinction to introduce the informational strategy and therefore the pivotal concepts of measurement error and measurement uncertainty.Footnote 3 However, before discussing such concepts, we first introduce a framework in which they may be understood.

3.2.1 A Sketch of the Framework

“Measurement is essentially a production process, the product being numbers” (Speitel, 1992). While we adopt a more encompassing view with respect to what is produced by a measurement (i.e., not only numbers), we agree that measurement is a production process, and as such the quality of the products and the quality of the process are in principle distinct (though plausibly related), and as such they each deserve some consideration. It is clear that users focus on the quality of what is produced: in general, users would like to have trustworthy, useful information on the measurands in which they are interested. Where it is possible to disentangle the quality of the process from the quality of the products, the former would become immaterial. But, as with any production process, the quality of the products—measurement results in this case—depends on the quality of the process, which is why both need to be taken into account.

The quality of a process of measurement has to do with the features of the experimental setup, which includes the measuring instrument(s) and everything that is exploited to control the environment with the aim of reducing its effects on the behavior of the instrument(s), so that in the best case, the output of the instrument conveys information only about the measurand and nothing else. As mentioned in Sect. 2.3, measurement is sometimes modeled as a black box that transduces an input property, i.e., the property being measured, to an output property, i.e., the instrument indication, under the acknowledgment that the transformation is usually affected by some influence properties. Such a transformation is modeled as an empirical transduction function, whose informational inverse is the instrument calibration function, which maps values of the instrument indication to values of the measurand.

On this basis, a wealth of models and accompanying parameters have been developed to characterize the behavior of measuring instruments, where the preliminary distinction needs to be maintained between the “nominal” behavior of an instrument, as shown in the controlled conditions of instrument manufacturer laboratories, and its “field” behavior in actual measurements, the former typically being the limit, best case of the latter. The characterization of the nominal behavior of an instrument is supposed to be performed in such a way that the values of the relevant, both input and output, properties are known independently of the instrument under test, and this makes it clear that such a characterization is not a measurement, though it may exploit previous measurement results. In what follows the values of the property being measured are designated as xi, the corresponding values of the instrument indication as yj (where both the property being measured and the instrument indication are assumed to be scalar for the sake of simplicity), and the values of the (vector of the) influence properties affecting the indication as zk, so that the transduction performed by the instrument is modeled as a function f of the form yj = f(xi, zk), that is, qualitatively, output = f(sought input, spurious input). Among the several parameters that characterize the behavior of a measuring instrument, and therefore its transduction function f, we focus on:

  • sensitivity, the ability of instruments to produce distinct outputs in response to distinct expected inputs;

  • selectivity, the ability of instruments to produce outputs unaffected by spurious inputs;

  • stability, the ability of instruments to produce the same outputs in response to the same expected inputs even in different time instants;

  • resolution, the ability of instruments to detect distinct expected inputs.

Together with other parameters not mentioned below, like discrimination threshold (JCGM, 2012: 4.16) and dead band (JCGM, 2012: 4.17), these “low-level” features of measuring instrument contribute to the “high-level” features of precision, trueness, and finally accuracy, as discussed in the rest of this section. It is remarkable that in principle, these parameters are structural features of measuring instruments, and as such in principle independent of whether what the instruments measure are physical or psychosocial properties, and whether such properties are quantitative or not. As a matter of fact, the tradition of physical instrumentation for quantitative properties is far more developed on this subject, and that is why we provide here definitions that apply to quantities, by thus including differences, ratios, etc., with a simple running example of a spring dynamometer, which transduces the input weight force (the property being measured) to an output spring elongation (the instrument indication) (extensions to non-quantitative properties are possible indeed, as discussed in Mencattini & Mari, 2015).

Despite the formal possibility of these parameters being used in psychosocial measurements, in fact they are not common in the related literature. The main reason is the requirement that the values of the input property need to be known, whereas there are no “controlled conditions of instrument manufacturer laboratories”, as was noted above, for most psychosocial contexts. Nevertheless, these parameters are indeed sensible ones and would be useful if available, even in somewhat approximate and/or different guises, and hence, we comment on such possibilities in the paragraphs below. We use somewhat different specific examples for different parameters, as distinct possibilities arise more commonly in some contexts than others, but will nevertheless try to incorporate the running example of reading comprehension ability (RCA) which has been introduced earlier.

The sensitivity of a measuring instrument, according to the VIM (JCGM, 2012: 4.12, adapted), is the “quotient of the change in an indication value of a measuring instrument and the corresponding change in a value of a property being measured”, under the supposition that the influence properties remained constant; hence,

$$sens\left( {instrument,x,z} \right) = [f(x + \Delta x,z){-}f\left( {x,z} \right)]/\Delta x = \Delta y_{z} /\Delta x$$

for a sufficiently small change ∆x. The sensitivity of the spring may be a function of the value x of the property being measured in the appropriate unit (e.g., metres per newton in the SI), describing how much the spring elongates in response to a change of the applied force, while all influence properties remain constant and may be in turn also a function of such constant values of the influence properties. An instrument such that the indication value y depends linearly on the measured value x has constant sensitivity, and an instrument whose sensitivity is zero for a given set of forces is useless for measuring a force in that set. However, the sensitivity of an instrument may be different for different measured values, and this is modeled with a sensitivity function, that for some physical transducers is logarithmic, with decreasing sensitivity as the measured value increases.

In the context of psychosocial measurements,Footnote 4 sensitivity can be qualitatively ascertained up to a limit. In a situation where there is a broadly accepted and recognized difference between, say, two readers in terms of their RCAs, then one could ascertain the ability of a certain RCA test to distinguish them, say, by an expert panel of judges of readers’ RCAs. This qualitative observation could then be used in a more controlled way to establish an ordinal categorization of multiple such judgments, such as, for say, typical grade 2 students, typical grade 3 students. This could then be used to describe, again in a qualitative way, the sensitivity of a certain RCA test to distinguish students with those RCAs. Thus, the logic pertains in the psychosocial setting, even if the realization of the parameter is not precise.

The selectivity of a measuring instrument is basically its insensitivity to influence properties, while the property being measured is constant, the selectivity of an instrument with respect to a given influence property can be evaluated as

$$sel\left( {instrument,z,x} \right) = \Delta z/[f(x,z + \Delta z){-}f\left( {x,z} \right)]$$

for a sufficiently small change ∆z. The selectivity of the spring with respect to temperature is a value in the appropriate unit (kelvin per metre in the SI), describing how much the spring elongates in response to a change in the environmental temperature, while the applied force and all other influence properties remain constant: the lower the selectivity, the better the instrument.Footnote 5

An example of an influence property for a mathematics test is the RCA of the student taking the mathematics test. Thus, the selectivity of the test with respect to the RCA of the reader is the ratio of the change in the test outcomes for a given change in the RCA of the reader. As noted above, this ratio may vary depending on the value of the reader’s mathematics knowledge, and hence, an estimate of the derivative with respect to RCA may be a selectivity function.

The stability of a measuring instrument is basically its insensitivity to time, thus under the acknowledgment that f may depend on time, while both the property being measured and the influence properties are constant, the stability of an instrument can be evaluated as

$$stab(instrument,\Delta t,x,z) = \Delta t/[f(x,z,t + \Delta t){-}f\left( {x,z,t} \right)]$$

for a sufficiently small change ∆t. The stability of the spring is a value in the appropriate unit (seconds per metre in the SI), describing how much the spring changes its elongation in different time instants, while the applied force and all influence properties do not change (stability considered in short intervals of time is also called repeatability): the greater the stability the better the instrument.Footnote 6

In theory, for an RCA test, the stability could be observed by retesting the students after a suitably small period. In the case of RCA, as with many psychosocial tests, it is unlikely that stability could be evaluated with a single test, as the readers will almost certainly recognize the passages and the RCA questions, and then recall their own answers—and hence the property under measurement would be confounded with readers’ memories of the first test, and so this “theoretical” approach is not useful. However, an example of an instrument that would be relevant here would be a test bank, where the stability of RCAs over different choices from the bank could be observed, and students can be observed using different tests, which have no overlapping items, from the test bank.

The resolution of a measuring instrument, according to the VIM (JCGM, 2012: 4.14, adapted), is the “smallest change in a property being measured that causes a perceptible change in the corresponding indication”, thus again under the supposition that the influence properties remained constant; hence,

$$res\left( {instrument,x,z} \right) = {\text{min}}(\Delta x),{\text{such that}}\,f(x + \Delta x,z){-}f\left( {x,z} \right) = \Delta y \ne 0$$

The resolution of the spring is a value in the appropriate unit (newton in the SI), describing the minimum change of applied force that changes the spring elongation, while all influence properties remain constant: the smaller the resolution the better is the instrument.

In the case of psychosocial measurements, the difficulty of obtaining the values x of the property, as noted above, makes this parameter difficult to calculate. Nevertheless, some insight into the resolution can be found, based on the discrete nature of the items in the instrument. For a RCA test, the minimum detectable change in the indication would be a change where one of the set of items changed from being incorrect to correct (assuming that the items were dichotomous). This amount (i.e., the change) would depend on (a) the RCAFootnote 7 of the reader and (b) (for some situations) which item changed. As for the general definition, the resolution would then be the minimum of these values.

What follows is a summary characterization of these lower-level parameters in terms of two higher-level parameters, which are illustrated in Fig. 3.2.

Fig. 3.2
A graph of precision versus trueness depicts 4 dartboards with different dot patterns. If precision and trueness are high the dots are very close to the center.

Visual metaphor for precision, trueness, and accuracy; note that precision is independent of the presence of the bullseye, thus emphasizing that precision can be evaluated even if no reference values are known and the instrument is not calibrated, while trueness and then accuracy cannot

The precision of a measuring instrument (by adapting the VIM: JCGM, 2012: 2.15, modified in reference to ISO, 1994: 3.12) is the closeness of agreement between indication values or measured values obtained by repeated independent measurements on the same or similar objects under specified conditions. If measurement is modeled as affected by errors (see Sect. 3.2.2), precision is inversely related to the random component of errors, i.e., the one that is reduced by increasing the size of the sample of values. Hence, precision is evaluated by means of statistics of dispersion, like standard deviation, and, if evaluated on samples of indication values, does not require the instrument to be calibrated. In fact, the precision of a measuring instrument may be effectively estimated by the precision of its results, obtained in test conditions. In psychosocial measurement, precision is usually termed “reliability”, which is evaluated as a degree, from 0 to 1: for an RCA test, precision would typically be estimated by one or more reliability coefficients, including internal consistency reliability (consistency of the results across the items within the test), test–retest reliability (consistency of a test results across different administrations), and interrater reliability (for item-formats that require human judgments, consistency of different raters, either at the item or the test level).

The trueness of a measuring instrument (by adapting the VIM: JCGM, 2012: 2.14, modified in reference to ISO, 1994: 3.7) is the closeness of agreement between the average value of a large series of measured values, obtained by replicate independent measurements on the same or similar objects under specified conditions, and an accepted reference value. If measurement is modeled as affected by errors, trueness is inversely related to the systematic (i.e., non-random) component of errors, i.e., the one that does not depend on the size of the sample of values. The trueness of a measuring instrument may be effectively evaluated as the inverse of measurement bias (JCGM, 2012: 2.18), with respect to a reference value taken from an available calibrated measurement standard in the process of metrological confirmation of the instrument (JCGM, 2012: 2.44).Footnote 8 Trueness is not commonly used in the human sciences, though it bears some similarity to early definitions of validity, discussed further in Sect. 4.3.

As defined, precision and trueness are complementary features of an instrument: when the property being measured does not change, precision is about the capability of the instrument to maintain the values it produces to be close to one another, and trueness is about the capability of the instrument of maintaining the values it produces to be on average close to the target reference value.

Of course, a measuring instrument is expected to have at the same time a good precision and a good trueness. By combining precision and trueness, the highest-level parameter for characterizing the quality of a measuring instrument is obtained:

The accuracy of a measuring instrument (by adapting the VIM: JCGM, 2012: 2.13, modified in reference to ISO, 1994: 3.6) is the closeness of agreement between a measured value and an accepted reference value. If measurement is modeled as affected by errors, accuracy is inversely related to errors. The accuracy of a measuring instrument is evaluated in test conditions, by somehow combining its precision and trueness with respect to a reference value taken from an available calibrated measurement standard in the process of metrological confirmation of the instrument.Footnote 9 An accurate instrument is exactly what we would like to have: an instrument that produces trustworthy values.

As is discussed at further length in Sect. 4.3, in the human sciences one encounters the term “validity”. In its earliest usages, which are still common in some areas of the literature, validity refers to closeness of agreement between measured values and a true value, sometimes operationalized in terms of an accepted reference value, as discussed in Sect. 4.3.1. Thus, from this perspective, the early version of validity is similar to accuracy.

Given this sketch of a framework about the characterization of the quality of the process of measuring, the issue arises of whether the quality of measurement results is entirely and solely determined by the quality of the measurement that produced them.

3.2.2 The Error Approach (or: True Value Approach)

Appropriate applications of the empirical strategy mentioned in the opening of Sect. 3.2 generally result in improvements of the behavior of measuring instruments and consequently reductions in the variability of their results. Extrapolating from this, one might suppose that, if this improvement process were to continue indefinitely, a single, definite value would actually be obtained: “by analyzing the meaning of the obtained results of measurement, the experimenter ponders on the true value, the value that the best possible instrument would have generated” (translated from Idrac, 1960, emphasis added). Indeed, this hypothesis has a simple statistical basis. To see this, assume that the measurement is repeatable—that is, that (a) the environment and the measuring instrument are sufficiently stable and that (b) the observed variability between interactions is only due to the influence of small, independent causes. Then, the repeated interaction of the instrument with the measurand generates a sample of values whose distribution becomes more and more stable as the number of values increases. More specifically, if the values can be averaged in a meaningful way (hence, this obviously does not apply to nominal and ordinal properties), and s is the sample standard deviation, it is well known that the standard deviation of the distribution of the sample mean (the so-called standard error) is estimated by \(s/\sqrt{n}\), where n is the sample size. By increasing the sample size, i.e., repeating the number of interactions of the measuring instrument with the measurand, the variability of the sample mean converges to zero, showing that the total influence of the aforementioned “small causes” is progressively reduced. Reasonably, then, the mean of a sufficiently big sample estimates “the mean value that the best possible instrument would have generated”.

However, in order to make this concept of “best possible instrument” operational, a second condition must be fulfilled: an instrument is expected to maintain its calibration over time, so that, together with the calibration information, its indication values are sufficient to obtain appropriate values of the measurand. If the calibration information were not updated, the measurement results produced by the no-longer-calibrated instrument would be systematically biased and, critically, this bias would not be revealed by the repeated application of the instrument itself, as it is unaffected by sample size. In other words, the convergence to a target distribution is necessary but not sufficient to obtain the value that would be generated by “the best possible instrument”.

This is a delicate point: the quality of the information produced by measurement is hindered by causes traditionally treated as belonging to two distinct and independent kinds, called random and systematic respectively. This distinction can be functionally characterized: the observed variability in repeated measurements is treated as being due to random causes, whereas systematic causes generate a bias which remains constant across repeated measurements, whose results are then affected in the same way by such causes. The consequence is that, while the assumption of repeatability could be sufficient to assess the presence of random errors, revealing the presence of systematic errors requires performing the measurement of the same measurand by means of different and independently calibrated instruments. Furthermore, the consideration that the effects of these two kinds of causes manifest themselves in statistical versus non-statistical ways led the authors of the GUM to conclude that they “were to be combined in their own way and were to be reported separately (or when a single number was required, combined in some specified way)” (JCGM, 2008a: E.1.3).

On this basis, a conceptualization was developed that considers the value generated by “the best possible instrument” to be an intrinsic feature of the measurand, traditionally called the true value of the measurand and defined as “the value which characterizes a quantity perfectly defined, in the conditions which exist when that quantity is considered” (ISO, 1984: 1.18). The designation “Error Approach” has this origin: due to its experimental component, measurement is unavoidably affected by errors, understood as the difference between the measured value and the true value. Under the assumption of the unknowability of true values, but with the aim of maintaining the operational applicability of the framework, this sharp characterization has sometimes been weakened by instead considering “conventional true values” (of course, the very concept of conventional truth is questionable, to say the least) (see, e.g., ISO, 1984: 3.10) or “reference values” (see, e.g., JCGM, 2012: 2.16), where a reference value “can be a true quantity value of a measurand, in which case it is unknown, or a conventional quantity value, in which case it is known” (JCGM, 2012: 5.18, Note 1). The philosophical justification of the claim of the very existence of a true value of an empirical property is complex and controversial, and we do not discuss it here further.

The important point here is the acknowledgment that “every measurement is tainted by imperfectly known errors, so that the significance which one can give to the measurement must take account of this uncertainty” (BIPM, 1993: Foreword). While errors generate uncertainty in measurement, nothing in principle precludes the possibility that uncertainty has other causes as well: this suggests that measurement uncertainty is an encompassing concept and justifies the current trend of moving away from the Error Approach and toward the Uncertainty Approach witnessed in both the VIM and the GUM.

3.2.3 The Uncertainty Approach

Like the Error Approach, the Uncertainty Approach can be characterized primarily as a framework that provides functional solutions to implement what we have called an informational strategy to cope with the observed variability of measurement results and secondarily as a conceptualization that can be included as a background justification for the way in which uncertainty is understood and discussed.Footnote 10

The starting point is that, even when the measurement is not repeated, the information available in the context of the measurement may allow the measurer to acknowledge that the obtained results have a limited quality, due in particular to the quality of the measuring instrument and of the available information on the instrument calibration and on the influence properties. Compared to the Error Approach, the focus here is less on the experimental errors themselves and more on the state of partial knowledge of the measurer, who designs and performs the measurement for the explicit purpose of gaining information on the measurand, with the acknowledgment that “complete” information (whatever this may actually mean) cannot be obtained even by the best possible measurement.Footnote 11 As related to measurement results, the concept <measurement uncertainty> emphasizes this incompleteness, and the standardization of the methods for identifying sources of uncertainty and formalizing the quantitative evaluation of their contributions and their combination provides an even more solid common ground for measurement (JCGM, 2008a: 0.3):

A worldwide consensus on the evaluation and expression of uncertainty in measurement would permit the significance of a vast spectrum of measurement results in science, engineering, commerce, industry, and regulation to be readily understood and properly interpreted. In this era of the global marketplace, it is imperative that the method for evaluating and expressing uncertainty be uniform throughout the world so that measurements performed in different countries can be easily compared.

Such an information-oriented standpoint is the basis for a recommendation issued in 1980 by a working group promoted by the International Bureau of Weights and Measures (BIPM) and approved in 1981 by the International Committee of Weights and Measures (CIPM). The traditional classification of kinds of (causes of) errors as random or systematic is replaced here with a distinction about the methods of evaluating measurement uncertainty (JCGM, 2008a: 0.7):

The uncertainty in the result of a measurement generally consists of several components which may be grouped into two categories according to the way in which their numerical value is estimated: A. those which are evaluated by statistical methods, B. those which are evaluated by other means.

This change is the premise for the two key parts of the recommendation.

  • First, uncertainties shall be formalized as standard deviations not only when a statistical sample is available, i.e., for “the components in category A”, but also in all other cases, i.e., for “the components in category B”,Footnote 12 for which the standard deviation is “based on the degree of belief that an event will occur” (JCGM, 2008a: 3.3.5). The list of these components—each then made of a description of the evaluation method and the related standard deviation, called “standard uncertainty” in this contextFootnote 13—is included in the uncertainty budget (JCGM, 2012: 2.33).

  • Second, as a consequence of this formalization, the information provided by all uncertainty components in an uncertainty budget can be synthesized in a single outcome: “The combined uncertainty should be characterized by the numerical value obtained by applying the usual method for the combination of variances. The combined uncertainty and its components should be expressed in the form of standard deviations”. (JCGM, 2008a: 0.7).

Of course, such a position might be considered a pragmatic means of solving the problem of separately reporting statistical and non-statistical components of uncertainty, while simply sidestepping the traditional problem of separately identifying random and systematic causes of errors, as maintained for example by Rabinovich (2005: p. 286).

The subject is complex and widely analyzed in the literature on the science and philosophy of measurement, though for the present purposes we need not discuss it further here.

Box 3.1 The logic of error/uncertainty propagation

What traditionally has been called the law of propagation of errors can be exemplified by the measurement of human body temperature by means of a mercury thermometer. Several possible sources of error can be identified in this measurement, including the finite resolution of the instrument (which is not able to discriminate among temperatures closer to one another than a given threshold), the effect of the temperature and the atmospheric pressure of the environment on the instrument (so that the instrument output changes when the temperature under measurement does not change), and the alteration of the temperature under measurement due to the interaction between the body and the instrument. Under the hypotheses that (1) each of these errors, whether its source is of a statistical nature or not, can be formalized as a standard deviation (thus consistently with the CIPM recommendation mentioned above, as then implemented by the GUM), (2) such errors are statistically uncorrelated, and (3) their contribution to the total error is not analytically known, the simplest case of the law of propagation of errors is obtained, which prescribes computing the total error as the square root of the sum of the squares of these standard deviations (i.e., of the variances associated with the errors).

The underlying logic is as follows. For each component Xi, it is assumed that a measured value xi, computed as a sample mean value, and an error, formalized as the standard deviation s(xi) of the mean, are known. The measurand Y is assumed to be a function of the components, Y = f(X1, …, Xn) (in the case of indirect measurement—see Sect. 2.3 and Chap. 7f could be the function that computes the measurand Y from the input quantities Xi), so that the measured value y of Y is, as usual, computed as y = f(x1, …, xn). Under the supposition that the errors are sufficiently small and that f is derivable and can be linearly approximated around the n-dimensional point (x1, …, xn), the total error s(y), in turn formalized as a standard deviation, is computed by the first-order approximation of the Taylor series of f, which in the simplest case in which the quantities Xi are not correlated corresponds to

$$s^{2} \left( y \right) = \left. {\mathop \sum \limits_{i = 1}^{n} \left( {\frac{\partial f}{{\partial X_{i} }}} \right)^{2} } \right|_{{X_{i} = x_{i} }} s^{2} \left( {x_{i} } \right)$$

In the case f is not known (hypothesis (3) above), all partial derivatives—each modeling the relative weight of the component Xi to the total error—are assumed to be equal to 1, thus leading to the quadratic sum as in the previous example.

By reinterpreting the errors s(xi) as standard uncertainties, the GUM has taken this traditional result and assumed that it can be systemically applicable also to uncertainties evaluated by non-statistical (i.e., “type B”) methods (for an expanded explanation, see the GUM itself, JCGM, 2008a, or, in particular, Lira, 2002; Rossi, 2014).

3.2.4 Basic Components of Measurement Uncertainty

The structure of the measurement process introduced in Sect. 2.2.4 is reflected in a classification of the components of measurement uncertainty.Footnote 14 By reinterpreting the abstract structure in Fig. 2.8, still in reference to quantities with unit for the sake of simplicity, some basic components can be identified, as depicted in Fig. 3.3.

Fig. 3.3
A model diagram presents quantity and unit on the left, measurement in middle, and results on the right. Uncertainties at all stages are marked.

Basic components of measurement uncertainty as related to the abstract structure of measurement (in the case of quantities)

  • Measurand definitional uncertainty. Measurement is an evaluation of a property of an object, and therefore of an individual property, which is an instance of a general property. Hence, in order to design a measurement process some knowledge of the general property is usually presupposed, and on its basis, a model is adopted for the measurand, which in principle should be defined to guarantee that the obtained Basic Evaluation Equation

    $$measurand = measured\,value\,of\,a\,property$$

    conveys meaningful information. This is by no means a trivial assumption: in defining such a model, it may be admitted that the measurand is not completely identified and characterized, thus acknowledging the presence of a non-null measurand definitional uncertainty, “resulting from the finite amount of detail in the definition of a measurand” (JCGM, 2012: 2.27) and therefore being “the practical minimum measurement uncertainty achievable in any measurement of a given measurand” (JCGM, 2012: 2.27 Note 1).Footnote 15 A classic example of definitional uncertainty from the physical sciences comes from the measurement of temperature: by defining the measurand as the temperature of a given body, the body itself is (implicitly) modeled as thermally homogeneous, and actual differences of temperature between parts of the body contribute to definitional uncertainty (see also Box 2.3). The human sciences, arguably, regularly contend with many forms of definitional uncertainty. In the case of reading comprehension ability, for example, (a) there might be a lack of clarity regarding the definition of the object that bears the property of reading comprehension ability, insofar as it could be (and, historically, most commonly is) a single human being in isolation, or, in the context of contemporary sociocultural theories (e.g., Mislevy, 2018), it might be a group of people or a single person with a given set of sociocultural resources, which could have practical implications for issues such as whether the examinee should be allowed access to the Internet as they read and attempt to comprehend a text; (b) there might be a lack of clarity regarding the definition of reading comprehension ability in general (e.g., involving questions such as: does it include reading speed? are orthographic fluency and morphemic awareness subcomponents of reading comprehension ability, or causes of it? does reading comprehension ability include metacognitive abilities like critical analysis of textual information, or just “direct” comprehension? and so forth); (c) there might be a lack of clarity regarding how the general property of reading comprehension ability is instantiated in a given human being, if, for example, an individual is blind and uses a text-to-speech program (is this still considered “reading”?), and so forth. We return to a discussion of special issues faced in the human sciences in Chap. 8. The quantification of measurand definitional uncertainty is an issue that depends on the specific context: we are not aware of any generally applicable technique on this matter.

  • Unit definitional uncertainty. Measurement is performed as a (direct or indirect) comparison of the quantity of an object with a unit (the more general case of non-quantitative properties is discussed in Chap. 6). Complementary to measurand definitional uncertainty, the quantity unit also needs to be defined before it can enter a comparison process, and nothing in principle prevents such a definition from involving unit definitional uncertainty. In the 2019 edition of the International System of units (SI), the definitional strategy is adopted of deducing the definition of the units from the numerical values assumed of some constant quantities, where “the numerical values of the seven defining constants have no uncertainty” (BIPM, 2019: p. 128; see also Sect. 6.3.4): in this case, unit definitional uncertainty is zero.

  • Calibration uncertainty. The relation between the unit, as defined, and the value selected for the measurand is guaranteed by the calibration of the measuring instrument, which generally cannot be assumed to be perfectly stable, resulting in a non-null calibration uncertainty. For example, in measuring the temperature of a body the thermometer could have been calibrated long before the measurement, and in the period since the calibration its structure might have been changed, so that the obtained temperature no longer corresponds to the correct one. In the case of reading comprehension ability, if different texts were being used for different students, and these texts had different reading difficulties, then any process that ignored these differences would be an instance of calibration uncertainty.

  • Instrumental uncertainty. Again in reference to the Basic Evaluation Equation

    $$measurand = measured\,value\,of\,a\,property$$

    referring the obtained value to the measurand is justified by the quality of the measuring instrument, which is expected to generate an output that is stable in the case of repeated interactions with the object under measurement and specifically depends on the measurand and not on other properties, i.e., the influence properties. The fact that in this sense the measuring instrument has a limited quality is acknowledged in terms of a non-null instrumental uncertainty (JCGM, 2012: 4:24), which is inversely related to the instrument’s accuracy (see Sect. 3.2.1). For example, the thermometer used to measure the temperature of a body could be sensitive also to the temperature of the environment and therefore could produce an indication affected by instrumental uncertainty due to its dependence on properties other than the measurand. In the case of reading comprehension ability, if different students were asked questions by different judges, and these judges expressed the same questions in different ways, this would be an example of instrumental uncertainty.

  • Interaction uncertainty. Finally, the interaction between the object under measurement and the measuring instrument can alter the state of the object itself. This may occur when acquiring information on physical properties—the so-called loading effect—and it is even more usual for psychosocial properties, as for example in most cases of interviews, in which a respondent may be prompted by interview questions to consider issues in a new way. This is acknowledged in terms of a non-null interaction uncertainty. In the case of temperature measurement, a sufficiently small body might change its temperature due to its interaction with an initially colder or warmer thermometer, thus corresponding to an interaction uncertainty. In the case of educational testing, examinees who are asked to respond to a given set of test questions arranged from most to least difficult might conceivably perform worse, on average, than examinees asked to respond to the exact same set of questions arranged from least to most difficult, if their confidence is affected by their experience with the first few questions. Another well-known example in human science measurement relates to “stereotype threat”, where people from different sociocultural groups, who may have different assumptions regarding the overall likelihood of success of individuals from their own group on the instrument, tend to respond in ways that are sensitive to those beliefs, especially if their identity as members of the relevant group is made psychologically salient (see, e.g., Steele & Aronson, 1995).

While this classification offers a rich, multidimensional perspective on measurement uncertainty, the aim of providing an overall indication of the quality of the information produced by the measurement requires such components eventually to be combined.

3.2.5 Measurement Uncertainty and Measurement Results

As construed in the Uncertainty Approach and specified by the GUM, measurement uncertainty is a quantity associated with measurement results and inversely related to the quality of the information they convey: the greater the uncertainty, the lower the quality.Footnote 16 There is an open debate about what, specifically, is uncertain when measurement uncertainty is stated (the measured value? the measurement result? the estimate of the true value of the measurand? the trustworthiness of the acquired information? …: see, e.g., the mention in JCGM, 2008a: 2.2.4), but the general agreement seems to be that measurement uncertainty is an encompassing entity aimed at summarizing the quality (and quantity) of information acquired through the measurement. The components discussed above summarize the quality-related aspects of a measurement system and, independently of the way they are evaluated, by either statistical or non-statistical methods, they can be in turn summarized into a single, overall combined standard measurement uncertainty (JCGM, 2012: 2.31).

The model proposed by the GUM on this matter can be first considered as a black box. By quoting again the BIPM/CIPM recommendation of 1980, “the combined uncertainty and its components should be expressed in the form of standard deviations” (JCGM, 2008a: 0.7): from a list of standard deviations, one for each identified component, a standard deviation must be computed as result. There is nothing new in this problem, and in fact the recommendation states that “the combined uncertainty should be characterized by the numerical value obtained by applying the usual method for the combination of variances” (JCGM, 2008a: 0.7). This reinterprets, in the context of the Uncertainty Approach, what is traditionally called the “law of error propagation” (see, e.g., Bevington, 1969: p. 58; see also Box 3.1), which is based on a partial sum of the Taylor series expansion of the function by which a value of the measurand is computed, about the measured value and usually computed only in its first-order terms under the hypothesis of sufficient linearity of the function at the measured value.

The conclusion reached in Chap. 2 about how to report the information obtained by a measurement may be revised accordingly and then written

$$\begin{aligned} measurand &= \left( {measured\,value\,of\,a\,property,\, } \right.\\ & \quad \left. {combined\,measurement\,uncertainty} \right) \end{aligned}$$

where if more analytical information were required, the whole uncertainty budget could also be reported.

This relation (or at least its right-hand side term) is to be considered the measurement result, contrary to the tradition still witnessed in the definition of <measurement result> given in the second edition of the VIM: “value attributed to a measurand, obtained by measurement” (BIPM, 1993: 3.1). In other words, from this perspective the measurement uncertainty is assumed to be a constitutive component of the measurement result, and not just an additional, complementary entity. Indeed, “when a measurement result of a quantity is reported, the estimated value of the measurand […] and the uncertainty associated with that value, are necessary” (BIPM, 2019: p. 127). In the clear words of Lira (2002: p. 43),

we will […] refrain from referring to the “uncertainty of a measurement result”. There are two reasons for this. First, a measurement result should be understood as comprising both the estimated value of the measurand and its associated uncertainty. Second, once the estimated value is obtained, there is nothing uncertain about it. […] Hence, expressions such as the uncertainty in knowing the measurand or the uncertainty associated with an estimated value are more appropriate, even if longer sentences result.

This paves the way for extending the very concept of measurement uncertainty to the evaluation of quantities for which the expected value and the standard deviation of the underlying distribution are not sufficiently representative. An example would be where the distribution is strongly asymmetric. More generally, this could encompass ordinal or nominal properties (Mari et al., 2020), for which standard deviations are not meaningful. One solution is to acknowledge that entire probability distributions of values could be reported to convey the available information, on each uncertainty component and then the measurand, as inFootnote 17

$$measurand = distribution\,of\,values\,of\,a\,property$$

possibly according to the modeling mentioned in Box 3.2. Attributing to the measurand a single value or a distribution of values may in fact be considered the two extreme options, where other strategies are possible for reporting the information acquired by the measurement, so as to convey more information than a single valueFootnote 18 but less information than an entire distribution. In particular, a measurement result could be reported as a subset of values (usually an interval of values, in the case that the measurand is a quantity), where for discrete cases the greater the number of the values in the subset, the greater the measurement uncertainty, or as a subset of values and a confidence level, i.e., the probability attributed to the subset. This multiplicity of strategies also reflects the variety of tasks involving measurement results: while single values are the usual choice for uncertainty propagation and computing functions in indirect measurement, and of course for daily, non-scientific uses, intervals of values may be more suitable in decision-making applications, for example, conformity assessment or when investigating the compatibility of two measurement results.

Box 3.2 Another perspective on (un)certainty

Given that, particularly in the context of Bayesian interpretations, probability is considered to be the logic of uncertainty,Footnote 19 one could wonder why the default quantitative model of measurement uncertainty is standard deviation, instead of probability itself. The difference is not trivial: while a probability is a pure number, a standard deviation has the same dimension as the measurand.

There is, first, a plausible historical reason for this: measurement uncertainty is the offspring of measurement error, which is indeed the difference from the true value, with which it then shares its dimension. But another reason seems not less important. Reporting a measurement result in terms of a single measured value, together with a standard uncertainty whenever appropriate, is a convenient choice given that most mathematical models (e.g., physical laws) are designed to operate on values, not on subsets/intervals of values or more complex entities like probability distributions over values (the numerical propagation of distributions through analytical functions has only recently become feasible thanks to the availability of efficient computational tools, as presented in JCGM, 2008b). But if the information on the value of the measurand is reported as a single value, what remains for providing information about the quality of the result is some index of the expected dispersion of the measured value, thus under the questionable assumption—already discussed in Footnote 13—that dispersion and uncertainty are basically the same concept.

However, by relaxing the condition of single measured value a more general and expressive modeling framework can be adopted, as follows. Let us suppose that the information empirically acquired on the measurand is summarized by means of a probability mass or density function. From such a function, several coverage subsets/intervals can be obtained, each of them with an associated coverage probability (as a well-known example, for a Gaussian function a coverage interval centered in the expected value and whose half width is two standard deviations corresponds to a coverage probability of about 0.95). The quality of a measurement results has then two dimensions: the width of the coverage interval is about the inverse of the specificity of the reported information, while the coverage probability is about the certainty attributed to the interval. As it is reasonable, once the information on the measurand has been empirically acquired, and therefore once the underlying probability distribution is chosen, reporting a more specific information makes it less certain, and vice versa. This provides us with some added flexibility in reporting measurement results, by balancing their specificity and certainty: for example, if the length of a rod can be reported in centimetres, the result will be more certain than if reported in millimetres.

Of course, there is a sharp difference between this concept of certainty and what is today commonly considered measurement uncertainty: in this framework, certainty is a probability, thus ranging from 0 (deemed to be impossible) to 1 (deemed to be certain).

3.3 The Operational Context

Measurement is a process designed and performed in a context that is in fact structurally more complex than the one introduced in Chap. 2 and depicted in Fig. 2.8, for at least the following reasons:

  • the quantity unit is defined independently of the specific measurement problem and is made available through a metrological system;

  • the comparison between the measurand and the unit, and therefore the obtained measured value, is generally affected by other properties, which reveal the presence of a measurement environment.

Through the consideration of these contextual elements, as depicted in Fig. 3.4, let us switch from an abstract and conceptual interpretation of measurement to one that is more concrete and operational. We discuss here the case of quantities and defer the treatment of non-quantitative properties and their values to Chap. 6.

Fig. 3.4
A model diagram presents the measurement environment and metrological system on the left, measurement in the middle, and value on the right.

Broad context of measurement (in the case of quantities)

3.3.1 The Metrological System

Measurement is a process that enables the quantity-related comparison of objects through a process of delegation: for any two objects a and b both having a general quantity Q (say, length or reading comprehension ability), the information that a and b are empirically indistinguishable with respect to Q, Q[a] ≈ Q[b], can be obtained not only through their direct comparison (e.g., by the comparison of the extreme points of two rods, possibly mediated by a third rod, to evaluate their lengths, or by the synchronous comparison of two individual readers by a judge, to evaluate their reading comprehension abilities) but also by means of the independent measurement of the two quantities and the comparison of the obtained values. If the measured value of the lengths of two rods is the same, then the two rods are inferred to have the same length; if the measured value of the reading comprehension abilities of two individuals is the same, then the two individuals are inferred to have the same reading comprehension ability.

Hence, through measurement values operate as mediators for the comparison of quantities of objects. The meaning of these equalities is that the chosen unit qref is a quantity of the same kind as Q[a] and Q[b], and Q[a] and Q[b] have the same relation with qref, in the sense that if Q[a] = n1 qref and Q[b] = n2 qref then n1 = n2. In principle, this requires the unit qref to be accessible for its comparison with the measurands, Q[a], Q[b], …, even if the measurements are performed in different places and times: the widespread availability of the unit needs then to be somehow guaranteed.

In some cases, the only practical solution is to produce and disseminate multiple objects that realize the definition of the unit. In the tradition of physical measurement, the organizational and technical structure aimed at guaranteeing this is called a metrological system. Whenever it is possible to infer the information on the comparison to the unit from the comparison with the quantity realized by a replicated object, a measurement result is said to be metrologically traceable (JCGM, 2012: 2.41) to the unit. Hence, in order for one to be able to make the inference that Q[a] ≈ Q[b] from Q[a] = n qref and Q[b] = n qref even if the two measurements were performed in different places and times, the metrological traceability of the two results to the same unit must be guaranteed by an effective metrological system.

The quality of metrological systems is traditionally maximized through a structural strategy of hierarchical delegation: the definition of the unit is first realized in a primary measurement standard (JCGM, 2012: 5.4), which is then replicated in some secondary standards that are disseminated, which in turn are replicated and disseminated, and so on, thus generating traceability chains (JCGM, 2012: 2.42) of standards, under the responsibility of National Metrology Institutes and accredited calibration laboratories. Mari and Sartori (2007) show that under given conditions, this strategy is both efficient and effective. Indeed, a metrological system, as a network of measurement standards and measuring instruments connected through calibrations,

  • is connected by a relatively small number of calibrations, each corresponding to an edge of the network, and therefore the system is efficient because its global costs are relatively small,

  • and the average length of the traceability chains (corresponding to the average shortest path length between nodes) is also small, and therefore, the system is effective because each calibration reduces the quality of the provided information, so that minimizing the length of a traceability chain maximizes the quality of the measurement results.

Hence, metrological systems reproduce the structure of small-world networks, which at the same time are connected but guarantee a small number of degrees of separation (Watts & Strogatz, 1998).

In the case of reading comprehension ability (like most other psychosocial properties), there are no direct reference objects, and therefore, such a system of reference properties and calibrated standards does not pertain. However, responses to test questions are informational, and hence, if they are seen as stand-in reference objects, then a similar system can be constructed, using comparisons between the scored responses to the questions in the original sample, and a second sample of responses (Wilson, 2013). Just as in the discussion above, the length of the traceability chain can be kept relatively short by always using the original sample to calibrate new sample sets. A logic built in this way can be seen as functionally analogous to a metrological system (Maul et al., 2019).

3.3.2 The Measurement Environment

As presented in Chap. 2, a measurement result reported as the Basic Evaluation Equation

$$Q\left[ a \right] = x\,{\text{q}}_{{{\text{ref}}}}$$

(again neglecting measurement uncertainty) assumes that the (direct or indirect) comparison between the measurand Q[a] and the unit qref depends only on these two quantities. However, in real-world contexts the comparison may be affected by influence properties, of the object under measurement or the environment, including the measuring instrument itself: if an influence property changes, the measurement result could also change even if the measurand remains the same.

For example, returning to an example discussed above, the length of a rigid body could be measured by a caliper whose structure is sensitive to the environmental temperature in such a way that different indications are obtained for the same measurand when the temperature changes: in this case, then, environmental temperature is an influence property. In the case of the measurement of reading comprehension ability, an influence property might be the specific content of the passages a person reads: a person with strong prior knowledge of the content may be more likely to successfully answer related questions than someone less familiar with the content, even though they may have the same reading comprehension ability. This shows that the information reported by a Basic Evaluation Equation is in fact affected by the context in which the measurement is performed.

There are two complementary structural strategies for taking into account the presence of the influence properties.

One strategy aims at making the measurement less and less sensitive to influence properties, by improving the measuring instrument and the overall measurement procedure. In the case of length measurement, the process could be performed in a carefully temperature-controlled environment. In the case of a test of reading comprehension ability, an example of an attempt to reduce the effects of an influence property would be to deliberately write passages about a topic known to be equally unfamiliar to all examinees. The experimental abilities of measurers have to do with designing and implementing strategies for this purpose.

The other strategy exploits the fact that the measurand is the property intended to be measured, and therefore, that must be defined, explicitly or implicitly. As exemplified in Box 2.3, the measurand could be then defined by including the specification of some environmental conditions in the definition itself, thus changing the role of the corresponding properties, from influence properties to components of the measurand, or vice versa by broadening the definition so as to make it explicitly indifferent to some influence properties. Chapter 7 includes a discussion about the complex subject of measurand definition. For example, it might be specified that the length of a rigid body is considered at a given temperature, or that reading comprehension ability simply pertains to one’s ability to comprehend a given set of texts regardless of whether this comprehension is based solely on semantic processing of the texts or is also aided by prior knowledge of the content of the texts, in which case prior content familiarity is seen as a component of reading comprehension ability rather than an influence property.

3.4 The Conceptual Context

For performing a measurement, in addition to the operational conditions discussed in the previous section, some conceptual conditions also need to be fulfilled: an existing property of an object has to be identified as the property intended to be measured, leading to a model of the measurand, and a suitable process of property evaluation has to be designed, leading to a model of the measurement. This last section of the chapter is devoted to a preliminary discussion of these two aspects of the measurement problem, thus setting the strategic context for the more careful and extensive analysis developed in the chapters that follow.

3.4.1 Measurement and Property Identification

Performing a measurement presupposes that something is there to be measured: as previously mentioned, this is a property of an object. In most cases of mature measurement practice, both the general property and the object are well known and clearly identified before and independently of the measurement. Given the well-developed status of physics, this is the usual case for measurement of physical properties: it is physics itself that guarantees that general properties such as length, mass, and time duration are sufficiently well defined, and in fact interdefined, in a network of general properties (and more specifically quantities) connected by equations globally known as the International System of Quantities (ISQ) (see ISO, 2009: 3.6 and JCGM, 2012: 1.6). In this context, a measurement problem starts from a previously defined general property and only requires that one identifies the individual property intended to be measured as an instance of that general property.Footnote 20 Of course, as physicists discover new properties and seek to measure them, it may be that at least initially these assumptions cannot be met.

Thus, things are not always so simple. In the case of physical quantities, interesting examples have been studied of situations in which measurements were instrumental in the very identification/definition of the general property, a well-known case being temperature, as analyzed in particular by Hasok Chang (2007). In these cases, the neat sequential procedure—from the assumption of an already-defined general property and a pre-existing measuring instrument, to the identification of an instance of that property as the measurand and then the design and operation of a measuring instrument—becomes a complex loop in which the distinctions between the activities of defining the property, constructing the measurement system, and performing the measurement are blurred, and one might operate by measuring without a clear idea of what one is measuring. It may happen, as in the words of Thomas S. Kuhn, that “many of the early experiments involving [a new instrument] read like investigations of that new instrument rather than like investigations with it” (1961: p. 188).Footnote 21

In the context of the human sciences, which currently lack anything like an ISQ, this situation of general property definition intertwined with measurement is not unusual. New variables may be readily obtained via computation, and without a system such as the ISQ to establish that properties are well defined, such variables are not necessarily the formal counterpart of empirical properties. It is indeed not hard to provide examples of variables, such as the “hage” of a person obtained as the product of her height and age (Ellis, 1968: p. 31), which can be computed very accurately and fulfilling all expected requirements about uncertainty propagation etc., and nevertheless do not seem to correspond to any property of individual humans. Less trivially, this problem becomes critical in the context of complex concepts such as the social status of an individual, the quality of the research performed by a team, the performance of a company, or the wealth of a nation, all of which are sometimes claimed to be measurable by computing given mathematical functions from empirically collected data. In these cases, one can interpret measurement as a tool not only for the acquisition of information on the measurand, but also, and even before, for gaining knowledge about the general property under consideration.

The case of reading comprehension ability is interesting in this respect, given how it has changed historically. In the nineteenth century, one procedure for checking whether students could read was that they were asked to read a text aloud and often also asked to recite parts of the text from memory (Matthews, 1996; Smith, 1986). Thus, at that point of time, the (implicit) property definition was something like “ability to accurately read a text out loud”. With time, it was realized that students could succeed at such tasks without understanding the content of the text passage. This led to the advent of silent reading tests, where reading comprehension ability was assayed by the interaction of the reader with comprehension questions (Pearson, 2000), generically called “items”. Thus, at this later point, the property definition (still implicitly) changed to something like “ability to demonstrate understanding of the content of the text (and the question)”. An early test of this sort was developed by Frederick Kelly (1916), and an example question from which is shown in Fig. 1.1. The questions were chosen to be likely to generate incontrovertibly either correct or incorrect responses. The indication of a student’s reading comprehension was then the number, or percentage of, the test items the student answered correctly.

3.4.2 Measurement and Measure

In the framework of necessary conditions for measurement introduced in Sects. 2.2 and 2.3—measurement as an empirical and informational process, designed on purpose, whose input is an empirical property of an object and that produces information in the form of values of that property—a further specification is appropriate here, about the very distinction between measurement and measure.

The ancient Greek verb for <to measure> has the root “metr-” (μετρ-), from which the term “metrology” derives, highlighting the relation of the concepts <to measure> and <measurement> . However, the relation between the concepts designated by the nouns “measure” and “measurement” is more delicate.

The Euclidean tradition has been described as “the earliest contribution to the philosophy of measurement available in the historical record” (Michell, 2005: p. 288),Footnote 22 as witnessed by the oft-quoted first definition of Book 5 of the Euclid’s Elements: “A magnitude is a part of a[nother] magnitude, the lesser of the greater, when it measures the greater” (Euclid, 2008: V.1, emphasis added). Hence, the hypothesis that the noun “measurement” and the verb “to measure” refer to the same entity appears plausible, so that a theory of measurement and a theory of measure would be more or less the same thing, or at least they would be inherently related. This position is further evidenced, for example, in the claim that “to understand measurement theory it is necessary to revisit the theory of integration and, in particular, Lebesgue measure theory” (Sawyer et al., 2013: p. 90).

However, suspicions about the equivalence of <measurement> and <measure> might arise from a sufficiently careful reading of Euclid’s work itself, which is not really about measurement as we intend it. For example, in the introduction to an English translation of the Elements one can read that “in the geometrical constructions employed in the Elements […] empirical proofs by means of measurement are strictly forbidden” (Euclid, 2008: introductory notes, emphasis added). Let us indeed compare the above-mentioned definition by Euclid, with the following one, again from the Elements but now taken from Book 7: “a number is part of a(nother) number, the lesser of the greater, when it measures the greater” (VII.3, emphasis added). While the two quoted sentences refer to different entities (magnitudes, μεγέθη, and numbers, ἀριθμοὶ), in both the relation is said that one entity measures (καταμετρῇ) the other. Hence, as derived from the Euclidean tradition, “to measure” does not necessarily have an empirical meaning, and in fact the Euclidean <to measure> is coextensive with <to be (an integer) part of> (Mari et al., 2017). Consistently with this position, then, a “measure of a number is any number that divides it, without leaving a reminder. So, 2 is a measure of 4, of 8, or of any even number; and 3 is a measure of 6, or of 9, or of 12, etc.” (Hutton, 1795). The conclusion is that “measure” and “to measure” have (at least) two distinct meanings: one is empirical, and is indeed related to measurement, and the other is mathematical; this twofoldness, already recognized by Bunge (1973), has often been confused.Footnote 23

Perhaps unsurprisingly, on this conceptual basis a “measure theory” has developed, where “a measure is an extended real valued, non-negative, and countably additive set function μ, defined on a ring R, and such that μ(0) = 0” (Halmos, 1950: p. 30): that is, it is a mathematical entity. Whether and how measure theory is related to a theory of measurement, and more generally to measurement science, is an issue that we discuss in Chap. 6, in the section about the measurability of non-quantitative, and specifically non-additive, properties. But it should be clear now that “measure” is not just a synonym of “measurement” and, most importantly, that “quantification” is not just a synonym of “measurement”: not every quantification is a measurement, and it could even be accepted that non-quantitative properties may also be measured. Thus, one can see the wisdom in the VIM’s avoidance of any use of the noun “measure” (except in the idiomatic term “material measure”) to reduce ambiguityFootnote 24 and its adoption of “measurement result” to designate the outcome of the process. This is the lexical choice that we make here too.

The operational and conceptual issues discussed in this chapter provide a basis for our analysis of philosophical perspectives on measurement, to which the next chapter is devoted.