7.1 Introduction

Despite—or perhaps, to at least some extent, because of—the ubiquity of measurement-related concepts and discourse, there remains a remarkable lack of shared understanding of these concepts across (and often within) different fields, perhaps most visibly reflected in the vast array of distinct proposed definitions of <<measurement> itself, as discussed in Sect. 4.2. It would seem, then, that the clarification of foundational measurement concepts should (continue to) to be a high priority, not only in terms of the definition of <measurement> , but also of the identification of those features of measurement that justify its epistemic authority, i.e., its commonly-afforded degree of public trust and social prestige: the claim that “measurement is often considered a hallmark of the scientific enterprise and a privileged source of knowledge relative to qualitative modes of inquiry”, in Eran Tal’s words (2020). Justification of the epistemic authority of measurement and its results, in turn, depends on identifying those features of the measurement process that ensure (or, at least, confer high likelihood upon) the quality of its results. We argue that these features are independent of the domain of application, and thus in principle apply equally to the measurement of physical and psychosocial properties; as such, this topic is a key component of our endeavor toward a conceptual framework of measurement across the sciences.

As described in Chap. 4, since the second half of the twentieth century, scholarly treatment of the foundational aspects of measurement has largely focused on mathematical criteria rather than the concrete realization of the process, as exemplified by claims such as that “we are not interested in a measuring apparatus and in the interaction between the apparatus and the objects being measured. Rather, we attempt to describe how to put measurement on a firm, well-defined foundation” (Roberts, 1979: p. 3) and “the theory of measurement is difficult enough without bringing in the theory of making measurements” (Kyburg, 1984: p. 7). This emphasis on formal characterizations of measurement may be in part explained by the expansion of measurement into new domains and the related need to abandon characterizations and requirements that were needlessly tied to specific areas. In particular, it is clear that the measurement of non-physical properties cannot conform to expectations based on the traditional use of measuring instruments operating on the basis of transduction effects implemented by physical sensors. As a consequence, interpretations of measurement have become so abstract that they may be unable to provide a convincing and useful demarcation of measurement from formally similar processes that are generally thought to lack epistemic authority, such as most statements of subjective judgments and opinions.

Our position on this matter is pragmatic: there is a social interest in sharing a scientific and technical concept system across disciplines,Footnote 1 particularly in the case of an infrastructural activity like measurement, and there is a social acknowledgment of the epistemic authority of measurement, which has critical consequences in particular in terms of public trust attributed to the outcomes of putative measurement processes and the resources devoted to such processes. If the claim that a given evaluation process is a measurement could be invoked at will, without understanding or concern for what has historically made it a valued practice, measurement itself would become simply a rhetorical device, risking the discredit of its practice in general.

In Sect. 4.4.2 we presented our claim that measurement is most appropriately characterized by empirical rather than mathematical conditions, as grounded on a model-dependent realism, introduced in Sect. 4.5 and developed in Chaps. 5 and 6, about the objects of measurement, i.e., properties. We develop here that claim by proposing that measurement is a process characterized by its structure, not only by the specification of the relationship connecting its inputs to its outputs: what is required for justifying the dependability of measurement results is the specification of how the process does what it does. Whereas an input–output relationship relies solely on a black box model, a structural model involves identification of the invariant aspects of the empirical process, and therefore looks “inside the black box”. And this, we argue, is what provides justification for the claim that measurement results are publicly trustworthy. As a corollary, any purely black box model cannot adequately account for the all relevant features of measurement, and thus is not sufficient for the purpose of understanding the quality of measurement results.

As already highlighted, the conditions presented in Chap. 2—i.e., that measurement (i) is both an empirical and an informational process (ii) designed on purpose, (iii) whose input is an empirical property of an object, and that (iv) produces information in the form of values of that property—are necessary but not sufficient for a process to be considered a measurement. We propose here that the missing sufficient conditions are provided by a structural model of the process of measuring. As a corollary, such an integrated set of necessary and sufficient conditions provides a characterization of measurability: a property is measurable if and only if there exists a property-related process fulfilling these conditions.

From this perspective, the analysis of the structure of a measurement process plays a crucial role: in the metrological tradition, the general description of the structure of such a process is provided by a so-called measurement method, defined as a “generic description of a logical organization of operations used in a measurement” by the International Vocabulary of Metrology (VIM) (JCGM, 2012: 2.5). A key related distinction is between direct and indirect (methods of) measurement, first introduced in Sect. 7.2.3. To this we first devote our attention here, using the model proposed by Giordani and Mari (2019) as a starting point, which we develop to encompass the scenarios that arise when the model is applied across the sciences.

7.2 Direct and Indirect Measurement

Though it contradicts what is currently specified by the VIM, which defines <measurement> to be an experimental process (JCGM, 2012: 2.1), and also against our own presentation of this as a necessary condition (see Sect. 2.2.1), the idea that measurement is not necessarily empirical is not new. In his seminal 1920 book, Norman Campbell defined <measurement> as “the process of assigning numbers to represent qualities” (1920: p. 267). A linguistic detail is again revealing: Campbell wrote “the process”, not “a process”, thus supposedly implying that any process of assigning numbers to properties—with only a slight paraphrase—is a measurement. This was conceived in the context of a foundationalist endeavor aimed at framing measurement as a core enabler of science (1920: p. 267):

Physics could be distinguished from other sciences by the part played in it by measurement. Other sciences measure some of the properties they investigate but it is generally recognized that when they make such measurements they are always depending, directly or indirectly, on the results of physics. All fundamental measurements belong to physics, which might almost be described as the science of measurement.

Here the term “fundamental measurement” is used with a specific meaning, introduced by Campbell himself: being fundamental is what characterizes properties whose instances can be directly compared with each other by equivalence and order and can be additively combined (Campbell used the term “physical addition”, p. 279, for what has later been called “concatenation”, e.g., by Krantz et al., 1971: p. 2). This fundamentality is due to the fact that, for any three objects a, b, and c having the property P the construction introduced in Sect. 6.3 applies, i.e., if P[a] ≈ P[b] and P[a] + P[b] ≈ P[c] then P[c] = 2 P[a], or P[c] / P[a] = 2: with no other conditions—hence with no previously defined units, measurement standards, metrological traceability chains, instrument calibrations, etc.—numbers have been thus assigned to ratios of properties. Moreover, were P[a] conventionally set as the unit u, the previous equality would become P[c] = 2 u, which is an example of a Basic Evaluation Equation and the simplest case of a measurement result, under the condition that measurement uncertainty need not to be reported explicitly. If this is all that is required for (fundamental) measurement, then only the adjective “physical” in the term “physical addition” ties measurement to the empirical world: indeed, exactly the same structural conditions may apply to properties of mathematical objects, through purely computational processes (and this could be indeed the underlying reason for the NIST definition of <<measurement>  as a “an experimental or computational process”, as quoted in Footnote 11 of Chap. 5).

In fact, given his interest in establishing measurement as a foundation for science, Campbell included an almost incidental second condition: “in order that a property should be measured as a fundamental magnitude, involving the measurement of no other property, it is necessary that a physical process of addition should be found for it” (p. 267). Together with additivity, he considered “involving the measurement of no other property” as the basis for the distinction between fundamental and derived properties, and then between fundamental and derived measurement. This distinction was later refined and became a sort of default reference, in particular in the version presented by Brian Ellis (1968), where measurements—and more properly measurement methods—are classified as either fundamental (for which Ellis also used the term “direct”) or indirect, so that every measurement method that is not fundamental/direct is considered to be indirect.

As the scope of measurement broadened, and (especially) psychophysicists and psychologists argued against the necessity of physical addition operations for measurement, Campbell’s second condition—that no other properties are involved in the process—became the characterizing feature of those methods which were then called “direct methods” of measurement, as exemplified by the first edition of the VIM, which defines <direct method of measurement> as a “method of measurement in which the value of a measurand is obtained directly, rather than by measurement of other quantities functionally related to the measurand” and <indirect method of measurement> as “method of measurement in which the value of a measurand is obtained by measurement of other quantities functionally related to the measurand” (ISO et al., 1984: 2.13 and 2.14). Similar definitions or characterizations can be found elsewhere in the literature, for example (Lira, 2002: p. 39)

Many times the information about the measurand is acquired from the readings of a single measuring instrument. We will refer to such a quantity as subject to a direct measurement. However, the metrological meaning of the word ‘measurement’ is more ample: it also refers to quantities whose values are indirectly estimated on the basis of the values of other quantities, which in turn may or may not have been directly measured.

or (Gertsbakh, 2003: p. viii)

There are two principal types of measurements: direct and indirect. For example, measuring the voltage by a digital voltmeter can be viewed as a direct measurement: the scale reading of the instrument gives the desired result. On the other hand, when we want to measure the specific weight of some material, there is no such device whose reading would give the desired result. Instead, we have to measure the weight W and the volume V, and express the specific weight p as their ratio: p = W/V. This is an example of an indirect measurement.

The idea that a measurement is performed according to a direct method, is direct for short,Footnote 2 if it does not involve properties other than the measurand seems to be accepted also by the Guide to the expression of uncertainty in measurement (GUM), which notes that “in most cases, a measurand Y is not measured directly, but is determined from N other quantities X1, X2, …, XN through a functional relationship f, Y = f(X1, X2, …, XN)” (JCGM, 2008: 4.1.1; emphasis added). This was discussed by Walter Bich as follows (: p. 272; emphasis added):

even the simplest, seemingly direct measurements [...] fall into this categorization. For example, the indication of a bathroom balance, which is expressed in divisions of the scale, is not the measurand Y (which is the mass of the person in kilograms), but simply one of the input quantities, say, X1. The measurand is obtained from the indication X1, perhaps repeated two or three times, and a series of corrections X2, X3, ..., XN (the zero and the span of the scale, and perhaps its linearity, or the deviation of the local acceleration due to gravity from that of the place in which the balance was manufactured and adjusted).

The consequence is then straightforward: from this perspective, since “even the simplest model will be incomplete if corrections to the indications of the instruments used in direct measurements are not taken into account, […] no measurement can strictly be considered to be ‘direct’.” (Lira, 2002: p. 50; emphasis added).

Something peculiar can be observed in this sequence of apparently coherent steps: it started by emphasizing that the foundational role of measurement is guaranteed by fundamental, or direct, measurement, and ended with the admission that, in practice, no measurement can be, in this sense, direct!

This shift was plausibly driven by the recognition that even in the simplest measurements some computational activity is required, as part of the modeling of the empirical process that takes place in the interaction between the object under measurement and the measuring instrument in the given experimental context. And in fact, significantly, the functional relationship f mentioned above is described by the GUM as a model of measurement (JCGM, 2008: 4.1.2), thus with the understanding that “observations are never interpreted independently of some abstract model of the […] system” (Cook, 1994: p. 4), because “observation of x is shaped by prior knowledge of x” (Hanson, 1958: p. 19), i.e., all observations are unavoidably “theory-laden”.Footnote 3

Let us take for granted that measurement cannot produce “pure empirical data”—whatever this could mean—both because some background model is always and unavoidably, though sometimes only implicitly, present also in measurement, and because measurement produces information, not empirical entities, and therefore it must include an informational stage (see Sect. 7.2.3 for a preliminary justification of this). However, this only affects the claim that there can be methods of measurement that exclusively require empirical operations, a position that in fact can be then easily ascertained as wrong. What is at stake here is way more than that: it is the very possibility of providing a justification for the epistemic authority of measurement, as based on an understanding of the relationship that measurement establishes between the empirical and the information world.

Indeed, the acknowledged foundational role of measurement for empirical sciences needs to be explained by showing where, in principle, measurement ends and other information production processes (such as computation, opinion making, and guess) start. The thesis proposed and developed here is that this justification must be found in the distinction between direct and indirect (methods of) measurement, thus providing a way to overcome this possible crisis of foundationalism in measurement (Mari, 2005).

Consider again the sentence quoted above from the GUM: the functional relationship f, Y = f(X1, X2, …, XN), through which a value is attributed to the measurand Y is what the VIM calls a measurement model, i.e., a “mathematical relation among all quantities known to be involved in a measurement” (JCGM, 2012: 2.48). This assumes the quantities Xi to be somehow evaluated, but does not set any constraint on this matter: in particular, could their values come from simulations, or hypotheses, or guesses? Without an independent characterization of the concept <<measurement> , the idea that a mathematical entity f is a model of a measurement, and therefore an interpretation of what is a measurement, blurs everything. And the issue is not settled simply by acknowledging the unavoidable presence of models in measurement. Indeed, as we discuss below, together with what the VIM calls “measurement models” there are at least two other kinds of models that are relevant in this context:

  • models of the measuring instrument behavior, with respect to the environmental properties that influence the relation between the property being measured and the instrument indicationFootnote 4;

  • models of the measurand, with respect to the way the measurand is affected by, and more generally related to, other properties.Footnote 5

An exploration of the distinction between these (kinds of) models and their relations will lead us to a better understanding of the role of models both of and in measurement, and in consequence of measurement as such. The position that we are going to propose here is that what the VIM calls “measurement model” is not a model of a measurement, but a component of an evaluation process, that is an indirect measurement only if it includes also at least one direct measurement: hence, the characterization of direct measurements is pivotal for our endeavor.

7.2.1 Recovering a Meaningful Distinction Between Direct and Indirect Measurement

The term “direct measurement” does not have a single inherent meaning and it is not trademarked, of course: to our knowledge Campbell did not use it; Albert de Forest Palmer characterized it as “the determination of [the value] by direct observation of the measured quantity, with the aim of a divided scale or other indicating device graduated in terms of the chosen unit” (1912: p. 11), and Ellis might have been the first to include it in a structured analysis of measurement, though with the not-so-clear characterization that “direct measurement is any form of measurement which does not depend upon prior measurement” (1968: p. 56). Were the idea of direct measurement intended as designating a model-free form of measurement, it should simply be dismissed, given the acknowledgment that purely empirical and model-free measurement is not possible.

Rather, we propose to recover a meaningful distinction between direct and indirect measurement by restarting from the definition in IEC 60050 / Electropedia of <direct (method of) measurement>: “method of measurement in which the value of a measurand is obtained directly, without the necessity for supplementary calculations based on a functional relationship between the measurand and other quantities actually measured” (IEC: 311–02-01). Admittedly, the definition is circular, since <direct measurement> is defined in terms of directly obtaining something. A possible amendment of the IEC definition that eliminates this circularity could be as follows:

(preliminary, provisional characterization of <direct method of measurement>) a measurement method is direct if it is based on the use of a measuring instrument that is designed to empirically interact with properties of the same kind as the measurand

where the adjective “direct” refers to such a condition of direct interaction.Footnote 6 It is also worth noting that Ellis acknowledged the importance of this typology and called it “associative measurement”, though, in contrast to the proposal we advance here, he considered it a form of non-fundamental and therefore indirect measurement (1968: p. 90).

Many measurements are direct in this sense: basically, whenever the core empirical process is a transduction from a property of the same kind as the measurand to another property (which in current physical applications is typically an electric quantity), as implemented in a sensor. Examples of direct measurement would then include both the measurement of temperature by means of an alcohol thermometer—where the thermometer interacts with the object under measurement and transduces its temperature into a position of the upper surface of the alcohol in the glass tube (temperature has well-known characteristics today, yet has a long and complex measurement history; see for example Chang, 2007)—and the measurement of reading comprehension ability by means of a test—where the items in the test interact with the reader and transduce her reading ability into a pattern of item responses.

However, we consider the characterization above provisional because it provides at most a necessary but not sufficient condition for a method to be direct. In fact, it turns out that a measurement could be indirect—according to the usual way the adjective is used as attributed to measurement methods—even if the property with which the measuring instrument interacts and the measurand are of the same kind.

The philosophical tradition gives us a nice example of this. When visiting some Egyptian priests, Thales asked them about the height of the Great Pyramid and, since they were unable to answer the question, he devised a procedure to get the information by himself. Thales measured the length of the shadow cast by the Pyramid and compared it with the length of the shadow cast by a vertical rod of known length, so that he could use his theorem on the proportionality of the sides of similar triangles to infer the Pyramid’s height. If asked to assess this case, we would say that Thales performed an indirect measurement, through the direct measurement of the length of the shadow cast by the Pyramid and the length of the shadow cast by the rod, though these quantities are of the same kind as the measurand, being in fact all lengths. Therefore, the previous characterization may be refined as follows:

(refined, provisional characterization of <direct method of measurement>) a measurement method is direct if it is based on the use of a measuring instrument that is designed to empirically interact with properties of the same kind as the measurand and is actually coupled with the object carrying the measurand

In Thales’ case, he needed to measure first the lengths of two shadows: while the instrument he used to this purpose was designed to empirically interact with lengths, it was not coupled with the object carrying the measurand, i.e., the Pyramid, but with another objects, i.e., the shadows of the Pyramid and of the rod.

While again still provisional, this characterization allows us to describe the basic structure of a direct measurement process, as it is presented in Sect. 7.2.3, as follows.

  • Transduction. The measuring instrument is put in interaction with the object under measurement with respect to a property of the object; as a result, the instrument changes its state, by transducing the property under measurement to another property, i.e., the instrument indication.

  • Instrument scale application. The instrument indication, which is still an empirical property, is associated with an indication value through the application of the instrument-related scale; this is the crucial step in which an empirical entity (e.g., the position of the upper surface of the alcohol in the tube of a thermometer; a pattern of responses to a set of test items) is associated with an information entity (e.g., a value of position; a number of correct answers).

  • Calibration function computation. The indication value is mapped to a measurand value by computing the instrument calibration function, consistently with the VIM definition, according to which the second step of a calibration “establish[es] a relation for obtaining a measurement result from an indication” (JCGM, 2012: 2.39), and where corrections to the indications of the instruments are part of the calibration function.

Hence, the sequence.

$${\text{transduction}} \to {\text{instrument}}\,{\text{scale}}\,{\text{application}} \to {\text{calibration}}\,{\text{function}}\,{\text{computation}}$$

may be interpreted as a process mapping a property of an object to a measured value, as depicted in Fig. 7.1 (measurement uncertainty is not considered here).

Fig. 7.1
A block diagram presents the measurement object, transduction, instrument indication, scale application, value, calibration, and measured value in series.

The basic structure of a direct measurement (adapted from Fig. 2.10)

It is crucial that both empirical and non-empirical stages—i.e., transduction and calibration function computation respectively—are required to complete this process. This provides a key to distinguishing measurement from computation and to highlighting a crucial asymmetry with respect to their sources of justification: a necessary, though insufficient, condition for computation to produce justified results is that its input data are justified. Hence, if such input data comes from and is about empirical properties, only procedures based on direct measurement provide empirically justified results. Even in the case of a measurement which is non-direct, one or more measuring instruments that produce the input for the computations are required to empirically interact with properties of the relevant objects. This allows us to propose as a provisional characterization that.

(provisional characterization of <indirect method of measurement>) a measurement method is indirect if it is not direct, i.e., if the measuring instruments are not designed to empirically interact with properties of the same kind as the measurand or if they are not coupled with the object carrying the measurand

As a consequence, any measurement requires that at least one direct method be applied, for the measurement of one or more “intermediate” measurands, as in the case of the (indirect) measurement of density via the (direct) measurements of mass and volume, which then operate as intermediate measurands. This is consistent with the definition in IEC 60050 / Electropedia of <indirect (method of) measurement> as a “method of measurement in which the value of a quantity is obtained from measurements made by direct methods of measurement of other quantities linked to the measurand by a known relationship” (IEC: 311–02-02). If such direct measurements are considered as black boxes, which produce values of the n–1 intermediate measurands, the functional relationship f presented by the GUM is the combination function of an indirect measurement, which models the relation among some properties of the object under measurement (density, mass, and volume in the example), but not the behavior of an instrument. Figure 7.2 summarizes this conceptual structure.

Fig. 7.2
A block diagram illustrates the combination function of two direct measurements in the middle. This output is then combined with the indirect measurement.

The basic structure of an indirect measurement, including some unavoidable direct measurements (adapted from Fig. 2.11)

Even this simple analysis highlights the fundamental differences between direct and indirect methods of measurement, where a computation component is present in both, but with clearly different roles. Let us compare these methods in more detail in Table 7.1.

Table 7.1 A comparison of direct and indirect methods of measurement, with respect to the role of the computation component

Of course, our point is not merely lexical: we emphasize the importance of maintaining a distinction between these two methods,Footnote 7 be they called “direct” and “indirect” or anything else.

7.2.2 Refining the Distinction Between Direct and Indirect Measurement: First Step

We have implicitly assumed so far that in direct measurements the property intended to be measured, i.e., the measurand (JCGM, 2012: 2.3), is the same as the property with which the measuring instrument interacts, which the VIM calls the “quantity being measured”. However, as the VIM itself acknowledges (see Note 3 to def. 2.3), the measurand is not necessarily the same as the quantity being measured. On this matter we have introduced in Sect. 7.2.3 the distinction between.

  • the intended property: the measurand, i.e., the property that is intended to be measured and to which the measurement result is attributed, and

  • the effective property: the property of the object under measurement with which the instrument actually interacts and that causes an effect through the transduction performed by the instrument.

This accounts for the ambiguity of the terms “measured property” and “property under measurement”, which may refer to both the intended property and the effective property. This distinction is not maintained in daily measurements, in which the measurand is often not explicitly defined, because the interest is to measure “here and now” and nothing else, with the goal of obtaining information about the property that induces the transduction in the instrument.Footnote 8 But whenever measurement results are aimed at providing transferable information, such an implicit, indexical definition is no longer appropriate, and the model of the measurand needs to be improved, possibly by including in it the specification of values of properties by which it is affected. In Sect. 7.3.2 we have called such properties “affecting properties”, because they are the properties that affect the measurand (it is unfortunate that the VIM does not have an entry, nor a term, for this concept). For example, the length of an iron rod is affected by temperature, because of thermal expansion, and therefore a measurand could be defined as the length of the rod at a given temperature, say 293.15 K.

It should be noted that, generally, in a given measurement context the affecting properties and the influence properties are not the same:

  • the affecting properties are causally related to the measurand, and therefore are in principle independent of the measuring instrument (in the example above the volume of the rod is modeled in such a way that the temperature of the environment is an affecting property);

  • the influence properties are causally related to the transduction implemented in the measuring instrument, and alter its behavior in producing an indication in response to a given effective property

where, of course, a property may be both an affecting property and an influence property: this does not remove the principled distinction.

Since affecting properties enter the model of the measurand and influence properties enter the model of the instrument behavior, and since our provisional characterizations of direct measurement and indirect measurement are grounded on the distinction between these two models, as Table 7.1 highlights, our introduction of the distinction between affecting properties and influence properties requires us to refine the distinction between direct and indirect (methods of) measurement introduced above.

Let us then suppose that the measurand is defined as the volume of an iron body at the temperature of 293.15 K and that it has been established—by, say, an independent measurement—that the current temperature is 287.65 K, possibly with some measurement uncertainty, which is not relevant here. There are at least three possible strategies for coping with this situation:

S1: prior to measurement an empirical action is performed for changing the temperature of the body to the specified value, thus operating what could be called an empirical correctionFootnote 9;

S2: the difference between the specified and actual temperatures is taken into account by suitably increasing definitional uncertainty, for example by admitting that the measurand is defined within a given range of temperatures, and thus still independently of the way the measurement is performed; at the end measurement uncertainty is compared with definitional uncertainty to check that the former is not lower than the latter (which would mean that some resources were wasted in performing unnecessarily accurate measurements)Footnote 10;

S3: whether an empirical action is performed for changing the temperature of the body or not, both the current volume of the body and its current (thus possibly modified) temperature are measured, and then their values are properly combined, by a function that could be called a computational correction, to obtain a value for the measurand.

In all three of these strategies the measuring instrument is designed to empirically interact with properties of the same kind as the measurand, i.e., the volume of the body: according to the provisional characterization proposed above, these cases would then be all classified as direct measurements. However, their analysis in the light of the comparison proposed in Table 7.1 reveals some significant differences. Let us consider them with respect to one simple question: what if the model of the measurand (and more generally of the object under measurement) is wrong?

  • S1 does not actually rely on the model of the measurand, as all that is required for justifying the choice of performing the empirical correction is the qualitative hypothesis that the measurand somehow depends on temperature; were the model discovered to be wrong—i.e., the volume of the body does not depend on its temperature as instead stated—this strategy would remain effective. When using this strategy, measurement results do not depend on the validity of the model of the measurand: even if volume did not depend on another property, such as temperature as expected, the measurement would directly lead to a result, and the measurement uncertainty would not be affected by this;

  • S2 might rely on the model of the measurand, but only for evaluating the contribution of the difference of temperatures to definitional uncertainty, what traditionally is called a “bias”, i.e., the estimate of a systematic error (JCGM, 2012: 2.18) induced by such a difference; were the model discovered to be wrong, the only consequence would be that definitional uncertainty might be under- or over-evaluated. Under the assumption that definitional uncertainty is compared with measurement uncertainty, not propagated as a component of the uncertainty budget (see Sect. 3.2.4), also when using on this strategy measurement results do not depend on the validity of the model of the measurand;

  • S3 entirely relies on the model of the measurand: measurement results are obtained by computing a combination function that models the relationship among the relevant properties of the object: were the relation between temperature and volume discovered to be significantly different from the one used to compute the correction, measurement results should be changed accordingly; moreover, the combination function is exploited for propagating the uncertainties of (effective) volume and temperature, to obtain the uncertainty of (modeled, i.e., intended) volume. When using this strategy, measurement results do depend on the validity of the model of the measurand.

Both S1 and S2 fulfill the conditions in the left column of Table 7.1 and therefore can be considered uncontroversial cases of direct measurement, thus showing that a measurement may be direct even if some affecting properties are acknowledged to be part of the model of the measurand. S3 may be described as the two (direct) measurements of (uncorrected) volume and temperature, followed by a computational correction for obtaining a value of (intended) volume by computing a combination function which is (part of) the model of the measurand. Hence this structure fits with the conditions in the right column of Table 7.1. However, S3 is not a typical case of indirect measurement, such as when the density of a body is measured by computing its value as a function of the values of the mass and the volume of the body. Hence this difference needs to be further analyzed.

7.2.3 Refining the Distinction Between Direct and Indirect Measurement: Second Step

At the core of a model of a measurand is the hypothesis that the general property of which the measurand is an instance is an element of a network of properties with which the general property is in a lawlike relation (Sect. 6.6 presents a short analysis of this important subject). There are two basic reasons for exploiting such a network in calculating a value of the measurand, as exemplified by the two cases mentioned above:

  • (example 1) the volume of an object is related to its temperature, i.e., the network includes volume and temperature; the definition of the measurand specifies a reference temperature, and the network allows us to correct the directly measured value of volume by taking into account the difference between the measured temperature and the specified temperature;

  • (example 2) the density of an object is related to its mass and volume, i.e., the network includes density, mass, and volume, and allows us to compute a value of density as a function of the measured values of mass and volume.

Hence, both examples refer to the model of the measurand, but only the first has to do with the possible distinction between the effective property and the intended property, as generated by some affecting properties, given that we would surely not think of density as affected by mass and volume, nor that mass and volume operate as corrections for density. This calls for a model-related refinement of the characterization of the distinction between direct and indirect (methods of) measurement proposed above. Let us then revise Table 7.1 accordingly by adding a middle column, where the left column is the same as in Table 7.1 and examples 1 and 2 are cases for the middle and the right columns respectively.

It should be noted that all three methods include a mathematical model in the form of a function by means of which a value for the measurand can be calculated and its uncertainty evaluated, with the consequence that measurement cannot be a purely empirical process: none of these methods can provide “pure data”. Only in an abstract perspective, however, methods A, B, and C may be treated in an undifferentiated way, under the consideration that all of them require calculating a function “among all properties known to be involved in a measurement”, paraphrasing from the previously-quoted VIM definition of ‘measurement model’ (JCGM, 2012: 2.48). Although the same formal rules for, say, uncertainty propagation apply to the three cases, maintaining the distinctions presented in Table 7.2 seems to be helpful for a better understanding of the structure of the measurement process and the role of mathematical models in it.

Table 7.2 A comparison of three general methods of measurement, provisionally called “method A”, “method B”, and “method C”

This analysis shows that measurement methods can be classified according to two general criteria, related

  • (first criterion) to the way in which the measuring instrument is designed and coupled with the object that carries the measurand, and

  • (second criterion) to the way in which the measurand is modeled and this model is exploited to compute the measurement result.

In light of this distinction and in reference to the content of Table 7.2 again, method A (left column) and method C (right column) may be acknowledged to be direct and indirect respectively, according to these characterizations, where parts (i) and (ii) in the two definitions below are based on the first and the second criterion respectively:

a measurement is based on a direct method (as in the left column of Table 7.2) when

(i) an instrument is used that is coupled with the object carrying the measurand and is designed to interact with instances of the general property of the measurand, and

(ii) the model of the measurand is only used in measurement for identifying the measurand.

and:

a measurement is based on an indirect method (as in the right column of Table 7.2) when

(i) instruments are used that are not necessarily coupled with objects that carry the measurand or designed to interact with instances of the general property of the measurand, and

(ii) the model of the measurand is used in measurement for identifying the measurand and its dependence on the properties from which the measurement result can be computed.

However, method B (middle column of Table 7.2) is not uniquely characterized, being analogous to a direct method with respect to the first criterion and to an indirect method with respect to the second criterion. Since we are not interested here in lexical issues, we simply call this method “direct/indirect”:

a measurement is based on a direct/indirect method (as in the middle column of Table 7.2) when

(i) an instrument is used that is coupled with the object carrying the measurand and is designed to interact with instances of the general property of the measurand, and

(ii) the model of the measurand is used in measurement for identifying the measurand and its dependence on affecting properties, from which the measurement result can be computed.

This mixed case highlights the complexity of our subject, and perhaps explains some of the confusion around the distinction between direct and indirect methods of measurement. Furthermore, the claims that any measurement is based on either a direct or an indirect method, in the sense proposed here, and that any indirect measurement requires at least one direct measurement, contribute to the clarification of the strategic issue of whether measurement science is becoming a branch of data science: the answer is negative. Even though the computational components are becoming more and more important, measurement maintains its distinction from computation by virtue of involving empirical components: any direct measurement includes at least one empirical component, and any measurement includes at least one direct measurement.

Hence, the framework we are going to propose needs to embed a structural model of direct measurement and to be based on it.

7.3 A Structural Model of Direct Measurement

We are grounding our account on the model proposed by Giordani and Mari (2019), which shares important features with the Berkeley Assessment System (BAS) model (Wilson, 2005; Wilson & Sloane, 2000), and Evidence Centered Design (ECD: Mislevy et al., 2003). We initially use the same example as in Giordani and Mari to help make the description of the structural model clearer and more concrete. The aim of this section is to supply a structural model for the fulfillment of the Basic Evaluation Equation for a generic property, in the case that in the previous section we have characterized as direct measurement.

In reference to what we have presented in Sect. 7.2.3, our starting point is the consideration that measurement is a process that maps empirical entities to informational entities, i.e., empirical properties of objects to values, and this highlights the fundamental role that scales have in measurement (and more generally evaluation) processes, as also introduced in Box 6.2. What could be called a Basic Evaluation ScaleFootnote 11 is then a mapping

$$\text{property}\,\text{of}\,\text{a}\,\text{given}\,\text{object} \to \text{value}\,\text{identifier}\,\text{of}\,\text{a}\,\text{property}$$

from a set of comparable and distinguishable properties of objects, all of them being instances of the same general property, to a set of distinct identifiers each corresponding to a value of that general property, where the mapping is constrained by the conditions of scale transformation (see Sects. 6.3.2 and 6.5.1). For example, for ordinal evaluations scales are defined up to monotonic transformations among identifiers, and for ratio evaluations scales are defined up to changes of the units. While syntactically equivalent to a Basic Evaluation Equation, a Basic Evaluation Scale differs in that it is not, in principle, true or false, depending on which identifier, and then which value, is chosen for each given property. Rather, it is a specification, which is instead required to be consistent: that is, if the property P[a] is greater than the property P[b] then the value specified for P[a] must be greater than the value specified for P[b], and so on. This can be summarized as in Table 7.3.

Table 7.3 A comparison of Basic Evaluation Equations and Basic Evaluation Scales

Two scales are involved at the core of any direct measurement. In the example of the measurement of the temperature Θ of an object a, Θ [a], by means of an alcohol thermometer, we can consider the following.

  • One is the scale that maps the temperatures of some already-established measurement standards to their values. For temperature, there would be a set of standards of temperature {sj*}, each having a temperature Θ[sj*], and a Basic Evaluation Scale of temperature would be built upon itFootnote 12:

for each sj* in the given set, Θ [sj*] → θj

where θj is the j-th value in the scale.Footnote 13 For example, bodies at the boiling and freezing points of water, at sea-level, could be two such temperature standards, and the numbers 100 and 0 in the given scale could be the chosen values. Since measurement standards are expected to be socially available for supporting metrological traceability (see Sect. 3.3.1), a scale about measurement standards may be called a public scale.

  • The other is the scale that maps the instrument indications for a specific instrument (or type of instrument) to indication values. For an alcohol thermometer, this would map the positions of the upper surface of the alcohol in the tube of the thermometer to position values: thus, a set of positions Xi* of the upper surface of the alcohol in the tube of the thermometer is chosen, and a Basic Evaluation Scale of position is built upon it:

for all Xi* in the given set, Xi*xi

where xi is the i-th value in the scale (in millimetres, say). This scale is specific to the given measuring instrument, and therefore may be called a local scale.Footnote 14

The fundamental structure of a direct measurement is in the relation between a public scale and a local scale:

  • the measurand is a property of the same kind as the properties in the public scale, and measurement needs to produce measured values in the public scale;

  • the measuring instrument performs a transduction from the property under measurement to a property of the same kind as the properties in the local scale (e.g., for an alcohol thermometer, from temperature to positions), so that indication values are values in the local scale (say, millimetres).

As we will see, the calibration of a measuring instrument may be interpreted as the process of connecting a public scale with the local scale of the instrument. It is this connection that makes it possible for a calibrated instrument to obtain a measurand value from an instrument indication value, i.e., a value in a public scale from a value in a local scale, as depicted in Fig. 7.3.

Fig. 7.3
A model diagram presents local scale instrument indicated values at the top and public scale measurement standard values at the bottom. Calibration is marked in middle.

Calibration as a relation of a public scale and a local scale

It is upon this idea that our structural model of direct measurement is grounded. We next describe its components and how they are connected with each other, first in the simpler case in which uncertainties are not included in the description and starting from the preliminary process of the design and construction of the measuring instrument.

7.3.1 The Design and Construction of a Measuring Instrument: backgrounder

Direct measurement is enabled by the use of a measuring instrument, a device able to interact with the property under measurementFootnote 15 and to map it to a value in the local scale embedded in the instrument. The usual structure of a measuring instrument may be described as constituted of three functional components: (i) a transducer, (ii) a scale, and (iii) something that matches the transducer outputs with the properties in the scale (an exception is discussed in Sect. 7.3.3).

The design and construction of a measuring instrument are grounded on the consideration of empirical properties of objects as associated with modes of empirical interaction of the objects with their environment, as introduced in Sect. 2.2.3 and then commented in Sect 5.1, and therefore on the formulation of the hypothesis—corroborated by appropriate observations – of a causal relationship between the general property of interest and another general property—the transduced property—whose instances are in some sense more readily empirically distinguishable as state transitions of the instrument. As described by Robert Rosen, “every recognition, measurement, discrimination or classification ultimately rests on the capacity of a given system S to induce a dynamics (i.e., a change of state) in another system M; this latter system will be variously called a metre, discriminator, recognizer, classifier, etc. It is this dynamical behavior in M, and particularly its asymptotic behavior, that provides the basis for learning about and describing the original system S.” (1978: p. x).

In the case of temperature this happened with the discovery of thermal expansion, i.e., the transduction effect according to which changes of the temperature of a body cause changes in its volume. In the case of a competence, like reading comprehension ability (RCA), this is typically based on the construction of a test whose items are specifically designed for checking that competence, where the transduced property is then the pattern of responses produced by a reader who responds to those items. We develop the case of temperature here, and a psychosocial case in Sect. 7.3.5.

Let us consider the example of an alcohol thermometer: it is a transducer from temperatures Θ of objects a, Θ[a], to positions Xm of the upper surface of the alcohol column housed in the glass tube of the thermometer, where the index m refers to the measuring instrument. Under a hypothesis of causality, the transduction is modeled as an empirical map Θ[a]→Xm. Prior to its usage in a measurement, the instrument must then be configured by etching a set of marks along the tube, corresponding to the distinguishable positions Xi* of the upper surface of the alcohol in the tube, in such a way that these positions—which could be called local reference properties to emphasize their dependence on the instrument—are mapped to values, so as to establish a local scale, modeled as an empirical-to-informational map Xi*xi from reference positions Xi* to values of position xi. This map, which is constructed under controlled conditions and then is assumed to be invertible, takes into account the specific features of the instrument, and as such differs from instrument to instrument.

A transducer and a local scale are the main functional components of a measuring instrument. For coupling them, the output of the former needs to be matched with the input of the latter. In an alcohol thermometer this requires being able to establish the reference position Xi* that best matches the transduced position Xm, an operation modeled by a map XmXi*.

On this basis the stages of the process of direct measurement are designed and performed, including the construction and the application of the instrument calibration function, an informational map xi → θj from instrument-related, and therefore local, values of position xi to the sought values of temperature θj.

7.3.2 The Stages of Direct Measurement

Transduction. The first empirical operation of a measuring instrument is a transduction, in our example from temperatures to positions of the upper surface of the alcohol in the thermometer tube, where this mapping is based on an assumption of causality: that is, via the transducer, the transduced property is the effect of the property under measurement.Footnote 16 This behavior is the outcome of the two basic features of a measuring instrument: it is sensitive to a given property, and it can change its state as the result of its interaction with an object carrying that property. From this perspective, a measuring instrument is a device designed so as to empirically realize a cause-effect relationship as an input–output function, the input being a property of external objects with which the instrument is able to interact and the output being a property of the instrument itself. Thus, suppose we bring the thermometer m into contact with the object a; then, the position of the upper surface of the alcohol is Xm, which is the transduced property obtained by the transduction of Θ[a] via the map Θ[a] → Xm, as depicted in Fig. 7.4.

Fig. 7.4
A model diagram presents property under measurement, transduction, and transduced property one above another. An upward arrow is marked in the middle.

The first stage of direct measurement: transduction (in the example Θ[a] is the temperature of an object and Xm is a position of the upper surface of alcohol in the tube of a thermometer)

Matching. The output of a transducer is an empirical property. Transduced properties are not always unambiguously distinguishable with each other. However, measuring instruments are designed so as to make the identification of transduced properties possible, and usually effective, through their matching with a predefined set of reference properties Xi*. This is traditionally performed by the measurer, who visually compares the transduced property and some references properties somehow identified on the instrument—for example the position of the upper surface of alcohol in the tube as matched to the marks etched along the tube—and to the best of her ability tries to minimize indication errors, such as those due to insufficient lighting, parallax, etc. In electronic instrumentation this task is performed by a quantizer (usually a component of an analog-to-digital converter, together with a sampler), which operates as a classifier of the transduced property. Whether manual or automatic, the matching can be then modeled as a map XmXi*, as depicted in Fig. 7.5. In the case of transducers designed to produce already discrete properties, the matching function might be, trivially, the identity.

Fig. 7.5
A model diagram presents property under measurement, transduction, transduced property, matching, and local reference property in sequence.

The first two stages of direct measurement: after transduction, matching (in the example Xm is a position of the upper surface of alcohol in the thermometer tube and Xi* is the position of the i-th mark etched on the scale of the thermometer)

Local scale application. Though still empirical, local reference properties can be effectively dealt with thanks to the appropriate design of the measuring instrument, as it is in the case of the positions identified by etches along the tube of a thermometer. In particular, this allows the instrument manufacturer or its users to establish an instrument-specific, and hence local, scale—i.e., an empirical-informational map from local reference properties to identifiers—and then values of the transduced property (see the analysis on evaluation scales in Box 6.2). Hence, once the reference property Xi* is identified that best matches Xm, the map Xi*xi is applied to find the local value that, via the instrument, corresponds to the property under measurement Θ[a], as depicted in Fig. 7.6.

Fig. 7.6
A model diagram presents measurement property, transduction, transduced property, matching, local reference property, scale application, and value in series.

The first three stages of direct measurement: after transduction and matching, local scale application (in the example Xi* is the position of the i-th mark etched on the scale of the thermometer and xi is a value of position)

As Fig. 7.7 shows, a sequence of actions.

$${\text{transduction}} \to {\text{matching}} \to {\text{local}}\,{\text{scale}}\,{\text{application}}$$
Fig. 7.7
A model diagram presents transduction, matching, and local scale application in series. Pre measurement is marked directly between property and local value.

Pre-measurement, as the composition of transduction, matching, and local scale application

allows us to attribute local, i.e., instrument-specific, values to properties under measurement. While this is already a map from empirical to informational entities, as expected from a measurement, there are two main drawbacks in this:

  • the values are in the instrument scale, not the measured property scale: in our example, this would imply reporting information about temperature in terms of values of length;

  • the values depend on the specific instrument, not only the property under measurement: in our example, this would imply producing information that is valid only for a given thermometer.

Such a “local measurement” is called pre-measurement by Frigerio et al. (2010).Footnote 17

The drawbacks of pre-measurement can be overcome by introducing a public scale for the property under measurement and calibrating the instrument against it.

Public scale construction. As discussed above, a set of reference properties of the same kind as the property under measurement is made available through a set of measurement standards, and a public scale is thus built as a map from such reference properties to identifiers or values (see the analysis on evaluation scales in Box 6.2). In the case of temperature, Θ[sj*] → θj is a public scale that maps each reference temperature Θ[sj*] of the standard sj* to the value θj, as depicted in Fig. 7.8. A traditional way to accomplish this is by means of two standards, such that Θ[s1*] is the temperature of the freezing point of water and Θ[s2*] is the temperature of the boiling point of water, under given conditions of pressure, to which the values 0 and 100 are conventionally assigned in the Celsius (Centigrade) scale, where the values to be associated to any other temperature can be obtained by means of some sort of interpolation, for example as discussed in Sect. 6.3.6.

Fig. 7.8
A model diagram presents public reference properties on the left and public values on the right. Public scale construction in middle links them.

A preliminary stage of direct measurement: public scale construction (in the example Θ[sj*] is the temperature of the measurement standard sj* and θj is a value of temperature)

Calibration. The sequence

$${\text{transduction}} \to {\text{matching}} \to {\text{scale}}\,{\text{application}}$$

can be applied not only to a property under measurement, but also to the reference properties of a public scale. What is obtained are still local values, but in this case they are known to correspond to the public values that were attributed to the reference properties. A pointwise mapping from local values to public values can be then established by repeating the process for each reference property: this is the instrument calibration map. In our temperature example, the chain Θ[sj*] → XmXi*xi together with the scale Θ[sj*] → θj lead to the calibration map xi ↔ θj, as depicted in Fig. 7.9.Footnote 18 Such a map may be presented by means of a table, listing the relevant pairs (xi, θj), or a (local values, public values) chart, through which any obtained indication value can be associated with a measured value. In some cases, the preferred option is instead to embed the information produced by the calibration directly in the instrument, for example by writing the public values directly on the instrument scale in place of the indication values, as is common for many thermometers. Of course, this substitution does not imply any structural change in the model we have just presented.

Fig. 7.9
A model diagram presents transduction, matching, and local and public scale applications to produce corresponding values. These values are calibration mapped.

A preliminary stage of direct measurement: calibration

Thanks to the calibration map, pre-measurement finally upgrades to (direct) measurement: the instrument-specific map of pre-measurement Θ[a] → xi can be composed with the calibration map xi ↔ θj, together producing a map Θ[a] → θj, which corresponds to a Basic Evaluation Equation and therefore to the expected outcome of a measurement (again, in the simple case in which uncertainties are not taken into account). This structure is depicted in Fig. 7.10.

Fig. 7.10
A model diagram presents transduction, matching, local scale, and calibration map in series. Direct measurement is marked directly between property and public value.

The operational structure of direct measurement, as a composition of pre-measurement and calibration

which is a summary version of the more explicit structure in Fig. 7.11, in which the role of the public scale for the creation of the calibration map is also shown.

Fig. 7.11
A model diagram presents transduction, matching, local and public scale application, direct measurement, and calibration mapping to produce public values.

The operational structure of direct measurement, highlighting the role of the public scale for the creation of the calibration map; note that the two transductions are performed at different times

7.3.3 An Alternative Implementation

As the diagram in Fig. 7.11 shows, in measurements like the ones performed by means of an alcohol thermometer the properties under measurement and the reference properties in the public scale are not compared directly: as a consequence, for the instrument map and the calibration map to be correctly composed, the instrument must not significantly change its behavior over time or across environments. These two conditions were characterized, in Sect. 3.2.1, in terms of instrument stability and selectivity, respectively, such that if these conditions do not hold the instrument may be in need of recalibration or repair. Under the condition that the instrument behaves in a sufficiently stable and selective way, the sequence shown in Fig. 7.10 of provides an operational implementation of a direct measurement.

transduction → matching → local scale application → calibration mapping

An alternative implementation of direct measurement is possible if the instrument allows the direct comparison of the property under measurement and the public reference properties, as in the case of the measurement of weight by means of a two-pan balance,Footnote 19 in which the property under measurement, i.e., the weight of the object under measurement, is directly compared with some standard masses.Footnote 20 This process may be modeled as matching the property under measurement to the reference properties in the public scale. The operational structure of direct measurement in this case is much simpler, as depicted in Fig. 7.12.

Fig. 7.12
A model diagram from measurement property to public value has two pathways. They are via matching, public scale application, and direct measurement.

The operational structure of direct measurement, as based on the direct comparison of the property under measurement and the public reference properties

7.3.4 The Hexagon Framework

The two operational implementations presented in the previous sections can be merged into one conceptual structure, which then presents a more implementation-neutral and therefore general interpretation of direct measurement, as depicted in Fig. 7.13, referred to henceforth in this book as the “Hexagon Framework”. This structure shows that in a direct measurement the map.

property under measurement → public value

Fig. 7.13
A model diagram presents transduction, matching, local and public scale application, direct measurement, and calibration mapping as the sides of a hexagon.

The conceptual structure of direct measurement: the Hexagon Framework

Fig. 7.14
A model diagram presents measurement, transduced property, local and public reference, and values as the corners of a hexagon. Local and public scales are marked.

First symmetry in the Hexagon Framework: local scale and public scale

may be implemented, and therefore a Basic Evaluation Equation may be obtained, in two alternative ways,

transduction → matching → local scale application → calibration mapping

and

matching → public scale application

This diagram reveals some symmetries in the conceptual structure of direct measurement according to the model we are proposing:

  • there are a local scale and a public scale, connected by calibration (Fig. 7.14) Footnote 21;

  • the transduction map has an informational counterpart in the calibration map (Fig. 7.15):

    Fig. 7.15
    A model diagram presents measurement, transduced property, reference properties, and values as the corners of a hexagon. Transduction and calibration are marked.

    Second symmetry in the hexagon framework: transduction and calibration

  • measurement includes an empirical component and an informational component (Fig. 7.16):

    Fig. 7.16
    A model diagram presents measurement, transduced property, reference properties, and values as a hexagon. Empirical and informational components are marked.

    Third symmetry in the Hexagon Framework: empirical component and informational component

This structural characterization of the process of measurement is independent of the algebraic structure of the scale. As discussed in Sect. 7.4.4, the model characterizes direct measurement through empirical but not mathematical conditions, and as such this structural account applies also to non-quantitative properties (see also Mari et al., 2017).Footnote 22 Furthermore, the fact that the characterization is purely structural makes the model applicable to understanding the measurement of both physical and non-physical properties. Let us see an example.

7.3.5 An Example Application of the Model in the Human Sciences

In this section, we give an account of two aspects of the application of the Hexagon Framework, in the human sciences. First, we describe how the model can be applied to a recent example from educational measurement. Second, we then discuss how the perspective of the Hexagon Framework can be exploited to inform an instrument development process used on psychosocial measurements using the same example.

The common background context for both of these is an assessment system built for a middle school statistics curriculum that leans heavily on the application of learning sciences ideas in the STEM (science, technology, engineering, mathematics) domain, called the Data Modeling curriculum (Lehrer et al., 2014). The Data Modeling projectFootnote 23 created a series of curriculum units based on real-world contexts that would be familiar and interesting to students, with the goal of making data modeling and statistical reasoning accessible to students who usually do not do so well on this subject. Within specific topic areas, the curriculum and instructional practices use the idea of a construct map, as described below, to help materials developers design and develop the educational materials, to help teachers guide the design of their instructional plans in the topic to be taught, and, where the students are sufficiently mature, to help learners appreciate their own growth in knowledge and skills. The construct map presented in this example describes transitions in reasoning about data modeling and statistical reasoning when middle school students are inducted into practices of visualizing (i.e., they draw illustrations of what they think is going on), carrying out scientific practices (e.g., they pay attention to the process, being careful to record what they observe), and modeling the variations they have observed and recorded in the context they are studying (e.g., they observe how different student’s measurements of certain hard-to-measure properties of certain objects will vary, such as the heights of tall trees). In the Data Modeling curriculum, teaching and learning are closely coordinated with assessment, which is the focus here.

To make the idea of a construct map concrete for the reader, let us consider an instrument aimed at measuring student knowledge and skills—we call them a “competence”—in a part of the Data Modeling curriculum called Modeling of Variability, to be described in detail in the next paragraph. We assume that this student’s property can vary at least ordinally, from high competence to low competence, and that the measurement developer can postulate a (finite) number of consecutive educationally distinguishable points between the extremes. This corresponds to a typical step in the development of the measurement of a property in the human sciences, where, before one has achieved the ability to carry out a full measurement process, as in the Hexagon Framework, the measurement developer has to go through a process of deriving an ordinal understanding of the property—in this case a competence of individuals—using the research literature and professional knowledge in the topic area. Quite often the property will be conceptualized as describing successive points in a typical process of change, or development, over time within an individual student, and the construct map can then be thought of as being analogous to a qualitative “roadmap” of change in the competence (see for example Black et al., 2011). In recognition of this analogy, these qualitatively different locations along the construct are called “waypoints”: it is from these qualitative descriptions that the ordering of the waypoints is derived, and is very important to the instrument development process and also in the interpretation of the measurement results. Each waypoint has a qualitative description in its own right, but, in addition, it derives meaning by reference to the waypoints below it and above it. Finally, we model this competence as a dense property, in the sense that, in principle, for any two students with distinguishable competences there can be a third student whose competence is situated between theirs.

The property under measurement. The property that we focus on for this example is the competence to deal with Models of Variability, “MoV” for short. This is a student competence at a relatively young age, between about kindergarten and Grade 5. The project’s final construct map for MoV is illustrated in Fig. 7.17, where the waypoints are symbolized by blue dots. At the low end of student development (i.e., at the bottom of the figure), the focus is on the identification of sources of variability, then advancing to the incorporation of devices to represent the mechanisms of those sources of variability; at the highest waypoint, students are also able to develop models of variability and to judge how well they work by examining how repeated model simulations relate to an empirical sample. Note that one waypoint has two labels attached to it (MoV2 and MoV3): this means that these are two different categories of student thinking that occur at the same point, or at least very similar points, in development (more about this later).

Fig. 7.17
A model diagram presents M o V 1 to 5 as points on a line from bottom to top. M o V 2 and 3 overlap. Details for each M o V are presented.

The MoV construct map (the distances shown between waypoints are arbitrary, and they are not represented as equally far apart to emphasize that there is not assumption of “equal differences” in this diagram)

Transduction. A task like the one shown in Fig. 7.18, as based on the MoV construct map, is designed to operate as a transducer, that is sensitive to the student’s MoV competenceFootnote 24 and produces a complex transduced property in the form of responses given to questions by each student, for example a response identifying an affirmative or a negative response for Question 1(a) and a written text for Question 1(b). This can be then modeled as a map from the property under measurement Θ[a] to a transduced property Xm, where Θ[a] is the MoV competence Θ of student a, as in Fig. 7.4. This task capitalizes on Data Modeling students’ experiences in learning about measuring properties of awkward objects as a process that generates variability. In particular, Question 1(a) is intended to prompt the student take one of two positions. Some students may note that, for the given bins, the mode is the same for both displays and consider that sufficient to say “No”: these students are not able to specify the source of the variation, as they are not perceiving the spread as being relevant. Following this set-up, the later Questions, 1(b) and 2(b), explore students’ understanding of how measurement techniques could affect variation, providing the opportunity to gather evidence regarding whether a student’s MoV competence is at least at the waypoint MoV2 in the construct map. This task does not provide opportunities for evidence above MoV2, as no models of chance or chance devices are involved in the question: hence, the sensitivity of this measuring instrument to changes of MoV competence above MoV2 is zero.

Fig. 7.18
A text image presents the piano width measurement task and its results. In this task, musicians are asked to measure the width using a 15 centimeter ruler and a meter stick. Two graphs depict the group measurement counts. Explanations for differences in results are required as answers.

The Piano Width task

Matching. Particularly if the task is presented to students on paper, so that they operate writing by hand, the response produced by each student must be matched to the local (i.e., instrument-related) reference properties Xi* for the item, and this can be modeled as a map XmXi*, as in Fig. 7.5. In the simplest cases, as for Question 1(a), the transducer is designed to discriminate between two properties, X1* and X2*, corresponding to two alternative states of selection, as shown in Fig. 7.19. The matching may be performed by a human rater or by a mark sense reader. In practice, it may be that some responses are not uniquely classifiable by the local references, such as when a reader makes a mark that is not clearly in one box rather than another, which might lead to the addition of a third reference property, corresponding to such an “unclear” situation. Furthermore, for some open-ended questions the very possibility to define a set of reference properties to classify responses could be problematic: in these cases the process must include a well-documented and monitored system for judging the open-ended responses (see, for example, Wilson, in press: Chaps. 4 and 7) in order to fulfill the minimal conditions this Framework sets to consider it a measurement.

Fig. 7.19
A text image presents two sets of Yes and No. Yes and No is selected in the left and right side sets respectively.

The local references for the example: X1* (left) and X2* (right)

Local scale construction and application. The local reference properties are still empirical entities, being for example patterns which may be read on a sheet of paper. However, the fact that they are preliminarily identified and guaranteed to be distinguishable in the instrument operation allows us to associate them with information entities. A map Xi*xi from local reference properties Xi* to local values xi is then defined, as in Fig. 7.6. Typically, for each dichotomous item the patterns corresponding to the correct response are scored as 1 and the patterns corresponding to the incorrect response are scored as 0. Once such a local scale has been constructed, whenever one of its reference properties Xi* is indicated by a student’s response, the corresponding value xi is recognized by applying the scale. As it is commonly said, the student’s response is “scored”, and this is the conclusion of the pre-measurement, as depicted in Fig. 7.7. For more complex open-ended responses, the matching process involves the development of a scoring guide for each item, including explanations of what a typical response at each waypoint must contain, as well as well-chosen exemplars, and a training scheme for judges (see Wilson, 2005, or Wilson, in press, for illustrations of these). For example, the following response to Question 1(b).

When we used the ruler, there were more mistakes (more gaps and laps) when we switched to the meter stick, there were fewer mistakes. So the measurements with the ruler are more spread out than the measurements with the meter stick.

would be mapped to MoV2.

Interlude: reality check. The theory and practice of MoV competence is not sufficiently advanced that a one-item instrument such as Item 1(a) used above could dependably provide information of sufficiently high quality for most intended purposes. Hence, good instrument design demands a strategy where a property such as the MoV competence is sampled across important facets, and this can only be accomplished by using multiple items (Wilson, in press). A more realistic situation is then that instrument developers would create a set of K items Im = {Imk}, where k = 1, …, K, each of them designed to interact with a facet of the MoV competence of a student. Then, once the student has responded to the items in the test, the transduced property is the vector Xm= (Xm1, …, XmK) of the responses to the set of items. These items may be designed so that some responses are indicative of more sophisticated understanding of variation, i.e., higher Θ, and other responses are indicative of less. The local values for the MoV competence of reader a can then be gathered into a vector xi = (xi1, …, xiK) of 1 and 0 s of length K, which might be summarized by taking the sum of the K values,Footnote 25 as before, referred to as “sum-scores”, or “total scores”, as indeed has been done in psychology and education for over a century within the framework of classical test theory (e.g., Nunally & Bernstein, 1994: p. 215–247, p. 308–310).

Public scale construction and application, and calibration. The local values obtained by the test are not broadly useful other than when using that specific set of items Im, as the relationship between the local values for this set and the local values for a different set of items is not generally known. Whenever the comparability of results may remain local to the context where the same instrument is used (e.g., within a single teacher’s classroom), this may not be a problem, but a further step would be needed to equip these local values for public usage, through the definition of a public scale of MoV against which different MoV tests could be calibrated. As Fig. 7.8 shows, in principle this requires identification of an appropriate set of public reference properties, and therefore of publicly available MoV waypoints.

Thus, for each student represented by their local values in the data set, an estimate of the location of her MoV competence on the public scale is obtained, corresponding to a Basic Evaluation Equation, at least for this initial stage of the instrument development. The interpretation of this estimate is aided by using a chart sometimes referred to as a “Wright map”. This capitalizes on the most important feature of the output from an analysis using the Rasch model: the estimated locations of the respondents on the property described by the construct map are on the same scale as the estimated locations of the categories of item responses. This allows one to relate the empirical findings about the items to the educational hypotheses embodied in the construct map. Such a feature is crucial for both the measurement theory and measurement practice in a given context:

  1. (a)

    in terms of theory, it provides a way to empirically examine the waypoints, and adds this as a powerful element in studying the validity of use of an instrument;

  2. (b)

    in terms of practice, it allows the measurers “go beyond the numbers” in reporting measurement results to practitioners and consumers, and equip them to use the construct map as an important interpretative device.

As reported in Wilson and Lehrer (2021), results from the analysis of data were collected using a sample of about one thousand middle school students from multiple school districts to calibrate the MoV items, by fitting a partial credit calibration model (a one-dimensional Rasch-family item response model (Masters, 1982)) to the item responses. The resulting Wright map for this analysis is shown in Fig. 7.20: looking from left to right, the first column shows the logit scale, the second shows the distribution of the students as a histogram (in horizontal orientation), running from lowest ability at the bottom to highest at the top, and the raw sum-score approximately corresponding to the histogram bars is shown next; the next five columns show the thresholdsFootnote 26 for each of the MoV items, separated out into those that relate to each waypoint, with the waypoints indicated along the bottom of each column.Footnote 27

Fig. 7.20
A wright map for M o V 1 to 5 depicts piano and buildings 2 and 4, rock, soil, and models 1 to 3. The highest and lowest total scores are 178 and 1.

A Wright map for MoV competence (the location of the threshold for each item is represented i.k, where i is the item number and k is the item score; for example, 9.2 is the second threshold location for item 9, i.e., the threshold between the scores 0 and 1 compared to scores 2 to 4, the maximum for item 4)

The last column on the right-hand side shows the banding of the Wright map. Note that the waypoint for each item is represented by a single location on the logit scale and hence the set of thresholds for each waypoint has multiple locations: this raises the issue of how best to represent the waypoint. The solution is to use the range of that set of locations as representing each waypoint, and we call that range the band for that waypoint. This marks the transition from an ordered qualitative set of waypoints on the construct map to an empirical “band” on the logit scale. Thus, in Fig. 7.20, parts of the logit scale (the bands) are delineated that correspond to the set of thresholds for each waypoint. Due to the uncertainties and many influencing properties inherent in the item design process, it is not always possible to do so in a completely non-overlapping way, and it can be seen here that, although the bands are mainly inclusive of the relevant thresholds, some are not contained in their respective bands (e.g., thresholds 7.1 and 1.3).

Thus, the MoV competence Θ[a] under measurement can now be associated with public values θj via the Basic Evaluation Equation using the scale based on, and interpreted through, the public references Θ[sj*], corresponding to the levels in the construct map in Fig. 7.17.

Conceptualizing the BEAR Assessment System using the Hexagon Framework. In the remainder of this section we discuss how the perspective of the Hexagon Framework can be exploited to inform an instrument development process used in psychosocial measurements. Our aim is to show how the Hexagon Framework can be seen as consistent with, and hence underlying the logic of, a specific instrument design approach that is used in the psychosocial sciences, the BEAR Assessment System (BAS: Wilson, 2005; in press). Thus, the BAS is described here as an overlay on the Hexagon Framework: we see this as an illustration of how the Framework can be used not only as a philosophical foundation for measurement, but also as a basis for other useful scientific conceptualizations and models.

The Data Modeling project used the BAS to develop an instrument to measure MoV competence. The BAS uses four “building blocks” to develop an instrument: (a) the construct map, (b) the items design, (c) the outcome space, and (d) the Wright map. These building blocks are used in an iterative sequence during instrument development. For a more detailed account of an instrument development process that works through these building blocks, see Wilson (2005; in press). Below, each of these building blocks is considered in turn with respect to MoV competence and related to the Hexagon Framework.

The initial construct map for the MoV competence is similar to the final one, shown in Fig. 7.17, but for one important difference, which is elaborated on below.

During the Data Modeling project, the typical work-pattern was to (a) develop instructional materials and practices based on the current idea about MoV (e.g., starting with the initial ideas about the construct map based on the literature), (b) try them out in classrooms, and with input from the teachers involved, and (c) revise them, including updating the construct map. This was repeated over several different iterations in different classrooms with different teachers. At the initial stage, the waypoints were ordinally related to one another. The next few steps lay out how the project team worked to find a quantitative relationship between them. The waypoints in the construct map are merely labeled here: the interested reader can find detailed descriptions in Wilson and Lehrer (2021).

As should be clear from the discussion above, the place of the construct map in the Framework is as a description of the property under measurement (see the construct map ellipse in Fig. 7.21), and as such it is analogous, in the example of the development of thermometers, to a sequence of fixed points of temperature, like the freezing point and the boiling point of water.

Fig. 7.21
A hexagon framework presents properties, public, and local references, and values. Construct, write maps, item design, and outcome space are encircled.

The BEAR Assessment System (BAS) building blocks overlaid on the Hexagon Framework

The next step in instrument development under the BAS is the items design, which involves developing ways in which the property described by the construct map could be manifested via an empirical situation: this is the transduction in the Hexagon Framework (see the items design ellipse in Fig. 7.21). For example, the Data Modeling items often began as everyday classroom experiences and events that teachers have found to have a special significance in learning of variability concepts. Example items are shown in Fig. 7.18, in the Piano Width task.

The next step under the BAS is the development of the outcome space, which involves the step from the classification of the population of potential student responses to the assignment of scores for use in the statistical analysis to follow. To facilitate this, the project collected a sample of student responses during the instrument development to support the instrument development. Thus, this is the equivalent in the Hexagon Framework of the steps from the transduced properties to the local values (see the outcome space ellipse in Fig. 7.21).

The next step in this developmental process is to relate these scores back to the construct map. This is initiated through the fourth BAS building block, the Wright map (see the Wright map ellipse in Fig. 7.21). As mentioned above, a statistical calibration model is used to analyze the resulting data.

The most striking feature of the banded Wright map in Fig. 7.20 is that the waypoints for MoV2 and MoV3 were found to occupy the same segment of the logit scale. In the initial investigations of this Wright map, it was found that the thresholds for the waypoints Mov2 and Mov3 were thoroughly mixed. A large amount of time was spent exploring this, both quantitatively, using the data, and qualitatively, examining item contents, and talking to curriculum developers and teachers about the apparent anomaly. The conclusion was that for these two waypoints, although there is certainly a necessary hierarchy to their lower ends (i.e., there is little hope for a student to successfully use a chance-based device such as a spinner to represent a source of variability (MoV3) if they cannot informally describe such a source (MoV2)), these two waypoints can and do overlap quite a bit in the classroom context. Students are still improving on MoV2 when they are initially starting on Mov3, and they continue to improve on both at about the same time. Hence, at least formally, while it was decided to uphold the distinction between Mov2 and MoV3 in terms of content, it also seemed best to ignore the difference in difficulty of these waypoints, and to label the segment of the scale (i.e., the relevant band) as “MoV2” and “MoV3” as in the final construct map shown above in Fig. 7.17. This is an example of how the empirical findings during the measurement development process can lead to modifications of the understanding of the property under measurement, and it is also an illustration of the iterations of the BAS: in effect, the researchers involved in the project changed the definition of the property under measurement as they worked around the Framework!

Thus, the building blocks of the BAS can be seen as an overlay on the Hexagon Framework and hence incorporating the logic of the Hexagon Framework. The logit scale is a representation of the property under measurement in this situation, and the banded Wright map is the feature that affords the possibility of a criterion-referenced interpretation of the students estimated locations on the logit scale in terms of the MoV waypoints, which serve as the public reference properties for MoV competence.

The account above has traced the progress of the instrument development through just one iteration of the BAS, but in fact it is just one embedded in several. There were several initial partial circuits around the BAS where the measurement developers were working closely with the curriculum developers and the co-operating teachers to develop not only the items and the outcome space, but also even the waypoints themselves. In addition, a final run-through of the data collection and a second analysis were needed to provide more items for the instrument and a better, more representative sample of the student population.

The idealized model of direct measurement described in the preceding sections needs to be generalized to the more realistic case in which uncertainties are taken into account.

7.4 Measurement Quality According to the Model

As operationalized by the Hexagon Framework, the Basic Evaluation Equation needs to be augmented with an assessment of the quality of the information conveyed by measurement. As discussed in Sect. 7.3.2, this role is played by measurement uncertainty, which is inversely related to information quality: the better the quality the less the uncertainty. By taking uncertainty into account, the model of direct measurement introduced above is improved and generalized. In the human sciences, the quality of measurement is usually assessed in terms of validity (see Sect. 7.4.3) and reliability (see Sect. 3.2.1): while reliability is usually assessed via a quantifier, this is seldom the case for validity, and if so, then it would only be quantified for some components of what is generally termed “validity”.

The acknowledgment of the structural, and not only operational, importance of measurement uncertainty is relatively new in physical metrology: “the need to find an agreed way of expressing measurement uncertainty in metrology” was stated in the Recommendations issued by the International Committee of Weights and Measures (CIPM) in 1980–81 (quoted in JCGM, 2008). In the human sciences, the need to investigate the validity of the measurement has been a basic element of measurement practice since the early twentieth century (see Sect. 7.4.3). This can be interpreted as a revision of the basic black box model: given an input property, a measurement is expected to produce not only one or more values but also some information on the quality of the information that such values provide on the measured property.

As mentioned, until a recent past, measured values were reported together with estimates of measurement errors. The relation between error and uncertainty in measurement is complex: uncertainty has sources that are not what would traditionally be described as errors, as in the case of definitional uncertainty, and some errors could be known only with some uncertainty; hence error not only may generate uncertainty, but also may have its own uncertainty.Footnote 28 More importantly, the emphasis on uncertainty is a result of a conceptual shift in the recent metrological literature from a purely empirical to a model-based approach, incorporating both empirical and informational interpretations of measurement. As a consequence, the central concept of measurement science is arguably no longer the “true value” that exists independently of measurement and that would be obtained by an error-free empirical process. While measurement is still sometimes characterized as a process aimed at estimating the true value of a property (a prominent example is in Possolo, 2015: p. 12), the very idea of a measured property having an inherent true value requires clarification as soon as the unavoidable role of models in measurement is accepted. For example: does the true value of a property change if the model of the property or the model of the measurement change? is therefore a true-in-a-model value? Does the true value remain unique also in presence of non-null definitional uncertainty? Or are there in this case multiple true values? See the related discussion in Box 6.1.

Hence, in what follows we do not deny in principle the hypothesis that properties have a true value, but neither do we rely on it: instead, we attempt to provide an encompassing standpoint which should be understandable and acceptable independently of this hypothesis. Like the rest of this chapter, and like most of this book, what follows may be read as starting from the VIM definition of <measurement> as a process of reasonable attribution of values to properties of objects (JCGM, 2012: 2.1) and aimed at establishing sufficiently well-defined criteria of such reasonableness. An appropriate characterization of measurement uncertainty plays a key role in service of this goal. What follows may be interpreted as a reconsideration of the basic components of measurement uncertainty, as introduced in Sect. 3.2.4, in light of the model presented above. But, first of all, we need to reconsider the Hexagon Framework and expand it in order to take into account the possibility of feedback with a measuring instrument.

7.4.1 Measurement that Involves Feedback

We have assumed so far that the interaction between the object under measurement and the measuring instrument, as realized in the transduction stage (see Sect. 7.3.2), is unidirectional: the interaction changes the state of the instrument, and therefore the transduced property Xm, as modeled by the map Θ[a] → Xm. In a more general case, however, the interaction produces a change also in the state of the object under measurement. In particular, when the object under measurement is a human being (or a set thereof), as is usually the case in the human sciences, the objects under measurement may be aware of their being objects under measurement. In this circumstance, not only might interaction uncertainty become a critical component of the uncertainty budget, but the structure of the process itself becomes more complex due to the presence of one or more feedback loops.

Three structural cases may be identified, as follows.

First case (no feedback): these are measurements in which the interaction with the measuring instrument does not induce a change in the measurand (an obvious example is the measurement of the spectral density of the radiation emitted by a star: of course, the state of the star is not affected by the operation of the spectrometer). In this case there is no feedback in the process, and therefore there are no problems in objectivity resulting from the one-way interaction.

Second case (non-oriented, random feedback): these are measurements in which the interaction with the measuring instrument induces a random change in the measurand, for example due to an uncontrolled transfer of energy between the instrument and the object under measurement. In this case a non-oriented feedback is present in the process: in usual conditions, the measurement trueness is not affected by this loop (in other words, no systematic errors arise from the interaction), and some problems in objectivity may arise due to insufficient measurement precision, revealed by large random errors.

Third case (oriented, non-random feedback): these are measurements in which the interaction with the measuring instrument induces a non-random change in the measurand (which is called a “loading effect” in the context of electrotechnology, for example). Whenever such a change is identified and modeled, typically as a systematic error (or as a bias, i.e., the estimate of systematic error; JCGM, 2012: 2.18), its effects may be experimentally minimized or mathematically corrected. As in the previous case, this situation of oriented feedback generally results in problems in objectivity. In the human sciences, this is the context in which the so-called “Hawthorne effect” (Landsberger, 1958) arises, in which individuals alter their behavior due to their being aware of being observed. An undesired consequence of this effect is summarized in what is sometimes called Goodhart’s law: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes” (Goodhart, 1981).Footnote 29 A related phenomenon in educational measurement is sometimes called “teaching to the test”, when the awareness of educators and their students of being tested (and the consequences thereof) alters the nature of teaching and learning. A peculiar consequence is that, at least in the short term, measurement-like activities may be exploited as managerial tools, leading individuals to change their behaviors only because they are informed that some measurements will be performed on them, even if the measurements are ultimately not utilized or even performed, in which case the feedback loop does not actually include the measuring instrument.

7.4.2 Uncertainties in the Stages of Direct Measurement

Our updated account of measurement uncertainty begins with the consideration of the preliminary stages of any actual measurement: the definition of the measurand and the calibration of the instrument.

Regarding the definition of the measurand. Measurement is aimed at acquiring the information required to assert a Basic Evaluation Equation Θ[a] = θ, i.e., the equality of the property Θ of the object a and the public value θ (see the analysis in particular in Sects. 5.1.1 and 6.4). In the simplest condition, the measurand is defined as the property which interacts with the instrument and triggers the transduction, so that, in a (tautological) sense, we measure what the instrument measures (this corresponds to what in Box 2.3 is called an operational strategy of measurand definition). This might be acceptable when the information acquired by measurement is required only in the specific time and place in which the measurement is performed, and thus the model of the measurand can remain implicit: the intended property is the same as the effective property, and nothing else needs to be specified. In the example of the measurement of the temperature Θ of an object a, defining the measurand simply as the temperature that is transduced by the thermometer implies that one can assume that neither the thermal inhomogeneities of the object (different points of the object might have different temperatures) nor the environmental conditions (e.g., the temperature of the object might be affected by the pressure, or humidity of the environment) are relevant in the information to be reported.

However, in scientific measurement, a Basic Evaluation Equation is generally expected to convey widely transferable information. Values, on the right-hand side of the equation, can in principle be interpreted in the same way everywhere and always thanks to their metrological traceability (see Sect. 3.3.1). Analogously, measurands, on the left-hand side of the equation, should be interpretable beyond the here-and-now situation. This is accomplished by explicitly defining the measurand, by means of a model which identifies the measurand by description instead of by purely indexical means, and therefore by taking into account the possible differences between the measurand and the property that produces the transduction: the information is empirically acquired about the effective property but is reported about the intended property, and a purpose of the model is to establish a connection between these two. Sometimes these differences can be considered explicitly, if the model has a mathematical form in which a value of the intended property is calculated as a function of both the effective property and the appropriate corrections, as when the measurand is the temperature of an object in given environmental conditions but the temperature is measured in different conditions, and there is a known law connecting such environmental conditions to the property under measurement.

But, as usual, there is a price to be paid for improving the transferability of the measurement information: the greater the specificity of the information, the greater its uncertainty, which in this case is uncertainty about the definition of the measurand, what the VIM calls the definitional uncertainty (JCGM, 2012: 2.27) (see Sect. 3.2.4). Thus, ignoring the distinction between the intended and the effective property amounts to the elimination of definitional uncertainty from the model. For example, considering the temperature of water in a container, the effective property is the temperature of that part of the water with which the thermometer interacts, in the context of the, possibly unknown, conditions of the water in the container at the time of the interaction. However, the measurand could be defined by a specification of the conditions of the object (i.e., the water) and the environment (i.e., the container and the surrounding space), for example by assuming that the water is thermally uniform and the measurement takes place at a given environmental pressure. Assuming this makes the information more transferable, but at the cost of non-null definitional uncertainty: we must take into account the differences between the specified conditions (i.e., “the water is thermally uniform and the measurement takes place at a given environmental pressure”) and the actual conditions of the interaction of the object and the measuring instrument.Footnote 30 The place of definitional uncertainty in the Framework is depicted in Fig. 7.22. As noted in Sect. 3.2.4, there are many types of definitional uncertainty in the context of measurement in the human sciences (sometimes referred to as “threats to validity”). One particular threat is construct underrepresentation: this is where the effective property is less complex or rich than the intended property. An example in the case of RCA would be, for example, where the property is considered to pertain to comprehension across paragraphs or texts, but the transducers (i.e., the reading comprehension items) are instead always focused on specific words or phrases within a sentence.

Fig. 7.22
A hexagon framework presents properties, public, and local references, and values. Property and public value are linked via the intended property in middle.

An extension of the Hexagon Framework including the distinction between intended property and effective property, and the place (highlighted by the gray ellipse) of definitional uncertainty in it; the mapping labeled “direct measurement” is the operationalization of a Basic Evaluation Equation

Regarding the definition and dissemination of the public scale and calibration. We assume that at a given time the instrument has been calibrated against some reference objects, i.e., measurement standards, with reference properties Θ [sj*] and corresponding values θj. This requires, first, that the public scale Θ [sj*] → θj was effectively disseminated, from the primary realization of the definition of the reference properties (and therefore of the unit, in the case of quantities), via a metrological traceability chain. Thus, along the chain and across contexts, inaccuracies and instabilities may affect the reproduction of the reference properties in such a way that the properties of the primary standards may differ from the properties Θ [sj*] of the working standards. This leads to uncertainties in the public scale Θ [sj*] → θj used for the instrument calibration. Moreover, calibration requires some empirical processes to be performed, i.e., transduction and matching, in which influence properties may affect the instrument indication and therefore the construction of the local scale Xi*xi. These uncertainties in both the public scale and the local scale combine to form a calibration uncertainty affecting the calibration map xiθj, as depicted in Fig. 7.23. As noted above, the creation of public reference objects in the human sciences may be based on the means of sum-scores of specified groups (e.g., Grade 6 students in a given school system taking an RCA test). This can lead to calibration uncertainty if the means change over time, and these changes are not accounted for in the public scale, as with the Flynn effect (Flynn, 1987).

Fig. 7.23
A hexagon framework presents the links between effective, transduced properties, public, primary and local references, and values. The bottom right corner is encircled.

The Hexagon Framework including the definition of the primary scale and its dissemination, and the place (highlighted by the gray ellipse) of calibration uncertainty in it

Regarding transduction and matching. As with any empirical process, the specifications in the measurement procedure will not be entirely realized when the measuring instrument is operated. This is due, first, to the unavoidably limited stability and selectivity of the transducer. Moreover, there could be errors in matching the transduced property Xm with respect to the reference properties Xi* in the local scale (e.g., the so-called “reading errors” in the case of instruments with analog scales). In addition, the local scale could also be affected by instabilities, resulting in a time-dependent mapping Xi*xi, which is only uncertainly known. These issues lead to a non-null instrumental uncertainty. One example for RCA would be where the RCA test is simply mis-scored, whether by human or machine.

Finally, the fact that the object under measurement must somehow interact with the transducer may induce an unwanted change in the state of the object, which in turn produces an interaction uncertainty, as depicted in Fig. 7.24. An example of this in RCA testing would be where the text passage that was used contained elements that affected (in either positive or negative ways) some individual readers (in either positive or negative ways), which altered their responses to items.

Fig. 7.24
A hexagon framework presents the links between effective, transduced properties, public, primary and local references, and values. The top left corner is encircled.

The place (highlighted by the gray ellipse) of instrumental uncertainty and interaction uncertainty in the Hexagon Framework

In summary, the different components of the measurement model invoke different aspects of measurement uncertainty, as depicted in Fig. 7.25. As discussed in Sect. 3.2.5, where these can be quantified, these components can be gathered into an uncertainty budget.

Fig. 7.25
A model diagram presents property and public scale on the left, measurement in the middle, and results on the right. Uncertainties at all stages are marked.

An updated black box model, now including the components of measurement uncertainty (an updated version of Fig. 3.3)

7.4.3 Quality of Measurement as Objectivity and Intersubjectivity

The overall understanding of the quality of measurement and its results may be characterized in terms of the two basic features, which we have called object-relatedness and subject-independence (Mari et al., 2012; Maul et al., 2018). We introduce them here and show how, in the context of the Hexagon Framework, they are related to the different kinds of measurement uncertainty discussed in the previous section.

Object-relatedness (“objectivity” for short) is the extent to which the conveyed information is about the measurand, i.e., the intended property, and nothing else. According to the model we have presented, the problem of objectivity of measurement and its results is threefold.

First, considering definitional uncertainty, empirical properties are interrelated because they are mutually dependent, so that the measurand depends on other properties, i.e., the affecting properties, as discussed in Sect. 7.2.2. Since the information produced by measurement is supposed to be transferable, i.e., usable not only in the time and place in which it was obtained, and not all the relevant properties might be known at that moment, the issue arises of defining the measurand in a sufficiently specific way for making the information transferable without losing the reference to the measurand. Definitional uncertainty is the related component of uncertainty, and where it is quantifiable, it can be incorporated into the uncertainty budget.

Second, considering interaction uncertainty, while the object under measurement needs to change the state of the measuring instrument—thus resulting in a transduced property – for a measurement to take place, the opposite effect also sometimes happens, with the consequence that the object under measurement also changes its state, and then possibly the property under measurement changes in turn, due to its interaction with the instrument, as discussed in Sect. 7.4.1. The result is a loss of objectivity, which may be quantified by interaction uncertainty.

Third, considering instrumental uncertainty, the measuring instrument is generally sensitive not only to the measurand but also to other properties (the influence properties, as discussed in Sect. 7.2.2), with the consequence that its output depends also on such properties: since the information produced by measurement is supposed to be usable independently of the instrument by which it was obtained, the issue arises of characterizing the instrument behavior in a sufficiently specific way for making it possible to extract information on the measurand by filtering out the spurious information generated by influence properties. In Sect. 3.2.1 the metrological behavior of an instrument is characterized in terms of its accuracy, and then more specific features such as trueness and precision. When reported in terms of measurement results, this component of objectivity may be quantified by means of instrumental uncertainty.

Subject-independence (“intersubjectivity” for short) takes into account the goal that the conveyed information be interpretable in the same way by different persons in different places and times. This requires that the information produced by measurement is reported in a way that is independent of the specific context and only refers to universally accessible entities, so that in principle its meaning can be unambiguously reconstructed by anyone. Metrological systems, including quantity units realized by measurement standards disseminated through traceability chains, are developed and maintained to fulfill this requirement. The appropriate calibration of the measuring instrument guarantees the metrological traceability of the information it produces, and therefore the condition of intersubjectivity. Calibration uncertainty, which includes all uncertainties related to the definition of the public scales and their realizations in the measurement standards in the traceability chain, is then what may quantify intersubjectivity.

The characterization of measurement in terms of objectivity and intersubjectivityFootnote 31 is relevant for both users and designers:

  • users are generally interested in the information, not the way it is produced; from the user’s point of view, objectivity and intersubjectivity are features of the products of the process, i.e., of measurement results;

  • designers are interested in the way the information may be produced; from the designer’s point of view, objectivity and intersubjectivity are features of the process, then inherited by its products, i.e., first of all of measurement.

Hence, we can see objectivity and intersubjectivity as features of both the process (i.e., measurement) and the products (i.e., measurement results).Footnote 32

As they have been characterized, objectivity and intersubjectivity are embedded in measuring systems: in other words, measuring systems are designed, set up (including via their calibration), and operated so to be able to produce information with the expected degree of objectivity and intersubjectivity, i.e., able to produce measurement results with a measurement uncertainty which is less than target uncertainty, the “upper limit [of uncertainty] decided on the basis of the intended use of measurement results” (JCGM, 2012: 2.34). This highlights the pragmatic nature of measurement: what counts as high or low quality is relative to the purpose of the measurement; if a comparatively lower quality instrument provides results of sufficient accuracy, using it could lead to still acceptable (and cheaper) measurements.

7.4.4 Can Measurement Be “Bad”?

According to the characterization we have just proposed, objectivity and intersubjectivity are independent features: something can be objective but not intersubjective (as might happen in the case of the usage of an uncalibrated measuring system), or vice versa (as when the result of an evaluation is expressed in the customary format for the values of quantities, i.e., number times unit, but was obtained through a badly flawed measurement). Together they identify the two dimensions of quality of measurement: the claim of the possibilities of obtaining information about empirical properties, and of socially reporting such information.Footnote 33 It is thus through their objectivity and intersubjectivity that measurement results are considered to be of good quality. Since objectivity and intersubjectivity are not Boolean (i.e., yes–no) conditions, in a given operational situation one could set a threshold of minimum acceptable objectivity and intersubjectivity, aimed at guaranteeing that the results of measurement will be useful for their intended use. This highlights the pragmatic nature of measurement: the same measurement results might be considered good for some purposes and bad for some others. Hence, objectivity and intersubjectivity are features of good measurements, not of measurement as such. This allows <bad measurement> to be an acceptable concept—i.e., not all measurements are good –, where “bad” is meant as <not sufficiently objective and intersubjective according to the given purposes of the measurement>.Footnote 34

The objectivity and intersubjectivity of measurement results may be interpreted as their overall “degree of quality”, which is (inversely) specified and quantified by measurement uncertainty: a good measurement produces measurement results whose uncertainty is.

  • beyond the definitional uncertainty of the measurand (again, a measurement uncertainty less than definitional uncertainty corresponds to a waste of resources devoted to design and perform the measurement), but

  • less than the specified target uncertainty (a measurement uncertainty greater than target uncertainty corresponds to a useless measurement).

An emphasis on sufficient objectivity and intersubjectivity for a given purpose is then operationally useful, for the general guidelines it provides regarding the design and performance of measurements (e.g., in Petri et al., 2015), but it is still too specific at least in one respect: it would assume that measurement is always good measurement. While pragmatically this is sound—if we know that what we are doing is a bad measurement we (hopefully) avoid doing it—the concept ‘bad measurement’ as such is not contradictory, and bad measurements do not fulfill the condition of sufficient objectivity and intersubjectivity. In other words, in order to maintain the VIM’s characterization of “reasonableness”, objectivity and intersubjectivity are useful but still not sufficient: some other conditions have to be identified. This, among other things, is discussed in the next chapter.