1 Introduction

Why is mathematics so remarkably effective in the natural sciences, and what might the social sciences have to learn from the way it is used in those sciences? Is the effectiveness of mathematics in the natural sciences truly “unreasonable,” as Wigner (1960) put it? Previous research on Maxwell’s foundational contributions shows the effectiveness of mathematical model-based reasoning to be rooted in everyday thinking (Nersessian, 2002), and not in any special capacities associated with scientists or their objects of study. Rasch’s (1960) adoption of Maxwell’s method of analogy (Nersessian, 2002) set the stage for extending the effectiveness of mathematics into the social sciences (Fisher, 2010; Fisher & Stenner, 2013).

Journal of Physics: Conference Series 1044. 2018

In our ongoing explorations of the ways the natural sciences and social sciences invoke, define, and engage in measurement, we have identified a number of differences that are not as epistemologically necessary or predetermined as is popularly imagined (Stenner & Smith, 1982; Stenner et al., 1983, 2006, 2013; Williamson et al., 2013; Burdick et al., 2006; Stenner & Stone, 2010; Fisher, 2009; Fisher & Stenner, 2016). We have, to some benefit, contrasted human temperature thermometry (NexTemp thermometers Medical Indicators Inc (2006); see Appendix A) with the testing of mathematical ability and the measurement of English language reading ability. Although cataloging these differences has been useful, we now believe they are all traceable to a common cause. Physical science measurement virtually without exception takes place in the context of well-developed substantive theory, experimental evidence, and instruments calibrated to uniform unit standards. In the natural sciences, theories are not just compelling stories about the relationships between measurement outcomes (such as the count of cavities turning black on a NexTemp thermometer), unit standards (degrees Celsius), and measurement mechanisms (a chemical specification equation). They are instead sufficiently elaborated and precise in their specifications that they can be used to calibrate instrumentation.

In contrast, throughout the behavioral and social sciences, instrument calibration depends on data, is typically devoid of theory, and is not traceable to a unit standard. We hypothesize that most of the observed differences between behavioral and physical science measurement are traceable to these foundational differences. The absence of theory is the primary determinant of the need for data-based calibration and the lack of efficient methods for defining units and traceability to them. Further, we offer an example of a theory-referenced reading and text measurement system in the educational sciences that exhibits key theoretical, experimental, and instrumentation features analogous to those of human thermometry. Finally, we review the affordances shared by human thermometry and reading measurement (Cano et al., 2016).

2 A Reading Example

A consensus unit and systems for ensuring traceability to it are typical of most natural science measurement. Sometimes, as in temperature measurement, the unification process is not fully completed, but for the vast majority of natural science attributes (referred to as constructs in psychometrics), a unification process has resulted in diverse instrument makers sharing a unit of measure even when the measurement mechanisms vary from manufacturer to manufacturer. Mercury in glass tube thermometers for human temperature measurement differ substantively with NexTemp technology, but produce comparable results. Though the measurement mechanisms are drastically different they both report out in either Fahrenheit or Celsius units. In the case of NexTemp thermometry, a chemical specification equation calibrates the instrument in °C or °F. The chemical specification equation derived from experimental evidence enforces the unit, which is embodied in the instrument to a known degree of uncertainty. Similarly, in reading measurement, a text complexity specification equation enforces the unit and ensures that 100L of difference between two readers, two texts or a reader/text encounter is invariant over any of 100 + English reading tests that, at present, employ the unit (Fisher & Stenner, 2016).

Strictly parallel instruments are typical in the natural sciences. Such instruments share a common correspondence table that links a measurement outcome (count of cavities turning black on a NexTemp thermometer) to a °C or °F. The ability to manufacture essentially identical instruments in large quantities is a hallmark of natural science measurement. The specification equation is the recipe for manufacturing and calibrating clones of an instrument. The social sciences borrow the concept and talk about ‘parallel’ instruments or ‘alternate forms’ and advertise that say, form A and B produce exchangeable measures. But without a specification equation it is impossible to manufacture copies or clones that share the same correspondence table. The reading measurement specification equation can be used to build strictly parallel clones of any reading test (see Appendix B). No such capability exists, for example, for mathematics, and this is so precisely because, at present, there exists no specification equation for mathematical ability that can calibrate mathematics test items (see Appendix C). Different mathematics tests are empirically linked to a common scale through large scale, expensive field studies typically involving thousands of students.

Typical Rasch model applications in the social sciences are singly prescriptive. The major prescription that data must meet is non-intersecting item characteristic curves (ICCs), which relate the probability of a correct response to the difference between person ability and item difficulty. The data are used to estimate person and item parameters with no a priori constraints on the item parameters. Mathematics ability measurement is achieved in this way, as is typical of much social science measurement. Because there is no strong substantive theory for ‘mathematical ability,’ there is no specification equation and, thus, no potential for theoretically calibrating items/instruments. Instrument calibrations depend on sample data and a property of the Rasch model: when data fit the model differences between persons and differences between items are independent of items and persons, respectively.

Contrast this singly prescriptive measurement framework with the doubly prescriptive models underlying NexTemp human thermometry and the theoretical framework for reading. In both these cases strong substantive theory coupled with either a Guttman model or a causal Rasch model requires not just data fit to the model but also data fit to the theory specified item/instrument calibrations. For NexTemp a chemical specification equation is used as a recipe for the chemical compound that fills each cavity. By precisely varying the amount of additive the difference between any two adjacent cavities in sensitivity to the green component of light is precisely 0.2 degrees Fahrenheit. The chemical specification equation enforces this common unit difference for each of the 44 adjacent cavity differences across the 9 °F operating range for the instrument.

When data fit a doubly prescriptive Rasch model absolute person measures (not merely differences) are independent of items and instruments and are independent of person sample precisely because no person data figures in the instrument calibration process. Theory calibrated Rasch models are, thus, doubly prescriptive: prescriptive as to Rasch model requirements and prescriptive as to the substantive theory i.e., item/instrument calibrations. Person misfit to a doubly prescriptive model signals that the measurement mechanism that transmits variation in the attribute to the measurement outcome (often a count) is not working as intended for that individual. Frequent failures of theoretical invariance forces reexamination of the substantive theory, the measurement mechanism and instrument calibration procedures. Theoretical invariance can be tested within person over time (e.g. reading ability growth trajectories) and when intra individual theoretical invariance holds across persons then inter-individual theoretical invariance necessarily holds i.e., the attribute is homologous (Borsboom & Dolan, 2007; Borsboom et al., 2009b; Hamaker et al., 2007; Molenaar, 2004; Molenaar & Newell, 2010).

Molenaar (Hamaker et al., 2007; Molenaar, 2004; Molenaar & Newell, 2010) shows that inferences moving in the reverse direction, inferring from inter-individual factor structures something about intra-individual factor structures, is fraught with complications. The fact that so much of social and psychological measurement is based upon factor analysis of inter-individual variation prompted Molenaar (Molenaar, 2004; Molenaar & Newell, 2010) to call for a Kuhnian revolution, a paradigm shift in the concepts and methods of measurement in psychology. This paper is intended as another in a series of contributions to this revolution (Fisher, 2009, 2010; Stenner & Smith, 1982; Stenner et al., 1983).

3 Conclusion

Unification of measurement refers to a 200-year-old process whereby dozens if not hundreds of distinct scales for measuring a common attribute are, sometimes quickly and more often slowly, reduced to one, two or three exchangeable units of measure. The history of temperature measurement is a paradigmatic case (Chang, 2004; Sherry, 2011) that parallels many contemporary measurement movements in the social and behavioral sciences. Typically, an attribute (construct) captures the imagination of a community of scholars and engineers and different tests, instruments, mechanisms, and scales are proposed for measuring the attribute, and each is uniquely named. Once there is consensus that the selfsame attribute is being measured across these various devices small scale linking studies are undertaken to build conversion tables to re express one unit in one or more other units. More advanced linking studies reduce the link to an equation °F = °C * 9/5 + 32 making for quick and easy conversions. Since at this stage there is often not much to elevate one scale about the competition the market place takes over and ‘unification’, with all its time and cost savings eventually prevails. Sometimes unification is swift and decisive but more often, particularly in the social sciences, metrology is poorly understood and unification plods along.

A useful case study of unification in the social sciences is the longstanding network of reading measures that has linked 100 + English language reading tests across the world, 250,000 book measures and 200 million article measures. The unification process is 27 years old and is accelerating but is far from complete. This effort drew inspiration and strategies from the history of the unification of temperature (Chang, 2004; Molenaar & Newell, 2010; Stenner et al., 2013).

Rather than using factor analysis of inter individual data to define an attribute structure and then asking if this structure obtains when examining intra individual data we suggest the use of substantive theory (in the form of specification/calibration equations) to establish the universality of attribute structure and measurement mechanism at the individual level. Once this is accomplished there is no puzzle about whether between person differences have the same structure as within person differences—of course they do. So, what this analysis reveals is that it is problematic to study between person variation at one point in time to glimpse truths about within person structures over time (Hamaker et al., 2007; Williamson et al., 2013). But the surprise is that if we start with within-person theory-referenced measurement, where in the extreme no two persons have any items in common over 5 years of measurement, then we would not stop for a moment to puzzle about the validity of the claim that at the end of year 1 Jane was higher than Bob but at the end of year 5 Bob was higher than Jane (i.e., a claim about inter-individual variation.) This is yet another benefit of theory based instrument calibration.

Several key features distinguishing physical science and behavioral science measurement systems can be traced to the absence of substantive theory sufficiently developed that said theory can be used to calibrate measurement instruments. Once such a calibration/specification equation is available most of these distinguishing features can quickly and easily be imported into the behavioral sciences.