In principle, any characteristic of the instrument, objects of measurement, or measurement contexts that influences the change in probability of a modeled response (i.e., slope of the item response function) is a threat to unit invariance. Such influences produce instability in the unit. Progress in science requires stable unit specifications because it is only through such conventions that a unit is reproduced and shared. A large fraction of the metrology budget for more mature sciences is devoted to identifying and engineering around threats to unit stability.

A big step in realizing the metrology program outlined by Humphry is the abandonment of descriptive Rasch models in favor of an explicit causal interpretation of the regression of the response probability on the exponentiated difference between a person parameter and an instrument/item parameter. All that is meant by this causal claim is that an intervention on the person parameter can be traded off for an offsetting intervention on the item parameters to hold the probability of a correct response constant. If this trade-off property is experimentally verified throughout the range of the attribute and is invariant across task types, person characteristics, and measurement contexts, then, a stable reproducible unit for measuring persons and items has been specified and actualized. If invariance is lacking, say for a new task type, the first thing to check is whether the new task type purportedly measuring the same construct as the task type evidencing invariance has unexpected added easiness/hardness or is measuring in a differently sized unit. In writing research we have found that human and machine scoring of student writings need to be adjusted for differences in unit size, and once this adjustment is made the concordance between machine scoring and human ratings is striking.

A measurement instrument is built to detect variation of a kind. The specification equation answers the question what causes the variation the instrument detects? It is also clear that non-test behaviors (e.g., reading a Harry Potter novel) can be brought into a reading measurement frame of reference by imagining that the novel contains an ensemble of test items the distribution properties of which may be treated as known. The text complexity for the novel is then the reader ability required to correctly answer, say, 75% of the virtual items making up the novel. The specification equation is the tool used to calibrate these non-test behaviors. A third use of the specification equation is to calibrate actual test items making it possible to convert counts correct into quantities for, for example, computer-generated reading items. However, all of the above uses of the specification equation pale in relation to the role the specification equation can and should play in maintaining the unit of scale for an attribute. Once a specification equation for an attribute is “locked down” the unit origin and unit size are fixed. New task types and measurement contexts (e.g., machine vs. human scoring) can be linked back to the fixed unit. Tampering with the specification equation by changing the intercept or adjusting the regression weights alters the origin and unit size, respectively. Thus, the specification equation defines the unit, independent of any particular test form or linking study, and maintains that unit over widely varying instrumentation and measurement contexts.

It often happens that new task types, improved theory or improved technology necessitate changes to the specification equation that, if not taken into account, will result in a “new” unit. Adjustments to the specification equation can be made to ensure that the “old” unit and “new” unit are comparable as to origin and unit size. Typically some standard artifact (boiling point of normal water, platinum meter bar, or collection of empirically calibrated texts) is used to ensure unit stability over time.

Scale unification is a well-understood theme in the history of science. The obverse, scale proliferation, is a prominent feature of measurement theory and practice in the human and social sciences. Today there are dozens of scales for measuring every important attribute (anxiety, depression, reading ability, spatial reasoning). There is often debate about whether the fact that the task types vary in added easiness/hardness or unit size may signal that something different is being measured. Infrequently, attempts are made to document that the same attribute is being measured by the various instruments and then linking studies are launched that result in correspondence tables or equations that link the respective scales (similar to the equation that links the Fahrenheit and Celsius scales).

Humphry has sketched a metrology program for the human and social sciences to follow as we begin the arduous task of building a system of units. Although the far-term goal is a system of units for the human and social sciences, the near-term goal should be an invariant unit shared by a relevant community for a single attribute. We will learn much as these first attempts play out.