In this paper we look closely at two latent variables (temperature and reading) and two instruments used in their measurement. At first glance the conceptual foundations for these constructs seem to represent quite different entities. A thermometer measures human temperature widely assumed to be a physical attribute, while a test for measuring reader ability clearly represents a mental attribute. Common sense tends to assert fundamentally dissimilar ontologies. We intend to show that underneath this surface dissimilarity there are striking parallels that can be exploited to illuminate the “oughts” of human science measurement. Moreover, our expectation is this comparison will offer important insights to health outcome researchers by showing the efficiency of linear measurement for establishing closed knowledge systems. Closed knowledge systems offer important benefits to effective rehabilitation by isolating key constructs affecting patient functional status. Their parameterization in a closed system diminishes treatment uncertainty and increases measurement efficiency. Both thermodynamics and Lexile Theory emphasize latent constructs that are operationally defined not only by units but comprehensive “deep structure” substantive theories, which guide general system formulation. Both thermodynamics and Lexile Theory are systems that manipulate only a few key constructs, which are applicable across broad classes of physical and mental activity. This approach to latent traits suggest health outcome measurement is also probably defined by several deep structure constructs that permeate outcome measurement. Closed systems conforming to substantive theories that follow thermodynamics and Lexile models can likely be developed to govern these constructs.

Before comparing and contrasting temperature and reader ability we assert the physical science construct temperature and the human science construct reader ability share common philosophical foundations. Both latent constructs signify real entities whose causal action can be manipulated and their effects observed on interval scales. Neither construct should be conceived of as “just a useful fiction.” Thus, we reject a constructivist interpretation for either construct. Both constructs are attributes of human beings, and we are entity realists. The relationship between the latent variable and the measurement outcome is causal at both the intra-individual and inter­individual levels, stated more formally, the conditional probability distribution of the measurement outcome given the latent variable is to be given a stochastic subject interpretation (Holland, 1990). We reject a repeated sampling interpretation of the conditional probability distribution. We assert for both temperature and reading ability the measurement model takes the same form within and between persons, what Ellis and Van den Wollenberg (1993) called the local homogeneneity assumption. We assert this assumption is testable in the theory referenced measurement context (Stenner et al., 1983).

Figure 1 presents an aspect chart for two latent variables: temperature and reader ability. For our purposes we assume that the object of measurement in both cases is a person. In each case the instrument (thermometer or reading test) is brought into contact with the person. A measurement outcome (number of cavities (0–45) that fail to reflect green light or count correct on 45 theoretically calibrated reading items) is recorded by a professional (nurse or teacher). Note that both instruments have been targeted on the appropriate range for each person. The thermometer measures from 96° to 104.8 °F and let’s assume the reading test has been targeted for a typical fourth grader (500–900 L). The substantive theory provides the link between the measurement outcome and the measure denominated in a conventional unit (degrees Fahrenheit (°F) or Lexiles (L). Note that substantive theory is used to build the respective instrument and to convert cavity counts and counts correct on the reading test into a measure. Cavity counts and counts correct are sufficient statistics for their respective parameters, i.e., measures. Sufficient statistics exhaust the information in the data that is relevant to estimate the parameter/measure. In each case a point estimate of the measure is produced for the person without recourse to information about other persons’ measures. Instrument calibrations come from theory and individual instruments may be disposable—have never been used before and will never be used again. We turn now to a more detailed look at the Nextemp® Thermometer and My Reading Web to further draw out the parallel structures underlying the measurement of these two latent variables.

Fig. 1
figure 1

Temperature versus reader ability measurements

1 Human Body Temperature

Temperature is a physical property of systems including the human system that imperfectly corresponds to the human sensation of hot and cold. Temperature as a latent variable is a principal parameter in thermodynamic theory. Like other latent variables temperature cannot be directly observed but its effects can be and thermometers are instruments that detect these effects. A thermometer has two key components: The sensor (e.g., cavities of fluid that differentially reflect light) which detects changes in temperature and a correspondence table that converts the measurement outcome (cavity count) into a scale value (degrees Celsius) via theory (thermodynamic). A wide range of so-called primary thermometers have been built each relying on radically different physical effects (electrical resistance, expansion coefficients of two metals, velocity of sound in a monatomic gas, and gamma ray emission in a magnetic field). For primary thermometers the relationship between a measurement outcome and its measure is so well understood that temperature readings can be computed directly from the measurement outcomes. So-called “secondary” thermometers (e.g., mercury thermometers) produce measurement outcomes whose relationship to temperature is not yet so well understood that temperature readings can be directly computed. In these by far more common cases secondary thermometer readings are often calibrated against a primary thermometer and a correspondence table is generated that links the measurement outcome to temperature expressed in, say, degrees Celsius. Celsius measures can then be converted directly into Kelvin, Fahrenheit, Rankine, Delisle, Newton or other metrics that find use in, for example, special engineering applications or high energy physics.

In 1861 Carl Wunderlich reported in a study of one million persons that the average human body temperature was 37.0 °C. This value converts precisely to 98.6 °F and is believed to be the source for the putative “normal” temperature. Mackowiak et al. (1992) measured 148 healthy men and women multiple times each day for three consecutive days with electronic oral thermometers. The authors found that the average temperature was 36.8 °C which converts to 98.2 °F. They speculate that Wunderlich rounded his “average” up to 37 °C and then the rounded measure was converted to 98.6 F and subsequently popularized. Mackowiak, Wasserman and Levine recommend that the popular value of 98.6 F for oral temperature be abandoned in favor of the new value of 98.2 and that “normal” ranges be similarly revised.

Figure 2 provides a black line sketch of a NexTemp® disposable thermometer for measuring human temperature. The NexTemp thermometer is a thin, flexible, paddle-shaped plastic strip containing multiple cavities. In the Fahrenheit version, the 45 cavities are arranged in a double matrix at the functioning end of the unit. The columns are spaced at 0.2 °F intervals covering the range of 96.0–104.8 °F. Each cavity contains a chemical composition comprised of three cholesteric liquid crystal compounds and a varying concentration of a soluble additive. These chemical compositions have discrete and repeatable change-of-state properties, the temperatures of which are determined by the concentrations of the additive. Additive concentrations are varied in accordance with an empirically established formula to produce a series of change-of-state temperatures consistent with the indicated temperature points on the device. The chemicals are fully encapsulated by a clear polymeric film, which allows observation of the physical change but prevents any user contact with the chemicals. When the thermometer is placed in an environment within its measure range, such as 98.2 °F (37.0 °C), the chemicals in all of the cavities up to and including 98.2 °F (37.0 °C) change from a liquid crystal to an isotropic clear liquid state. This change of state is accompanied by an optical change that is easily viewed by a user. The green component of white light is reflected from the liquid crystal state but is transmitted through the isotropic liquid state and absorbed by the black background. As a result, those cavities containing compositions with threshold temperatures up to and including 98.2 °F (37.0 °C) appear black, whereas those with transition temperatures of 98.2° (37.0 °C) and higher continue to appear green (Medical Indicators, pp. 1–2).

Fig. 2
figure 2

NexTemp thermometer

In-vitro accuracy of the NexTemp liquid crystal thermometer equals or exceeds glass­mercury and electronic thermometers. More than one hundred million production units have shown agreement with calibrated water baths to within 0.2 °F in the range of 98.0–102.0 °F and within 0.4 °F elsewhere (0.1 °C in the range of 37.0–39.00 C and within 0.2 °C elsewhere).

In-vivo tests in the U.S., Japan and Italy resulted in excellent agreement with measurements using specially calibrated glass-mercury thermometers. The mean difference between the NexTemp thermometers and the calibrated glass-mercury equilibrium device was only 0.12 °F (0.07 °C). The NexTemp thermometer also achieves equilibrium very rapidly, due to its small “drawdown” (cooling effect on tissue upon introduction of a room-temperature device) and the small amount of energy required to make the physical phase transition.

A competitive marketing analysis reported the following advantages of this new technology over the market dominant older mercury in a tube technology. These “technical advantages” will prove useful in our comparison and contrast with reading test technology:

  • Cost—The NexTemp temperature measurement technology provides lower costs when compared to other temperature devices.

  • Safety—The safety advantages of NexTemp technology are substantial. There is no danger, as with a conventional thermometer, of glass ingestion or mercury poisoning if a child bites the active part of the unit. NexTemp® and its packaging are latex-free.

  • Speed and ease-of-use—The NexTemp thermometer is quick, portable, non-breakable and easy to use (e.g., no shakedown or resetting).

  • Reduced chance of cross-contamination of patients—The NexTemp disposable product comes individually wrapped and is intended to be used and then discarded. The reusable version of the product is for single patient use over time with cleaning between uses (Medical Indicators, p. 4).

2 Reader Ability

Of approximately 6000 spoken languages in the world only about 200 are written and many fewer than that have an extensive test base. Reader ability is the capacity of the individual to make meaning from text. Reader ability like other latent variables cannot be directly observed. Rather its existence must be inferred from its effects on measurement outcomes (count correct). A reading test is an instrument designed to detect variation in reading ability. A reading test has three key components:

(1) Text (e.g., a newspaper article on global warming), (2) a response requirement (e.g., answering multiple choice questions embedded in the passage), and (3) a correspondence table that converts the measurement outcome (counts correct) to a scale value (Lexiles) via a theory (Lexile Framework for Reading). Although the ubiquitous multiple choice item type dominates as the response requirement in the measurement of reading ability other task types have found use for specific ranges of the reading ability scale, including: retelling, written summaries, short answers, oral reading rate and cloze. Because the notion of scale unification is still foreign in the human sciences common practice associates a unique measurement scale with every published instrument. Many dissertations are written that involve development of a new instrument and a new scale purportedly measuring the intended construct in an equal interval metric unique to that instrument. At this writing the authors are unaware of any dissertation that reports on the unification of multiple measures of the same latent variable (e.g., depression, anxiety, spatial reasoning, mathematical ability) denominated in a common unit of measure. Failure to separate the instrument from the scale has had pernicious effects on human science. Because different reading instruments often employ different task types different test names (test of reading ability, test of reading comprehension, test of reading achievement) and the aforementioned different scales it should not surprise us that many test users and reading researchers perceive that these various reading tests measure different latent variables. The unwary are fooled by the fact that looks can be deceiving.

WEB based reading technology (MyReadingWeb) has been developed to accommodate Lexile measurement. Accompanying text and associated machine generated cloze items is a “key” to score (mark correct or incorrect) reader responses and a correspondence table relating count correct to Lexiles. Any continuous prose can be loaded into MyReadingWeb and the software will instantly turn text into a reading measurement instrument. The first step involves measuring the text be it an article, chapter, or book. The Lexile Analyzer computes various statistics on word frequency and sentence length. These statistics are then combined in an equation that returns a Lexile measure for the text. MyReadingWeb uses this measure to “cloze” vocabulary words that are at a comfortable range for a reader’s vocabulary who has a reading ability equal to the text readability of the article. A-part-of-speech parser then chooses three foils (incorrect answers) that are the same part of speech and have a similar Lexile level as the closed word but that are not synonyms or antonyms of the closed word. Finally, the four choices are randomized before presentation. On average a response requirement is imposed on the reader every 50–70 words of running text.

Counts correct on a text are evaluated by a dichotomous Rasch model in which the location and dispersion parameters of an item difficulty distribution are treated as known. The Lexile measure that maximizes the likelihood of the data is the reader measure reported for each particular encounter between reader and text. A Baysean growth model is used to combine individual article measures within and across days over the complete MyReadingWeb history for a reader.

Table 1 presents results from a large study designed to test how well Lexile Theory could predict percent correct on machine generated items like those described above. First grade through twelfth grade students (N = 1,743) read a total of 289,345 articles comprised of 194,968,617 words.

Table 1 Lexile theory of prediction of machine generated items

They spent 2 years 157 days 23 h 16 min in the program and averaged 150 words read per minute (WPM). Of the 3,051,341 unique cloze items generated by the computer the participants answered 2,245,741 or 73.90% of the items correctly. The model forecasted 2,291,787 correct or 75.11%. Figure 3 presents a histogram of differences between theory and observation. One thousand and five (1,005) students had observed counts correct within ±3% of the model expectation for a subsample of 1,325 students.

Fig. 3
figure 3

Differences between Lexile theory and observations

The best explanation for the close agreement between theoretical comprehension rate and observed comprehension rate is that the Lexile theory and Rasch model are cooperating in providing (l) good text measures for the articles, (2) good reader measures for the students, and (3) well modeled comprehension rates. The cooperation between substantive theory and the Rasch model evidences cross sectional developmental consistency, i.e., the theory works throughout the reading range reflected in Table 1 (100–1500 L). Invoking the “no miracles argument” currently fashionable among philosophers of science of the realist persuasion: the congruence between theory and observation is explained by the fact that the theory is at least approximately right.

In summary, the measurement of human temperature and reading ability if conceptualized in a particular way can be seen to share a common deep structure. Both constructs are latent variables which assign a causal role to an unobservable attribute of persons. Conditioning on the latent variable renders the measurement outcomes (cavity count and count correct) statistically independent. Temperature and reading ability are real entities that can be manipulated and the effects of these manipulations can be detected. Persons possess a true value on each construct that is approached by repeated measurement but is never precisely determined. The two attributes apply equally well to between person variation and within person variation. For human temperature measurement with the NexTemp® technology we can trade off a change in the amount of the soluble additive for a change in temperature to hold the number of cavities that “turn black” constant. Similarly, for reading we can trade-off a difference in reader ability for an equivalent difference in text readability to hold constant the count correct. In both cases enough is known about the measurement procedure and the relevant active processes that two persons with equal temperature or reading abilities can be made to produce different measurement outcomes (cavity count or count correct) by systematically manipulating the respective instruments. In short the two latent variables are under precise experimental control and that is why an indefinitely large number of parallel instruments can be manufactured for each construct.

Both thermodynamic theory and Lexile reading theory force a distinction not made before the theories were put forth. The sensation of hot and cold was formalized as temperature and was later distinguished from the common parlance synonym “heat.” Reading ability was likewise distinguished from the common parlance synonym “reading comprehension.” The former is a text independent characterization of reader performance, whereas, the latter is a text dependent characterization. Both theories have made extensive use of the ensemble interpretation first proposed by Einstein (1902) and Gibbs (1902).

The two constructs temperature and reader ability figure in laws that are strikingly parallel in conception and structure. The combined gas law specifies the relationship between volume and temperature conditioning on pressure. Specifically, log pressure and log volume—log temperature = a constant, given a frame of reference specified by the number of molecules. Similarly, the reading law specifies the relationship between reader and text conditioning on comprehension rate. Logit transformed comprehension rate plus text measure—reader measure = the constant 1.1, given a frame of reference that specifies 75% comprehension whenever text measure = reader measure. Therefore a + b − c = constant, holds for both the combined gas law and the Lexile reading law (Burdick et al. 2006).

Finally, the NexTemp® and MyReadingWeb technologies share several additional features: (l) they are both inexpensive (NexTemp cost is 9 cents, MyReadingWeb’s cost per item is fractions of a penny), (2) they function within an intended range and are useless outside that range, (3) the respective technologies produce instruments that are one-off and disposable, (4) both instruments are theoretically calibrated and this produces generally objective measures, (5) both are readable technologies—the user does not need to understand even the rudiments of thermodynamic theory or Lexile theory to produce valid and useful measures, and (6) both can measure growth or change within and between persons.

In conclusion, although closed knowledge systems have not been developed for health outcomes, corollaries between temperature and reading ability presented in this paper should offer valuable insights to their formulation. For example, functional assessment in rehabilitation currently defined by separate CAT measures could be consolidated into a closed system defined by deep structure common across functional measures. Separate functional items could be reduced to a single structure that includes not only outcome but intervention constructs. In this system, variation of intervention treatment values would be systematically related to outcome measures. Consequently, treatment effectiveness could be specified in advance of delivery and terminated at maximum effectiveness.