This section will now scrutinise practices that are currently established in psychological and social-science research for implementing the different methodological functions of ‘scales’ and ‘units’ and will explore the consequences that these entail for the establishment of numerical traceability. The focus will be on the widely-used language-based quantification methods (e.g., tests, questionnaires).
Standardised item and answer ‘scales’ fail to establish both data generation traceability and numerical traceability
Identical wording and formatting of item and answer ‘scales’ is often assumed to be sufficient as standardisation for enabling quantitative investigations. To make these ‘scales’ applicable to a broad range of individuals, phenomena and contexts; their wordings are often abstract and even vague (e.g., ‘often’, ‘sometimes’). This entails that items and answer categories reflect not specific target qualities and particular quantities of them as required for measurement but instead concepts, which mostly refer to conglomerates of qualitatively heterogeneous properties and study phenomena (Uher 2018b, 2021c). Insufficient specification of concrete study phenomena, the target property and its divisible (quantitative) properties to be studied in them, however, promotes phenomenon—quality—quantity conflation. Given this, it is unsurprising that individuals construct for the same standardised items heterogeneous meanings and for multi-stage answer categories not mutually exclusive, quantitative meanings but instead overlapping and often even qualitatively different meanings (Lundmann and Villadsen 2016; Rosenbaum and Valsiner 2011; Uher and Visalberghi 2016). This precludes the establishment of standardised and traceable processes of data generation (Uher 2018a).
Particular problems for numerical traceability arise from the narrow range of values in ‘answer scales’. Bounded value ranges are also used for some measurement units; but these are either repeating, and thus unlimited (e.g., clock time), or inherent to the target property (e.g., degrees in a circle). Other quantitative categories with bounded value ranges refer to specified samples of unlimited size (e.g., percentages, counted fractions), thus indicating quantity values that are traceable. By contrast, the numerical values created from standardised ‘answer scales’ are conceived to be bounded from the outset—regardless of the diverse phenomena, qualities and quantities to which they may be applied. As a consequence, data-generating persons must assign a broad range of quantitative information flexibly to a predetermined, narrow range of values. But how do they do this? For example, how often is “often” for a phenomenon to occur, given that general occurrence rates vary for different phenomena (e.g., talking versus sneezing) and also across contexts (Uher 2015a)? To fit their ideas into narrow ‘answer scales’, respondents sometimes seem to intuitively weigh the study phenomena’s observed occurrences against their presumed typical occurrence rates in given contexts (e.g., sex/gender or age groups; Uher 2015c; Uher and Visalberghi 2016; Uher et al. 2013b)—just like Procrustes, the stretcher and subduer from Greek mythology, who forced people to fit the size of an iron bed by stretching them or cutting off their legs.
Without knowing the specific quantitative relations by which this intuitive fitting is done (unlike logarithmic scales), however, the quantitative data thus-generated can reflect quantitative information neither of the actual (inaccessible) phenomena and properties of interest nor of those used as their (accessible) indicators. Indeed, the requirement to assign diverse ranges of quantities flexibly to a bounded range of values can distort and even inverse quantitative relations, thereby introducing shifts in the quantitative meaning of the data produced. A simple hypothetical example illustrates this. Assumed we judged the size of different bicycles on a verbal ‘answer scale’ (e.g., ‘small’, ‘medium’, ‘large’) and did the same also for different cars and different trains using that same ‘scale’. Although any ‘large’ bicycle is smaller than any ‘small’ train, the assigned answer values would suggest otherwise. Hence, the same values do not always represent the same quantitative information, thereby precluding the establishment of numerical traceability.
Numerical recoding of answer categories fails to establish numerical traceability
Assumed these problems in data generation could be ignored and a target property is specified, such as agreement in ‘Likert scales’. Could researchers establish numerical traceability by systematically assigning numerical values to respondents' chosen answer categories (e.g., ‘1’ to ‘strongly disagree’, ‘2’ to ‘disagree’, ‘3’ to ‘neither disagree nor agree’, ‘4’ to ‘agree’, ‘5’ to ‘strongly agree’)? To ensure that these numerals can indicate quantitative information, the answer categories must reflect divisible properties of the target property.
Referring to Steven’s ‘scale’ types, what would this mean? Regarding ‘interval scales’, often assumed for data analysis, can we assume that the difference between ‘strongly disagree’ and ‘disagree’ equals that between ‘disagree’ and ‘neither disagree nor agree’? Regarding ‘ordinal scales’, one could certainly say that ‘strongly agree’ indicates more agreement than ‘agree’. But could ‘agree’ reflect more agreement than ‘neither agree nor disagree’—which respondents often choose to indicate having no opinion or finding the item inapplicable (Uher 2018a)? And does ‘disagree’ really reflect a lower level of agreement than ‘agree’—or is disagreeing with something not rather an entirely different idea than agreeing with it? Likewise, what tells us that, in ‘happiness scales’, feeling ‘pretty happy’ versus ‘not too happy’ reflects only differences in intensity—thus, divisible properties of the same kind of emotion rather than emotions of qualitatively different kind, like joy and sadness? Semantically, two different qualities can be easily merged into one conceptual dimension (as done in semantic differentials; Snider and Osgood 1969). But what divisible (quantitative) properties could be identified in such conglomerates of heterogeneous qualities?
Further inconsistencies occur. Recall that measurement scales involve identical units of defined magnitude. The conventional meaning of the magnitude of a 4-cm long measurand is established through the equality of its length to that of four concatenated centimetre units—it covers a rulers’ first, second, third and fourth centimetre-unit. But this does not apply analogously to ‘agreement scales’—to indicate ‘agree’, one cannot also tick ‘strongly disagree’, ‘disagree’ and ‘neither disagree nor agree’ without introducing fundamental contradictions in meaning. That is, the (hypothesised) quantities that verbal ‘answer scales’ are intended to reflect do not match the quantitative relations ascribed to their numerically recoded values—not even when just ordinal properties are assumed—and thus fail to meet at least basic axioms of quantity.
Moreover, the numerical assignment procedure differs fundamentally from that used in measurement. In ‘rating-scale’ based investigations, researchers do not assign numerals to measurands compared with a unit. Instead, they recode the ‘answer units’ in themselves into numerals. Unit-free values, however, can provide information about neither the particular target property studied nor any specific quantity of it. But the same numerical value has, necessarily, different quantitative meanings—with regard to both different units indicating the same target quality (e.g., ‘4’ grams, ‘4’ ounces, ‘4’ tons) and different target properties (e.g., ‘4’ kilogrammes, ‘4’ metres, ‘4’ minutes). Moreover, in measurement, numerical values are assigned with reference to the conventionally agreed and numerically traceable standard quantity indicated by the measurement unit. In ‘answer scales’, by contrast, the assigned numerals depend on study-specific decisions about the structural data format. For example, depending on the value range chosen for ‘answer scales’ in a given study, their middle category can be recoded into very different numerical values (e.g., ‘0’, ‘3’, ‘4’ or ‘50’)—even when referring to the same item and meant to indicate the same quantitative information.
Recoding the heterogeneous meanings of the verbal values of ‘answer scales’ into numerical values entails further problems because it implies that they would reflect homogeneous meanings. Numeral–number conflation promotes the frequent ascription of mathematical properties to numerals (e.g., that ‘4’ is more than ‘3’ and this more than ‘2’), ignoring that these quantity relations do not apply to the meanings of the verbal values thus-recoded. This creates the illusion that the numerical recoding of ‘answer units’ could establish a universal meaning for the variable values thereby enabling mathematical exploration of the phenomena and properties described. Following this erroneous belief, unit-free scores are often treated as if they would represent ontological quantities that can be ordered, added, averaged and quantitatively modelled—ignoring, that this applies neither to the answer categories used for data generation nor to the conglomerates of heterogeneous study phenomena and properties commonly described in rating ‘scales’.
Differential analyses cannot establish numerical traceability
Without unbroken documented connections both to the measurand of the target property studied (data generation traceability) and to a known quantity reference of that target property (numerical traceability), unit-free values are meaningless in themselves. The only option to create meaning for such scores is to compare different cases with one another.Footnote 10 For this purpose, psychologists and social scientists apply a differential perspective when analysingFootnote 11 and interpreting scores by considering not the absolute scores in themselves that are ascribed to cases but instead the relative between-case differences that these reflect.
This shift in perspective justifies merging values obtained for different properties and study phenomena, as this is commonly done to compute overall indices for constructs from the values obtained for their various construct indicators. For example, the Human Development Index is a summary score computed from various normalised values obtained for ‘life expectancy at birth’, ‘years of schooling’, and ‘cross national income per capita’ (Conceição 2019). Clearly, values in units of years and of monetary currencies cannot be meaningfully merged or compared in themselves because they refer to different qualities. This is possible only with regard to the differential information that they reflect (Uher 2011; Uher et al. 2013a). Differential analyses can enable meaningful comparisons and may circumvent problems arising from arbitrary algorithms for merging heterogeneous scores. But, although statistically derived, differentially standardised scores do not establish systematic proportional relations to the primary quantity values from which they were derived—not even when these are measurement results (e.g., response times in milliseconds)—because differential standardisation is based on the score distributions in specific samples. Consequently, differential summary scores of heterogeneous qualities are derived through processes of artificial quantification but not ‘construct measurement’ as widely believed (Uher 2020).
Assumed we could ignore the problem that quantity values referring to different qualities cannot be meaningfully merged, could we at least summarise numerical values that are derived from answer categories meant to indicate the same quality? As the above analyses (e.g., of ‘agreement scales’) already showed, upon closer reflection, different answer categories actually do not reflect the quantitative properties implied by their numerically recoded values and thus cannot be quantitatively merged. Or would it be reasonable to assume that answering twice ‘neither disagree nor agree’ (3)—often used to indicate ‘no opinion’ or ‘not applicable’—could correspond to (roughly) the same quantity of agreement as does answering once ‘strongly disagree’ (1) and once ‘strongly agree’ (5)—that is, having a split opinion or different item interpretations? In both cases, the arithmetic average of the numerically recoded answer categories would amount to ‘3’. Ignoring the meanings that verbal answer categories can actually have and that the data-generating persons may actually have in mind entails shifts in the quantitative meaning ascribed to numerical data derived from recoding answer categories.
Regardless of these problems, score distributions obtained from statistical modelling are often interpreted as reflecting the distributions of the (hypothesised) target property’s magnitudes in a sample. But differential scores cannot indicate quantities of a particular target property as in measurement because the quantitative meaning attributed to differential values depends on the distribution of all values in the sample considered, leading to reference group effects. For example, persons 1.70 m tall will obtain higher differential scores when compared to a sample of mostly shorter persons and lower differential scores when compared to mostly taller persons. That is, meaning for the quantity values that are ascribed to the still unknown magnitudes of the measurands of individual cases is created by comparing these ascribed values with one another—thus, by comparing many unknowns. This differs fundamentally from measurement where the measurand’s unknown quantity is compared with that of a known and specified reference quantity.
Differential analyses may enable pragmatic quantification that is useful in many fields of research. But they have paved the way for the widespread fallacy to interpret between-case differences as reflecting real quantities that are attributable to the single cases being compared. This problem is inherent also to psychometric ‘scaling’.
Modelling ‘psychometric scales’ cannot establish numerical traceability
Psychometric modelling, as well, involves the transformation of numerical values based on their distribution patterns in given samples, thus on the normalisation of variable values. Many ‘IQ scales’, for example, are standardised such that the sample’s average is set to 100 and one standard deviation to 15. In a norm distribution, the scores of 68% of a sample’s individuals fall within one standard deviation from the sample’s average in both directions (IQ range 85–115) and the scores of 95% of the individuals fall within two standard deviations (IQ range of 70–130). To maintain their ascribed differential meaning, IQ scores are normalised in various ways, such as for different age groups and different educational levels but in particular for different cohorts given substantial increases during the twentieth century and recent decreases (Flynn 2012; Teasdale and Owen 2005). That is, persons are ascribed particular IQ scores on the basis of the norm variations established for their particular reference group.
Given this, the ‘units’ of ‘IQ scales’ do not indicate specific quantities with regard to a hypothesised target property. Instead, they refer to the proportion of cases in the norming sample that obtained particular numerical summary scores (indicating, e.g., correctness on multiple test items). That is, ‘IQ scale units’ are ‘interval scaled’ with regard to the ranges of numerical summary scores—the meaning of which, however, varies with their distribution patterns in the samples studied. Hence, ‘IQ scale units’ refer to a population (sample) parameter. The common practice of normalising ‘IQ scales’ would correspond to defining metre scales on the basis of people’s average body heights. Given that average human height varies, such as by gender, country and socio-economic factors (e.g., in industrialised countries during the twentieth century), the specific reference quantities that would be defined in this way as the length of a 1-metre unit would vary over space and time—and thus, the measurement results obtained for one and the same measurand. This fundamentally contradicts the meaning and merits of measurement. Indeed, if all cases would have the same quantity of a (hypothesised) target property, the differential approach (and thus also psychometric modelling) would be unable to determine their magnitude—in lack of a specified quantity reference.
Normalising allows to create meaning for scores through differential comparisons but this entails shifts in the quantitative meaning that can be ascribed to these scores because this meaning is bound to the sample studied. Although based on scores ascribed in some ways to individual cases (e.g., correctness, speed in response), differential scores cannot be interpreted as reflecting properties of these cases in themselves. Normalising may be useful for pragmatic purposes but is entirely different from measurement.
Statistical results are commonly not interpreted in terms of the information actually encoded in the data
Further shifts in meaning occur during result interpretation. Let us consider again the example of ‘agreement scales’, which are clearly intended to reflect levels of agreement (ignoring the problems shown above). Surprisingly, the statistically analysed results are typically interpreted not as reflecting the respondents’ levels of agreement as inquired during data collection but instead as hypothetical magnitudes regarding the actual phenomena of interest (e.g., those described in constructs of ‘extraversion’, ‘neuroticism’, ‘happiness’ or ‘honesty’). Can agreement be reasonably assumed to be a property inherent to these diverse phenomena or does agreement not rather form part of the judgement process itself? Ultimately, people can agree on the length of different lines (as in Solomon Asch’s classical social conformity experiments; Asch 1955); still this agreement is not a property of these lines but of the persons judging them. While this is an obvious example, in psychological and social-science investigations, it is difficult to disentangle the psychical phenomena involved in the judgement processes from the phenomena to be judged—thus, to distinguish the means of investigation from the phenomena under study.