Scaling preferences using probabilistic choice models: is there a ratio-scale representation of subjective liking?

In two online experiments, we tested whether preference judgments can be used to derive a valid ratio-scale representation of subjective liking across different stimulus sets. Therefore, participants were asked to indicate their preferences for all possible pairwise comparisons of 20 paintings (Experiment 1) and 20 faces (Experiment 2). Probabilistic choice models were fit to the resulting preference probabilities (requiring different degrees of stochastic transitivity), demonstrating that a ratio-scale representation of the liking of both paintings and faces can be derived consistently from the preference judgments. While the preference judgments of paintings were consistent with the highly restrictive Bradley–Terry–Luce model (Bradley and Terry, Biometrika 39:324–345, 1952; Luce, 1959), the liking of faces could be represented on a ratio scale only when accounting for face gender as an additional aspect in an elimination-by-aspects model. These ratio-scaled liking scores were then related to direct evaluative ratings of the same stimuli on a 21-point Likert scale, given both before and after the pairwise comparisons. It was found in both studies that evaluative ratings can be described accurately as a logarithmic function of the indirectly derived liking scores for both types of stimuli. The results indicate that participants are able (a) to consistently judge preferences across two heterogeneous stimulus sets, and (b) to validly report their liking in direct evaluative ratings, although the numeric labels derived from direct evaluative ratings cannot be interpreted at face value for ratio-scaled liking scores.


Introduction
The use of direct evaluative ratings to measure subjective liking or disliking 1 is ubiquitous in many areas of fundamental and applied research such as educational psychology (e.g., Greenwald & Gillmore, 1997), aesthetic judgments (e.g., Hekkert & van Wieringen, 1990) medicine (e.g., pain assessment scales; Gordon, 2015), and marketing research (e.g., Raines, 2003;Wildt & Mazis, 1978). It is particularly prevalent in social cognition research, for instance for measuring the changes in liking resulting from mere exposure (Zajonc, 1968) or evaluative conditioning (De Houwer et al., 2001). In social psychology, evaluative ratings are often used to study attitudes (Olson & Zanna, 1993), a central concept in social psychology (Allport, 1935).
In a typical 'evaluative rating' procedure, the participant is asked to report the degree of subjective liking or disliking of a given stimulus on a predetermined scale (e.g., a numerical rating scale, a visual analog scale, a feeling thermometer, or a scale with verbal labels such as 'not at all' and 'very much'). Another approach to assess the liking of a stimulus is to ask the participant to make a comparative judgment with regard to two stimuli to indicate which of the two stimuli is liked more (evaluated more positively), and we refer to this measure as 'preference judgments'.
Both evaluative ratings and preference judgments are forms of evaluative responses. Although there are debates about the relationship between an evaluative response and the underlying mental construct (e.g., Breckler, 1984;De Houwer et al., 2013, 2021Krosnick et al., 2005;Kruglanski & Stroebe, 2005), most researchers assume that the evaluative response can be distinguished from a mental representation, which is sometimes referred to as the attitude (Eagly & Chaiken, 1993;Krosnick et al., 2005;Zanna & Rempel, 2008). Others see attitudes as complex entities, consisting of behavioral, cognitive, and affective components (Breckler, 1984;Rosenberg et al., 1960). In any case, the question arises how a presumably multidimensional construct such as an attitude or a subjective liking is translated into an evaluative rating (e.g., choosing a point on a one-dimensional scale) or a preference judgment.
From a psychophysical perspective, evaluative ratings are an example of direct scaling (Stevens, 1971), which requires participants to introspectively judge the intensity of a subjective liking of a stimulus and to be able to verbally report this intensity on a particular scale (e.g., '7' on a scale from '1' to '9', or the 73% point on a visual analogue scale). Hence, it is implicitly assumed that (a) the underlying subjective liking is represented validly at a certain scale level and (b) the participants are able to verbally report their liking on a numerical rating scale (i.e., the chosen value on the scale can be interpreted as the intensity of liking). Unfortunately, it is not well understood whether participants' direct evaluative ratings can be taken as a valid scale of their liking, and what level of measurement applies to this scale for any given set of stimuli (e.g., an ordinal or a ratio scale; Stevens, 1946). In other words, the validity of evaluative ratings as a method of direct scaling is questionable.
It is possible to address these questions empirically using scaling models that are based on axiomatic measurement theory, which specify the exact conditions that are necessary for direct measurement on a particular scale level (e.g., Iverson & Luce, 1998;Narens & Luce, 1986). The main advantages of such models are that (a) the mathematical pre-conditions of scaling on a particular level (e.g., on a ratio scale) can be tested empirically, and (b) the data collection is separated from the derivation of a scale (in contrast to direct scaling, which requires participants to respond with a numerical value on the scale itself, as in evaluative ratings). One such model is the Bradley-Terry-Luce (BTL) model (Bradley & Terry, 1952;Luce, 1959) allowing the derivation of a ratio scale of subjective liking from the preference judgments in a full set of (consistent) pairwise comparisons. The BTL model is a highly restrictive probabilistic choice model based on Luce's choice axiom (Luce, 1959), assuming that the choice between multiple alternatives (i.e., preference judgments) is probabilistic and can be predicted as a function of the respective weights (u) of the alternatives (e.g., the intensity of liking of a stimulus). The empirically observed preference probabilities ( p ab , indicating the probability of preferring stimulus a over stimulus b in a full pairwise comparison) can be related to the values on a ratio scale, representing the weights (or liking) u of the two stimuli (see Eq. 1).
However, the BTL model makes very strong assumptions on the structure of the data, and a violation of these assumptions precludes fitting the model to the data. In particular, the model requires that the choice between two stimuli must be independent of the context provided by the entire stimulus set (i.e., context independence or independence of irrelevant alternatives). For example, the preference for clementine over grapefruit must be independent of the entire set of stimuli presented to the participant (e.g., whether the set consists of only citrus fruits or includes other food products such as cheesecake as well). Obviously, this assumption can fail especially in case of a multidimensional stimulus space or when there are similarities within certain subgroups of the stimuli (e.g., Carroll & Soete, 1991;Choisel & Wickelmaier, 2007;Debreu, 1960;Rumelhart & Greeno, 1971). The context independence property requires the observed preference judgments to be highly consistent, which can be tested empirically for any given set of data (i.e., full pairwise comparisons) in terms of violations of the transitivity axiom. Transitivity of preference judgments requires that whenever stimulus a is preferred over stimulus b, and this stimulus b is preferred over a third stimulus c, then a should be preferred over c as well. Different levels of stochastic transitivity can be distinguished with regard to the preference probabilities (see Eq. 2), and the BTL model requires strong stochastic transitivity (SST) to hold. According to the SST axiom, the probability to prefer stimulus a over stimulus c ( p ac >0.5) must be larger than both the two probabilities to prefer a over b ( p ab >0.5) and b over c ( p bc >0.5). Importantly, if the restrictive BTL model cannot be fitted to the data due to systematic violations of SST (i.e., when the number of transitivity violation exceeds the number that would be expected by chance alone), then it may be possible to still derive a less restrictive probabilistic choice model, such as the generalized elimination-by-aspects (EBA) model (Tversky, 1972;Tversky & Sattath, 1979). The EBA model only requires moderate stochastic transitivity (MST), so that it is sufficient when the probability p ac is larger than any of the two probabilities p ab or p bc . Finally, weak stochastic transitivity (WST) requires that the probability to prefer a over c must be greater than 0.5 whenever the probabilities p ab and p bc are greater than 0.5. A systematic violation of WST indicates that participants may not be able to integrate multiple relevant stimulus dimensions (e.g., symmetry, colorfulness, familiarity) on a single scale (e.g., liking), which makes it impossible to even derive a meaningful rank order of the stimulus likings (Choisel & Wickelmaier, 2007).
According to the EBA model (compare Eq. 3), participants are supposed to judge several features of stimuli separately (i.e., the aspects , , … ), and aspects are chosen by the participant based on their individual weights (u). A stimulus a will be preferred over stimulus b, whenever a crucial aspect ( ; e.g., symmetry) is present in a, but not in b ( ∈ a � �b � )-or, in other words, stimuli not containing the crucial aspect will be eliminated successively. Technically, the BTL model is a special case of EBA with only one aspect per stimulus (i.e., a unidimensional stimulus set). However, in contrast to the BTL model, the EBA model does not require context independence.
The first purpose of this study was to test whether simple preference judgments can be described with a probabilistic choice model, allowing us to derive a ratio-scale representation of the underlying 'liking' weights of the stimuli. The preference judgments from a full pairwise comparison of all stimuli were then tested for weak, moderate, and strong stochastic transitivity. Based on the outcome of these consistency checks, a representation of the liking scores on a ratio scale was derived using either the BTL model or the generalized EBA model (note that both models can fail). The second objective of the present study was to test whether the commonly used direct evaluative ratings on a 21-point Likert scale are valid on a ratio scale, which would allow the interpretation of both ratios and differences between ratings on the scale at face value. Therefore, the direct evaluative ratings were described as a mathematical function of the ratio-scaled liking scores of the stimuli (the u-values) derived from the preference judgments using a probabilistic choice model. To demonstrate the generalizability, preference judgments and evaluative ratings were investigated for two qualitatively different sets of stimuli: abstract paintings and pictures of human faces (including males and females).

Participants
One hundred and forty-five participants (72 women, 73 men) were recruited via Prolific Academic (https:// www. proli fic. co/). Ages ranged between 18 and 73 years (M = 35.3; Mdn = 33; SD = 12.6). Participants were allowed to take part in the study only if they had normal or corrected-to-normal vision, English as their first language, and an approval rate of at least 95%. Five additional participants had been recruited, but they did not complete all tasks and their data was not included. All participants confirmed to participate voluntarily and agreed to an informed consent sheet by clicking on a checkbox before starting the experiment. The entire task took about 12 min on average and participants were compensated with £1.10 on Prolific.

Apparatus and stimuli
The evaluative rating and pairwise preference judgment tasks were programmed in JavaScript and conducted entirely online (using jsPsych;de Leeuw, 2015). A set of 20 pictures of paintings from four different artists (five pictures each from Jean Dubuffet, André Masson, Robert Motherwell, and John Wells) were chosen from the TATE Modern online art collections (https:// www. tate. org. uk/ art/ artis ts/a-z) and used as the to-be-evaluated stimuli in Experiment 1. The particular pictures were selected to be of similar complexity, rather non-representational (i.e., abstract art) and from artists not very well known by the general population.

Procedure
The study started with the evaluative rating phase, asking participants to judge their liking of each of the 20 pictures on a 21-point Likert scale (which is typically used in studies on evaluative conditioning; e.g., Baeyens et al., 1990;Gast & Kattner, 2016;Hammerl & Grabitz, 2000) ranging from "totally dislike" (− 10) via "neutral" (0) to "totally like" (10). There was no time limit for clicking on the scale. The clicked position on the scale was highlighted for 250 ms, and the next rating started after a 500-ms blank screen interval. The order of the pictures was randomized for each participant.
After the first evaluative rating phase, participants were presented with all possible n(n−1) 2 = 190 combinations of two paintings in the preference judgments task. The paintings of each pair were shown simultaneously in the left and right half of the screen (the order was chosen randomly), and participants were asked to indicate which painting they liked better by clicking on the image. There was no time limit to the choice. The clicked picture was highlighted for 250 ms, and the next pair was presented after a 500-ms intertrial interval with a blank screen. The order of the pairs was randomized for each participant.
After the preference judgments, participants were asked to provide another set of evaluative ratings of all stimuli using the same procedure as described above. At the end of the study, participants were asked to indicate whether they recognized any of the painters of the pictures by typing their names in a text box. Finally, participants should indicate how much they were interested in art on a 5-point scale from "not at all" to "very much".

Results
Only one participant recognized one of the painters correctly ("Motherwell"). Most participants reported to have recognized none of the pictures or artists, and there were only a few incorrect responses (e.g., "Klimt", "Picasso", "Klee", or "Cezanne"), thus confirming that the artists were not well known in our sample. The average rating for the interest in art was 2.90 (SD = 1.16).
The pairwise preference judgments were first checked for consistency by testing the number of observed circular triads (when a > b, b > c, but a < c) against the number of circular triads that would be expected by chance alone (285 triads) using the{eba} package for R (Wickelmaier, 2020;Wickelmaier & Schmid, 2004). There was an average number of 39 circular triads out of a maximum of 330. In one participant, the number of circular triads did not differ significantly from chance performance (T = 279 circular triads; Kendall's zeta = 0.15; p = 0.65), indicating highly inconsistent or random choice behavior, and the data of this participant were not included in the subsequent analyses. For the remaining participants, a cumulative preference matrix is shown in Table 1. This matrix contains the absolute frequencies of preferences of a painting in the row of the table over the painting in the respective column of the table. As a measure of consistency of the choice behavior across participants, stochastic transitivity was checked for all possible triads of stimuli (a, b, c) in the cumulative preference matrix. There were only two violations of WST (in 1140 tests), and an approximate likelihood ratio test (Iverson & Falmagne, 1985;Tversky, 1969) revealed that these violations were not significantly different from what would be expected by chance if transitivity held, D = 0.028; p = 0.87, thus indicating that an ordinal scale of the paintings can be derived from the preference choices. Further, there were also relatively few violations of the MST (12 triads; 1%) and SST criterion (206 triads; 18%). A BTL probabilistic choice model was then fitted to the preference probabilities to estimate the u-scale weights of liking of the 20 paintings using maximum-likelihood estimation (using the "OptiPt" function; Wickelmaier & Schmid, 2004). A likelihood ratio test was conducted to test the validity of the model, contrasting the likelihood of the BTL model with the likelihood of a restricted model assuming the preference probabilities to result from independent binomial distributions: −2ln L BTL ∕L restricted . The G-test of goodness of fit revealed that the data did not deviate significantly from the BTL model, G 2 (171) = 164.4; p = 0.63, indicating that the BTL model provides an accurate description of the preferences. Therefore, it was possible to estimate the u-scaled values of the liking of all 20 paintings with a restrictive probabilistic choice model. An arbitrary value of 1 was assigned to the painting with the lowest u-scale value ("Dubuffet_3"), and all other values were scaled relative to this reference. The resulting ratio scale of the liking of painting is illustrated in Fig. 1. It can be seen that the least and most liked paintings are separated by a factor of 16.58, which suggests that (to our non-expert sample) the liking of painting "Wells_1" was more than 16 times stronger than the liking the painting "Dubuffet_3".
The relationship between the direct evaluative rating and the BTL-scaled u-values of the liking of the paintings (as derived from the pairwise preference judgments) is illustrated in Fig. 3. A non-linear regression using a least-squares method revealed that a two-parameter logarithmic function of liking scores u provided a good fit of the relationship between the evaluative ratings (ER) and the BTL-scaled liking weights (see Eq. 4; with = 5.39; = 5.71 and = 3.96; = 7.49 for the best fits of the evaluative ratings that were given before and after the pairwise comparisons, respectively).

Discussion
Experiment 1 demonstrated that the subjective liking of abstract paintings can be expressed consistently in a full set  Correlation between the first evaluative ratings (before the pairwise preference judgments) and the second evaluative ratings (after the pairwise preference judgments) on a scale from − 10 to + 10 for the 20 paintings of four different artists presented in Experiment 1. Error bars indicate standard errors of the means. The solid line represents a linear regression between the two ratings, and the dashed line represents the diagonal Fig. 3 Evaluative ratings before and after the pairwise comparisons as a function of the BTL-scaled liking weights of the 20 paintings. The solid line is the best-fitting logarithmic regression predicting the first and second ratings as a function of the liking scores based on the BTL model of preference judgments, enabling the estimation of ratioscaled liking scores using a probabilistic choice model. Specifically, the fact that the data can be described with the highly restrictive BTL model proves that the subjective preference probabilities of the paintings from four different artists exhibit "context independence", meaning that the preference judgments for any pair of paintings are based on the same evaluative aspects, both between and within different artists (i.e., the set of aspects that are relevant for the evaluation of paintings remains the same across contexts). In contrast, if different aspects had been considered depending on the context (e.g., different evaluative aspects in different artists), then the model would have failed statistically.
Since the subjective liking of paintings, as expressed in preference judgments, could be represented on a BTL ratio scale, we were able to relate this mathematically grounded ratio scale of liking scores to the direct evaluative ratings given by participants. In contrast to the indirect derivation of a BTL scale of liking scores (which requires only preference judgments), direct evaluative ratings are based on the untested assumption that participants are able to assign numerical values to the stimuli that provide a valid representation of the degree of subjective liking. In the present experiment, a typical evaluative rating procedure was used, asking participants to rate the liking of the paintings on a 21-point scale ranging from − 10 to + 10. Interestingly, these evaluative ratings-given both before and after the pairwise preference judgments with the same stimulus set-were found to be (a) highly consistent (i.e., high re-test reliability) and (b) strongly related to the BTL-scaled liking scores, thus demonstrating the general validity of direct evaluative ratings. Nevertheless, there was no simple linear relationship between the derived liking scores on the BTL scale and the directly reported evaluative ratings. The observation that evaluative ratings can be described accurately as a logarithmic function of the BTL-scaled liking scores suggests that directly reported values on the rating scale cannot be taken at face value and may not be valid on a ratio-scale level. Hence, differences and ratios between any two numerical values on the rating scale cannot be interpreted in terms of mathematical numbers (e.g., the difference between an evaluative rating of '3' and '5' may not be the same as the difference between a rating of '5' and '7', and an evaluative rating of '4' does not mean that the stimulus is twice as pleasant as a stimulus with a rating of '2'). The strong relationship between direct evaluative ratings and a ratio-scale measure of subjective liking, however, suggests that an exponential transformation of the direct ratings may represent a valid ratio scale of the subjective likings.

Experiment 2
To test the reliability and generalizability of the findings of Experiment 1, a second experiment was conducted using an entirely different set of stimuli (pictures of male and female human faces). Again, participants were asked to indicate the liking of the faces both in a full pairwise preference judgment task and through direct evaluative ratings. The procedures and data analysis strategies of Experiment 2 were pre-registered on OSF in January 2021, prior to the data collection: https:// osf. io/ h9d5k.

Method Participants
A sample of 197 participants (155 women, 80 men, 1 queer, 1 nondisclosure) were recruited via Prolific Academic. The data of ten additional participants were incomplete and could not be included in the analyses. Participants were allowed to take part in the study only if the pre-screening confirmed normal or corrected-to-normal vision, English as their first language, an approval rate of at least 95%, and that they did not take part in Experiment 1. Ages ranged between 18 and 75 years (M = 33.7; Mdn = 33; SD = 12.3). The entire experiment took about 13 min, and all participants were compensated with £1.70 via Prolific (note that the duration was similar to Experiment 1, but the payment was increased because the actual duration of Experiment 1 was longer than expected). All participants provided informed consent by clicking on a checkbox before starting the task.

Apparatus and stimuli
The stimuli were presented in an online task, using essentially the same experimental routines as in Experiment 1 (programmed in JavaScript using jsPsych).
A set of 20 young (age group 18-29 years), Caucasian faces was selected from a data base of adult facial stimuli (Minear & Park, 2004). Ten male and ten female faces were selected (according to the categorization provided by Minear & Park, 2004), all with neutral facial expressions. We opted for only Caucasian and only young faces to have a data set only characterized by one type of (widely inferred) group membership (gender). With these constraints, a fully random sample of facial images was drawn from the full set of 137 images in these categories (using the sample() function in R), thus avoiding any biases due to subjective selection by the experimenters.

Procedure
The procedure was identical to Experiment 1, except that only about half of the participants (n = 101) rated the 20 faces on a 21-point Likert scale before the pairwise comparisons (thus allowing us to compare unbiased pairwise comparisons with pairwise comparisons that followed direct evaluative ratings). All participants completed a full pairwise comparison of all 190 pairs of faces using the same procedures as in Experiment 1, and subsequently rated the faces on the Likert scale.

Results
As in Experiment 1, the pairwise comparisons of faces were checked for (in)consistency in terms of circular triads (when a > b, b > c, but a < c) using the{eba} package for R (Wickelmaier, 2020;Wickelmaier & Schmid, 2004). On average, participants produced 53.2 circular triads out of a maximum of 330, but the number of circular triads was significantly below chance in all but one participant, who made T = 279 circular triads (Kendall's zeta = 0.18; p = 0.34). As this indicated random choice behavior, the data of this participant were not included in the subsequent analyses. The cumulative preference matrix for the remaining participants is shown in Table 2, depicting the absolute numbers of preferences for each face in the rows compared to the respective faces in the columns of the table. Stochastic transitivity checks for all possible triads of stimuli in the cumulative preference matrix revealed only two violations of WST (in 1140 tests), which was not significantly different from the number of transitivity violations expected by chance alone, D(3) = 2.23; p = 0.53. Therefore, the pairwise preference judgments fulfill the assumptions to derive an ordinal scale representing the rank order of the liking of the faces. There were also very few violations of MST (6 triads; 0.5%) and SST (in 200 triads; 17.5%)-slightly less than in Experiment 1.
The BTL model was again fitted to the preference probabilities to derive u-scale values of the 20 face likings. However, in contrast to Experiment 1, the maximum-likelihood test revealed that the face preference judgments deviated significantly from the restrictive BTL model, G 2 (171) = 225.3; p = 0.003, indicating that the BTL model does not provide an accurate description of the data. Therefore, a less restrictive EBA model was fitted to the data, including the gender of the faces as an additional aspect in the model structure.
The maximum-likelihood test revealed that the data did not deviate significantly from this model, G 2 (170) = 196.9; p = 0.08, and a model comparison confirmed that this EBA model (AIC = 1253) provided a more accurate description of the data than the BTL model (AIC = 1279), D(170) = 28.46; p < 0.001.
Based on the EBA model, u-scale values were estimated for the subjective liking of all 20 faces and normalized with regard to the face with the lowest liking, which was assigned a value of 1 (the reference). The resulting u-values of the liking of the 20 faces are illustrated in Fig. 4, revealing a considerable range of likings. Specifically, the most liked face (#F35) was found to be liked 25.6 times more than the reference face (#M93).
The average evaluative ratings in the first rating phase ranged between M = − 2.29 (SD = 3.52) and M = 3.60 (SD = 3.09). In the 100 participants who rated the faces only after the pairwise comparisons, the average evaluative ratings ranged between M = − 2.05 (SD = 3.95) and M = 5.56 (SD = 2.88). As in Experiment 1, there was a strong and significant correlation between the first and second evaluative ratings (in participants who rated the faces both before and after the pairwise comparisons), r = 0.97; t(18) = 18.14; p < 0.001, indicating high re-test reliability for the evaluative ratings of faces. In addition, a 2 (time of measurement) × 2 (face gender) repeated-measures ANOVA on evaluative ratings revealed a significant increase from the first (M = − 0.38; SD = 1.72) to the second ratings (M = 1.47; SD = 1.75), F(1,95) = 52.22; p < 0.001; η 2 G = 0.06, (compare Fig. 5). The evaluative ratings also differed significantly between male (M = -0.29; SD = 1.99) and female faces (M = 2.14; SD = 1.86), F(1,95) = 115.97; p < 0.001; η 2 G = 0.25. However, there was no interaction between time of measurement and the gender of the face, F(1,95) = 0.16; p = 0.69; η 2 G < 0.01, indicating that the face gender difference did not change with repeated evaluative ratings (in contrast to the differences between artists observed in Experiment 1). Figure 6 shows the direct evaluative rating before and after the pairwise preference judgments (including only the 'after' ratings of participants, who did not rate the faces before) as a function of the u-scale values estimated by the EBA model. As in Experiment 1, the evaluative ratings could be predicted quite accurately as a logarithmic function of the u-scale values of liking, both for the ratings before (a = 4.08; b = 3.93; p < 0.001) and after the pairwise comparisons (a = 5.70; b = 2.48; p < 0.001). A linear regression of the evaluative ratings as a function of the logarithm of the u-values also provided a very good fit of the data, accounting for about 93% of the variance for the ratings that were given before (R 2 = 0.93) and 98% of the variance for the ratings after the pairwise comparisons (R 2 = 0.98).
Based on such a strong relationship between the evaluative ratings on the Likert scale and the underlying liking weights of the stimuli (as derived with the EBA model), it might be possible also to derive a generalizable function to estimate the liking on a ratio scale based on the evaluative ratings alone, without having to complete a full pairwise comparison. In the present example, the u-scale values of liking can be predicted reliably as an exponential function of the first and second ratings (see Eqs. 7 and 8), accounting for 89% and 92% of the variance in direct evaluative ratings.

Discussion
Experiment 2 showed that the highly restrictive BTL model failed to derive a valid ratio-scale representation of the subjective liking of facial stimuli from the preferences judgments in male and female participants together, suggesting that facial photographs of young Caucasian women and men (in contrast to the paintings from different artists used in Experiment 1) are evaluated based on multiple discriminating aspects, which are difficult to integrate on a unidimensional scale. For the present set of male and female faces, it is plausible that such a discriminating aspect might be the gender of the face, and we found that an EBA model accounting for gender as an additional aspect provided a more accurate description of the data (note that this cannot be done in the more restrictive BTL model). Hence, it was possible to derive a valid ratio-scale representation of the liking of facial pictures, and these liking scores could be related to the corresponding direct evaluative ratings. In addition, two separate BTL models fitted separately to male and female participants' preference probabilities revealed an even better description of the preference judgment data.
Consistent with Experiment 1, it was found that direct evaluative ratings can be described quite accurately as a logarithmic function of the indirectly obtained liking scores on a ratio scale (accounting for more than 90% of the variance in evaluative ratings given either before or after the pairwise comparisons). This suggests again that differences and ratios of numerical values on the evaluative rating scale must not be interpreted at face value, but an exponential transformation can be used to derive liking scores that are valid a ratio scale (as the u-values derived through probabilistic choice models).

General discussion
Two experiments demonstrated that a ratio-scale representation of the subjective liking of stimuli can be derived from the preference judgments in full pairwise comparisons based on a probabilistic choice model. In Experiment 1, the liking of paintings from four different artists could be scaled with the highly restrictive BTL model, revealing a factor 16 between the liking of most disliked and the most liked painting. In Experiment 2, we found that the liking of male and female faces could be scaled on a ratio scale either with an EBA model accounting for the gender of the face as an additional aspect, or with two different BTL models that were fitted separately to the judgments of male and female participants.
In general, these findings indicate that affective preferences can be judged consistently for a given set of stimuli, both across participants and across multiple binary comparisons, allowing the derivation of a ratio-scale representation of the subjective liking weights of the stimuli. In contrast to a scale obtained from direct evaluative ratings, this scale of liking scores is based on the statistical assumptions of a probabilistic choice model, which had been tested for two independent data sets-and there is always a chance for the model to fail. In fact, while the restrictive BTL model provided a good fit to the preference judgments for abstract paintings in Experiment 1, it failed for the preference judgments of facial stimuli in Experiment 2. This suggests that pictures of human faces may be a stimulus set that is too heterogeneous and characterized by multiple attributes (aspects), which may lead to the formation of subgroups of similar stimuli (e.g., male and female faces) and possibly prevents the integration of different features or aspects on a unidimenstional liking scale. In the resulting multidimensional stimulus space, violations of the assumption of context independence become more likely with an increasing number and variety of to-be-compared stimuli (Tversky & Sattath, 1979). The finding that even a relatively homogenous stimulus set (young Caucasian faces) with only one clear categorical division (gender) can lead to inconsistent preference judgments has implications for research areas where stimulus sets that are structured into subgroups are to be evaluated on the same scale (e.g., investigating prejudice against groups or temporally unstable preferences for healthy or unhealthy food). However, even a slightly more complex EBA model with a single additional aspect for the gender of the face could be fitted successfully to the preference data, indicating that the judgments were highly consistent despite the existence of a multidimensional stimulus space.
In addition to the mathematical foundation of scaling, the degree of liking of the stimuli on the resulting scale can be interpreted in terms of mathematical ratios, allowing statements such as "painting A is liked eight times better than painting B". In contrast, such an interpretation is not valid for the numerical liking scores obtained through direct evaluative ratings. However, in the present study, the evaluative ratings were related to the liking scale based on the probabilistic choice models, indicating that the evaluative ratings can be described as a logarithmic function of the underlying ratio-scaled liking score. Hence, the multiplication of a liking with a particular factor (e.g., to like a stimulus twice as much) will result in a constant increment on the evaluative rating scale. Such a non-linear relationship between an underlying sensation and direct reports on a numerical scale (e.g., magnitude estimation) has been observed in many psychophysical studies (Stevens & Galanter, 1957), and more recently also for affective dimensions (Zimmer et al., 2004).

Open practices statement
The data of the present two experiments (pairwise comparisons and direct evaluative ratings of paintings and faces, contained in csv-files) and the analysis code (R) are openly available as a working example in an Open Science Framework (OSF) repository at this link: https:// osf. io/ vt6gb/? view_ only= 846ec ab021 5e453 dad08 dfe96 1d848 a5 The analysis code can be used to fit two types of probabilistic choice model to preference judgments (using the {eba} package): the BTL model and an EBA model with one additional aspect. In addition, the code involves frequentist statistics (e.g., using the {ez} package), and allows to reproduce the main figures from this article. Additional information such as the experimental routines (JsPsych) can be made available upon request.
Funding Open Access funding enabled and organized by Projekt DEAL. There is no specific funding to this project.

Conflict of interest None.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.