Quantifying idiosyncratic and shared contributions to judgment

Martinez, Joel E.; Funk, Friederike; Todorov, Alexander

doi:10.3758/s13428-019-01323-0

Quantifying idiosyncratic and shared contributions to judgment

Published: 02 January 2020

Volume 52, pages 1428–1444, (2020)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Quantifying idiosyncratic and shared contributions to judgment

Download PDF

Joel E. Martinez^1,2,
Friederike Funk³ &
Alexander Todorov¹

2963 Accesses
22 Citations
14 Altmetric
Explore all metrics

Abstract

Identifying relative idiosyncratic and shared contributions to judgments is a fundamental challenge to the study of human behavior, yet there is no established method for estimating these contributions. Using edge cases of stimuli varying in intrarater reliability and interrater agreement—faces (high on both), objects (high on the former, low on the latter), and complex patterns (low on both)—we showed that variance component analyses (VCAs) accurately captured the psychometric properties of the data (Study 1). Simulations showed that the VCA generalizes to any arbitrary continuous rating and that both sample and stimulus set size affect estimate precision (Study 2). Generally, a minimum of 60 raters and 30 stimuli provided reasonable estimates within our simulations. Furthermore, VCA estimates stabilized given more than two repeated measures, consistent with the finding that both intrarater reliability and interrater agreement increased nonlinearly with repeated measures (Study 3). The VCA provides a rigorous examination of where variance lies in data, can be implemented using mixed models with crossed random effects, and is general enough to be useful in any judgment domain in which agreement and disagreement are important to quantify and in which multiple raters independently rate multiple stimuli.

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Article Open access 08 January 2024

Interrater reliability estimators tested against true interrater reliabilities

Article Open access 29 August 2022

Putting the individual into reliability: Bayesian testing of homogeneous within-person variance in hierarchical models

Article Open access 23 November 2021

A fundamental challenge in the study of human behavior is identifying the shared and idiosyncratic contributions to judgments. Every judgment made in life contains some level of agreement with other people. In many psychology studies, this agreement or consensus is often equated with meaningful variance in the data. However, for any judgment, stable idiosyncratic differences might explain a larger proportion of the variance than is explained by consensus. Hence, it is essential to estimate the shared (consensus) and idiosyncratic contributions to judgments. Different methodological approaches to estimating these contributions can lead to radically different theoretical implications, and essentially can change how we think about the nature of human preferences.

As such, estimating the shared and idiosyncratic contributions to judgments is particularly relevant to the domains of basic science and replicability, and it has important practical policy implications in a variety of domains, ranging from legal sentencing decisions to medical diagnosis. The relative influences of people (i.e., idiosyncrasy) or stimuli (e.g., sharedness) on judgments are key to theoretical debates on finding general principles of, for example, morality (Heiphetz & Young, 2017), intergroup relations (Carter & Murphy, 2017; Xie, Flake, & Hehman, 2019), face preferences (Cunningham, Roberts, Barbee, & Druen, 1995; Grammer & Thornhill, 1994; Hehman, Sutherland, Flake, & Slepian, 2017; Hönekopp, 2006; Langlois et al., 2000; Rhodes, 2006), and object or art preferences (Kurosu & Todorov, 2017; Leder, Goller, Rigotti, & Forster, 2016; Schepman, Rodway, & Pullen, 2015; Vessel, 2010; Vessel, Maurer, Denker, & Starr, 2018). These estimates can also contribute practical knowledge to current discussions of replicability. Often replications are focused on recapturing the effect sizes of different manipulations. Another criterion by which to judge replications is the extent to which similar influences of participants or stimuli (Gantman et al., 2018), or even unexplained variance (Doherty, Shemberg, Anderson, & Tweney, 2013), are recaptured across studies. The interpretation and possibility of replicating or not replicating an effect size changes if the relative influences of people, stimuli, and unexplained variance underlying the effect change or stay the same. Finally, practical concerns in a variety of domains rely on these estimates. Quantifying how judgments vary by judges or cases can shape understandings of whether the judicial system may be unjustly punitive through inconsistent sentencing (Austin & Williams III, 1977; Forst & Wellford, 1981; Hofer, Blackwell, & Ruback, 1999) and the effectiveness of medical procedures in reliably identifying diagnoses (Shoukri, 2011).

One reason for varied conclusions about the uniqueness or sharedness of judgments is that there is currently no agreed-upon optimal estimation method. This lack of standard has led to the use of many methods and divergent conclusions about the nature of judgmental agreement in many fields. For example, the decision to use a differences in means or a variance decomposition approach can lead to opposite theoretical interpretations of whether judgments of racism are shared within and across racial groups, which has implications for how antiprejudice endeavors should be structured (Martinez & Paluck, 2019). Consequently, the aim of the present study is to identify a general method for disentangling sources of agreement in judgments across a variety of stimuli with different psychometric properties.

All estimation methods require repeated judgments so as to extract the two critical measurements on which idiosyncratic and shared judgments are based: intrarater reliability and interrater agreement. Unfortunately, very few psychology studies include repeated measures, which are essential for measuring not only intrarater reliability across time, but also the general data quality and potential replicability of findings. In the extreme case in which intrarater reliability is indistinguishable from zero, computing interrater agreement is meaningless, and any results will be spurious.

Standard practices for estimating rater agreement

Judgment studies often use several techniques to estimate rater agreement. The most common measure of this agreement is Cronbach’s alpha. However, one can obtain very high estimates of alpha with a sufficiently large sample of raters even when the agreement between individual raters is extremely low (Hönekopp, 2006; Kramer et al., 2018). A better measure of rater agreement is the interrater correlation, where higher correlations indicate greater consensus. This correlation can be computed between individuals or groups (Kurosu & Todorov, 2017; Ma, Xu, & Luo, 2016; Zebrowitz, Montepare, & Lee, 1993) or between individuals and the group’s averaged ratings (Engell, Haxby, & Todorov, 2007; Germine et al., 2015; Zebrowitz, Franklin, Hillman, & Boc, 2013). How these correlations are computed can lead to very different interpretations of interrater agreement. To the extent that the average agreement between individual raters is greater than zero, correlations between the aggregated ratings across raters are bound to inflate interrater agreement. Such correlations do not necessarily mean high consensus among raters, since the average correlation between any two individual raters could be much lower.

The same issues arise for correlations between participant ratings and measured or computed stimulus attributes. For example, the correlation between face ratings and the physical size of different facial features might be used to estimate the size of the relationship between physiognomic characteristics and social judgments. Whereas some studies correlate stimulus attributes with the group’s average ratings (Cunningham et al., 1995), it is also possible to examine how these attributes correlate with individual ratings (Hönekopp, 2006; Jacobsen, Schubotz, Höfel, & Cramon, 2006; Kurosu & Todorov, 2017). Again, the latter correlations are bound to be lower than the former.

The fact that individual-level correlations are lower than aggregated-level correlations suggests much larger idiosyncratic contributions to judgment. However, it is also possible that these low correlations result from measurement error or unreliability of the individual raters. The only way to rule out the latter explanation is to compute intrarater correlations: a measure of how consistent a rater is across two or more time points. Unfortunately, even when repeated ratings are collected, intra- and interrater correlations are typically examined separately. But we cannot assume that these correlations are independent. When the intrarater correlations are positive, we would expect that averaging within raters across repeated measurements would increase their reliability and, consequently, result in higher estimates of the idiosyncratic contributions to judgments. But it is also possible that more reliable individual ratings would result in higher interrater agreement and, consequently, in higher estimates of the shared contributions to judgments. If so, one might derive very different estimates of idiosyncratic and shared contributions, depending on the number of repeated measurements.

Estimating shared versus idiosyncratic contributions

A general method for estimating shared versus idiosyncratic contributions to judgments is to compare the variance components from different clusters in the data (Hehman et al., 2017; Hönekopp, 2006; Kenny, 1996; Leder et al., 2016). Clusters are components of a study that are similar across measurements, such as raters or stimuli or occasions, and are treated as if sampled from a random population (Judd, Westfall, & Kenny, 2012; Westfall, Kenny, & Judd, 2014). The goal of a variance component analysis (VCA) is to quantify and attribute systematic variance to specific clusters by estimating their variance components. Comparing the size of the variance components can provide information about the importance of different clusters to the ratings. The more variance in a cluster, as measured by variance portioning coefficients (Goldstein, Browne, & Rasbash, 2002), the more systematic differences between instances of that cluster (Shavelson, Webb, & Rowley, 1989).

Within the VCA framework, estimating the shared versus idiosyncratic contributions to judgments is straightforward. Shared contributions are estimated by the stimulus cluster variance. There will be greater stimulus cluster variance if everyone agrees on the judgment of each stimulus and if the stimuli are distinct on the judged dimension (e.g., if everyone rates Stimulus 1 as beautiful and Stimulus 2 as not). Idiosyncratic contributions are estimated by the variance attributed to the Rater × Stimulus cluster. The latter cluster represents idiosyncratic taste, as it measures differences in ranking preferences. For example, this variance will be large if Participant A prefers Stimulus 1 more than Stimulus 2, as compared to Participant B, who has stronger but opposite preferences.

The role of the rater cluster is controversial. Although it represents individual differences (e.g., personality, mood, or even subjective construal of the experimental materials; Paluck & Shafir, 2017), it is unclear whether it should count as a source of idiosyncratic contribution for judgment (Hönekopp, 2006). For example, if Participants A and B rate Stimuli 1, 3, 2, in the same preference order, but Participant A’s ratings are higher than Participant B’s, this mean difference does not necessarily count as idiosyncratic taste, because the participants would still share preference rankings. On the other hand, the participant with higher mean ratings may genuinely like the stimuli more, and this greater liking can lead to different behaviors.

Once the variance components are computed, one can create variance ratios that represent the shared and idiosyncratic contributions to judgments. Hönekopp (2006) called these ratios beholder indices. Following Hönekopp, we calculated two beholder indices, to take into account the ambiguity of the interpretation of the rater cluster. The first index, b₁, ignores the role of the rater cluster, and is simply the ratio of the Rater × Stimulus variance and the sum of the latter and the stimulus variance (see Formula 2 below). The second index, b₂, includes the rater variance (see Formula 3 below). Ratios higher than .50 indicate stronger idiosyncratic than shared contributions to judgments.

Recently, Germine et al. (2015) introduced an intuitive measure of shared versus idiosyncratic contributions to judgments. The measure is a correlation index that directly combines inter and intrarater correlations and is similar to the beholder index. The correlation index partitions out the amount of interrater agreement from the variance accounted for by repeated judgments, the intrarater correlation, thereby treating the intrarater correlation as the ceiling of meaningful variance and any further variance as noise. This index suggests that there are more shared contributions if both the interrater correlation and the intrarater correlations are high and more idiosyncratic contributions if interrater correlation is low but the intrarater correlation is high.

Here we tested whether the VCA provides sensible estimates across varied stimulus contexts and how sensitive it is to measurement error. We also compared estimates of shared and idiosyncratic contributions, using beholder and correlation indices. In sum, we sought to find the kind of analysis that best quantifies shared and idiosyncratic contributions to judgments, in order to develop guidelines on best practices.

Overview of studies

In Study 1, we used three sets of stimuli rated on their beauty, a domain in which the psychometric properties of different stimuli are well characterized. The stimuli provided extreme test cases for measuring idiosyncratic and shared judgments: randomly generated complex color patterns (close to zero intrarater reliability and zero interrater agreement), a set of novel objects (high intrarater reliability and close to zero interrater agreement; see Kurosu & Todorov, 2017), and faces (high intrarater reliability and high interrater agreement). The first two cases are particularly important, because they allowed us to distinguish between proper and improper estimation (e.g., estimating meaningful shared components of judgments in cases in which the interrater agreement is zero). We found that the VCA is able to capture the psychometric properties of each stimulus set.

In Study 2, we moved beyond aesthetic ratings, to report a simulation study designed to (1) show the generality of these methods and (2) examine the relative advantages of increasing either sample or stimulus size, or both, to estimations. Currently, a lot of attention has been directed at larger sample sizes as being indicative of better estimate precision, power, and replicability (Maxwell, Kelley, & Rausch, 2008). The number of stimuli has also been given recent consideration as another important design choice (Judd et al., 2012; Westfall, Judd, & Kenny, 2015; Westfall et al., 2014), particularly when using too few stimuli can obscure whether results are due to the specific stimuli chosen or the sample (Wells & Windschitl, 1999). Here we found that the number of stimuli can often (but not always) have a larger impact than the number of raters on the precision of correlations and VCA estimates.

In Study 3, we further explored how multiple repeated measures of aesthetic judgments of objects and faces affect estimates of idiosyncratic and shared judgments. With more repeated measures, the intra- and interrater correlations both increased nonlinearly at different rates and were not independent from each other. Any index created from these correlations would therefore change with more repeated measures. The VCA estimates, however, stabilized with more measures.

Study 1

The aim of Study 1 was to test whether the VCA provides sensible estimates about the shared and idiosyncratic contributions to beauty judgments of stimuli with varying psychometric properties: faces (high interrater agreement and intrarater reliability), objects (low interrater agreement, high intrarater reliability), and patterns (low interrater agreement and intrarater reliability). First, we validated that the intra- and interrater correlations reflected the hypothesized stimulus properties. Then we checked whether the VCA arrived at appropriate conclusions. Finally, we qualitatively compared the beholder index estimates with the correlation index estimates.

Method

Participants

One hundred twenty-four participants were recruited using Amazon Mechanical Turk, in accordance with the Princeton University Institutional Review Board (Protocol: 0000007301), to evaluate either faces (N = 40), objects (N = 40), or patterns (N = 44) (Fig. 1). For sample size, we used past research on the reliability of face (Oosterhof & Todorov, 2008) and object (Kurosu & Todorov, 2017) judgments, in which the minimum sample needed was about 18–25 participants to obtain high reliability (r = .9 for faces, .5 for objects). For the patterns, for which little previous research had focused on reliability, we decided a priori on 40 participants, which was larger than the number needed for faces or objects.

Stimuli

We randomly selected 50 faces (half female) from the Karolinska Directed Emotional Faces database (KDEF; Lundqvist, Flykt, & Öhman, 1998). The pictures depicted white men and women with neutral facial expressions. We randomly selected 50 3-D objects, created with Grasshopper algorithms (http://www.grasshopper3d.com), from the set developed by Kurosu and Todorov (2017). We generated 50 color matrices in Python. Specifically, each matrix consisted of 100 × 100 square blocks, in which the color of each block was randomly generated using the RGB color system (see Fig. 1).

Procedure

Participants completed a self-timed rating task in which they judged “how beautiful is this {face, object, image},” using a scale from 1 (not at all) to 7 (extremely). Participants rated each stimulus twice, once in two successive blocks. The stimulus order was randomized for each participant and within each block.

Descriptive analysis

The intrarater correlation, or test–retest reliability, examines the consistency of raters’ ratings with themselves across repeated measurements (Gwet, 2014). We computed Pearson correlations between the two time points for the same individual. The interrater correlation measures the consensus between raters. For descriptive purposes, here we compute this value with pairwise Pearson correlations between raters as well as correlations between a rater and the group’s average ratings (RtG).

Variance component analysis

Implementation

A recent approach for estimating variance components is through maximum likelihood (ML) and restricted maximum likelihood (REML) estimation in a mixed regression model (Nakagawa & Schielzeth, 2010; Searle, Casella, & McCulloch, 2006). For this analysis, we used the Linear Mixed Effect 4 (lme4) package, version 1.1.11 in R (Bates, Mächler, Bolker, & Walker, 2015), to estimate the VCA. Since the stimuli were rated twice by multiple participants, our models were cross-classified (Schielzeth & Nakagawa, 2013). The random effects included the rater, stimulus, and Rater × Stimulus, as well as the block, Block × Rater, and Block × Stimulus, clusters, and were estimated using REML, which is comparable to a random-effects analysis of variance (ANOVA; Corbeil & Searle, 1976; Searle et al., 2006). To minimize convergence issues, the models were optimized using the “bobyqa” option in lme4 and were run for a maximum of 500,000 iterations.

To test an additional analytic option, we also estimated the VCA using random-effects ANOVA from the (“VCA”) package, version 1.3.2 in R (Schuetzenmeister, 2016), which provides sum-of-squares (SS)-derived variance components. The VCA estimates were similar across (restricted) maximum likelihood and ANOVA implementations (Supplementary Fig. 1), suggesting that the choice of VCA implementation may be informed by the desire for further functionality, such as bootstrapping, in the relevant software packages, by the size of the data as variance estimates are partially pooled in mixed models (Gelman & Hill, 2007), or by issues with convergence (see Appendix C).

Variance partitioning coefficient (VPC)

The equation for the VPC is

$$ VPC=\frac{\upsigma_{\mathrm{cluster}}^2}{\upsigma_{\mathrm{cluster}}^2+{\upsigma}_{\mathrm{residual}}^2} $$

(1)

which is the ratio of a cluster’s variance to the total variance (Goldstein et al., 2002). The σ²_cluster represents the between-instances-of-a-cluster variance, whereas the σ²_residual represents within-cluster variance. When the VPC is closer to one, the between-cluster variance will explain most of the variance and the within-cluster variance will be low. When the VPC is closer to zero, there is no between-cluster variance, and thus most of the variance is within-cluster. The analyses in this study contained random intercepts only, suggesting that the VPCs in these analyses can be interpreted as intraclass correlations (ICCs; Goldstein et al., 2002; Nakagawa & Schielzeth, 2010).

Beholder indices

Following Hönekopp (2006) and using the relevant VPCs, we computed two beholder indices The first index, b₁, ignores the role of the rater cluster variance (Formula 2), whereas the second, b₂, includes the rater variance as an idiosyncratic taste (Formula 3).

$$ {b}_1=\frac{\upsigma_{\mathrm{Rater}\times \mathrm{Stimulus}}^2}{\upsigma_{\mathrm{Rater}\times \mathrm{Stimulus}}^2+{\upsigma}_{\mathrm{stimulus}}^2} $$

(2)

$$ {b}_2=\frac{\upsigma_{\mathrm{rater}}^2+{\upsigma}_{\mathrm{Rater}\times \mathrm{Stimulus}}^2}{\upsigma_{\mathrm{rater}}^2+{\upsigma}_{\mathrm{Rater}\times \mathrm{Stimulus}}^2+{\upsigma}_{\mathrm{stimulus}}^2} $$

(3)

Typically, beholder indices are reported as is; however, we subtracted them from 1 for the purpose of standardizing value interpretation and comparison with the correlation index (see below). After this transformation, a beholder index greater than .5 would be interpreted as evidence that shared taste contributes more to judgments than does idiosyncratic taste, whereas a beholder index smaller than .5 would indicate a greater role for idiosyncratic taste.

Correlation index

The correlation index (Germine et al., 2015) assumes a zero-sum relationship between the intra- and interrater correlations. Germine and colleagues calculated pairwise interrater correlations, squared them, and then averaged them to obtain “the proportion of ratings across faces that overlap between two typical individuals.” They also calculated the mean of the squared pairwise intrarater correlations. The formula used to obtain an estimate of shared preferences was

$$ \mathrm{Shared}=\frac{E\left[{r}_b^2\right]}{E\left[{r}_w^2\right]} $$

(4)

where r_b represents the interrater correlations, r_w the intrarater correlations, and E is the expected value or average. The formula for idiosyncratic preferences is simply 1 – Shared. These formulas as written can accommodate only data sets in which pairwise interrater correlations are uniformly positive—that is, data in which everyone agrees to some extent (which was the case for the preselected stimuli Germine and colleagues used). For data in which disagreement between raters exists, this formula would estimate shared contributions when there were none (Supplementary Fig. 1). This is due to the squaring procedure: squaring negative correlations to transform them into variance-explained metrics (r²) would impute agreement from disagreement between individuals. We therefore made one slight modification in order to account for disagreement, so that the modified shared preferences are now

$$ \mathrm{Share}{\mathrm{d}}_{\mathrm{modified}}=\frac{E\left[{r}_b^2\ast \frac{r_b}{\left|{r}_b\right|}\ \right]}{E\left[{r}_w^2\ast \frac{r_w}{\left|{r}_w\right|}\ \right]} $$

(5)

In simpler terms, multiplying the variances by the correlation divided by its absolute value means that if the original correlations were negative, the resulting variance is also negative. Although this modification helps account for disagreement, it may not be theoretically optimal, as the meaning of negative variances is inherently ambiguous (Nakagawa & Schielzeth, 2010). For simplicity, we only report estimates of shared preferences; a value greater than .5 would suggest greater contributions from shared taste.

Results

Validating psychometrics

The correlational patterns matched the expected stimulus psychometrics. Face (r_intra = .62, 95% CI[.52, .71]) and object (r_intra = .49, 95% CI[.39, .59]) judgments both showed high intrarater reliability, whereas the pattern judgments had almost none (r_intra = .067, 95% CI[.02, .11]) (Fig. 2A). Faces also showed interrater agreement (r_inter = .34, 95% CI[.32, .35]), whereas objects (r_inter = .003, 95% CI[– .02, .03]) and patterns (r_inter = – .002, 95% CI[– .01, .006]) did not. When examining the rater-to-group (RtG) correlations, faces showed a boost in agreement (r_RtG = .57, 95% CI[.49, .65]) that was not present in the objects (r_RtG = .039, 95% CI[– .06, .14]) or patterns (r_RtG = .004, 95% CI[– .03, .05]), congruent with the signal-boosting effects of between-rater aggregation.

Estimation

The VPC metrics suggest a pattern of results consistent with the psychometrics. Face judgments were largely explained by the rater VPC (i.e., potentially idiosyncratic variance) and Rater × Stimulus VPC (i.e., idiosyncratic variance), yet they also contained a sizable stimulus VPC (i.e., shared variance) (Fig. 2B). The stimulus VPC was correctly close to zero for the object and pattern judgments. The object and pattern data were largely explained by rater and Rater × Stimulus VPCs. However, the pattern data only contained rater variance, suggesting that unreliable judgments can contribute to this particular variance component. The findings also show that observing large idiosyncratic contributions through the Rater × Stimulus VPC is dependent on collecting reliable judgments, which was the case for the face and object data (Figs. 2A and 2B). Because the beholder indices are derived from the variance components, they reflect the VPC results but provide a simpler summary metric (Fig. 2C).

The correlation index appropriately estimated that faces contained more idiosyncratic judgments (i.e., 64% of the reliable variance) and that there were close to no shared judgments in the object and pattern data (Fig. 2D). Although this index provided reasonable estimates, it is important to examine how consistent they are with the VCA estimates. If one were to consider the correlation index and beholder index values as being comparable, then it would seem the correlation index estimates more idiosyncratic contributions by about 0.1 units in the face data, and less in the object or pattern data. This might be due to how each method deals with repeated measures. Mixed models naturally account for repeated measures, but for the correlation index one has to average repeated measures in the process of calculating pairwise interrater correlations. In our case, we averaged them within raters first, then calculated agreement, likely increasing the idiosyncratic contributions by improving rater signal.

Discussion

After validating the hypothesized psychometric properties of beauty judgments for different kinds of stimuli, the analyses showed that the VCA metrics provided meaningful estimates. The correlation index, which only combines intra- and interrater correlations, provided estimates similar to those for the beholder indices, which are derived from the VCA. Yet, the averaging procedures required in order to address repeated measures in the correlation index may have led to an overestimation of idiosyncratic contributions (see Supplementary Fig. 17). We returned to this issue in Study 3. The VCA estimates relate to the psychometrics (i.e., stimulus VPC to interrater agreement, and Rater × Stimulus VPC to intrarater reliability), although the rater VPC remains ambiguous, since even unreliable pattern judgments contributed to this variance.

Study 2

The previous study showed that the VCA can provide reasonable estimates for stimuli that represent edge cases of psychometric properties. In Study 2, we conducted a simulation analysis to directly manipulate the magnitudes of reliability and agreement, for greater control of the investigated psychometric space, and examined the additional impacts of stimulus and sample size on estimations.