Same same, but different: A psychometric examination of three frequently used experimental tasks for cognitive bias assessment in a sample of healthy young adults

Cognitive bias research draws upon the notion that altered information processing is key for understanding psychological functioning and well-being. However, little attention has been paid to the question of whether the frequently used experimental paradigms hold adequate psychometric properties. The present study examined the psychometric properties of three widely used cognitive bias tasks: the Approach-Avoidance Task (AAT), the visual dot-probe-task, and the Implicit Association Test (IAT). Approach, attentional, and association biases towards valenced stimuli were repeatedly measured at five different time points in a sample of 79 healthy young adults. Two different devices were used for assessment: a personal computer (PC) and a touchscreen-based tablet. Reliability estimates included internal consistency and temporal stability. Validity was inferred from convergence across different behavioral tasks and correlations between bias scores and self-reported psychological traits. Reliability ranged widely amongst tasks, assessment devices, and measurement time points. While the dot-probe-task appeared to be completely unreliable, bias scores obtained from the PC-based version of the AAT and both (PC and touchscreen) versions of the IAT showed moderate reliability. Almost no associations were found across information processing tasks or between implicit and explicit measures. Cognitive bias research should adopt a standard practice to routinely estimate and report psychometric properties of experimental paradigms, investigate feasible ways to develop more reliable tools, and use tasks that are suitable to answer the precise research question asked. Supplementary Information The online version contains supplementary material available at 10.3758/s13428-022-01804-9.


Internal consistency: Split-half correlations
Next to investigating internal consistency for bias scores derived from the AAT, the dot probe task and the IAT (main manuscript), here we analyze reliability for mean reaction times (RTs) per condition (i.e., mean RTs for compatible or incompatible condition). Internal consistency of the measurements were quantified using the splithalf-method. More precisely, reliability estimates for mean RTs derived from the AAT and the dot probe task were determined by means of correlations between the odd and even trial numbers respectively. Internal consistency from the IAT was calculated by correlating the first (practice) and the second (test) block as recommended by Greenwald and colleagues (2003). Detailed results are displayed in Table A4.
AAT. Overall, split-half reliability coefficients demonstrated good to excellent internal consistency for mean condition RTs (both PC-and touchscreen versions) and ranged between r = .64 (rSB = .78) and r = .94 (rSB = .97).
Dot-probe task. Split-half correlations for mean RTs were high for both versions of the task and ranged between r = .71 (rSB = .83) and r = .90 (rSB = .95).
IAT. Split-half correlations were acceptable for mean condition RTs and ranged between r = .30 and r = .86. In general, reliability was somewhat greater for the touchscreen version of the task than for the PC-version.

Temporal stability: Test-retest reliability
Stability across time of the respective mean RTs was inferred from their bivariate autocorrelations. Detailed results are displayed in Table A5.
AAT. Mean RTs exhibited good test-retest correlations when assessed via PC. Estimates were lower, but still substantial for touchscreen-based assessment.

Dot-probe Task.
Estimates of test-retest reliability were high for condition RTs (all but two coefficients > .50).

IAT.
Autocorrelations of mean RTs were acceptable with most coefficients >.50.
Overall, all three tasks showed moderate to good internal consistency and temporal stability when mean RTs were analyzed (Koo & Li, 2016;Schmukle et al., 2005). Compared to the findings in the main manuscript, splithalf and test-retest correlations for mean RTs yielded high consistencies and were in general larger than reliability estimates obtained from bias scores. It should be mentioned in this context, however, that reliability estimates for mean RTs would always turn out to be somewhat higher than those for difference scores. This is due to the fact that measurement error from the two trials/blocks (i.e., compatible vs. incompatible trials) is compounded when combined into a single index, resulting in an attenuation of correlation coefficients (Overall & Woodward, 1975; see also Brown et al., 2014;Enkavi et al., 2019). Furthermore, mean condition RTs are difficult to interpret, as they are limited in capturing individual differences in information processing preferences. High correlations within and across sessions might also highlight general response speed, independent of emotional condition. As bias scores are a better index of information processing preferences than mere mean RTs, the herein presented results should be interpreted with caution and interested researchers should abstain from relying on mean RTs only when interpreting psychometric properties of behavioral RT-based tasks.

Table A4
Internal consistency (split-half correlation) and descriptives for cognitive bias assessment tasks SD: standard deviation; n: number of observed cases for each task; "r" denotes the correlation between the odd and even trials (in the case of AAT and dot probe) or the correlation between the first (practice) and second (test) block (in the case of the IAT) respectively and is based on multiple imputation. Due to the fact that internal consistency for the IAT was based in blocks rather than a set of items, a Spearman-Brown correction was not applicable to this case.

III.
The use of different scoring algorithms for bias score calculation

Criterion validity: Convergence between cognitive bias measures
Here, we report additional results on the comparison between different experimental paradigms for cognitive bias assessment (criterion validity). While in the main manuscript, the calculation of bias scores was based on the most conventional approaches from the literature (that is: difference between median reaction times for the AAT and the dot-probe-task, d-score algorithm for the IAT), to aid comparability across tasks, two additional Compatible RT   PC  T2  T3  T4  T5   T2  T3  T4  T5 . and compared the results achieved with the approach-avoidance (AAT) and attentional (dot probe) biases which were based on the same calculation method.

Incompatible RT
As can be seen in Figure A1, results are comparable to those reported in the main manuscript (conventional bias score calculation). In the PC-assessment, approach biases for positive and negative cues were highly correlated at each measurement time point, with correlations ranging between 0.58 and 0.75. Bias scores across different assessment tasks were mostly not correlated (only exception: significant correlation between approach biases for positive cues and the IAT-score at t1; r = .27). In the touchscreen-based assessment, other than for t1 (r = .34), approach biases for positive and negative pictures did not correlate. The only between task correlation appeared for the AAT (positive cues) and the dot probe task at t3 (r = -.25).
As can be seen in Figure A2, results did not change substantially when applying formula (2) (difference between median condition RTs). In the PC-assessment, approach biases for positive and negative cues were highly correlated at each measurement time point, with correlations ranging between 0.58 and 0.84. Bias scores across different assessment tasks were mostly not correlated (only exception: significant negative correlation between approach biases for negative cues and the IAT-score at t5; r = -.32). In the touchscreen-based assessment, approach biases for positive and negative pictures did not correlate at the first two measurement points, but correlations were high for the remaining time points (range: .62 -.71). Other than in the main manuscript where no significant interrelations between approach, attentional, and association biases appeared, there was a correlation between approach biases for negative cues and attentional biases (r = .22 ) and IAT biases (r = .26) at t4. AATN and AATP denote bias scores towards negative and positive cues respectively, AAT are attentional bias scores; the last row displays IAT bias scores. Numbers 1-5 indicate the respective measurement timepoint. All coefficients are standardized coefficients and were obtained by full information maximum likelihood estimation.

Criterion validity: Convergence between cognitive bias measures using the d-score algorithm
All regression coefficients are shown. Only the significant correlations are shown. * p < .05; ** p < .01; *** p < .001.

IV. Use of repeated measurements: Aggregating bias scores over measurement time points
Previous work has pointed to the use of combining data from multiple measurements in order to obtain more precise estimates. For instance, Toffalini and colleagues (2021) demonstrated that assessing an outcome multiple times (i.e., three times at pre-treatment and three times at post-treatment) constitutes a feasible way to increase power. Here we report results for aggregated bias scores that were summed over sessions (t1-t4). Please note that we decided to use only the first four measurement time points as the fifth time point took place too far apart in time (t5: +4 weeks vs. t1-t4: weekly sessions).

Internal consistency: Split-half correlations
Detailed results are depicted in Table A6. As can be seen, the PC-version of the AAT and both versions of the IAT showed good internal consistencies. Reliability for the touchscreen-version of the AAT was somewhat lower, but still substantial. Reliability estimates for the dot probe task failed to reach statistical significance.
Results are comparable to those reported in the main manuscript, but internal consistency appears to be somewhat higher when aggregated scores are used. .31

Bias
Association Bias

touchscreen
Note. AAT: Approach-Avoidance Task; IAT: Implicit Association Task; r: Pearson's correlation coefficient; rSB: Spearman-Brown correction; 5; * p < .05; ** p < .01; *** p < .001; M: mean bias score; SD: standard deviation; "r" denotes the correlation between the odd and even trials (in the case of AAT and dot probe) or the correlation between the first (practice) and second (test) block (in the case of the IAT) respectively and is based on multiple imputation. Due to the fact that internal consistency for the IAT was based in blocks rather than a set of items, a Spearman-Brown correction was not applicable to this case. Bias scores were aggregated for the first four time points. Table A7 displays results for correlations between behavioral tasks. As can be seen, approach bias scores for positive and negative cues were strongly correlated, but no correlations appeared between the two different versions of the AAT (PC vs. touchscreen). For the touchscreen-based assessment, approach biases for positive cues were negatively correlated to attentional biases and positively correlated to IAT d-scores. Finally, a strong correlation appeared between the PC-and touchscreen-based assessments of IAT scores. Results are somewhat comparable with those in the main manuscript, but correlations for the two versions of the IAT were stronger when scores were aggregated across measurement time points.

Construct validity: Association with self-report measures
Correlations between cognitive biases and self-report measures are presented in Table A8. As can be seen, there were only few significant associations between cognitive biases and personality traits or anxiety. Of interest were the positive correlations between IAT-scores and consciousness (for PC: r = .32; for touchscreen: r = .40), the correlations between negative personality traits and approach biases for negative cues as assessed via touchscreen (rneuroticism = .29; rFEAR = .29, rANGER = .32; rSADNESS = .29), and correlations between positive personality traits and approach and attentional biases as assessed via PC (rAATN;PLAY = -.30; rATT;CARE = .25). While these relationships are in the expected direction, there were also correlations contrary to the expected direction (negative correlation between approach biases for positive cues and CARE: -.32). Overall, results resemble those reported in the main manuscript. Interestingly, however, somewhat more correlations turned out to be significant if aggregated scores were used as compared to the single (first) measurement (see main manuscript). Still, correlations should be interpreted with caution, given the large number of comparisons.
Taken together, when using combined bias scores (i.e., summed over sessions), results largely resemble those reported in the main manuscript (i.e., per session analysis). However, in some cases, somewhat higher correlations could be reached. In particular, most differences between these exploratory analyses and the findings reported in the manuscript emerged for underpowered analyses. Hence, our findings are in line with Toffalini et al. (2021) and hint to the fact that the use of repeated measurement might increase power, especially in circumstances where power is low. However, a caveat is warranted here: In these exploratory analyses, we aggregated scores across sessions by calculating average scores. Toffalini et al. (2021) on the other hand, suggest using mixed-effects models, with participants as random effects rather than using aggregated scores, since the latter option would lose information on intra-individual variability. This, however, was not possible to perform for the current set of data. In addition, when using repeated measurements as proposed recently, measurement time points should ideally lie in close temporal proximity (i.e., should be performed at the same day) and data should be collected using different versions of the same task.  ATT Bias = Attentional Bias derived from the dot probe task; IAT Bias = Association Bias (D-Score) derived from the Implicit Association Test; *p < .05, **p < .01, ***p < .001; Bias scores were aggregated for the first four time points.