We conducted a statistical analysis on the results of inconsistent condition trials by comparing the trials in which natural/artificial consistency was maintained with trials in which the consistency was violated (Fig. 1). The results showed that natural/artificial inconsistent trials had significantly better accuracy than natural/artificial consistent trials in Experiment 1a (consistent mean .64 vs. inconsistent mean .81) (F(1,19) = 43.23, p < .001, η
2 = .70) and Experiment 1b (consistent mean .69 vs. inconsistent mean .80) (F(1,19) = 13.55, p = .002, η
2 = .42). On the other hand, natural/artificial consistent trials had significantly better accuracy than natural/artificial inconsistent trials in Experiment 2 (consistent mean .53 versus inconsistent mean .44) (F(1,19) = 4.90, p = .039, η
2 = .21).
Additionally, we tested these findings for possible viewpoint influence. Viewpoint effect was significant in Experiment 1a (F(1,19) = 24.74, p < .001, η
2 = .57) and Experiment 1b (F(1,19) = 20.13, p < .001, η
2 = .51), but not significant in Experiment 2 (F(1,19) < 1).
These results can be explained by the same logic we have been using across this paper: in Experiments 1a and 1b (object recognition task) natural/artificial inconsistent trials were more accurate due to the saliency of visual as well as semantic contrast between the object and its background compared with natural/artificial consistent trials, thus allowing a more accurate segmentation and recognition/naming of the target object. As for Experiment 2 (background recognition task), consistency within the natural/artificial aspect presumably helped the integration of object with the background hence reaching a higher performance compared with natural/artificial inconsistent trials. The viewpoint effect, which was present in this analysis in Experiments 1a and 1b, was absent in Experiment 2, for exactly the same reasons this effect appeared or disappeared when all conditions were taken into account (see individual Discussion sections for each experiment above and General discussion below).
An inter-rater reliability score (e.g., 75% if 3 of the 4 scorers made same judgment) also was computed for all the experiments, yielding the following results: the mean reliability was 97% for Experiment 1a, 95% for Experiment 1b, and 85% for Experiment 2. Because these values were well above 75% (i.e., the minimum agreement a trial needs to be considered correct), we concluded that the consistency of decisions made by the scorers was high enough.
Additionally, all reliability scores were significantly negatively correlated with (Experiment 1a: r (54) = −.48, p < .001; Experiment 1b: r (54) = −.49, p < .001; Experiment 2: r (54) = −.45, p < .001) and predicted by the variety in labels (see below) used by the participants as well as explained its significant proportion of variance (Experiment 1a: (unstandardized) b = 1.01, t(54) = 83.23, p < .001; R
2 = .23, F(54) = 16.52, p < .001; Experiment 1b: (unstandardized) b = 1.01, t(54) = 75.60, p < .001; R
2 = .24, F(54) = 17.35, p < .001; Experiment 2: (unstandardized) b = 0.95, t(54) = 32.28, p < .001; R
2 = .20, F(54) = 13.81, p < .001;). Thus, we can assume that the lower reliability score in Experiment 2 was mainly the result of high variability in descriptive labels used by the participants.
We counted the number of different labels or descriptions used by the participants to name each object and background, and we found that backgrounds had significantly more variety (507 in total) compared to objects in both Experiments 1a and 1b (229 and 323 respectively). These values were not only significantly negatively correlated with scorer reliability, but also with accuracy data for objects and backgrounds (Experiment 1a: r (54) = −.66, p < .001; Experiment 1b: r (54) = −.76, p < .001; Experiment 2: r (54) = −.72, p < .001).
Effect of task difficulty in background recognition
The overall accuracy of background recognition (Experiment 2) was lower than object recognition (Experiments 1a and 1b). One might assume that the absence of viewpoint effect in Experiment 2 (but present in Experiment 1) would be attributable to the higher task difficulty. To address the issue, we examined the effect of task difficulty in Experiment 2.
We sorted the background stimuli into two categories, those that yielded a better performance and those that yielded a worse performance. The overall accuracy for the better performing half was comparable with that of the object recognition task (Experiments 1a and 1b); (consistent/canonical: 83%, consistent/accidental: 80%, inconsistent/canonical: 70%, inconsistent/accidental: 73%). We tested these results with a two-factor repeated measures ANOVA (consistency × viewpoint) and we once again found a significant consistency effect where consistent scenes (M = .81) had more accuracy than inconsistent scenes (M = .72) (F(1,19) = 7.89, p = .011, η
2 = .29), along with a nonsignificant viewpoint effect (F(1,19) < 1) as well as a nonsignificant interaction between these two effects (F(1,19) < 1). This pattern of result was identical to that found in the analysis on the whole data of Experiment 2. We further confirmed that the pattern was replicated for the worse-performing half.
In addition, a three-factor ANOVA (the better performing half versus whole original dataset × consistency × viewpoint) showed a significant overall difference between the two sets of data (as expected, because we only used the better performing half of the original data here), but there was no interaction, which means that the original pattern did not change significantly. Therefore, we can assume that the conclusions of Experiment 2 were independent of overall task difficulty.
However, these ad-hoc analyses are not completely conclusive. Future examination is warranted to identify the reason why background naming is often more difficult than object naming.