Metamemory for pictures of naturalistic scenes: Assessment of accuracy and cue utilization

Undorf, Monika; Bröder, Arndt

doi:10.3758/s13421-021-01170-5

Metamemory for pictures of naturalistic scenes: Assessment of accuracy and cue utilization

Open access
Published: 02 April 2021

Volume 49, pages 1405–1422, (2021)
Cite this article

Download PDF

You have full access to this open access article

Memory & Cognition Aims and scope Submit manuscript

Metamemory for pictures of naturalistic scenes: Assessment of accuracy and cue utilization

Download PDF

Monika Undorf¹ &
Arndt Bröder¹

2081 Accesses
14 Citations
Explore all metrics

Abstract

Memory for naturalistic pictures is exceptionally good. However, little is known about people’s ability to monitor the memorability of naturalistic pictures. We report the first systematic investigation into the accuracy and basis of metamemory in this domain. People studied pictures of naturalistic scenes, predicted their chances of recognizing each picture at a later test (judgment of learning, JOL), and completed a recognition memory test. Across three experiments, JOLs revealed substantial accuracy. This was due to people basing their JOLs on multiple cues, most of which predicted recognition memory. Identified cues include intrinsic picture attributes (e.g., peacefulness of scenes; scenes with or without persons) and extrinsic aspects of the study situation (e.g., presentation frequency; semantic distinctiveness of scenes with respect to the context). This work provides a better understanding of metamemory for pictures and it demonstrates close parallels between metamemory for naturalistic scenes and verbal materials.

Modulating human memory for complex scenes with artificially generated images

Article Open access 28 January 2022

Conceptual relatedness promotes memory generalization at the cost of detailed recollection

Article Open access 20 September 2023

The effect of intrinsic image memorability on recollection and familiarity

Article Open access 23 November 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Much of memory’s power is due to the skillful ways in which people strategically regulate their memories based on metamemory—assessments of their own learning and remembering (Benjamin, 2007). There is much evidence that metamemory plays a crucial role in successful learning and remembering of verbal materials (see, e.g., Dunlosky & Metcalfe, 2009; Koriat, 2007; Undorf et al., 2020). In contrast, relatively little is known about metamemory for other materials and, in particular, pictures of naturalistic scenes (e.g., Rhodes, 2016). This lack of research is surprising because memory for naturalistic pictures is exceptionally good (e.g., Nickerson, 1965; Shepard, 1967; Standing, 1973). Investigating whether or not it is complemented by accurate metamemory is therefore an intriguing research question. The current research aims to expand our knowledge about metamemory by investigating the accuracy and basis of people’s predictions of remembering recently studied pictures of naturalistic scenes at a later test (judgments of learning [JOLs]). Thus, this research addresses the generalizability of established principles and findings in the literature on metamemory to stimuli researchers have rarely studied.

For verbal materials, immediate JOLs are moderately accurate (e.g., Bröder & Undorf, 2019; Dunlosky & Metcalfe, 2009; Koriat, 1997).^{Footnote 1} This accuracy is due to people basing their JOLs on probabilistic cues, many of which are predictive of actual memory (Koriat, 1997; Rhodes, 2016; Undorf et al., 2018; Undorf & Bröder, 2020). The cues that underlie JOLs are often classified into different groups (Koriat, 1997). Intrinsic cues are inherent to and inseparable from the study items; examples include word concreteness or word frequency. Extrinsic cues are bound to specific study conditions and can be randomly assigned to study items; examples include presentation time or the frequency of study presentations. Evidence is accumulating that people base their JOLs for verbal materials not only on single cues but on multiple intrinsic and extrinsic cues simultaneously (e.g., Bröder & Undorf, 2019; Tatz & Peynircioğlu, 2019; Undorf et al., 2018). For instance, Undorf et al.’s (2018) participants studied single words and made JOLs for recalling each word at a later test. When varying the two extrinsic cues number of study presentations (1 vs. 2) and font size (18 point vs. 48 point), individual-level analyses revealed that the majority of participants based their JOLs on both cues. Similarly, people integrated the two intrinsic cues concreteness (abstract vs. concrete) and emotionality (neutral vs. emotional) in their JOLs (also see Bröder & Undorf, 2019). Further experiments showed that all four cues affected JOLs when manipulated simultaneously and confirmed that people integrated three cues that varied on a continuum in their JOLs. In summary, it is established that JOLs for verbal materials are moderately accurate and a growing number of studies indicate that JOLs for verbal materials rely on multiple intrinsic and extrinsic cues.

For pictures of naturalistic scenes, however, the accuracy and basis of JOLs has remained largely unexplored. In a study by Kao et al. (2005), people studied scenes and indicated whether they would remember or forget each scene on a later recognition test. Results revealed that JOLs accurately predicted differences in the relative memorability of scenes. Using pictures from the IAPS database (Lang et al., 2008), some of which presumably depicted scenes, Hourihan and Bursey (2017) and Tauber et al. (2017) found above-chance JOL accuracy and higher JOLs for positively valenced pictures than for neutral pictures. Schmoeger et al. (2020) complemented this finding by reporting higher JOLs for negatively valenced than for neutral IAPS pictures. Together, these findings indicate that people at least partly based their JOLs for pictures on emotional valence.

Recall memory (Schmoeger et al., 2020; Tauber et al., 2017) but not recognition memory (Hourihan & Bursey, 2017) was better for positive and negative pictures than for neutral pictures, meaning that people accurately predicted valence effects on recall but not on recognition memory (for similar findings with verbal materials, see Bröder & Undorf, 2019; Zimmerman & Kelley, 2010).

The few studies on JOLs for naturalistic pictures thus provide preliminary evidence that these judgments may be accurate and rely on probabilistic cues. Of course, the specific cues that underlie JOLs for naturalistic pictures may differ from those identified with verbal materials. Some of the cues found to underlie JOLs for words are by definition specific to verbal materials (e.g., word concreteness, word frequency, font size of study words). In contrast, other cues that form the basis of JOLs for verbal materials could, in principle, also underlie JOLs for naturalistic pictures (e.g., valence, arousal, or presentation frequency). Also, there are probably cues that are specific to JOLs for naturalistic pictures. One potential cue is whether scenes are presented in color or in grayscale (for effects of this cue on recognition memory, see Spence et al., 2006; Wichmann et al., 2002).

The research reported here aims to contribute to a better understanding of the accuracy and basis of metamemory for naturalistic pictures. Across three experiments, we simultaneously manipulated two or more cues that were expected to affect recognition memory. These expectations were based on a massive research effort on the memorability of more than 2,000 scene photographs (Bylinskii et al., 2015; Isola et al., 2011, 2014). In this research, participants completed an online continuous recognition task in which they viewed long streams of scene photographs and pressed a key whenever they recognized a scene they had seen before. Results from more than 600 participants showed that scenes systematically differed in memorability. Also, results showed that numerous picture attributes predicted memorability. For instance, scenes that independent raters judged to tell a story were more memorable than scenes that did not tell a story (Isola et al., 2011). Also, judged peacefulness of scenes affected memorability, with scenes judged as peaceful being less memorable than scenes judged as not peaceful (Isola et al., 2011). Semantic distinctiveness of scenes with respect to the context boosted memorability: Scenes were more memorable when presented in the context of scenes from other categories than in the context of scenes from the same category (e.g., memorability of a highway scene was higher when presented among scenes from 21 different categories than among highway scenes only; Bylinskii et al., 2015). Also relevant for present purposes, Isola et al. (2011, 2014) reported that it was not trivial for people to accurately assess scene memorability. Isola et al. (2014) had participants from an online pool indicate for a set of scenes whether each was or was not memorable or whether they would or would not, when seeing each scene again later, realize having seen it before. People’s binary ratings did not correlate with actual memorability (Isola et al., 2014). Also, ratings revealed that people thought aesthetic beauty and interestingness to increase memorability, whereas both picture attributes rather impaired memorability. Similar misconceptions prevailed among experts on human and computer vision (Isola et al., 2011). Notably, these general memorability ratings are so different from standard JOLs—made in a learning phase and on scales that allow learners to differentiate finely graded levels of confidence—that the accuracy of JOLs for naturalistic pictures is still an open question. Based on the literature on JOLs for verbal materials, we expected that JOLs for naturalistic pictures would show at least some accuracy. We will return to the issue of differences between Isola et al.’s (2014) memorability judgments and standard JOLs in the General Discussion.

We manipulated two extrinsic cues in Experiment 1, two intrinsic cues in Experiment 2, and a total of five cues (two extrinsic, three intrinsic) in Experiment 3. In each experiment, participants studied around 100 scene pictures, made an immediate JOL for each scene, and completed a recognition test where studied scenes were randomly intermixed with an equal number of new scenes. We examined whether and how each manipulated cue affected JOLs and investigated the relative accuracy of JOLs (e.g., Dunlosky & Metcalfe, 2009; Koriat, 2007; Rhodes, 2016). Relative accuracy (also referred to as resolution) assesses the extent to which JOLs distinguish between items people will later recognize and those they will not recognize.

Experiment 1

In Experiment 1, we examined the accuracy and basis of JOLs for pictures of naturalistic scenes that differed in contextual semantic distinctiveness and color. To manipulate contextual distinctiveness, we varied the number of scenes that came from specific semantic categories: Distinct scenes came from categories that contributed four scenes to the study list, and indistinct scenes came from categories that contributed 28 scenes to the study list. This manipulation affected primary distinctiveness according to Schmidt (1991), because it made distinctive items stand out from the other study items (e.g., a kitchen scene stands out more than an airplane scene when four scenes depict kitchens and 28 scenes depict airplanes). High contextual distinctiveness was previously found to improve memory performance for scenes (e.g., Bylinskii et al., 2015) and to increase memory and metamemory judgments for words and face–name associations (Dunlosky et al., 2000; Watier & Collin, 2012). To manipulate color, participants studied the colored originals of half the scenes and grayscale versions of the remaining scenes.^{Footnote 2} Studies show that colored scenes are remembered better than grayscale scenes (e.g., Spence et al., 2006; Wichmann et al., 2002), and a single study found that young children expect to remember sets of colored pictures of common objects better than sets of black-and-white pictures (Kreutzer et al., 1975). We therefore predicted that contextual distinctiveness and color would improve recognition memory and probably also increase JOLs.

Method

Participants

We aimed at a sample size of 54 participants to obtain a statistical power of (1 − β) = .95 to detect medium-sized main effects (f = .25, equivalent to η_p² = .06) with α = .05 when assuming a correlation of .50 between repeated measures (all power analyses conducted via G*Power 3; Faul et al., 2007). We aimed at optimizing power for main effects because we had no hypotheses about interactions. Participants were 56 University of Mannheim undergraduates (43 females, 13 males) with a mean age of 23.05 years (SD = 4.41).

Design and materials

The design was a 2 (distinctiveness: low vs. high) × 2 (color: grayscale vs. color) within-participants design. Stimuli were 336 pictures from the SUN database (Xiao et al., 2010), with 56 pictures from each of six semantic scene categories (airplane, bedroom, beer garden, highway, kitchen, stadium; see Fig. 1). Categories varied in whether they included indoor or outdoor scenes (indoor: kitchen and bedroom; outdoor: airplane, beer garden, highway, stadium) and whether scenes did or did not contain persons (with persons: beer garden, stadium; without persons: airplane, bedroom, highway, kitchen).

Appendix Table 3 illustrates how stimuli were distributed across within-subjects conditions. In indistinct categories, all 56 pictures served as stimuli. These pictures were divided into two parallel sets, with 28 items each. In distinct categories, a fixed set of eight randomly selected pictures served as stimuli, divided into two parallel sets with four items. For each participant, one set of scenes per category served as targets (items presented at study and test) and the second set of scenes served as distractors (new items presented at test but not at study). Scene categories and picture sets were counterbalanced across participants such that each scene category was equally often distinct and indistinct and that each picture served equally often as target and distractor.

We created a grayscale version of each picture that was used in the grayscale condition, whereas the colored original was used in the color condition.

Procedure

Participants were tested individually or in small groups in the laboratory. The experiment was fully computerized and took about 30 min to complete. It consisted of a study phase and a recognition memory test. Participants studied 96 scene photographs, with four scenes from each of the three distinct categories and 28 scenes from each of the three indistinct categories. They were tested on 192 scenes that included four studied and four new scenes from each distinct category and 28 studied and 28 new scenes from each indistinct category. At study and test, participants saw one randomly selected half of the scenes per category in color and the other half in grayscale. For each participant, scenes were presented in a new random order at study and test. At study, each scene was preceded by a fixation cross (500 ms) and displayed for 1 s. Immediately afterwards, participants made a JOL by indicating the probability of recognizing the picture at test, using one of 11 keys labeled 0, 10 … 90, 100. Following the study phase, participants performed a numerical filler task for 3 min to avoid recency effects. On filler trials, participants added up 10 digits (e.g., 4234657984) and typed in the sum’s last digit (e.g., 2). This filler task was chosen, because it does not interfere with episodic memory or pictorial encoding. Participants then completed a self-paced recognition test, in which one scene after the other was presented, and participants indicated whether it was old or new by pressing labeled keys.

Results

Contextual distinctiveness and color affected both JOLs and recognition memory (see Table 1). A 2 (distinctiveness: low vs. high) × 2 (color: grayscale vs. color) repeated-measures ANOVA on JOLs revealed a main effect of distinctiveness, F(1, 55) = 22.10, p < .001, η_p² = 0.29, indicating higher JOLs for scenes from distinct than indistinct categories as well as a main effect of color, F(1, 55) = 43.78, p < .001, η_p² = 0.44, indicating higher JOLs for color than grayscale scenes, but no interaction, F(1, 55) = 1.95, p = .168, η_p² = 0.03. A similar ANOVA on corrected hit rates P_r measuring actual recognition performance revealed main effects of distinctiveness, F(1, 55) = 60.00, p < .001, η_p² = 0.52, and color, F(1, 55) = 29.64, p < .001, η_p² = 0.35, but no interaction, F < 1.

Table 1 Descriptive statistics for Experiments 1 and 2

Full size table

Even though contextual distinctiveness and color affected JOLs at the aggregate level, it was still possible that no participant integrated the two cues in his or her JOLs (cf. Undorf et al., 2018). The reason is that main effects of the two cues at the aggregate level could result if each participant’s JOLs were based on only one cue, but different participants based their JOLs on different cues. Two supplementary analyses therefore evaluated cue integration at the individual level. The first analysis focused on simple mean differences. A participant was scored to base JOLs on distinctiveness if his or her mean JOLs were higher for scenes from distinct than indistinct categories and was scored to base JOLs on color if his or her mean JOLs were higher for color than grayscale scenes. According to this criterion, 66% of participants integrated distinctiveness and color in their JOLs (see Table 2).

Table 2 Number of experimentally manipulated cues used by percentages (and absolute numbers) of individual participants in Experiments 1 to 3

Full size table

The second individual-level analysis was based on effect sizes for the very same mean differences. Figure 2 shows each participant’s Cohen’s d for distinctiveness (x-axis) and color (y-axis). The majority of participants are located in the upper right quadrant, indicating that they predicted better memory for distinct than indistinct scenes and better memory for color than grayscale scenes. When using Cohen’s (1977) small effects convention of |d| ≥ .2 as evidence for reliable cue effects, 43% of participants integrated the two cues in their JOLs.

Finally, we examined the relative accuracy of JOLs. The most commonly used measure of relative JOL accuracy is the within-subject gamma correlation between JOLs and memory performance (Nelson, 1984). Recently, however, using the gamma coefficient as a measure of relative accuracy has been criticized for conflating people’s metacognitive competency with random noise (Bröder & Undorf, 2019) and for inflated Type 1 errors (Murayama et al., 2014), among others (cf. Benjamin & Diaz, 2008; Masson & Rotello, 2009; Spellman et al., 2008). We therefore analyzed relative accuracy not only using the gamma correlation between JOLs and recognition hits but also using the matching index G from Brunswik’s (1952) lens model (Bröder & Undorf, 2019) and using a mixed-effects model analysis (Murayama et al., 2014).

A significantly positive gamma coefficient, M = .28 (SD = .21), t(55) = 9.83, p < .001, d = 1.33, and a significantly positive matching index G, M = .40 (SD = .09), t(55) = 4.63, p < .001, d = 0.62, indicated that JOLs captured differences in the relative memorability of scenes. The matching index G was numerically but not significantly higher than the gamma coefficient, t(55) = 1.50, p = .139, d = 0.20. The mixed-effects model analysis corroborated the conclusion that JOLs accurately predicted scene memorability. We used a logistic regression model with random slopes and intercepts for participants that predicted recognition hits from group-mean centered JOLs. This model showed that JOLs significantly predicted memory performance, b = 0.02, z = 9.03, p < .001, and it provided a significantly better fit to the data than an otherwise identical model without JOLs as a fixed-effects predictor, χ²(1) = 53.28, p < .001.

We also examined relative accuracy separately for each within-subject condition. As can be seen in Appendix Table 5, Gamma coefficients ranged between .19 and .35, were all significantly positive, and did not differ across within-subject conditions. In a mixed-effects model analysis, accuracy differences across conditions should produce significant interactions between the fixed-effects predictors for JOLs and for the manipulated cues. In a logistic regression model with random participant intercepts and random participant slopes for JOLs, we predicted recognition hits from group-mean centered JOLs, distinctiveness (0 = low, 1 = high), color (0 = grayscale, 1 = color), and their interactions. Results showed that neither distinctiveness nor color interacted with JOLs, Distinctiveness × JOLs: b = 0.001, z = 0.26, p = .798; Color × JOLs: b = 0.003, z = 1.14, p = .253; Distinctiveness × Color × JOLs: b = −0.010, z = 1.36, p = .175. Also, this model did not provide a better fit to the data than an otherwise identical model without interactions between JOLs and one or both cues, χ²(3) = 3.28, p = .350. Thus, the mixed-effects model analysis also demonstrated similar relative accuracy of JOLs across conditions.

Discussion

As in previous work, contextual distinctiveness and color helped recognition memory. A new finding was that people based their JOLs for pictures of naturalistic scenes on contextual distinctiveness and color, with most but not all participants predicting better memory performance for distinct scenes and color scenes than for indistinct scenes and grayscale scenes. Individual-level analyses showed that many participants integrated both cues in their JOLs. Three different measures of relative accuracy converged on the conclusion that JOLs captured differences in the relative memorability of scenes and that relative accuracy was in the range reported in studies with verbal materials (e.g., Koriat, 2007; Rhodes, 2016; Bröder & Undorf, 2019). Overall, Experiment 1 revealed that the basis and accuracy of JOLs for naturalistic scenes was similar to that of JOLs for verbal materials.

Experiment 2

Experiment 2 aimed to replicate and extend Experiment 1 by manipulating two intrinsic cues. As we mentioned in the Introduction, Isola et al. (2011) obtained independent ratings for various scene attributes from a norming sample and then assessed whether these ratings were predictive of picture memorability in a continuous recognition task completed by a new sample. Two of the picture attributes that Isola et al. (2011) found to predict memorability were the extent to which scenes tell a story and to which scenes are peaceful. Scenes that were judged to tell a story were remembered better than scenes that did not tell a story. Judged peacefulness, in contrast, was inversely related with scene memorability. In Experiment 2, participants studied, judged, and recognized scenes that were either low or high in story and peacefulness. We expected to replicate Isola et al.’s (2011) findings of better recognition memory for scenes high in story and worse recognition memory for scenes high in peacefulness. We expected that, as in Experiment 1, JOLs would be moderately predictive of recognition memory. It was unclear, however, whether story and peacefulness would affect JOLs and hence whether JOLs would accurately predict cue effects on recognition memory.