The effect of intrinsic image memorability on recollection and familiarity

Broers, N.; Busch, N.A.

doi:10.3758/s13421-020-01105-6

The effect of intrinsic image memorability on recollection and familiarity

Open access
Published: 23 November 2020

Volume 49, pages 998–1018, (2021)
Cite this article

Download PDF

You have full access to this open access article

Memory & Cognition Aims and scope Submit manuscript

The effect of intrinsic image memorability on recollection and familiarity

Download PDF

N. Broers^1,2 &
N.A. Busch^1,2

4429 Accesses
6 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 25 May 2021

This article has been updated

Abstract

Many photographs of real-life scenes are very consistently remembered or forgotten by most people, making these images intrinsically memorable or forgettable. Although machine vision algorithms can predict a given image’s memorability very well, nothing is known about the subjective quality of these memories: are memorable images recognized based on strong feelings of familiarity or on recollection of episodic details? We tested people’s recognition memory for memorable and forgettable scenes selected from image memorability databases, which contain memorability scores for each image, based on large-scale recognition memory experiments. Specifically, we tested the effect of intrinsic memorability on recollection and familiarity using cognitive computational models based on receiver operating characteristics (ROCs; Experiment 1 and 2) and on remember/know (R/K) judgments (Experiment 2). The ROC data of Experiment 2 indicated that image memorability boosted memory strength, but did not find a specific effect on recollection or familiarity. By contrast, ROC data from Experiment 2, which was designed to facilitate encoding and, in turn, recollection, found evidence for a specific effect of image memorability on recollection. Moreover, R/K judgments showed that, on average, memorability boosts recollection rather than familiarity. However, we also found a large degree of variability in these judgments across individual images: some images actually achieved high recognition rates by exclusively boosting familiarity rather than recollection. Together, these results show that current machine vision algorithms that can predict an image’s intrinsic memorability in terms of hit rates fall short of describing the subjective quality of human memories.

Cue quality and criterion setting in recognition memory

Article 02 February 2018

Visual properties and memorising scenes: Effects of image-space sparseness and uniformity

Article 13 July 2017

Test position effects on hit and false alarm rates in recognition memory for paintings and words

Article 23 September 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Our visual memory capacity for real-life scenes and objects is one of the most impressive feats of human cognition (Brady, Konkle, Alvarez, & Oliva, 2008; Standing, 1973). While memories of specific images are in part influenced by individual factors such as interest (Hidi, 1990) or expertise (Curby, Glazek, & Gauthier, 2009), it has been shown that many images are in fact consistently remembered or forgotten across many observers (Isola, Xiao, Parikh, Torralba, & Oliva, 2014; Bylinskii, Isola, Bainbridge, Torralba, & Oliva, 2015; Bainbridge, Isola, & Oliva, 2013; Bainbridge, 2020). This consistency of an image’s memorability spans a wide array of different picture presentation times (Mancas & Le Meur, 2013; Broers, Potter, & Nieuwenstein, 2018; Goetschalckx, Moors, Vanmarcke, & Wagemans, 2019b; Mohsenzadeh, Mullin, Oliva, & Pantazis, 2019), study and test intervals (Goetschalckx, Moors, & Wagemans, 2018; Isola et al., 2014) and experimental paradigms (Bylinskii et al., 2015; Bainbridge, 2020; 2017; Jaegle et al., 2019), implying that memorability is largely independent of personal or situational factors (Bainbridge, 2019). While some images contain information one would expect to be highly memorable (e.g., close-ups of humans/animals, distinctive objects that appear out of context), many memorable images are not particularly conspicuous and observers cannot accurately judge whether an image is memorable or not (Isola et al., 2014) (see Fig. 1 for example images). Most previous studies have focused on the application of machine vision algorithms to predict memorability as accurately as possible and to identify the image information that makes an image memorable (Isola et al., 2014; Bylinskii et al., 2015; Khosla, Raju, Torralba, & Oliva, 2015; Goetschalckx, Andonian, Oliva, & Isola, 2019a). Convolutional neural networks (CNNs) have been particularly successful at predicting image memorability (Khosla et al., 2015). These networks are composed of multiple processing layers that learn representations of input data with increasing levels of abstraction, setting new benchmark performances in scene and object recognition (LeCun, Bengio, & Hinton, 2015; Simonyan & Zisserman, 2015). Importantly, these studies have quantified memorability by assessing hit rates in image recognition tasks. However, the cognitive processes underlying these recognition decisions are largely unknown (but see: Akagunduz et al., 2019.

It has long been acknowledged that old items can be recognized based on a feeling of familiarity or recollection of specific contextual details about the study event (Mandler, 1980; Yonelinas, 2001). The famous “butcher-on-the-bus” anecdote by Mandler (1980) perfectly exemplifies these two phenomenologies during recognition. The anecdote concerns an encounter with a man on a bus whose familiar face prompts a query in memory. The observer might not be able to retrieve additional information about the man, despite being confident of knowing him. Thus the man only feels familiar. If a query in memory yields additional information about the man, the observer would then recollect that he is in fact the butcher from the local supermarket. Two of the most prominent methods for assessing recollection and familiarity are Remember/Know (R/K) statements (Tulving, 1985) and receiver operating characteristics (ROCs; Yonelinas and Parks, 2007). In R/K tasks, participants indicate directly, after an old/new statement, whether they remember specific episodic details about the item (recollection) or whether they only know that the item is old (familiarity) (Tulving, 1985; Gardiner, Ramponi, & Richardson-Klavehn, 2002).

ROCs on the other hand are an indirect tool to index recollection and familiarity (Yonelinas & Parks, 2007). An ROC is a function that relates the hit rate to the false-alarm rate across different levels of an increasingly relaxed response criterion, such as decision confidence (see Fig. 2 for illustrations). ROCs have been explored with different computational models that make different assumptions on the cognitive mechanisms underlying recognition. According to dual-process signal detection (DPSD) models, the shapes of ROC curves can reflect two distinct memory processes (Yonelinas & Parks, 2007). First, recollection is treated as an all-or-none process, where information about an item is only recollected if its memory strength exceeds a certain threshold. Recollection-associated responses are assumed to be more confident on average for hits than for false alarms, resulting in a “hockey-stick”-shaped ROC. Thus, the intercept is an index of recollection and bent upwards for most conservative responses in z-transformed ROC shapes (see Fig. 2a). Secondly, familiarity is treated as a signal-detection process, where an item is accepted as old if its memory strength exceeds a decision criterion. Familiarity-associated responses produce curvilinear ROCs, where the area between the curve and the chance diagonal is an index of familiarity, and linear z-transformed ROCs, where the intercept is an index of recognition accuracy (see Fig. 2b). Importantly, according to DPSD models, the difference between recollection and familiarity is conceptually distinct from differences in decision confidence, although they may be correlated empirically. Successful recognition always depends on both processes, but if recollection fails, recognition is assumed to rely on familiarity (Yonelinas, Aly, Wang, & Koen, 2010). Thus, the two processes are assumed to be parallel, but functionally and neuroanatomically distinct (Eichenbaum, Yonelinas, & Ranganath, 2007). By contrast, single-process signal detection models assume that recollection and familiarity are both simply a measure of memory strength, with recollection reflecting higher memory strength than mere familiarity (Donaldson, 1996; Wixted & Stretch, 2004). A particularly successful variant of single-process models is the Unequal Variance Signal Detection (UVSD) Model, which assumes that the distribution of old items has greater variance than the distribution of new items. It is important to emphasize that neither model denies that recollection and familiarity are phenomenologically distinct ways of remembering, whether or not they may reflect distinct cognitive processes.

Interestingly, the effect of experimental manipulations on recollection and familiarity is quite variable (see Yonelinas, 2002 for a comprehensive review). For example, deep encoding compared to shallow encoding improves recollection more than it improves familiarity (Gardiner, 1988). In a similar vein, full attention conditions compared to diverted attention conditions are more associated with recollection rather than with familiarity (Yonelinas, 2001). However, other factors such as item repetition affect recollection and familiarity to a similar extent (Gardiner, Kaminska, Dixon, & Java, 1996). Processing fluency (i.e., how easily an item is processed, Rajaram, 1993) and rote rehearsal (Dobbins, Kroll, & Yonelinas, 2004) even influence familiarity more than recollection. Consequently, to which degree scenes across the memorability spectrum produce different kinds of memories is an open question yet to be resolved.

In the present study, we investigated whether intrinsic image memorability is associated with recollection and familiarity to a similar or different extent, using ROC curves (experiments 1 and 2) and R/K judgments (Experiment 2). Moreover, we investigated how the nature of memorability can be accounted for by cognitive computational models. While neural networks can predict how well people will recognize a scene based on a statistical analysis of image content (e.g., Khosla et al.,, 2015), it is unclear which kinds of memory representations support these recognition decisions. Importantly, different types of memory representations associated with different memory experiences activate different neural structures in the medial temporal lobe (Eichenbaum et al., 2007, e.g.,; Kafkas & Montaldi, 2012) and are associated with distinct event-related potentials in the EEG (Tsivilis, Otten, & Rugg, 2001; Rugg & Curran, 2007). Thus, any theory of memorability has to take the phenomenology of remembering into account. To this end, we compared how well recognition ROC curves are fitted by DPSD and UVSD models, and how their model parameters differ between highly and low memorable images.

Experiment 1

Methods

Participants

Fifty participants (31 female, mean age = 29.06) were recruited from the University of Muenster, Essen University Hospital, Open University Hagen and the University of Duisburg/Essen. All participants provided written informed consent. Participation was compensated with course credit (for students) or was voluntary. Four participants were excluded from analysis due to incomplete data sets. Another participant was excluded due to an unusual shape of the ROC curve, which could not be fit with any model. The study was approved by the ethics committee of the faculty of psychology and sports science, University of Muenster.

Apparatus and materials

Stimulus presentation and response logging was controlled with PsychoPy v1.83.04 experimental software (Peirce, 2007), running on a Toshiba Satellite with 2.53 GHz Intel Core processor, 8 GB RAM and a Windows 7 64-bit operating system. Stimuli were presented on a 19-inch CRT monitor, with a 1280x768 resolution and a 60-Hz refresh rate.

Our stimulus set was comprised of 660 images. We extracted 355 pictures from the memorability image database FIGRIM (Bylinskii et al., 2015) and 305 images from the database established by Isola, Xiao, Parikh, Torralba, and Oliva (2014). A total of 241 different semantic categories were depicted in the images (see Table 4 in the Appendix for a distribution of unique semantic categories per condition). Each memorability category comprised an equal number of images, evenly split between the indoor/outdoor scene category.

The images from the FIGRIM database were shrunk to a resolution of 250x250 px, the same size as that of the pictures from Isola et al., (2014). Previous research has shown that memorability remains robust against overall decreases in picture size (Goetschalckx et al., 2019b). In order to avoid a confound of memorability and specific image content, this selection included only images without added elements such as text objects, and no close-up shots of human or animal faces. Since faces contribute to an image’s memorability (Isola et al., 2014; Khosla et al., 2015), we thereby excluded a number of images that were found highly memorable in previous studies. Images were categorized according to the memorability scores provided by Isola et al., (2014) and Bylinskii et al., (2015), which represent hit rates in online recognition memory experiments obtained from large samples of participants. Memorability scores > 75% were categorized as high memorability (hi-mem), scores < 75% and > 55% were categorized as intermediate memorability (mid-mem), and scores < 55% were categorized as low memorability (low-mem). Each category comprised 220 images with equal numbers of indoor and outdoor scenes. Each image was a target picture for one half of all participants and a foil picture for the other half. Memorability category and indoor/outdoor category were counterbalanced between the two sets of images. Mean scores per Memorability category and indoor/outdoor scene gist can be seen in Table 1.

Table 1 Mean memorability scores and mean hit-rates and false-alarm rates per memorability category and indoor/outdoor scene gist in Experiment 1

Full size table

Procedure

Image memory was tested in a recognition task with separate encoding and test blocks, separated by a 10-min break (Fig. 3).

In the encoding block, participants were instructed to memorize all images (330 in total, 110 per memorability category) while simultaneously categorizing each image as indoor or outdoor as fast as possible by pressing one of two response buttons. Trials started with a fixation cross (200 to 400-ms duration), followed by a scene image (500-ms duration), followed by a response prompt (indoor vs. outdoor). To keep participants engaged with the task, accuracy feedback was provided after each response by briefly turning the fixation cross red (error) or green (correct).

In the test block, participants were instructed to categorize each image as old or new, and to rate their confidence in their decision on a three-point scale, with no emphasis on response speed. All images from the encoding block were presented intermixed with 330 new foil images. Trials started with a fixation cross, followed by a scene image (1000-ms duration), followed by response prompts for the old/new and confidence reports. After the two reports were given, feedback about the old/new decision was provided. ^{Footnote 1} Note that this paradigm with separate phases for encoding and test diverges from most previous studies of image memorability, which used a continuous recognition task (e.g., Isola et al., 2014; Bylinskii et al., 2015) where encoding and testing happen simultaneously.

Analysis

Performance was quantified separately for each individual image by calculating hit rates, false-alarm rates, and d’ (Green and Swets, 1966). These performance indices were obtained by collapsing data across all participants. Hit rates and false-alarm rates were adjusted to avoid extreme values of 1 and 0, respectively, by adding 0.5 to both the number of hits and the number of false alarms, and adding 1 to both the number of old and new items, before calculating the hit and false-alarm rates (Snodgrass & Corwin, 1988; Hautus, 1995).

Moreover, hit rates, false-alarm rates, and d’ were quantified separately for each participant and the three memorability categories by collapsing data across all images within a category. In addition, we analyzed each participant’s ROC curve by computing the area under each curve (AUC) using the trapezoidal rule for numerical integration (Wickens, 2002), which does not require a theoretical model of the ROCs. Performance measures were compared between memorability categories using paired, two-tailed t tests. Effect sizes of these analyses are reported as Cohen’s d (Cohen, 1988), computed according to Lakens (2013).

Finally, ROC curves were fitted with a DPSD model (Yonelinas, 1994) and a UVSD model (Mickes, Wixted, & Wais, 2007) using ROC Toolbox for MATLAB by Koen, Barrett, Harlow, and Yonelinas (2017). The UVSD model assumes that the distributions of memory strength of old items and new items overlap to a certain extent. The model parameter d’, or sensitivity, is an index of this overlap with larger values indicating less overlap, and thus better recognition performance. The second parameter (Vo) is an index of the variability of the old item distribution, with the assumption that memory strength of old items may be more variable than the strength of new items. In the DPSD model, the recollection parameter (Ro) represents the probability that participants recollect at least some aspect of the study event, whereas familiarity is represented by d’, with larger sensitivity indicating greater familiarity.

We first considered whether the models generally provide a statistically acceptable account of the individual participant data based on the G-test of goodness-of-fit (Koen, Aly, Wang, & Yonelinas, 2013). The test estimates the discrepancy between the expected values and the actual observed values in the model. If the test yields a value smaller than the 5% significance level, it is concluded that the given model deviates significantly from the data and is thus rejected (McDonald, 2009). We then compared performance between models on the basis of the Bayesian Information Criterion (BIC). The aim of the BIC is to obtain the posterior probability of the model given the data. The smaller the BIC for one model versus the other, the larger the posterior probability given the data (Schwarz & et al. 1978; Lewandowsky & Farrell, 2010). Both indices were applied to the aggregate as well as individual participant data. The model with lower BICs in 80% of participants was declared the winning model, on the condition that it has a statistically acceptable account of the data in more than 80% of participants, based on the G-statistic. Given that the parameters of the UVSD model allow for greater flexibility, the UVSD model has an a priori advantage at fitting a wider range of ROC data (Klauer & Kellen, 2011). Therefore, we complemented the comparison of fit statistics by testing which parameters of which model were most strongly associated with memorability. Importantly, a model with a superior model fit due to overfitting could potentially turn out to show only weak association with memorability.

Results experiment 1

Replication of memorability

Across images, memorability scores obtained in previous studies (Bylinskii et al., 2015) were positively correlated with the hit rates (r = 0.34, p < 0.001, d = 0.73) and negatively correlated with false-alarm rates (r = − 0.17, p < 0.001, d = − 0.34) obtained for the same images in the present study. This resulted in a strong correlation between recognition sensitivity d’ and memorability scores (Spearman’s ρ = 0.41, p < 0.001, d = 0.91). In spite of this consistency with previous studies, hit rates in the present study were overall consistently lower than hit rates/memorability scores obtained for the same images by Bylinskii et al., (2015, t(659.00) = 12.77, p < 0.001, d = 0.50).

Across subjects, recognition performance was better for images in the high-mem category than for the mid-mem category, as indicated by higher hit rates (t(44.00) = 7.61, p < 0.001, d = 1.13), lower false-alarm rates (t(44.00) = − 4.81, p < 0.001, d = − 0.72) (see Table 1), and higher d’ (t(44.00) = 10.28, p < 0.001, d = 1.53). Likewise, hit rates (t(44.00) = 4.34, p < 0.001, d = 0.65) and d’ (t(44.00) = 3.07, p = 0.004, d = 0.46) were higher for images in the mid-mem category than for the low-mem category, but false-alarm rates did not differ between these categories (t(44.00) = 0.35, p = 0.725, d = 0.05). Moreover, area under the ROC curves (AUC) was strongly positively associated with memorability (Spearman’s ρ = 0.41, p < 0.001) across images (see Fig. 4). Across subjects, AUC was larger for the high-mem category than for the mid-mem category (t(44.00) = 10.50, p < 0.001, d = 1.57). Likewise, AUC was larger for images in the mid-mem category than for the low-mem category (t(44.00) = 3.44, p = 0.001, d = 0.51).

ROC and model results

ROCs had a curvilinear shape whereas zROCs were linear, which are shapes better predicted by the UVSD model (Fig. 5). Accordingly, the G statistic confirmed that single subject data were successfully fitted by the UVSD model for 85% of participants, while the DPSD successfully fitted the data of only 70% of participants. The aggregate and individual participant data were better fitted by the UVSD model than by the DPSD model, indicated by lower BICs for the UVSD model across all participants. The sensitivity parameter d’ of the UVSD model was significantly larger for the high-mem compared to the low-mem category (t(44.00) = 10.50, p < 0.001, d = 1.57). In contrast, the parameter modeling the variance of the old item distribution Vo was not significantly different between the two categories (t(44.00) = 1.76, p = 0.089, d = 0.26). Both the recollection (t(44.00) = 4.94, p < 0.001, d = 0.74) and the familiarity parameter (t(44.00) = 7.61, p < 0.001, d = 1.13) of the DPSD model were larger for high-mem compared to low-mem images.

Discussion experiment 1

Overall, the results replicate previous studies showing that intrinsic image memorability is a robust feature of an image, which affects people’s memory performance independently of personal factors. The ROC analysis confirmed and extended previous studies of memorability, which had focused on hit rates, by showing that memorable images also yield larger AUC.

The ROC curves were better fitted by the UVSD model, which assumes that recognition is based on a single, continuous memory strength dimension. The superiority of the UVSD appears plausible given the symmetrical, curvilinear shapes of the ROCs. Greater memorability was associated with larger sensitivity (d’), but not with greater variability of the old item distribution (Vo). While this model does not deny that some conditions, e.g., recognition of highly memorable images, tend to coincide with recollection of specific details associated with the studied item, it treats recollection simply as reflecting higher memory strength. Hence, in this experiment recognition was not based on a specific recollection process independent of memory strength, as predicted by the DPSD model. This finding could imply that recognition of scene images is generally based only on memory strength and that the superior recognition performance for highly memorable images is not associated with a separate recollection process.

However, the specific shape of the ROC curves in Experiment 1 might also be due to the overall low performance in the recognition task. Indeed, hit rates were consistently lower in our study than hit rates obtained for the same images in previous studies. Performance in our study might have been affected by the specific memory task: most previous memorability studies (Isola et al., 2014; Bylinskii et al., 2015) used continuous recognition tasks where the delay between encoding and test is shorter and the number of intervening items is significantly smaller compared to a design with separate encoding and testing blocks. In addition to this inevitable difference, other more amendable factors might have been responsible for the poor performance as well. First, presentation durations (500 ms) were shorter than in previous studies (1000 ms and 2000 ms in Isola et al., 2011 and Bylinskii et al., 2015). Second, participants had to perform an additional indoor/outdoor discrimination task. Together, these factors may have contributed to shallow rather than deep encoding of image aspects, thus obstructing the potential for recollection. Furthermore, the recognition task in the test phase, which required only a simple old/new decision instead of a report of the recollective experience, may have encouraged participants to base their recognition decisions and confidence judgments more on memory strength than on recollection.

In order to substantiate the association of recollection and intrinsic image memorability (or the lack thereof) we conducted a second experiment, in which encoding was facilitated and recognition required an additional judgment of recollective experience.

Experiment 2

Experiment 2 was similar to Experiment 1 with a few modifications. Most importantly, participants were to report their recollective experience with R/K judgments (Tulving, 1985). The R/K judgments were introduced to acquire an additional index of recollection independent of model parameters derived from ROC curves.

A hallmark finding regarding R/K judgments has been obtained in a study, in which words were learned under deep versus shallow encoding or full versus diverted attention conditions (Yonelinas, 2001). Results showed a perfect crossover: the proportion of deeply encoded and fully attended words was greater among remember statements whereas words presented in the shallow and diverted attention condition were more associated with know statements. Moreover, Tsivilis et al., (2001) studied R/K statements with picture stimuli and found that the proportion of R statements increased if to-be-remembered objects are presented in their original scene contexts, whereas the proportion of K statements was unaffected by object context.

However, using R/K judgments as an accurate index of recollection or familiarity is anything but trivial due to procedural (Migo, Mayes, & Montaldi, 2012) and statistical (Yonelinas, 2001; Haaf et al., 2020) challenges. First, if not instructed carefully, participants might confuse the “remember” category simply with high confidence, neglecting that a feeling of familiarity can occasionally go along with high confidence, too. Therefore, we followed recommendations for R/K procedures put forward by Migo et al., (2012, see Methods/Procedure). Second, the statistical analyses must account for the fact that the proportions of R and K statements are interdependent. Specifically, the probability of a know response is mathematically constrained by the proportion of remember responses and vice versa, making inferences assuming their independence (as in Gardiner & Java, 1990) statistically inappropriate (see Yonelinas, 2001 ). Therefore, we applied an analysis framework proposed by Haaf et al., (2020) (see Methods/Analysis).

Moreover, we extended the conventional remember/know framework by additionally asking for analogous judgments for new items, thereby exploring the mnemonic experience associated with the rejection of new information. Thus, whenever participants decided that an item was new, we asked whether they considered specific image details (D judgment) to be relevant for their decision or whether the item simply felt unfamiliar (U judgment). The D/U judgments for new items are thus equivalent to R/K judgments for old items and were thus analyzed with the same analysis framework.