The visual and semantic features that predict object memory: Concept property norms for 1,000 object images

Hovhannisyan, Mariam; Clarke, Alex; Geib, Benjamin R.; Cicchinelli, Rosalie; Monge, Zachary; Worth, Tory; Szymanski, Amanda; Cabeza, Roberto; Davis, Simon W.

doi:10.3758/s13421-020-01130-5

The visual and semantic features that predict object memory: Concept property norms for 1,000 object images

Open access
Published: 19 January 2021

Volume 49, pages 712–731, (2021)
Cite this article

Download PDF

You have full access to this open access article

Memory & Cognition Aims and scope Submit manuscript

The visual and semantic features that predict object memory: Concept property norms for 1,000 object images

Download PDF

Mariam Hovhannisyan^1,2,
Alex Clarke³,
Benjamin R. Geib¹,
Rosalie Cicchinelli¹,
Zachary Monge¹,
Tory Worth^1,4,
Amanda Szymanski²,
Roberto Cabeza^1,4 &
…
Simon W. Davis^1,2

6225 Accesses
16 Citations
14 Altmetric
2 Mentions
Explore all metrics

A Correction to this article was published on 09 February 2021

This article has been updated

Abstract

Humans have a remarkable fidelity for visual long-term memory, and yet the composition of these memories is a longstanding debate in cognitive psychology. While much of the work on long-term memory has focused on processes associated with successful encoding and retrieval, more recent work on visual object recognition has developed a focus on the memorability of specific visual stimuli. Such work is engendering a view of object representation as a hierarchical movement from low-level visual representations to higher level categorical organization of conceptual representations. However, studies on object recognition often fail to account for how these high- and low-level features interact to promote distinct forms of memory. Here, we use both visual and semantic factors to investigate their relative contributions to two different forms of memory of everyday objects. We first collected normative visual and semantic feature information on 1,000 object images. We then conducted a memory study where we presented these same images during encoding (picture target) on Day 1, and then either a Lexical (lexical cue) or Visual (picture cue) memory test on Day 2. Our findings indicate that: (1) higher level visual factors (via DNNs) and semantic factors (via feature-based statistics) make independent contributions to object memory, (2) semantic information contributes to both true and false memory performance, and (3) factors that predict object memory depend on the type of memory being tested. These findings help to provide a more complete picture of what factors influence object memorability. These data are available online upon publication as a public resource.

Drawings of real-world scenes during free recall reveal detailed object and spatial information in memory

Article Open access 02 January 2019

The relative contribution of shape and colour to object memory

Article Open access 15 June 2020

THINGSplus: New norms and metadata for the THINGS database of 1854 object concepts and 26,107 natural object images

Article Open access 24 April 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

One of the most important issues in memory research is why we remember some things but forget others. To address this issue, it is critical to answer not only which processes lead to successful encoding (e.g., depth of encoding effects; see Craik & Tulving, 1975) and/or retrieval (e.g., transfer-appropriate processing; see Morris et al., 1977), but also what contents of these events are more memorable than others. Although these two questions are closely related, their focus is different: the former concentrates on the actions of the person remembering, and the latter on properties of the stimuli. While the processes question has been a continuous focus in memory research since its inception, the second question has received much less attention. However, the question of which stimulus properties are easier to remember has been rapidly growing in popularity in recent years. Studies that focused on the concept of intrinsic memorability have typically used a combination of visual factors to predict subsequent memory of scenes (Bainbridge et al., 2017; Isola et al., 2014), objects (Jaegle et al., 2019), and unfamiliar faces (Bainbridge et al., 2013). Very few such memorability studies have examined verbal stimuli, and the semantic factors examined in such studies are usually limited to category membership (Bainbridge & Rissman, 2018) or automatic labels generated by automated computer vision algorithms (Borkin et al., 2016; Isola et al., 2014). More importantly, no study – to our knowledge – has simultaneously examined both visual and semantic factors, which is essential to understand memory for everyday scenes and objects. This was our overarching aim.

The current study relates two different but interconnected literatures. First, we relate findings in the neuroscience (Grill-Spector & Malach, 2004) and behavioral (Pylyshyn, 1999; Rosch et al., 1976) visual perception literatures that are relevant to explaining the memorability of visual stimuli. For example, memorability studies have shown that, despite a high human capacity to remember visual details (Brady et al., 2008), simple image measures, such as pixel statistics, image complexity, or the number of objects in an image, do not predict how well individual objects are remembered (Isola et al., 2014). One possible explanation for null findings is that simple image measures do not match the way the visual cortex processes visual information. To investigate this idea, the current study examined how memory for objects is predicted by measures of visual processing provided by a deep neural network (DNN). DNNs model visual processing in primate visual cortex using convolutional layers (Kriegeskorte, 2015). Although DNNs were originally designed for image classification in computer vision, they have been shown to be excellent neuroscience models for visual processing (Rajalingham et al., 2018; Yamins et al., 2014), often surpassing traditional theoretical models (e.g., HMAX, object-based models; Cadieu et al., 2014; Groen et al., 2018). In the current study, a DNN yielded measures of visual object processing that were used to predict subsequent object memory.

Second, the current study relates to findings in the semantic cognition literature that are relevant to explaining object memorability. Most memory studies examining the influence of semantic factors rarely incorporate visual features as predictors of memory strength and have centered on verbal stimuli and simple lexical factors like word frequency or concreteness (e.g., words that reflect more concrete concepts tend to be remembered better; see Fliessbach et al., 2006). Research on semantic factors in memory for objects is very scarce; including a few studies with neuropsychological patients (Kraut et al., 2002; Patterson, 2007), some studies on how labeling enhances or distorts memory for objects (Koutstaal et al., 2003; Richler et al., 2011; Richler et al., 2013) , and a limited number of studies on basic conceptual properties, such as the nameability (Richler et al., 2013) and typicality (Qin et al., 2014). However, there is virtually no evidence on how complex conceptual statistics determine object memorability. This question can now be examined by taking advantage of the conceptual structure account (CSA), which provides a comprehensive framework to quantify and formalize the relationship between semantic and visual features of objects in terms of their distinctiveness and interrelatedness (Devereux et al., 2018; Moss et al., 2005; Taylor et al., 2012). In the current study, the CSA provides measures of semantic properties of objects and is used to predict subsequent memory for object images.

In sum, the current object memory study investigated how well visual and semantic properties predict the visual and lexical memory of object concepts. Before this experimental study, it was necessary to conduct a normative study of the visual and semantic features of a large set of everyday object images. Available published norms comprising semantic properties for object concepts are only available for words (Devereux et al., 2014; Ken McRae et al., 2005), but currently there is no normative concept feature data on object images. In the current semantic norming study, each participant provided semantic features for a small set of different object images. Creating these norms was a prerequisite step for the study, but it was also a goal in itself, because norms of the visual and semantic features of a large set of objects is critical for functional magnetic resonance imaging (MRI) and memorability studies with objects. Our norms are freely available online (http://mariamh.shinyapps.io/dinolabobjects), along with regularized copies of the images.

In the memory study, which consisted of two experiments, visual and semantic variables were used to predict subsequent memory for objects. Complex visual measures were obtained by analyzing the object pictures using the layer-specific activation information from a popular DNN, AlexNet (Krizhevsky et al., 2012), and feature-based semantic metrics (mean distinctiveness and correlational strength) were obtained by an analysis of concept feature norms. We also examined more basic visual (e.g., basic pixel statistics) and semantic (e.g., word frequency) metrics. In each of the two memory experiments, every item in the corpus was tested in a visual memory test (Day 1: picture target; Day 2: picture cue) and a lexical memory test (Day 1: picture target; Day 2: lexical cue). We use these two different memory tasks to evaluate the conceptual and perceptual properties of the object images. We conducted the visual memory test to investigate the resiliency of memory across exemplars and the lexical memory test to investigate the contribution of semantic information to the memorability of object images. Both tests allow us to examine the contribution of visual and semantic properties in perceptual and conceptual memory. These tests are important for understanding (1) why some properties might contribute more to memory in one test than another, (2) if complex visual and/or semantic information is important regardless of the memory test, and (3) whether semantic or visual information, either simple or complex, contributes to memory when tested in the same domain (e.g., semantic information contributing more to memory in the lexico-semantic task than in the visual task).

Visuo-semantic object-norming study

Methods

Participants

Five hundred and sixty-six Amazon Mechanical Turk workers all with a 95% approval rating or above (347 females, 19–75 years of age, mean age = 34.6 years, all self-reported native speakers of American English) participated in this study. Participants had an average of 14.68 years of education, and the racial demographic was balanced with national averages. Participants could take part in repeat sessions and completed between one and five sessions. Sessions lasted about an hour with 40 concepts presented per session. Participants were paid $3.00 for their participation in the property norming study. Informed consent was obtained from all participants under a protocol approved by the Duke Medical School Institutional Review Board (IRB). All procedures and analyses were performed in accordance with IRB guidelines and regulations for experimental testing.

Materials

The two primary aims of this study were to (1) collect normative feature data on a large set of object concepts that can be expanded and manipulated by a wide range of research domains and 2) assess whether feature statistics can explain memorability of objects. To date, the most extensive and widely used set of property norms are the McCrae Norms (Ken McRae et al., 2005) and the Centre for Speech, Language and the Brain (CSLB) norms (Devereux et al., 2014), both of which are considered to be the standard for semantic feature representations of concepts. Both norms provide information on type of features and feature production frequency for a large number of concrete objects. However, both databases are based on responses to verbally presented stimuli, and therefore may not directly inform the memory for visual object stimuli. Nonetheless, throughout this paper, we use both CLSB norms and the McCrae Norms as a guide to characterize our dataset.

Stimuli

A total of 995 object concepts were used for the online object norming via Amazon Mechanical Turk (AMT). Image concepts were selected from a wide range of standard object categories (e.g., birds, buildings, mammals, tools, vehicles), as well as object categories present in everyday life, but not well represented in typical object databases (food, holiday items, street items). 237 of the objects were living and 758 were non-living. The relative size of each of the 29 categories in our database is depicted in Fig. 1. In selecting these concepts, we included concepts from CSLB norms that were familiar to English speakers, as well as additional object concepts to help balance the number of items across categories. We aimed to avoid ambiguous concepts and only included concepts in our analyses that had three or more features. We organized our 995 object concepts under a range of standard categories (Fig. 1); as is typical of such naturalistic object datasets, non-living categories outnumber living categories by about 2 to 1, and tools constitute the largest category of items.

Suitable images for each concept were selected from the image search engines Google Images, Bing Images, and Flickr. Images were selected based on the following criteria: (1) minimum size of 300 x 300 pixels; (2) either whitespace background, a background easily removable with image-editing software, or a background not otherwise integrated into the foreground or target object; (3) standard framing/positioning of the object, i.e., we avoided image orientations that obscured the identity of the object; (4) all images were in color, with no obvious chromatic or morphological filter; (5) no visible watermarks; and (6) no text printed on the object concept identifying it as such (e.g., “Fire Station No. 9”). After assembling two image exemplars for each concept, backgrounds were removed with photo-editing software and images were cropped to square dimensions and resized to 300 x 300 pixels.

Image attributes

Image attributes in the current analysis are characterized both as intrinsic properties of object identity and as potential predictors of object memorability. Attribute definitions, characteristics, and distributions within the current dataset are summarized in Fig. 2 (Visual features) and Fig. 4 (Semantic features). Visual measures comprised basic pixel and image statistics, as well as more complex statistics defined by the entropy of individual layers of a popular convolutional deep neural network (i.e., AlexNet). Semantic measures comprised frequency, as defined by the Corpus of Contemporary American English (COCA), name agreement, and the number of constituent features, as well as more complex statistics defined by the relation between features of items including mean distinctiveness (MD), correlational strength (CS), and a correlation x distinctiveness measure (CSxD). We describe these measures, including descriptive statistics of the underlying distribution of these measures, in more detail below.

Visual features

The first question addressed by the current article is whether simple image features are predictive of memory in the visual and lexical memory task. Visual measures are summarized in Fig. 2, including descriptive statistics on the underlying distribution and, when necessary, correction of the distribution to improve normality. We first calculated a number of low-level image features that describe item-wise values for a given image. Many of these properties have been shown to not be predictive of image memorability (Dubey et al., 2015; Isola et al., 2014), but underlying questions remain about the capacity for these basic visual features to predict conceptual memory. Basic pixel statistics such as Hue, Saturation, and color Value (commonly referred to as “HSV”) were calculated on each image in our database, as well as the proportion of non-white space in the normalized image. Image energy, a measure of the localized change of the image, and JPEG size, an indirect measure for image complexity based on image compression (Torralba & Oliva, 2003), were also included in the analysis.

Deep convolutional neural network similarity

Next, we assessed complex visual properties of object images by assessing the similarity of visual features derived from a DNN, which carries inherently relational information given that (1) a DNN optimizes based on all images within a training set, and (2) individual layers represent distinct but still dependent information between layers as image vectors change through the progression across layers (DNNs; Krizhevsky et al., 2012; LeCun et al., 2015). DNNs consist of layers of convolutional filters and can be trained to classify images into categories with a high level of accuracy. During training, DNNs “learn” convolutional filters in the service of classification, where filters from early layers predominately detect lower-level visual features and from late layers, higher-level visual features (Zeiler & Fergus, 2014). Therefore, a DNN is an ideal model to investigate multi-level visual feature distinction. Here, we used AlexNet, which was successfully trained to classify 1.2 million high-resolution images into 1,000 different categories (Krizhevsky et al., 2012). AlexNet consists of eight layers including five convolutional and three fully connected layers. We extracted the activation values for three representative layers (Layers 3, 6, and 8 for early, middle, and late DNN layers, respectively) for each image and converted them into one activation value per object. Multidimensional scaling provides a qualitative illustration of the possible image dimensions. For example, the early visual MDS plot (Layer 3, Fig. 3A) organizes concepts largely by shape (thin, vertically oriented objects are distributed on the left side, with horizontally shaped objects towards the left, and circular objects on the right side). The middle visual MDS plot (Layer 6, Fig. 3B), in contrast, suggests more complex frequency and orientation information, with high visual frequency (items with thin parts or changes in color or luminance) towards the top of the image, and items with low visual frequency information (unitary color and luminance across the item) towards the bottom. The late visual MDS plot (Layer 8, Fig. 3C) retains some of this complex configural information, but also begins to group items of similar categories together in loose categorical clusters (e.g., fruit towards the bottom left, animals towards the top right). Lastly, the MDS plot for semantic feature information groups items in a configuration roughly consistent with their categorical organization. In Fig. 3D, living things are mostly organized on the right side, for example, animals are distributed in the bottom right corner and foods in the top right corner, while non-living things are largely organized on the left side with clothing items distributed towards the top and tools and furniture distributed towards the middle bottom side.

Semantic attributes

Next, we describe the semantic attributes in the current analysis, as summarized in Fig. 4. First, a number of basic concept features were assessed. Concept (or lemma) frequency was assessed in our sample using the word frequency provided in the Corpus of Contemporary American English (Davies, 2008), which contains 425 million entries sampled from a broad range of written sources.

Name agreement, which reflects the agreement for a verbal label to an object photograph, was assessed with a standard picture-name agreement task (Snodgrass & Vanderwart, 1980). For every image, 25 Duke University undergraduates from introductory psychology courses identified each picture as briefly and unambiguously as possible by writing only one name for each image. Participants were instructed to respond “don’t know” if the picture was an object unknown to them, or if they didn’t know the name. Name agreement was then calculated as the proportion of participants identifying the modal name for a given object photograph. The number of features (NoF) for each object was also calculated and included in our analysis. This metric is a general semantic property that ignores the semantic content of those features; only non-taxonomic features were included when calculating the total number of features for an item. A taxonomic feature indicates superordinate category information to which a concept belongs (Ken McRae et al., 2005).

Semantic property norms

The principal approach of the current analysis is the application of a new and large set of property norms designed to characterize the semantic features associated with a broad range of object concepts. While many of the concepts overlap with the McRae (Ken McRae et al., 2005) and CSLB (Devereux et al., 2014) norms, the aim of the current corpus is to offer semantic norms associated not with verbal descriptors, but instead specific object photographs associated with those object concepts. We make these data available on our GitHub site (https://github.com/ElectricDinoLab), allowing researchers to estimate their own cutoff points for production frequencies associated with each object feature or concept.

Feature-based statistics

The feature-statistics used in this study are based on the conceptual structure account (CSA), a neurocognitively motivated theory of conceptual knowledge that captures information of conceptual representations (Taylor et al., 2011; Tyler & Moss, 2001). Feature statistics quantify the smaller elements of a concept, the semantic features, which provides a useful metric to assess behavior. Here, we use feature statistics to (1) assess the relational structure between features of items in our dataset and (2) use this structure to assess the dimensionality of memory scores from the memory task. In the current study, feature-based statistics from the CSA are used to characterize the relational structure between semantic features of items.

In addition to the simpler feature statistics described above, the current analysis sought to test the utility of object features in predicting memorability of items in either the lexical or visual memory test. We used three key measures that are capable of differentiating between similar objects. First, mean distinctiveness describes whether concepts have more distinct versus shared features. For each feature, a distinctiveness value is calculated by taking 1/number of concepts in which the feature occurred, with mean distinctiveness being the average value across the features in the concept. Non-living things tend to have more distinctive features than living things, owing generally to the fact that living things have more shared features (e.g., eyes, nose, legs) than do non-living things (e.g., tools & vehicles). Concepts that have more distinct features will have fewer semantic neighbors, thus activating a unique conceptual representation (Clarke & Tyler, 2015). Second, correlational strength describes how features of a concept co-occur; in other words, correlational strength for a concept is greater for objects composed of highly co-occurring features (e.g., has legs and has feet are often found in the same concept) that will mutually coactivate, facilitating feature integration and activation of the concept (Clarke & Tyler, 2015; K. McRae et al., 1997). While both mean distinctiveness and correlational strength are measures common to a number of feature-based accounts (e.g., Cree & McRae, 2003; K. McRae et al., 1997; Mirman & Magnuson, 2009; Rogers et al., 2004; Devereux et al., 2016), our third measure, correlational strength and sharedness (Correlational Strength x Distinctiveness or CSxD) is specific to the CSA. CSxD describes the relationship between the correlational strength and distinctiveness of the features for each concept (Taylor et al., 2012). This measure is defined as the unstandardized slope of the regression line that describes the interaction of these two statistics (Taylor et al., 2012).

Procedure

The current feature-norming dataset comprises 995 objects from 29 different categories and includes 5,520 features, each of which was present at least three times in the data. Taxonomic features (e.g., is a dog or is a mammal, for the example object dalmatian) are not typically considered true semantic features and thus were not used in the analysis of this dataset (Devereux et al., 2014; Ken McRae et al., 2005).

Participants were shown an object (e.g., a porcupine) and were given a space to add five unique features; similar to previous feature-norming paradigms (Devereux et al., 2014; Ken McRae et al., 2005), participants were limited to five features, and were prevented from proceeding through the task if all features were not completed. The participants were asked to select a relation word from a drop-down menu, with presets for <is>, <has>, <does>, <is made of>, and “…” (participants were instructed to use the blank space as they wished to specify some other relationship). The default pull-down option for the five feature-response cues was set to one of each of the above five options, as presented in Fig. 5. Participants could use any of the pulldown verb options, they were not required to use all of them, but were required to provide five features for each concept. Concepts were randomized so that two concepts from the same category did not appear consecutively and that each concept was presented to at least 20 participants. Each participant was presented with a series of 40 objects, presented pseudorandomly across participants, with an even distribution of objects per category. Participants were allowed to complete between one and five Human Intelligence Tests (HITs) of the feature norming task. AMT workers were prevented from performing the same HIT twice, and each set of 40 items comprised a unique subset of objects; thus, no objects were repeated for participants who completed more than one feature-norming HIT. In order to receive full credit for their participation, workers needed to complete all five spaces. Data from 566 total participants (with an average of 30.5 participants contributing to each concept) were eventually used to create a feature x concept production frequency matrix.

Before the construction of the production frequency matrix, feature responses underwent various stages of processing, following the procedures used by McRae et al. (2005) and Devereux et al. (2014). These steps, done by hand, included: (1) removal of adverbs, such as really and very, (2) feature-splitting, for example a feature such as “has a round face” was rewritten as “has a round face” and “has a face,” (3) synonym mapping, which involves identifying synonyms both within and across each concept; for example “does travel in groups” and “does travel in packs” and “does travel in a flock” were collapsed to “does travel in groups,” (4) correction of spelling mistakes and when incorrect relation words were used (e.g. “has a luxury item”) were also changed when the meaning was clear (“is a luxury item”) , (5) morphological mapping, for example “is used in cooking” and “is used by cooks” were collapsed together as “is used in cooking,” (6) removal of plural forms, and (7) removal of features not present in at least two concepts. Relation words were not changed. At all stages of processing the data, results were checked manually and were corrected if necessary to prevent from excessively modifying the features and to maintain inter-rater reliability. After preprocessing, a feature x concept production frequency matrix was created to describe the normalized frequency of a given feature for a given concept. The resulting preprocessed features were then collected into various feature-label groups, and summary statistics were calculated on all features. An example for a subset of features, their relation word, and production frequency for the object bee, are found in Table 1, which depicts both the individual features as well as the production frequency (i.e., the number of times that feature was mentioned across all raters) of each feature. We used a production frequency cut-off of three, such that only features that occurred at least three times were used in the analysis of the dataset.

Table 1 Example concept properties

Full size table