Introduction

Visual perception allows humans to understand their immediate environment, navigate the world safely and build conceptual knowledge. Arguably the most complex and important feat of visual perception is the recognition of objects and scenes. Whether attempting to find something in a crowded closet or trying to orient oneself in an unfamiliar forest, the ability to perceive and interpret a wide variety of real-world objects and scenes is crucial for completing daily tasks and for survival. In keeping with the central role of object and scene perception in human existence, there is an abundance of psychological research that uses object and scene stimuli to investigate the perceptual and cognitive processes underlying these abilities (Ashby, Prinzmetal, Ivry, & Maddox, 1996; Duncan & Humphreys, 1989; Jiang, Lee, Asaad, & Remington, 2015; Tresch, Sinnamon, & Seamon, 1993).

For lower-level visual perception, studies frequently involve precise and quantitative manipulation of the degree of similarity between different visual stimuli. Much has been learned about how individual neurons or even whole cortical regions represent simple stimulus properties, by presenting stimuli that vary systematically in edge orientation (e.g., Hubel & Wiesel, 1962, 1968), color (e.g., Brouwer & Heeger, 2009; Dow & Gouras, 1973; Solomon & Lennie, 2007), or motion direction (e.g., Geisler, Albrecht, Crane, & Stern, 2001; Rodman & Albright, 1989). Phenomena such as attention and perceptual learning have also been fruitfully investigated by exploiting well-defined stimulus-similarity continua (e.g., Treisman, 1991; Yang & Maunsell, 2004). Such research has been possible because low-level visual image properties like edge orientation, color and motion are easy to define and quantify.

Analogously to lower-level vision research, our understanding of high-level vision can be greatly advanced by using stimuli that vary systematically along quantifiable dimensions relating to high-level properties of objects and scenes. However, this is difficult to do in practice, because it is much harder to define and measure these high-level properties. We cannot simply decompose objects and scenes into their low-level image features and assume that these features capture the critical essence of the representations driving object and scene recognition, much less that they reliably capture the conceptual content of the stimuli (Martin, Douglas, Newsome, Man, & Barense, 2018; McClelland & Rogers, 2003; but see Greene & Oliva, 2009; Joubert, Rousselet, Fize, & Fabre-Thorpe, 2007; Oliva & Torralba, 2001 for evidence that scene recognition may in part depend on summary statistics). Moreover, when a research question requires knowledge of the perceptual similarity relationships between stimuli, mathematical measures of stimulus properties have been found not to predict human perceptions of similarity in a straightforward way, for both low-level and high-level stimulus properties (Busey, 1998; Cheung, 2016; Hopper, Finklea, Winkielman, & Huber, 2014; Larkey & Markman, 2005; Li, Liang, Lee, & Barense, 2020; Robertson, 1977; Schurgin, Wixted, & Brady, 2020).

A solution often adopted by researchers wishing to understand the perceptual similarity evoked by low-level stimulus properties is to gather empirical ratings of subjective similarity. In such studies, participants typically use a numerical scale to report how similar they perceive two items to be, repeatedly for many pairs of items (e.g., Busey, 1998; Caramazza, Hersh, & Torgerson, 1976; Hopper et al., 2014; Li et al., 2020). Some studies employ alternative methods such as requiring participants to “drag and drop” stimuli into screen locations that reflect perceived similarity relationships (Kriegeskorte & Mur, 2012). Most studies of perceived similarity allow the authors to use some variant of multidimensional scaling (MDS) to construct a perceptual similarity map of the evaluated stimulus set, in which items are laid out in a two-dimensional (2D) array with locations that best reflect the perceived “distances” between them.

Many existing stimulus databases provide color images of objects and scenes, for example, the Massive Memory Picture Database (Brady, Konkle, Alvarez, & Oliva, 2008; Konkle, Brady, Alvarez, & Oliva, 2010), the Amsterdam Library of Object Images (Geusebroek, Burghouts, & Smeulders, 2005), the revised Snodgrass and Vanderwart object pictorial set (Rossion & Pourtois, 2004; Snodgrass & Vanderwart, 1980), the Bank of Standardized Stimuli (Brodeur, Dionne-Dostie, Montreuil, & Lepage, 2010), the SUN database of scene images (Xiao, Ehinger, Hays, Torralba, & Oliva, 2014; Xiao, Hays, Ehinger, Oliva, & Torralba, 2010), and the Nature Scene Collection (Geisler & Perry, 2011). However, we are aware of no publicly available scene image database that is associated with published empirical similarity ratings, and only a few such object image sets (Hout, Goldinger, & Brady, 2014; Migo, Montaldi, & Mayes, 2013). Importantly, the object similarity studies we found report perceptual similarity for object images within the same basic-level category (e.g., the similarity relationships between a set of butterfly images) but not between objects with different basic-level identities (e.g., perceived similarity between a tiger and a lion, or a tiger and a whale). Yet, the perceived similarity between images from different basic-level categories of objects or scenes may be an important research design consideration. As one example, memory researchers may wish to systematically investigate the effects of stimulus-relatedness on memory for visual images. More generally, many research questions within visual cognition may require experimental designs that group items into sets that minimize visual or semantic interference either within or between sets; for this, stimulus-relatedness must be known.

The goal of the present study was to provide object and scene image databases that are accompanied by naming data, empirical similarity ratings, both visual and semantic, and similarity maps constructed from those ratings. We collated four stimulus sets: inanimate objects, animals, indoor scenes and outdoor scenes, each containing 60 images (240 images in total). All stimulus images were color photographs. We collected ratings of visual and semantic similarity within each image category and performed MDS to analyze the similarity ratings data. MDS is an exploratory data analysis technique that allows spatial visualization of the similarity relationships between the items in a set (Mugavin, 2008); it has been used extensively in studies of perception (Jaworska & Chupetlovska-Anastasova, 2009). The similarity ratings are used to construct a matrix from which a spatial map of the relationships between all items is derived, usually in two or three dimensions. A short distance between two items in the map means that the two items are similar, whereas a long distance means that they are dissimilar. We used MDS to visualize the relationships between the images in each stimulus set, for both visual and semantic perceived similarity. We make the stimuli (along with source attributions where available), naming data, ratings data and similarity maps available at https://osf.io/smk25/, so that researchers may use these stimuli in any experiments for which stimulus-relatedness information is useful.

Methods

Participants

A total of 765 participants were recruited via Amazon’s Mechanical Turk (MTurk). Previous research has shown that MTurk workers are more representative of adults in the United States than in-person recruited participants (Berinsky, Huber, & Lenz, 2012; Buhrmester, Kwang, & Gosling, 2011), and in visual cognition tasks, the data collected from MTurk have been shown to closely match data collected in a lab environment (Brady & Alvarez, 2011; Brady, Shafer-Skelton, & Alvarez, 2017; Brady & Tenenbaum, 2013). All participants were located in the United States and were over 18 years old. For each of the four image categories (i.e., animals, inanimate objects, indoor scenes, outdoor scenes), we created all possible image pairs (1770 pairs) and randomized those pairs into seven groups (see Procedures). Each group of images was bundled into a single Human Intelligence Task (HIT) on MTurk. For each HIT, ratings were collected from at least 10 participants (see Table 1). A participant was allowed to complete ratings for more than one HIT, with the constraint that no two HITs on the same stimulus set could be performed by the same participant. All participants provided informed consent before starting the experiment and were compensated $2 for their participation. A total of 199 participants were excluded from the data analysis because they failed to pass the attention test that was embedded as part of the experimental design (see Procedures) or because they entered the same response for all questions (a response that “passed” the attention tests, but was inappropriate for most other questions). Previous work indicates that an exclusion rate of 26% is not unusual for MTurk studies employing similar screening methods (Thomas & Clifford, 2017).

Table 1 Summary of similarity rating participant sample size

To collect naming data, an additional 14 participants were recruited from the University of Massachusetts-Amherst community and the community of authors D.M.W.S. and R.A.C. and were compensated $15 per hour of participation.

Stimulus materials

The stimulus image set contained 240 unique images in four categories: 60 inanimate objects, 60 animals, 60 indoor scenes and 60 outdoor scenes. Inanimate objects (e.g., printer) and animals (e.g., zebra) were color photographs taken mostly from the online databases of Konkle and Oliva (https://konklab.fas.harvard.edu/#; Konkle & Caramazza, 2013; Konkle & Oliva, 2012). We replaced any images that were deemed too low-resolution with higher-resolution photographs of the same object or animal taken from an internet search. All objects and animals were placed on white backgrounds and resized to a standard size, slightly smaller than the size of the background (600×600 pixels), to eliminate any large differences in image size. We aimed for a relatively heterogeneous set of animals and set of inanimate objects (e.g., including both large and small, with a range of animal taxonomic orders and a range of object functions), to span a range of stimuli that are visually and semantically distinguishable for the purposes of memory studies.

Scene images, both indoor (e.g., living room) and outdoor (e.g., volcano) were also color photographs. Most of the scene pictures were taken from Ross, Sadil, Wilson and Cowell (2018), but several images from this original set were removed and replaced with another item in cases where an image was deemed too semantically related to another image in the set (e.g., tennis court and volleyball court). The final sets of scene images (both indoor and outdoor) contained no image pairs with overlapping context (e.g., there were no two beaches or two tennis courts). All scene images were then resized to a standard size (600×800 pixels) to eliminate any large differences in image size. Again, for each category of scenes we aimed to find a relatively heterogeneous set of exemplars that would be reasonably distinguishable from each other and elicit distinct, unique names.

Procedures

For similarity ratings, for each category of 60 images (animals, inanimate objects, indoor scenes and outdoor scenes), there were 1770 possible pairs of images (“60 choose 2” combinations, in mathematical terms). The 1770 pairs were divided into seven groups: six groups of 253 trials, plus one group of 252 trials. The 1770 image pairs were assigned randomly to the seven groups, to avoid any systematic ordering such as, for example, all image pairs that contain a printer being assigned to Group 1 within inanimate objects. An additional 10 attention-check trials were embedded within each image-pair group to track whether participants were paying attention across the trials. Attention-check trials presented two identical images, such that only a "highly similar" response (“5”) would be appropriate. The 10 attention-check trials were randomly interleaved between the regular trials in each image-pair group, yielding a total of 263 (or 262) trials within each image-pair group. Each set of 263 (or 262) trials comprising a group of image pairs and 10 attention-check trials was described as a "task" or a "HIT" on Amazon MTurk. Each HIT had a 20-minute time limit. A very small number of HITs were collected after Amazon MTurk instituted a change to the maximum allowed length of a HIT, such that these HITs were constrained to contain fewer trials than in our original design. To collect this last batch of data, we conducted more HITs with fewer trials per HIT, and this is reflected in the indoor semantic and outdoor semantic rows of Table 1.

Within each HIT, participants saw a series of trials in which they were asked to rate the similarity between two pictures presented on the screen, based on either semantic or visual properties of the images (Fig. 1). For visual similarity ratings, participants were asked “How visually similar are the following two pictures? Please only make judgments based on how much the pictures look alike.”Footnote 1 For semantic similarity ratings, participants were asked “How semantically similar are the following two pictures? Please only make judgments based on how much the pictures have to do with each other.” Ratings were made using a numerical scale from 1, not similar at all, to 5, highly similar. Participants responded by checking the box of the corresponding value. After a box was checked, the next trial was presented. After finishing all trials in the HIT, participants submitted their ratings by clicking the submit button.

Fig. 1
figure 1

MTurk similarity rating trial examples. Participants were asked to rate either the visual (a) or semantic (b, c) similarity between two images presented on the screen. Instructions for rating the visual similarity were identical for all four image categories; however, instructions for rating the semantic similarity differed by the examples given in the second sentence for animals/objects (b) versus indoor/outdoor scenes (c).

Each worker was assigned a unique ID number by Amazon MTurk, which enabled restrictions on how participants were allowed to complete multiple HITs. There were four stimulus sets (animals, inanimate objects, indoor scenes and outdoor scenes) and two rating types (visual, semantic). No participant was allowed to rate a given pair of images twice, meaning participants (1) could not complete a particular HIT more than once (say, the HIT containing image pairs #1-253 for “inanimate object-visual” ratings) and (2) could not complete both kinds of ratings for a particular set of image pairs (say, the HIT containing image pairs #1-253 for “inanimate object-visual” ratings and the HIT containing the same pairs for “inanimate object-semantic” ratings). An additional constraint (imposed incidentally for convenience) is that participants who had completed ratings of one type (either visual or semantic) for a set of image pairs in a given category were unable to complete ratings of the other type for any set of image pairs in that category. Outside of these constraints, participants were allowed to complete multiple HITs, always providing ratings for stimulus pairings that they had not previously been exposed to.

We collected at least 10 ratings per image pair (Table 1). To assess whether a dataset of size 10 ratings is adequate to provide a stable MDS solution, we conducted simulation analyses. In brief, we sampled subsets of the full dataset that steadily increased in size and asked how much the addition of one extra rating changed the MDS solution at each dataset size (see Supplementary Information). The simulation results for the 2D case are presented in Supplementary Figure 1, and they imply that the MDS solutions would not be greatly modified by continuing to collect more ratings beyond 10 per image pair (other dimensionalities are also presented in the Open Science Framework (OSF) project at https://osf.io/smk25/).

Following the online collection of similarity ratings, we obtained naming data from 14 in-person participants. Each of the 240 images was presented across eight blocks of 30 trials. Blocks alternated between two broad image categories—animal/inanimate objects and indoor/outdoor scenes—and the category of the initial block was counterbalanced across participants. We informed participants that they would be naming animate and inanimate objects, and indoor and outdoor scenes, and we requested that they provide specific and identifiable labels, avoiding more generic identifiers such as “wild animal” or “store.” Image presentation was self-paced, such that the next image did not appear until participants submitted their response for the current image. Formal or quantitative analyses of these data were not of primary interest, but the raw data and “counts” of supplied names for each image have been made available at the OSF repository. Not only do these data provide normative names for all of the stimuli, but the “spread” of names for each image offers an additional qualitative assessment of whether it was consistently identified.

Statistical analysis

Multidimensional scaling (MDS)

Data analyses were performed using R (R Core Team, 2016) and the smacof package for multidimensional scaling (de Leeuw & Mair, 2009). For all stimulus sets (inanimate objects, animals, indoor scenes, outdoor scenes), we used the similarity ratings to create distance matrices for visual and semantic ratings separately, yielding eight matrices (four stimulus sets by two rating types). The similarity ratings were transformed into distance ratings by subtracting the original pairwise similarity ratings from 5, which yielded perceptual distance values ranging from 0 to 4 (e.g., a similarity level of 1 becomes a “perceptual distance” of 4; a similarity level of 5 becomes a “perceptual distance” of 0). Next, we averaged these perceptual distance values across participants and used the resulting average distance matrix for each category-rating combination as the input for non-metric MDS analysis (Kruskal & Wish, 1978).

Dimensionality of the MDS space and scree plots of stress level

Because MDS involves a reduction in dimensionality of the original data, it can be helpful to find the optimal dimensionality for representing the data. The optimal dimensionality of an MDS solution is usually determined by a scree plot, which displays the stress level of an MDS solution at each level of dimensionality assayed. Stress is the value that is optimized in finding the MDS solution, and it provides a measure of the mismatch between the Euclidean distances between each pair of stimuli in the MDS solution and the corresponding distances in the empirical data (Kruskal & Wish, 1978). A higher stress value indicates a poorer fit between the distances in the empirical dataset and the distances in the MDS solution, and stress is invariably higher for MDS solutions of lower dimensionality. However, in choosing the dimensionality there is a trade-off: although stress decreases as the number of dimensions increases, it is harder to interpret the data visualization offered by MDS in higher dimensions.

For each of the eight category-rating combinations (e.g., inanimate objects-visual), we calculated six different MDS solutions with dimensionalities from 2 to 7. We created scree plots showing the normalized stress level (Kruskal & Wish, 1978; Takane, Young, & de Leeuw, 1977) at each dimensionality of MDS solution, for each category-rating combination. To conveniently visualize stimulus-relatedness, we constructed maps based upon the 2D MDS solutions for each category-rating combination.

Monte Carlo simulations to determine the validity of MDS solutions

Although stress level provides a measure of how well an MDS solution fits the raw distance data, low stress level per se is not a guarantee that the MDS solution has found a meaningful interpretation of the similarity relationships in the empirical data. Therefore, we performed a Monte Carlo simulation, sampling random similarity ratings, to test how the stress levels of MDS solutions generated from random ratings compare to the stress levels we observe in the empirical MDS solutions (Hout et al., 2014). If the MDS solution for the empirical data has identified meaningful similarity relationships, then the empirical stress levels will fall in the extreme left-hand tail of the null distribution generated by the Monte Carlo simulations. To generate a Monte Carlo null distribution with characteristics that resembled the empirical distribution, but in which meaningful relationships were obliterated, ratings were randomly sampled with replacement from the pool of 141,600 empirical ratings (this total reflects 10 ratings of 1770 image pairs in each of the eight category-rating combinations). For each simulation, we sampled 10 ratings for each image pair (to simulate 10 participants) and assigned them randomly to the image pair labels. The 10 ratings per image pair were then averaged together before being entered into the MDS analysis, and the stress level of the resulting MDS solution was computed. This whole process was carried out 1000 times to construct a density plot of stress levels from random data that contain no meaningful similarity relationships.

Results

Similarity rating descriptive statistics

The mean and standard deviation of participant similarity ratings within each condition is provided in Table 2. Participants tended to use the lower end of the rating range, corresponding to a judgment of "dissimilar." This is perhaps unsurprising since all image pairs (except on attention check trials) comprised two items from different basic-level categories (e.g., a hippo and a crab, rather than two crabs). However, violin plots of all similarity ratings shown in Supplementary Figure 3 demonstrate that participants did use the full range of ratings across all image pairs.

Table 2 Similarity rating descriptive statistics

Additionally, we assessed inter-rater reliability via Kendall’s coefficient of concordance (W) across image category, similarity rating and HIT (Table 3). Kendall’s W averaged over all 56 groups was .34 (SD = .09) and ranged between .16 and .56. Chi-square tests revealed that inter-rater reliability was statistically significant across all groups (p < .001), indicating that participants provided similar ratings when comparing the same image pair.

Table 3 Inter-rater reliability as measured by Kendall’s coefficient of concordance (W)

Multidimensional scaling solutions

We performed MDS with dimensionalities 2 through 7 for all eight category-rating combinations (e.g., animals-visual, indoor-semantic, etc.). The full MDS solutions at all six dimensionalities for all eight category-rating combinations are provided at https://osf.io/smk25/. To visualize the results of MDS, we plotted the 2D solutions for all eight category-rating combinations, in Figs. 2, 3 ,4 ,5 ,6 ,7 ,8 and 9. Each plot shows all 60 stimuli in the set (e.g., all animals in Fig. 2) located in a 2D map such that the physical distances between images on the plot reflect the distances between images in the 2D MDS solution derived from the similarity ratings data. Inspection of two 2D solutions derived from the same set of images via different ratings—for example, animals rated visually and animals rated semantically in Figs. 2 and 3, respectively—reveals that participants were following instructions, and that the MDS solution identified meaningful relationships. For instance, the seahorse was rated as being close to the snail, frog and lizard in the visual ratings MDS solution, whereas it was grouped with other sea animals, such as the starfish and crab, in the semantic ratings MDS solution.

Fig. 2
figure 2

Visualization of the 2D MDS solution for animal-visual similarity ratings

Fig. 3
figure 3

Visualization of the 2D MDS solution for animal-semantic similarity ratings

Fig. 4
figure 4

Visualization of the 2D MDS solution for inanimate object-visual similarity ratings

Fig. 5
figure 5

Visualization of the 2D MDS solution for inanimate object-semantic similarity ratings

Fig. 6
figure 6

Visualization of the 2D MDS solution for indoor scene-visual similarity ratings

Fig. 7
figure 7

Visualization of the 2D MDS solution for indoor scene-semantic similarity ratings

Fig. 8
figure 8

Visualization of the 2D MDS solution for outdoor scene-visual similarity ratings

Scree plots of stress levels

Figure 10 displays scree plots showing the normalized stress level (Kruskal & Wish, 1978; Takane et al., 1977) at each dimensionality of MDS solution for each category-rating combination. The scree plots indicate, as expected, that greater dimensionality leads to lower stress. However, for all category-rating combinations, even the 2D MDS solution has a stress level well below 0.376Footnote 2 (Table 4), indicating a good fit between the proximities between images in the MDS solution and the rating distances in the empirical data (Sturrock & Rocha, 2000). This suggests that the 2D visualizations for all stimulus image categories may be sufficient to represent the similarity relationships between the images. However, we nonetheless make the MDS solutions for all dimensionalities available at https://osf.io/smk25/.

Table 4 Summary of stress levels for MDS solutions in two dimensions

Monte Carlo simulations to assess MDS solutions

We created MDS plots from randomly generated similarity ratings to produce a null distribution of stress levels for the scenario where there is no meaningful similarity information in the data. Figure 11 shows, for the case of 2D MDS solutions, the distribution of stress levels associated with MDS solutions derived from randomly sampled similarity ratings, generated via bootstrapping. In this figure, the highest stress level from among all eight category-rating conditions (namely, indoor scenes-semantic ratings, with a stress level of 0.110 for the 2D MDS solution), is displayed as a vertical red line. As seen in the figure, the stress level in our empirical MDS solution is far below the lower tail of the null distribution that was generated by sampling similarity ratings randomly. This indicates that the empirical MDS solution explains much more of the variance in the images' similarity relationships than the MDS solutions applied to randomly sampled image-pair ratings, suggesting that the empirical MDS solution captures meaningful similarity structure not present in the randomly sampled data. The analogous simulations for MDS solutions in 3D to 7D are shown in Supplementary Figure 2.

Discussion

The goal of this study was to provide visual image stimulus sets with accompanying naming data, and quantitative similarity metrics for both visual and semantic judgments. We provide these stimulus sets and metrics for use by psychologists and cognitive scientists running empirical studies involving objects and scenes.

The similarity data were collected through MTurk via pairwise ratings, in which participants judged the similarity of two simultaneously presented stimuli. Alternative means of collecting similarity data include the spatial arrangement method (SpAM; Goldstone, 1994), which includes single- and multi-arrangement methods, inverse MDS (Kriegeskorte & Mur, 2012), free-sorting (Coxon, 1999) and others. One disadvantage of the pairwise method is that the number of pairings increases rapidly (quadratically) with the number of stimuli in the set. In a lab environment, it is difficult to obtain enough participants, or enough ratings per participant, to collect sufficient data for all pairwise ratings. Alternative methods often require fewer trials than pairwise ratings, but each has its own disadvantages, such as precluding the discovery of similarity structures in more than two dimensions for the single-arrangement method, or limiting the data to binary similarity measures (same versus different category) for free-sorting (Kriegeskorte & Mur, 2012). Further, because each of our stimulus categories contained 60 stimuli, this would yield an unreasonable number of images for the single-arrangement SpAM method (spatially arranging by comparing each image with 59 others simultaneously). However, 60 items yield only 1770 pairs, which is a manageable number for pairwise ratings when many participants can be recruited via MTurk.

We validated the MDS solutions derived from the empirical similarity ratings in two quantitative ways. First, we verified that stress levels for all MDS solutions fell into a range typically taken to indicate good agreement between similarity relationships in the data and relationships in the MDS solution. Second, we ran Monte Carlo simulations and revealed that the distances between items in our MDS solutions corresponded to the actual distances between items in our empirical ratings data to a greater degree than would be expected from a random sampling of similarity ratings. Finally, we also validated the MDS solutions qualitatively, by visual inspection. Inspection of the 2D MDS solutions in Figs. 2 through 9 reveals that the perceptual and conceptual similarity spaces are intuitive and sensible. Further, the differences between the perceptual (visual) and conceptual (semantic) similarity maps for a given stimulus category revealed discrepancies that were entirely expected. For example, the inanimate objects were clustered in the visual condition according to color or global form, but in the semantic condition according to their function or the environment in which they are typically found (Figs. 4 and 5).

Fig. 9
figure 9

Visualization of the 2D MDS solution for outdoor scene-semantic similarity ratings

Fig. 10
figure 10

Scree plots for each category-rating combination. For all eight combinations of image category (animal, inanimate object, indoor scene, outdoor scene) and similarity rating type (visual, semantic), the level of stress (y-axis) decreased as the dimensionality of the MDS solution (x-axis) increased. However, even at two dimensions, all MDS solutions have stress levels below 0.12, indicating that they fit the empirical data well.

Fig. 11
figure 11

Comparison between the distribution generated by Monte Carlo simulation of 2D MDS solutions and the highest stress level from any 2D MDS solution for the empirical data. The red vertical line denotes the highest empirical stress level from the 2D MDS solutions for all of our category-rating combinations: the indoor scenes-semantic rating (0.110). The line falls far below the lower-tail of the bootstrapped distribution, which was generated by repeatedly sampling similarity ratings randomly from the rating data from all participants, deriving the 2D MDS solution and calculating the stress. The placement of the red vertical line indicates that the empirical MDS solution was successful in identifying meaningful similarity relationships from the data. Equivalent figures for three to seven dimensions yield the same conclusion and are shown in the Supplementary Information.

We also examined whether the dataset size of 10 ratings was sufficient to attain a stable MDS solution that would not change substantially with the collection of further ratings (see Supplementary Information). For this, we conducted “subsampling” analyses in which we sampled subsets of the full dataset that systematically increased in size. Our approach was to measure the distance between pairs of MDS solutions that were created from pairs of datasets that differed in size by one rating (i.e., n ratings versus [n+1] ratings). Our goal was to identify the size of dataset at which incorporating an additional rating into the dataset produced a negligible change in the MDS solution and/or “diminishing returns” in terms of eliminating any residual distance between the MDS solutions derived from datasets that differ by one rating. As seen in Supplementary Figure 1 (the results for 2D MDS solutions) and on the OSF repository (results for 3D through 7D solutions), the extent to which the MDS solution is altered by adding an additional rating stabilizes at a low asymptote by the time the dataset reaches a size of 10 ratings, for all category-rating combinations in the 2D through 5D MDS solutions. We suggest that researchers wishing to use MDS solutions with dimensionalities 6D or 7D consult the Supplementary Information and the “Subsampling Analysis” figures at https://osf.io/smk25/ to determine the stability of the MDS solution for the dataset of interest at the desired dimensionality. We reiterate that inter-rater reliability was high for all category-rating combinations and that, for all MDS dimensionalities, stress levels were significantly below those of MDS solutions for randomly sampled data. These additional “subsampling” analyses simply provide a guide as to which dimensionality of MDS solution may be optimally stable for a given dataset.

We make all stimuli, naming data, ratings data and the full MDS solutions for both visual and semantic ratings available to other researchers, at the following OSF website: https://osf.io/smk25/. We hope that this provides a valuable set of visual stimuli for experiments involving objects and scenes in which the similarity between different instances is critical to the experimental design.