Normative measures of verbal material are of special interest in psycholinguistics and cognitive research, in which they are used to control for confounding variables and so create balanced item sets in experimental procedures, as well as to achieve better comprehension of the organization of our conceptual system. Traditionally, normative studies have included classical psycholinguistic variables, such as word frequency, affective properties, orthographic/phonological metrics, concreteness, or imageability ratings. Referring to concreteness and imageability, the two constructs (and their associated ratings) have been often used interchangeably by the literature in the field (e.g., Connell & Lynott, 2012; Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011), due to their high correlation and theoretical relationship. However, the two concepts reflect, at least partially, different aspects of semantic representations, with concreteness representing the degree to which a word referent refers to a perceptible entity, and imageability scores strongly correlating with a concept visual properties (Brysbaert, Warriner, & Kuperman, 2014; Connell & Lynott, 2012, 2015).

Despite the importance assigned to these variables in facilitating word processing (the well-known concreteness effect; e.g., Paivio, 1991), imageability and concreteness have failed to explain and predict human behavior in a conclusive way, with evidence pointing to the opposite facilitation (i.e., abstractness effect; see, e.g., Kousta et al., 2011) or to no effects (e.g., Barca, Burani, & Arduino, 2002). The inconsistency of the empirical data has led to the idea that both concreteness and imageability could be considered noisy measures (Connell & Lynott, 2012) that do not offer an accurate approximation of the perceptual basis of concepts.

At the same time, the last decades have seen the prospering of research within the embodied-cognition framework, suggesting a strong involvement of the sensorimotor system in language comprehensionFootnote 1 (see Meteyard, Cuadrado, Bahrami, & Vigliocco, 2012; Pulvermüller, 2018, for a recent review). This evidence has led to the prospering of questionnaires investigating the perceptual and motor features of a word referent (e.g., Juhasz, Yap, Dicke, Taylor, & Gullick, 2011; Lynott & Connell, 2009, 2013; see Lynott, Connell, Brysbaert, Brand, & Carney, 2019, for the largest norm dataset). For example, sensory experience ratings (SER; Juhasz et al., 2011) are aimed at capturing the extent to which a certain word evokes a sensory and/or perceptual experience in the reader’s mind. To validate the obtained resources and provide evidence in favor of their relevance for psychological studies, such variables are typically tested against human performance, in particular response latencies obtained in chronometric studies with word stimuli. Juhasz et al. (2011), for example, collected SER for over 2,850 words and tested it against lexical-decision data for monosyllabic words from two English megastudies (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Keuleers, Lacey, Rastle, & Brysbaert, 2012). The authors found that words with higher SER elicited faster and more accurate responses than did words with lower SER. Several studies have addressed and replicated this point (Juhasz & Yap, 2013; see Bonin, Méot, Ferrand, & Bugaïska, 2015, for similar results in French), extending the results also to noun–noun compounds (Kuperman, 2013) and semantic tasks (Zdrazilova & Pexman, 2013).

It is crucial to note that in the aforementioned studies, participants were instructed to evaluate the degree to which a certain word evoked a general sensory experience, without distinguishing among the five senses. Such choice leaves to the participants’ initiative to consider all the different modalities through which an object can be experienced, with the potential limitation of leading to an underspecified characterization of the variable of interest or to an overestimate of one perceptual modality as compared to the others (Connell & Lynott, 2016; Lynott et al., 2019). A stronger measure, in this respect, can be obtained by asking participants to rate the perceptual strength of a given word separately for the five senses (Lynott & Connell, 2009, 2013).

In the last few years, perceptual modality norms of this kind have spread widely, becoming available in many different languages, such as Dutch (Speed & Majid, 2017), Russian (Miklashevsky, 2018), and Mandarin (Chen, Zhao, Long, Lu, & Huang, 2019), and their validity has been tested with several experimental paradigms. For example, Speed and Majid used perceptual strength norms in a similarity judgment task, finding that words from the same dominant modality were rated more similar than words from different dominant modalities, and such effect was enhanced for word pairs with higher ratings. Moreover, they investigated whether perceptual modalities were differently experienced in spatial terms, thus running a lexical decision experiment with word spatial position presented in proximal or distal space. Interestingly, they found that words dominant in olfaction were processed faster in proximal than in distal space, as compared to other modalities, suggesting that olfactory information is mentally simulated as being close to the body. Moreover, perceptual norms have been validated in modality-switch-cost tasks, in which participants are typically asked to verify a series of properties of a concept (e.g., TIGER–striped, visual). The behavioral pattern shows that participants are slower when the following target concerns a different modality (e.g., WHISTLE–shrill, auditory), as compared to the same perceptual modality (e.g., CANDLE–flickering) (e.g., Pecher, Zhao, Long, Lu, & Huang, 2003; van Dantzig, Pecher, Zeelenberg, & Barsalou, 2008; Vermeulen, Niedenthal, & Luminet, 2007).

But to what extent do perceptual strength ratings reflect concreteness and imageability, and can they explain human performance?

To the best of our knowledge, only Connell and Lynott (2012) have investigated this issue, by comparing perceptual modality ratings with concreteness and imageability scores and testing the three measures as competing predictors of participants’ performance in word recognition tasks. Their findings suggested that the maximum perceptual strength, namely the rating value of the dominant perceptual modality, predicted accuracy and reaction times better than concreteness and imageability. However, at present, these results have not been replicated on languages different from English. This rests uncomfortably with the evidence that ratings concerning the properties of word-denoted objects also reflect lexical statistics (as captured from models trained on text corpora; Hollis & Westbury, 2016). In fact, if when producing intuitions about referents participants are influenced by distributional properties of their associated words, it is conceivable that different linguistic experiences (as being exposed to a given language or another) might result in slightly different distributions in semantic norms. Given these considerations, it becomes crucial to search for cross-linguistic evidence concerning the impact of rating norms on language-processing data.

In the present work, we first describe a new resource providing perceptual-modality norms for Italian (following Lynott & Connell, 2013). These new data ideally complement the largest norming work currently available in Italian, namely the Affective Norms for English Words (ANEW; Bradley & Lang, 1999), in its Italian adaptation by Montefinese, Ambrosini, Fairfield, and Mammarella (2014). The dataset is composed of 1,121 words, of which 1,034 are the Italian translations of the ANEW stimuli, and 87 are based on a previously published database (Montefinese, Ambrosini, Fairfield, & Mammarella, 2013). The Italian ANEW includes rating-based norms for three affective variables—namely, valence, arousal, and dominance—as well as for familiarity, imageability, and concreteness. However, it lacks more specific estimates concerning the perceptual properties of the included words: no information concerning the perceptual experience associated with the five senses is provided, hence the perceptual strength of the stimuli cannot be estimated. The norms presented here include perceptual-strength estimates for the 1,121 words of the ANEW database. In the second section, we specifically investigate whether Connell and Lynott’s (2012) results can be extended to other languages by comparing the effect of perceptual strength to the one of concreteness and imageability in two novel experiments (lexical decision and word naming tasks) on Italian. Having found that this is not the case, in the third section of this article, we test whether the emerged dissociation between English and Italian can be considered a genuine cross-linguistic effect or is more trivially due to differences in item selection, thus comparing Italian and English datasets including the same (translated) words.

Part 1: Perceptual modality norms for 1,121 Italian words

In this section we present the perceptual-modality ratings collected for 1,121 words from Italian native speakers. We also compare these ratings to concreteness and imageability scores (as measured by Montefinese et al., 2014). If concreteness and imageability are a pure reflection of the degree of perceptual information in a concept, their scores should be positively related to perceptual strength ratings in all the five modalities. On the other hand, following the findings of Lynott and Connell (2013), it is also conceivable that concreteness and imageability reflect some perceptual modality more than the others.

Method

Participants

A total of 57 students (males = 28; Age = 23.6 ± 5.2) of the University of Milano-Bicocca took part in the experiment in exchange of course credit. Participants were Italian native speakers. The study was approved by the local ethical committee and participants’ ethical treatment was in accordance with the principles stated in the Declaration of Helsinki.

Materials

The item set contained the 1121 items from the Montefinese dataset (Montefinese et al., 2014). It comprises 20% of adjectives, 69% of nouns, 5% of verbs and a 6% of words that could be considered both as an adjective or a noun. Trial-by-trial data were released as Supplementary materials (https://osf.io/zdg59/).

Procedure

Items were randomly presented to participants for perceptual strength ratings in a norming procedure based on Lynott and Connell (2009). Each word was presented on a separated screen, in a sentence that reported “To what extent do you experience WORD” (with the WORD slot being filled by a noun or verb target) or “to what extent do you experience something being WORD” (with the WORD slot being filled by an adjective target or a target that could be considered both an adjective and a noun).Footnote 2 The sentence was completed underneath by five endings, corresponding to the five perceptual modalities: “by hearing,” “by tasting,” “by feeling through touch,” “by smelling,” and “by seeing.” Each of these endings was paired with a rating scale. An example of a trial is reported in Fig. 1.

Fig. 1
figure 1

Screen capture of an experimental trial. In the example, the participant was asked “To what extent do you experience MANICHINO (dummy) by hearing, by tasting, by feeling through touch, by smelling, and by seeing. Participants replied on a Likert scale from 0 (not at all) to 5 (greatly). At the top of the screen, participants could check the questionnaire progression. At the bottom of the screen, the left and right arrows allowed participants to go back to the previous item or move on once they had completed the trial.

Participants were hence asked to rate the extent to which they would perceive the referent of each word through each of the five senses, on a scale ranging from 0 (not at all) to 5 (greatly). The numerical rating scale was displayed with no default value selected, and participants clicked on a number to indicate their preference. Once each word had been rated on all five modalities, participants clicked an arrow placed at the bottom of the screen, in order to move to the following item. Participants were told to evaluate each item using their own judgment because there was no predetermined right or wrong answer. They were also instructed to skip items with which they were unfamiliar, moving directly to the following item. The experiment was self-paced, since participants were able to take a break every time they desired.

Unlike previous studies, which had divided the item set across different subquestionnaires (e.g., Bonin et al., 2015) or asked participants to rate only one dimension (e.g., color or smell; Díez-Álamo, Díez, Wojcik, Alonso, & Fernandez, 2018), in the present experiment all participants rated the full item set. The order of words was randomized across participants. The order of the five perceptual modalities was fixed across items for each participant but was counterbalanced across participants, in a Latin-square design. The experiment was administered online using the Qualtrics software (Provo, UT). Experiment links were sent to participants via e-mail so that they could fill questions using their personal laptop, tablet, or smartphone.

As a sanity-check, we selected 81 items that were unambiguously experienced through one sense more than one other (e.g., a tavolotable is more likely to be experienced through sight than through taste). On these items we evaluated participants’ accuracy to control that they paid attention to the task and did not answer randomly. Response accuracy for these sanity-check stimuli was higher than 80% for all participants (mean = 96.8, SE = 0.54).

Results

Perceptual modality norms

Participants’ ratings were collapsed, excluding missing trials (0.6% of the data), and for each word, average values were calculated separately for each modality, resulting in a dataset comprising 5,605 unique data points. In Table 1, we report rating means, standard deviations, and standard errors for each of the five modalities.

Table 1 Mean ratings, standard deviations, and standard errors of perceptual strength (on a 5-point scale) across the five modalities

Each item was assigned as its dominant modality (visual, haptic, auditory, olfactory, or gustatory) the modality that received the highest mean rating (Lynott & Connell, 2009). As in Lynott and Connell (2009), in which a few items had ties for the strongest modality (11 items out of 1,121; see Table 2), one of the tied modalities was randomly chosen as the dominant one.

Table 2 Words in which ratings of perceptual strength revealed nonunique dominant modality

Table 3 represents the distribution of modality dominance across items, showing their strength with respect to the other perceptual modalities and their exclusivity scores. Modality exclusivity indicates the extent to which a certain item is perceived through a single perceptual modality. Where each item has a vector containing mean ratings for the five modalities, modality exclusivity is calculated as the range of values divided by their sum, according to the formula

$$ \frac{\max \left(\mathrm{x}\right)-\min \left(\mathrm{x}\right)\ }{\sum \left(\mathrm{x}\right)}\ast 100 $$

where x is a vector of mean ratings for each of the five perceptual modalities. In such a way, modality exclusivity scores in principle can range from 0% to 100%, where an entirely multimodal property (scoring equally strongly on all perceptual modalities) would have the lowest modality exclusivity score of 0%, and an entirely unimodal property (scoring zero on all but one perceptual modality) would have the highest modality exclusivity score of 100%. The actual resulting scores ranged from 2.8% (for the item piacere, “pleasure”) to 96.1% (for the item arcobaleno, “rainbow”), with an overall mean of 40.6% (SD = 12.8%; see Table 3).

Table 3 Number of words and exclusivity scores (as percentages) per dominant modality, along with the mean ratings of perceptual strength (0–5) in each modality

Table 4 represents the distribution of the obtained ratings separately for the grammatical class of the items. A chi-squared test of the modality distribution across grammatical classes was not significant [χ2(8) = 12.8, p = .120], suggesting that the item distribution did not differ among the three linguistics categories.

Table 4 Mean ratings of perceptual strength (0–5) across the five modalities in each grammatical class, with standard deviations, standard errors, and numbers of dominant items for the different modalities

Concerning the relationship among the different modalities, not all perceptual modalities were equally distinct, as is shown in the correlation matrix (Bonferroni-corrected) reported in Table 5, as well as in the scatterplot of dominant-modality clusters reported in Fig. 2. Significant correlations were found for most of the modality pairs, although most of the correlations were weak to moderate.

Table 5 Correlations between perceptual-strength scores in different modalities (Bonferroni-corrected)
Fig. 2
figure 2

Clustering of words dominant in the auditory, haptic, gustatory, olfactory, and visual modalities on two factors extracted from the factor analysis by principal components (A). To better appreciate the relationships between the other modalities, the same plot is reported in panel B after excluding the visually dominant items.

In Fig. 2, ratings on the five modalities have been reduced to two dimensions using principal components analyses (singular value decomposition, explaining 68% of the original variance).

A comparison between Fig. 2A and B clearly highlights that most items were rated by participants as being most experienced through the visual modality. This modality was so preponderant that it encapsulated most of the other modalities, especially the haptic and auditory ones, indicating that many items that are experienced through touch and hearing can also be experienced through the visual modality. The gustatory and olfactory modalities, instead, were relatively separated from the visual elements, in line with the results indicating a nonsignificant correlation between gustatory and visual scores and a weak correlation between olfactory and visual ones. When we do not consider items classified as being visually dominant (panel B), the four remaining modalities show patterns that are relatively segregated from each other.

Correlation-wise (Table 5), the strongest positive relationships were observed between the olfactory and gustatory modalities, which is not surprising, given their chemical-sense status, and between the haptic and visual modalities, showing that objects that can be touched can also be seen. The auditory modality correlated negatively with all the other modalities, which, together with its distinct cluster in Fig. 2, suggests that the more a given word is experienced through hearing, the less the same word is experienced through the other sensory modalities.

Exclusivity scores differed across dominant modalities according to an analysis of variance [F(4, 1116) = 10.5, p < .001]: post-hoc analysis with Bonferroni correction showed that properties with gustatory dominance scored lower in modality exclusivity than did those for all other perceptual modalities (all ps > .528).

Relationship between perceptual strength ratings and concreteness/imageability scores

As a second step, we investigated whether concreteness and imageability reflect the perceptual properties of a word referent, or whether they instead represent different information. More specifically, if concreteness and imageability summarize the perceptual features of a word, we should find concreteness and imageability to be positively correlated with ratings for all the perceptual modality.

In line with this reasoning, we found that imageability and concreteness were highly correlated (see Table 6), suggesting that in our database they capture the same latent variable. The haptic and visual perceptual modalities had strong correlations with both concreteness and imageability. Interestingly, visual perceptual ratings correlated in the same way with concreteness and imageability, whereas the haptic modality correlated more with concreteness than with imageability. The relationships between the olfactory modality and both concreteness and imageability were significant but weak, whereas the auditory modality was negatively correlated with both concreteness and imageability, suggesting that word-denoted objects that can be experienced through hearing are considered more abstract and less imageable. The gustatory modality did not show a significant correlation with concreteness nor with imageability.

Table 6 Correlation between concreteness, imageability, and mean perceptual-strength ratings for each modality predictor in the Italian perceptual modality norms (N = 1,121)

Next, we investigated whether our ratings were good predictors of concreteness and imageability. We ran stepwise regression analysis using a backward procedure with either concreteness or imageability ratings as the dependent variable and ratings of auditory, gustatory, haptic, olfactory, and visual perceptual strength as predictors. In both cases, the model comprising the five perceptual modalities was found to be the best one.

Although all the five perceptual modalities contributed to the regression model, the direction of this relationship varied across modalities (see Table 7). Auditory and gustatory ratings were negatively related to concreteness: the more strongly a word referent was related to taste and sound experience, the less concrete it was. At the opposite extreme, the haptic and visual modalities showed the strongest positive relation with concreteness, followed by olfactory ratings.

Table 7 The t values for each modality of perceptual strength as predictor of concreteness and imageability

The imageability predictors were not totally overlapping with the predictors of concreteness. Whereas the visual and haptic modalities were also the best predictors for imageability scores in this regression, followed by olfaction, the auditory modality did not predict imageability ratings. The gustatory modality was the only one to have a negative effect on imageability, suggesting that words that are experienced through taste are less easy to imagine.

Discussion

In the present work, we collected perceptual modality ratings for Italian, with the aim of releasing a new resource as a complement of Italian adaptation of the ANEW database.

Following the original works by Lynott and Connell (2009, 2013) on English, perceptual strength norms have been also collected in different languages, including Russian (Miklashevsky, 2018), Dutch (Speed & Majid, 2017), and Mandarin (Chen et al., 2019). To allow for a more straightforward comparison among studies, we report in Table 8 a summary of the results from the available studies conducted with perceptual strength across languages.

Table 8 Modality exclusivity (M.E.) norms and number of items (N) for the five perceptual modalities

First, these results demonstrate that our sample of words was experienced in a multimodal way. The multimodal composition of words has been supported by perceptual ratings of English adjectives, nouns, and verbs (Lynott & Connell, 2009, 2013; van Dantzig, Cowell, Zeelenberg, & Pecher, 2011; Winter, 2016) and of Dutch and Russian nouns (Miklashevsky, 2018; Speed & Majid, 2017). In line with previous norming ratings, we replicated a visual dominance effect (Chen et al., 2019; Lynott & Connell, 2009, 2013; van Dantzig et al., 2011; Winter, 2016) with Italian speakers. Moreover, similar to the previous studies by Lynott and Connell (2009), Lynott et al. (2019), and Speed & Majid, (2017), we found gustation to be the most multimodal sense. Different findings were reported for Mandarin, in which the haptic modality was the most multimodal one, and for English nouns, in which the olfactory modality was the least exclusive modality. Russian nouns, on the other hand, received generally higher multimodal scores than did items in other norming datasets. Only in our norming dataset did olfaction receive the highest exclusivity rating, indicating that concepts that can be experienced by smelling are experienced less with the other four perceptual modalities. The auditory perceptual modality received high exclusivity scores as well, in line with previous studies (Connell & Lynott, 2012; Lynott & Connell, 2013; Miklashevsky, 2018). Visual and haptic modalities shared the third position after olfactory and auditory ones. Although haptic average-level multimodality was in line with previous results (except Mandarin items), visual modality was rated as the most unimodal in Dutch and Mandarin words. Such heterogeneous patterns between norming studies may be due to differences in the selection criteria used for the item list composition. Indeed, in Speed & Majid, (2017) and Lynott and Connell (2009), items were selected in order to cover equally all the five perceptual modalities, whereas Miklashevsky (2018) selected items representative of specific categories (e.g., animals, tools, emotions).

Concerning the relationship among the perceptual variables, the strongest positive correlation was observed between olfactory and gustatory modalities, which is consistent across languages. Similarly, haptic and visual perceptual ratings were positively related across the different samples, reflecting that concepts that can be touched can also be seen (e.g., Lynott et al., 2019). Auditory modality correlated negatively with all the other perceptual experiences. This negative relationship between the auditory modality and the other ones seems to be a robust pattern across different language and datasets (English: Connell & Lynott, 2012; Lynott & Connell, 2009; Lynott et al., 2019; Russian: Miklashevsky, 2018; Dutch: Speed & Majid, 2017).

As a second step, we compared perceptual ratings with two traditional psycholinguistic variables, namely concreteness, and imageability. Our aim was to investigate whether concreteness and imageability reflected the degree of concepts perceptual information or were predicted by some sensorial modalities more than others. In line with previous studies on English and Russian, we found haptic and visual modalities to be the strongest predictors of both concreteness and imageability.

Taken together, our results highlighted similarities in perceptual ratings across different languages, which may reflect the way in which we experience and interact with our environment.

Part 2: Behavioral evaluation of the collected norms

After norms were collected, we tested their validity in predicting chronometric data. To do so, we ran two novel word-processing studies, namely a lexical decision and a naming task, and tested which measures, among concreteness, imageability, and the different operationalizations of perceptual strength, are better at explaining human performance.

Lexical decision task

Method

Participants

A total of 33 psychology students from the University of Milano-Bicocca (males = 6; age = 23 ± 4.98 years; education = 14.8 ± 1.49 years) took part in the experiment in exchange of course credits. The participants were Italian native speakers and were naïfs to the experiment purpose. The study was approved by the departmental ethics committee, and participants’ ethical treatment was in accordance with the principles stated in the Declaration of Helsinki.

Materials

The word sample comprised the same 1,121 words used for the normative ratings (Montefinese et al., 2014), plus 1,121 pseudowords matched with the lexical stimuli for orthographic length. Pseudowords were created using the WUGGY software (http://crr.ugent.be/programs-data/wuggy; Keuleers & Brysbaert, 2010), a multilingual pseudoword generator that is able to create orthographic strings that respect the orthotactic rules of a given language.

For each item, we extracted imageability (M = 6.98, SD ± 1.16, SE = .03), and concreteness (M = 6.21, SD = 1.66, SE = .05) scores from the ANEW dataset (Montefinese et al., 2014). Lexical frequencies (M = 4706.7, SD = 13957.3, SE = 416.9), on the other hand, were obtained from subtlex-it (http://crr.ugent.be/subtlex-it/).Footnote 3 Trial-by-trial data of words and pseudowords are released in our supplementary materials.

Procedure

Participants took part in a two-session experiment, with each session lasting about an hour. The two sessions took place at the same time of the day at a maximum temporal distance of two weeks. After receiving information about the experimental procedure, participants were asked to sign the written informed consent. Participants were then sat in front of a 17-in. computer screen. They were informed that they would have been presented a string of letter at the center of the screen that could be either a word or a nonword and that they would have been asked to press the “N” key of the keyboard if the stimulus presented was a word and the “C” key if the stimulus was a pseudoword. Participants were asked to keep their index fingers over the two keys and to respond as fast as possible after word presentation.

A practice sequence took place at the beginning of each session, including ten words and ten pseudowords in a randomized order. Only in this phase participants received visual feedback after each trial informing them about their accuracy and response time. The two experimental sessions were composed of 1,120 and 1,122 trials (for a total of 2,242 trials for each participant), and each of them included a break after the first 560 trials. Each trial started with a fixation cross of 500 ms presented at the center of the screen. Subsequently, a written letter string (a word or a pseudoword) was presented for a maximum duration of 2,000 ms (the string disappeared as soon as the software recorded participants’ response), followed by a blank screen with a fixed duration of 1,500 ms. The order of the stimuli was randomized across participants. The experimental procedure was implemented in E-Prime 3 (Psychology Software Tools Inc., Pittsburgh. PA). Accuracy and reaction times were recorded.

Statistical design and analysis

One participant systematically inverted the response keys in the first session of the experiment, and the data of two participants were partially lost because of a power shortage during data collection. Data from these three participants were thus removed from the subsequent analysis. By-item average reaction times (RTs) were then computed. Before aggregating RTs, we removed nonword items, incorrect responses (1,874 data points), and RTs inferior to 100 ms (six data points). The raw RTs were then logarithmically transformed and converted into z-scores (over participant and session), following standard procedures in the literature on word recognition (Baayen, 2008; Balota et al., 2007), and finally, by-item average latencies were computed. This procedure ensures a more reliable measure of latency, accounting for individual differences in overall speed and variability (Balota et al., 2007). The dataset used for these analyses is available as supplementary material.

Statistical analyses were performed in the statistical environment R (R Core Team, 2008; https://www.R-project.org). We ran a series of linear regressions with RTsFootnote 4 as dependent variable. First, we fitted a baseline model with log-transformed item frequency and orthographic length as predictors (for a similar procedure, see Brysbaert & New, 2009). Then, we separately investigated the impact on RTs of each predictor of interest, namely concreteness, imageability and different operationalizations of perceptual strength, derived from the scores in the five perceptual modalities.

We considered different measures, following Connell and Lynott (2012) and Lynott et al. (2019) procedures, in order to compute composite variables that reduced the five-dimension profile.

  1. a)

    Five perceptual modalities: The five perceptual modalities were added separately to the regression as predictors.

  2. b)

    Maximum perceptual strength: This corresponds to the highest score across the five perceptual modalities. It has been suggested to be the best composite variable of perceptual strength (Connell & Lynott, 2012, 2016; Connell, Lynott, & Banks, 2018).

  3. c)

    Minimum perceptual strength: The opposite to the previous variable, this returns the minimum score across the five perceptual modalities.

  4. d)

    Mean perceptual strength: This represents the mean value of the ratings in the five perceptual modalities. It is equivalent to the summed strength previously used by Connell and Lynott (2012) and Lynott et al. (2019) and considers all dimensions as equally important.

  5. e)

    Magnitude of perceptual strength or Euclidean vector length: This corresponds to the length of the vector, including the scores for the five perceptual modalities (for details, see Lynott et al., 2019).

  6. f)

    Minkowski 3 distance: This reflects the perceptual strength in all the five dimensions, but the influence of weaker dimensions is attenuated. It has been suggested to be the best composite measure to account for multisensory integration in perception (To, Baddeley, Troscianko, & Tolhurst, 2010), and it has been showed to be the best candidate to predict reaction times and accuracy in lexical decision tasks (Lynott et al., 2019).

The six measures were added separately to the baseline model, and separate regressions were computed. We compared the resulting regression models in terms of goodness of fit—that is, their ability to explain variance in behavioral performance as compared to the baseline model. For each of the regression models, we calculated the r-squared value, the Akaike information criterion (AIC; Akaike, 1973; Bozdogan, 1987), the Akaike weights (see Wagenmakers & Farrell, 2004), the Bayesian information criterion (BIC; Schwarz, 1978), and a BIC-derived Bayes factor (Wagenmakers, 2007).

R-squared indicates the proportion of variance of the dependent variable, which is explained by the predictor (or predictors) in the model. AIC and BIC are popular methods used to compare the adequacy of multiple statistical models by estimating which model fits better the data, with both measures penalizing for model complexity thus, ceteris paribus, favoring models with fewer parameters. Lower values of AIC and BIC indicate better models. Akaike weights are a simple transformation of the raw AIC values (see Wagenmakers & Farrell, 2004, for procedural details) and capture the model probability to be the best one in fitting the data, thus providing greater insight into the model selection procedure. In the same vein, the Bayes factor gives a magnitude of the difference between BIC values belonging to different nested and non-nested models, providing a more reasonable measure of how likely data are to arise from one model than from another (Wagenmakers et al., 2018).

Results

The regression results are summarized in Table 9.

Table 9 AIC, BIC, r-squared, Akaike weights, and Bayes factors of the regressions run over the log-transformed reaction times in the lexical decision task

The results showed imageability to be the best predictor of RTs in lexical decision. This model is 30.9 times more likely to be the best model in terms of Kullback–Leibler discrepancy than the next-best model with the five perceptual modalities as predictors. In other words, the model with imageability is to be preferred over its nearest competitor with a normalized probability of .969.

However, when considering r-squared (hence, not accounting for model complexity), the model with five separate perceptual modalities appears to be on a par with, if not slightly better than, the model with imageability. Following this consideration, we computed an additional model (optimized perceptual modalities) by including as predictors only those perceptual modalities actually contributing to the model fit. In fact, including all five modalities, irrespective of their contribution, enhances model complexity; hence, this model is penalized in terms of the AIC and BIC for its increased number of parameters. Following a backward procedure, we excluded the haptic modality from the model and ran the same analysis described in Table 9.

As compared to the five perceptual modalities model, the optimized one had worse values in terms of AIC (684.299) and r-squared (.5425) but, as expected, better values in terms of BIC (724.4748) and the Bayes factor, with the data being 9.2 times more likely to arise from the optimized model than the one comprising the five perceptual modalities.

However, the direct comparison between imageability (i.e., previous best predictor) and the optimized perceptual modalities model did not lead to significant changes in the model comparison: imageability remained the best predictor in terms of the AIC and BIC indexes and explained variance.

Word-naming task

Method

Participants

A total of 28 psychology students from the University of Milano-Bicocca (males = 4; age = 22.8 ± 2 years; education = 14.18 ± 1.49 years) took part in the experiment in exchange of course credits. Participants were Italian native speakers and were naïfs to the experiment purpose. The study was approved by the local ethics committee, and participants’ ethical treatment was in accordance with the principles stated in the Declaration of Helsinki.

Materials

The item set comprised the same set of 1,121 words included in the previous experiment (see the Materials section of the lexical decision task for more details). The trial-by-trial database is included in our supplementary materials.

Procedure

Participants took part in a two-session experiment, which lasted about half an hour for each session. The two sessions took place at the same time of the day at a maximum temporal distance of two weeks. After receiving information about the experimental procedure, participants were asked to sign the written informed consent. They were then sat in front of a 17-in. computer screen. They were informed that they would have been presented a word at the center of the screen and they were instructed to read it aloud as fast as possible.

A practice phase, including ten words that were not part of the dataset, took place at the beginning of each experimental session. The two experimental sessions were composed of 560 and 561 trials (for a total of 1,121 trials for each participant), and each session included a break after the first 280 trials. Each trial started with a fixation cross of 500 ms presented at the center of the screen. Subsequently, an uppercase word was presented, for a maximum duration of 2,000 ms (the word disappeared as soon as the software recorded the participant’s response), followed by a blank screen with a fixed duration of 1,500 ms. The order of the stimuli was randomized across participants. The experimental procedure was implemented in E-Prime 3 (Psychology Software Tools Inc., Pittsburgh. PA). RTs consisted of voice onset times automatically recorded by a microphone connected to the response box. Accuracy was manually recorded by the experimenter.

Statistical analysis

For computations of the by-item aggregated RTs, we eliminated incorrect responses (43 data points), RTs inferior to 100 ms (48 data points), and superior to 1,700 ms (seven data points), and cases with technical failures in recording the response (2,072 data points). We followed the same steps of the lexical decision statistical analysis, keeping as fixed predictors word length and frequency and systematically changing the predictor of interest.

Results

The regression results are summarized in Table 10.

Table 10 AIC, BIC, r-squared, Akaike weights, and Bayes factors of the regressions run over the log-transformed reaction times in the naming task

The results show imageability to be the best predictor of naming RTs, with this model being 1.34 times more likely to be the best model in terms of Kullback–Leibler discrepancy than the next-best model, with the five perceptual modalities as predictors. In other words, the model with imageability is to be preferred over its nearest competitor with a probability of .572. Nevertheless, the model including the five perceptual modalities is a close second, despite its complexity, with an Akaike weight of 0.4128, and indeed it is associated with a higher r-squared score than the model with just imageability.

Comparing results on lexical decision vis-à-vis naming, it is evident that the latter has consistently lower scores in terms of explained variance. This is in line with previous results, showing that variance in naming latencies is typically more difficult to model than variance in lexical decision latencies (Brysbaert & New, 2009; Herdağdelen & Marelli, 2017).

In line with the follow-up analysis for the lexical decision task, we ran an optimized perceptual modalities model, eliminating predictors that did not improve model fit in naming RTs. With this procedure, we removed the haptic and gustatory modalities from the model and ran the same analysis described in Table 10. As compared to the five perceptual predictors, the optimized model was better in terms of AIC (739.4651) and BIC (774.619), with the data being 211.6 times more likely to arise from the optimized model than the model with the five perceptual predictors. The five perceptual modalities model, however, was better in terms of r-squared (.3323 vs. .3303, respectively).

Concerning the direct comparison between imageability and the optimized modality model, imageability was a better predictor in terms of BIC values, and consequently of the Bayes factor, with the data being 145.6 times more likely to arise from the model with imageability as predictor than from the optimized model. Even if the optimized perceptual modalities regressor was slightly better in terms of explained variance (.3303 vs. .3279), the two models were essentially equivalent, AIC-wise (739.47 vs. 739.55).

Discussion

In these further analyses, we evaluated the performance of perceptual strength measures based on our ratings in predicting RTs in lexical decision and word naming. To the best of our knowledge, only Connell and Lynott (2012) addressed such issue, finding maximum perceptual strength (i.e., a composite variable of the five perceptual modalities) to be stronger than imageability and concreteness in predicting chronometric data in word recognition.

Despite different authors claiming that maximum perceptual strength is the best composite variable to represent the multidimensional perceptual profile (e.g., Connell & Lynott, 2012, 2016; Connell et al., 2018; Winter, Perlman, Perry, & Lupyan, 2017), not all researchers have agreed on this topic. Đurđević, Popović Stijačić, and Karapandžić (2016), for example, found summed perceptual strength and vector length to be the best composite variable with which to reduce perceptual strength ratings. Lynott et al. (2019), instead, found Minkowski 3 distance to be the best composite variable to account for the multimodal profile, in line with To et al. (2010), who suggested that it was the best index to represent multisensory dimensions.

In our analyses, in the lexical decision task, imageability was the best predictor for all model fit indexes except the explained variance, where a larger portion was explained when the five perceptual ratings were separately added as predictors. In the naming task, the picture was more nuanced, with imageability being the best predictor according to BIC (and consequently according to the Bayes factor), the five perceptual modalities explaining the larger portion of variance, and the optimized model (comprising, in this case, all but the haptic perceptual modalities), being on par with imageability in terms of AIC.

The discrepancy between the two statistical procedures (AIC and BIC vs. r-squared) is easily explained: AIC and BIC favor the more parsimonious model—that is, the one with fewer parameters. This result is particularly interesting: on one side, it highlights that the five-perceptual-rating option, despite being more penalized by the AIC and BIC indexes, is a close second according to the same measures, suggesting that perceptual ratings play a role in predicting human performance in word recognition tasks. On the other side, we did not replicate previous findings on English: indeed, maximum perceptual strength was not the best predictor of participants’ behavior and did not hold a dominant position among the computed regression indexes.

What could have led to the difference between Connell and Lynott’s (2012) findings and ours? A first possibility is that our results are different because of the item list composition; in fact, in our item set, differently from Connell and Lynott (2012), we included 58 verbs, which are known to be less imageable than nouns (e.g., Bird, Howard, & Franklin, 2003). However, even when we removed the verbs from our item set, the analyses revealed a different pattern of results from the English norms (see the supplementary materials, section A).

A second possible explanation may be looked for in the instruction administered for rating collection. Despite our perceptual norms being collected using the same instruction used by Lynott and Connell (2009, 2013), our concreteness and imageability ratings came from the Italian version of ANEW (Montefinese et al., 2014), rather than from the MRC dataset (Coltheart, 1981), used for the English experiments. The concreteness instructions, indeed, were quite different in the two datasets, with Montefinese et al. suggesting the idea of concreteness being experienced through the five senses, asking participants “to assess the extent to which a word denotes something that can be directly perceived by the senses,” whereas the MRC authors linked concreteness to “objects, materials or persons”: “Any word that refers to objects, materials or persons should receive a high concreteness rating; any word that refers to an abstract concept that cannot be experienced by the senses should receive a high abstractness rating”. Imageability, however, was very similarly defined in the two datasets, not justifying the inconsistency between the two studies. In both cases, indeed, participants were asked to evaluate imageability on the basis of the easiness to access a “mental image”: “For the imageability scale, we ask you to evaluate how easily you can bring to mind a mental image (e.g., a mental picture, a sound, or other sensorial experience) of a given word when it is presented” (Montefinese et al., 2014) versus “any word which, in your estimation, arouses a mental image (i.e., a mental picture, or sound, or other sensory experience) very quickly and easily” (Coltheart, 1981; Paivio, Yuille, & Madigan, 1968). A third possible explanation concerns the statistical approach adopted. Connell and Lynott (2012) determined the best-fitting model using an index that measures the explained variance of each alternative, namely r-squared. We opted to compute different indexes to estimate the best fitting model. Interestingly, Akaike weights suggested the best model to be the one with imageability as predictor, which in lexical decision task has an r-squared similar to that of the model with the five perceptual modalities. In the naming task, Akaike weights indicated imageability as the best predictor, whereas according to r-squared, the five perceptual modalities led to a higher explained variance. However, notwithstanding these differences, in our analyses on Italian data the maximum perceptual strength (i.e., the best predictor for English) did not hold a dominant position among the computed regression indexes, suggesting that the cross-linguistic difference we observed does not depend on the adopted evaluation methods.

The last hypothesis is that, indeed, the empirical difference we observe depends on the different languages examined: there might be features that differentiate English from Italian, leading to perceptual strength better capturing word processing in the former than in the latter language, in which imageability seems to provide the best predictions. However, before delving more into this last hypothesis, we need to check that the observed dissociation cannot be trivially explained by the different datasets considered in the analysis of Italian vis-à-vis English. To address this possibility, we exploited the ANEW translations in order to extract a dataset of RTs for the English words corresponding to our items from available resources (English Lexicon Project; Balota et al., 2007). If the English stimuli led to the same pattern observed for Italian (i.e., better performance of imageability vis-à-vis perceptual strength), the present results would depend on the item set considered. If English data showed a pattern consistent with Connell and Lynott (2012; i.e., better performance of perceptual strength vis-à-vis imageability), then we would have support for a genuine cross-linguistic difference.

Part 3: Cross-linguistic comparisons

Materials

Relying on existing resources, we retrieved for the English language the same (dependent and independent) variables considered in our analyses of Italian data.

Behavioral data for English stimuli were taken from the English Lexicon Project (ELG) (https://elexicon.wustl.edu/), which includes accuracy and reaction times for both lexical decision and naming. ELP lexical decision RTs were collected from 815 American participants, native speakers of English, each one presented with 1,700 words (and 1,700 nonwords), with a total of 34 participants per item. ELP naming RTs, instead, were obtained from 444 American speakers in total, each one presented with 2,500 items (25 participants per item). From ELP we also collected word length and log-transformed frequency (Brysbaert & New, 2009; Keuleers, Diependaele, & Brysbaert, 2010).

English perceptual norms were obtained from the Lancaster scale (Lynott et al., 2019), which comprises 39,707 items rated over six sensorimotor dimensions (visual, haptic, auditory, olfactory, gustatory, and interoceptive) and five action effectors (mouth/throat, hand/arm, foot/leg, head excluding mouth/throat, and torso). Sensorimotor dimensions were rated by 2,625 participants, each one completing a mean of 5.99 lists comprising 58 items.

Imageability and concreteness ratings were taken from the MRC machine-usable dictionary (available online at http://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm), which comprises 150,837 words and 26 linguistic and psycholinguistics variables (Wilson, 1988).

The three English resources (ELP, Lancaster scale, MRC) were combined to create a parallel dataset, comprising translations of our Italian items along with the corresponding normative and behavioral data from English. To create the dataset, we proceeded as follows (i) we selected from the Lancaster scale the items overlapping with the Italian version of ANEW (Montefinese et al., 2014), exploiting the translations provided in the resource; (ii) we compared the remaining 1,090 items with the MRC database; and (iii) the remaining 658 overlapping items were then compared with words contained in the ELP; this final step did not produce item list reduction.

Statistical approach

We ran the same analyses described for the lexical decision and naming tasks, running models with different predictors and considering several indexes of model fit. The first analyses were computed on our lexical decision and naming RTs on the reduced 658-word item set. In this way, we aimed at testing whether the reported results are robust across the two item sets (the original 1,121 item set vs. the reduced, 658-word item set). Then, in order to test the alleged cross-linguistic dissociation, we ran the same analysis on the parallel dataset we obtained for English.

Results

In Tables 11 and 12, we summarize the results obtained for the Italian dataset over the lexical decision and naming tasks, respectively. Analyses were performed to disentangle whether discrepancies emerged between Italian and English were due to differences in item list composition, or they arose from a pure cross-linguistic effect.

Table 11 AIC, BIC, r-squared, Akaike weights, and Bayes factors of the regressions run over the log-transformed reaction times in the reduced Italian lexical decision dataset
Table 12 AIC, BIC, r-squared, Akaike weights, and Bayes factors of the regressions run over the log-transformed reaction times in the reduced Italian naming dataset

The results for the Italian reduced dataset are in line with the results observed in the full item set: imageability proved to be the best-performing predictor according to the AIC- and BIC-related measures, although the five perceptual modalities, separately introduced in the model, obtained slightly higher explained variance.

Tables 13 and 14 reports the results of the same analyses applied to the parallel English dataset.

Table 13 AIC, BIC, r-squared, Akaike weights, and Bayes factors of the regressions run over the log-transformed reaction times in the parallel English lexical decision dataset.
Table 14 AIC, BIC, r-squared, Akaike weights, and Bayes factors of the regressions run over the log-transformed reaction times in the parallel English naming dataset

The best-performing model was the one including the minimum perceptual strength as predictor, according to the AIC, BIC, and Akaike weights. The five perceptual modalities, however, obtained an overall larger explained variance.

Considering the naming RTs, the best-performing model was the one including the Minkowski distance, according to all but the r-squared model fit index. The model with the five perceptual modalities included as separate predictors was associated with the highest explained variance.

Discussion

The present section has aimed at disentangling whether differences between our results and previous findings in English could be due to differences in item selection, rather than to a genuine cross-linguistic effect. To investigate this possibility, we compared Italian and English datasets comprising the same items (following the Montefinese et al., 2014, translation) and ran the same statistical analyses on both sets. The rationale behind this procedure was that if differences between Italian and English were due to item selection, comparing the datasets including the same (translated) words, would cause the dissociation to disappear. Conversely, if we were observing a genuine cross-linguistic effect, then such difference should be confirmed through this procedure. Given these premises, we found that results on the reduced Italian dataset were consistent with findings over the complete database. More importantly, we found that differences between Italian and English were robust when considering the same items, thus suggesting the presence of a genuine cross-linguistic effect. However, it must be noted that our analysis on the English items only partially replicated the pattern found by Lynott et al. (2019). Indeed, in line with this study, we found the Minkowski distance composite variable to be the strongest predictor of naming RTs, followed by the maximum perceptual strength, which was found to be the best predictor in Connell and Lynott (2012) and was confirmed as a good predictor in Lynott et al. (2019). Considering lexical decision RTs, the pattern observed in our English dataset suggested that minimum perceptual strength—namely, the minimum score across the five perceptual modalities—was the best composite variable among the perceptual modalities for predicting behavioral performance. This result is partially in line with previous studies, showing that a measure of perceptual strength was stronger in predicting behavioral performance with English items than was either concreteness or imageability. However, the specific characterization of perceptual strength (i.e., the minimum score in the modality norms) has never been reported as the best predictor in previous norms. With the present data, we cannot evaluate whether this difference is theoretically relevant or simply depends on small variations between different characterizations of the same latent variable. We leave this question to future studies.

How to explain the observed cross-linguistic dissociation? At present, we can only speculate as to the reasons leading to such a difference. The source for this effect might be found in the norms themselves, insofar as it is known that in semantic rating studies with lexical materials, participants do not only simply evaluate the object denoted by the word; rather, they are also influenced in their judgments by statistical distributions in the language. For example, semantic transparency ratings are impacted by the frequency of constituent morphemes of the presented complex word (Bell & Schäfer, 2016). This kind of influence was also described for variables under investigation in the present article, such as concreteness (Hollis & Westbury, 2016) or perceptual modalities (Louwerse & Connell, 2011). In these works, this piece of evidence was interpreted in terms of language being able to encode grounded information. However, these results also indicate that, when producing semantic ratings, participants might be influenced by aspects of the presented word purely associated with nuanced patterns in its lexical distribution. If that’s the case, one can argue that different languages, being associated with different linguistic distributions, will be associated with slightly deviate semantic norms, as produced by their speakers. In other words, it is conceivable that speakers of different languages will provide slightly different rating scores to the same (translated) items because of their different language experiences. These differences could explain the cross-linguistic dissociation described here: intuitions of Italian speakers, during the rating task, might have produced imageability scores that are more apt at capturing chronometric data than their English counterparts.

An alternative explanation refers to nonarbitrariness in natural language, and more specifically to iconicity, which reflects the resemblance between word form aspects and a word meaning (for a review, see Dingemanse, Blasi, Lupyan, Christiansen, & Monaghan, 2015). Recently, iconicity received special attention from both a psycholinguistic and cognitive perspective. Across languages, iconic words are commonly used to drive perceptuo-motor analogies between word form and word meaning, such as referent color and shape, size, and temperature (Dingemanse, 2012). Despite the fact that Indo-European vocabularies were considered to be highly arbitrary (e.g., Perniss, Thompson, & Vigliocco, 2010; Vigliocco, Perniss, & Vinson, 2014) as compared to some African, Asian, and America lexicons, converging evidence has suggested the emergence of interesting patterns even in these languages. For example, iconicity ratings have been shown to correlate with sensory experience and semantic neighborhood, imageability, frequency, and age of acquisition in English (Juhasz & Yap, 2013; Perry, Perlman, M., & Lupyan, 2015; Sidhu & Pexman, 2018; Winter et al., 2017). Focusing on cross-linguistic differences, this variable has been shown to vary across different languages (English vs. Spanish: Perry et al., 2015) and language modalities (Perlman, Little, Thompson, & Thompson, 2018), shaped by the cultural evolutionary processes to favour learning, discriminability across categories and communication (Digenmanse et al., 2015; Imai & Kita, 2014; Lupyan & Casasanto, 2015).

Adopting a similar perspective, it is possible that differences in iconicity between English and Italian could explain the cross-linguistic dissociation reported in the present study. That is, Italian words may be overall more iconic, and hence Italian imageability scores may drive some unique information, only partially overlapping perceptual ones. This idea is partially supported by our data: indeed, in line with Connell and Lynott (2012), adding perceptual modality measures in a model already containing imageability increased the model fit, but in Italian the opposite was also true, with the imageability scores significantly improving the regression already containing the five perceptual modalities, thus suggesting that both imageability and perceptual ratings added some unique components to the models (see supplementary materials, section B, for the analysis and a more detailed discussion); moreover, as is shown in Table 7, imageability scores were predicted not only by visual and olfactory perceptual ratings (as it was in English), but also by haptic perceptual values. Further investigation will be needed in order to shed light on the observed cross-linguistic dissociation. However, irrespective of how the effect should be explained, the present results have important methodological implications, stressing the importance of having modality norms and, more in general, semantic norms in different languages: indeed, we can’t take for granted that results on English are generalizable to all the other languages. In this sense, a standardized dataset like ANEW, which is translated in different languages, can be a useful and powerful instrument to make cross-linguistic comparisons in a more controlled way.

Conclusions

In the present article, we have presented and released perceptual ratings for 1,121 Italian words. The richness and usefulness of the present resource is that it ideally complements the largest norming work currently available in Italian (Montefinese et al., 2014). Moreover, given the fact that the dataset comprises items from the English ANEW, which has adaptations in different language (e.g., Spanish: Redondo, Fraga, Padrón, & Comesaña, 2007; Portuguese: Soares, Comesaña, Pinheiro, Simões, & Frade, 2012), our resource is particularly important from a cross-linguistic comparison perspective, as clearly exemplified by the third experiment described in the present work.

The strength of the present resource, however, is not limited to the perceptual-rating validation. Indeed, we released a set of trial-by-trial chronometric data collected with a word naming and a lexical decision task, which does not have any comparable example in Italian. The two novel experiments were performed to investigate the extent to which perceptual strength compares with concreteness and imageability in predicting human behavior, and specifically reaction times in word processing tasks. Our results provide evidence for cross-linguistic differences in the impact of these rating-based measures, suggesting caution in generalizing results obtained on English studies to other languages.

Beyond the ability they provide to predict RTs, the present rating norms are crucial to studying how the conceptual system is organized and how different words can imply different semantic representations due to the “channels” available for the acquisition of the information they refer to.