Introduction

Word association tasks comprise a set of implicit conceptual memory tasks in which subjects respond to a written or spoken cue word with the first word that comes to mind. Since the late 1970s, large-scale word association studies have contributed to behavioral research in at least two distinct ways: first, as a normative tool to measure lexical properties of interest that can be used when controlling variables in psycholinguistic experimental designs (such as association strength, set size, etc.) (Nelson et al., 2004; Wilson et al., 1988), and second, as a tool to explore and model the structure of semantic representations stored in memory (De Deyne et al., 2013, 2019; Dubossarsky et al., 2017; Elias Costa et al., 2009; Nelson et al., 2000; Steyvers et al., 2005; Steyvers & Tenenbaum, 2005).

The Small World of Words (SWOW) project aims to collect large-scale word association norms for many languages, which are currently available in Dutch (SWOW-NL, De Deyne et al., 2013), and English (SWOW-EN, De Deyne et al., 2019). Through a dedicated website (www.smallworldofwords.org), responses from over 500,000 participants in 17 languages have already been collected. The online word association task consists of participants generating up to three responses for each of 18 cue words. This procedure, where a participant provides multiple responses to a cue word, is referred to as a continued word association task as opposed to a single word association task (e.g., Nelson et al., 2004) or a continuous association task (where participants associate to their previous responses). The main advantage of this procedure is that it provides both strong and weak associates in cases where there is a dominant first associate present (e.g., hammernail). This is important both when experimental control is required (i.e., making sure two words are not associated) or when deriving secondary measures such as centrality or similarity measures to predict lexical (e.g., lexical decision response latencies) or semantic processing (e.g., similarity ratings). In this work, we introduce the SWOW norms for Spanish spoken Uruguay and Argentina. In contrast to previously published norms for Dutch and English, this would be the first large-scale association norms for a romance language with an extensive grammatical gender system.

Spanish is used by approximately 480 million speakers in more than 20 countries. It ranks second in the list of languages with the most native speakers and fourth in the number of total speakers (Eberhard et al., 2022). Many dialectal variants of Spanish can be distinguished alongside its large geographical distribution that mainly encompasses Spain and the Americas. Although there is a growing corpus of psycholinguistic research in Spanish developed since the late 1980s and early 1990s (see, for instance, Carreiras et al., 1993; Cuetos & Mitchell, 1988), most normative studies for frequency, word association, affective and other lexical properties refer to Iberian Spanish, the standard variant spoken in Spain (see for instance Fernández et al., 2012; Hinojosa et al., 2021; Stadthagen-Gonzalez et al., 2017), with the notable exception of Aguasvivas et al. (2018).

Rioplatense Spanish, the variant spoken in most parts of Uruguay and some areas of Argentina (mostly centered around Buenos Aires), has many characteristic phonological, grammatical, and lexical distinctive features (Di Tullio & Kailuweit, 2011). This includes substantial differences in the preferred words for everyday objects such as "car" (auto instead of coche), "cellphone" (celular instead of móvil), and "money" (plata vs. dinero), and in the relative meaning frequencies for homonyms (chucho means "dog" in Iberian Spanish, "cold shiver" in Rioplatense Spanish) (Armstrong et al., 2016). These ubiquitous lexical differences justify the construction of specific lexical norms for different Spanish dialectal variants (Fumagalli et al., 2017; Manoiloff et al., 2010; Sarli & Justel, 2021; Vivas et al., 2017).

This work introduces the "Small World of Words" free association norms for Rioplatense Spanish (SWOW-RP). To facilitate cross-linguistic comparisons, we use methods that match those employed in previous reports (De Deyne et al., 2013, 2019) and present the largest word association dataset currently available for Spanish and the first large-scale free association norm for Rioplatense Spanish. In the first part of this article, we compare our data to currently available association data for Spanish and explore the role of grammatical gender in cue-response pairings. In the second part, we also assess the validity of response frequencies to explain word processing advantages in psycholinguistic reaction-time experiments and determine whether semantic similarity measures from a word association graph are consistent with direct human judgements of semantic similarity and compare them with similar analyses in Dutch (De Deyne et al., 2013) and English (De Deyne et al., 2019).

Methods

Participants

Participants were recruited online through web forums, and social media ad campaigns targeting individuals located in Uruguay and the Argentinian provinces of Buenos Aires, Corrientes, La Pampa, and Santa Fé. Some participants were also recruited in classrooms of the School of Psychology of the Universidad de la República at Montevideo, Uruguay. Participants were directed to the Small World of Words webpage dedicated to Rioplatense SpanishFootnote 1. They were then asked to report their age, gender, degree of education, and native language before starting with the task.

There were 67,525 (82% identified as female, 17% as male, 1% as other) participants, with an average age of 38.3 years (SD = 15.2, min = 5, max = 99). Figure 1 illustrates the geographical location of most participants. Their maximum attained education level and age by gender distributions are shown in Fig. 2. Native language options included Argentinian Rioplatense (49% of participants), Nor-oriental—Guaraní (0.8%), Cordobés (1.4%) and Uruguayan Rioplatense (43%) variants, as well as an “Other” option (5%). Hence, over 90% of participants declared to be native Rioplatense Spanish speakers (see "Participant selection and data balancing" below). Most participants had completed either secondary (41%) or higher education (48%).

Fig. 1
figure 1

Geographical localization of participants

Fig. 2
figure 2

Age, education level, and gender distribution of participants

Procedure

The study was conducted online and consisted of an introduction with a short explanation, followed by demographic questions and instructions. A continued word association task was used in which three different responses were asked for each cue word. Participants were instructed to respond only to the cue word of a given trial and not to their responses or previous cue words. The original instructions were translated and adapted from English (cf. De Deyne et al., 2019) and are provided in Appendix 1.

Each trial consisted of a cue word placed on a gray background at the top of the screen. Below there were three blank text fields in which they were asked to write down the first (R1), second (R2) and third (R3) words that came to mind after reading the cue. Once a response field was completed (by pressing "enter" or clicking a button), the text color would change to gray, and the response could no longer be edited. Participants were asked to press buttons labelled "No conozco la palabra" (I don't know the word) and "No más respuestas" (No more responses) located below the response fields if they did not know the cue word or could not come up with more responses, respectively.

Each participant responded to a unique set of 18 stimuli. The stimuli were mostly selected semi-randomly by picking a set of at least 1000 cues with the lowest number of responses aggregated over all participants. In some cases, cue words that were present in other studies were also included to maximize the utility of the norms across different studies (see below). Reaction time data were also collected for each response, although not analyzed in this work.

Stimuli

Cue words were selected using the snowballing procedure described in De Deyne et al. (2019). First, an initial set of 1000 high-frequency words (mostly nouns, verbs, adverbs, and adjectives) were selected using the SUBTLEX-ESP frequency norms (Cuetos et al., 2012). Subsequent cues were added in batches of 1000 words, selected from the most frequent words collected so far as responses. Words that appeared in other normative or experimental studies were also included to maximize the utility of the norms for future studies. The final set consisted of 13,457 cues, which included 419 of the words (77%) included in a study that reports on Spanish similarity norms (Moldovan et al., 2015), 1017 words (98%) of the Spanish adaptation of the ANEW affective norms (Redondo et al., 2007), and 396 words (67%) from the affective norms of Guasch et al. (2016).

Data preprocessing

The raw data included in our analyses included 3,646,335 responses that were subjected to a series of preprocessing steps to remove inputs from non-compliant participants and standardize the spelling of words. Both the complete raw data and preprocessed data are made publicly available. First, we removed punctuation characters, quotes, tags, hyphens, and extra whitespace characters. In some cases, participants wrote "No sé" (I don't know), "?", "No más respuestas" (No more responses) or variants of these instead of pressing the matching buttons. These responses were recoded as if the participants had used the buttons instead. Repeated responses to the same cue were recoded by keeping the first response (R1) and treating the second and third (R2 and R3) as missing responses. This affected 1625 responses (< 0.05% of the total). Some participants wrote three words on the first (R1) text field separated by a comma or blank space, leaving empty the other two text fields (R2 and R3). For participants who repeatedly showed this pattern (for more than 30% of the cues), the three-word responses in R1 were split and sequentially assigned to R1, R2, and R3, with any remaining responses in the second and third text field discarded. This was done for 4359 responses (0.12% of the responses).

Written forms were normalized by checking the spelling, including the capitalization of words. Each participant's use of capitalization was checked: if more than 15% of responses were in all caps or initial caps, all participants' responses were converted to lowercase. This step affected 7619 responses. Responses (and a few cuesFootnote 2) that unambiguously referred to proper nouns or acronyms were capitalized appropriately (i.e., juan becomes Juan, onu becomes ONU). Next, we set up normalization rules to convert responses with alternate or non-canonical spellings. As an example, applying these rules baso becomes vaso (glass), logico becomes lógico (logical), and harmónica becomes armónica (harmonica). In most cases, we followed the recommended spellings according to the Diccionario de la Lengua Española (REAL ACADEMIA ESPAÑOLA, n.d.). Where multiple spellings are acceptable, or the word is not listed in the dictionary, the most frequent response spelling was adopted as the canonical form. For instance, yogurt and yoghurt became yogur, sutien and sutién became soutien. To the best of our ability, we manually crafted these rules upon inspection of the responses, avoiding correcting ambiguous cases. Responses for cues that initially were included with alternative spellings were combined with responses to cues with the standard form: anana with ananá, obscuro with oscuro, inecesario with innecesario, and pobresa with pobreza. This step affected a total of 146,224 responses (4%). Finally, punctuation marks and symbols such as hyphens, parentheses, exclamation marks, and extra whitespaces were removed in 3522 responses.

Participant selection and data balancing

As in previous studies in Dutch and English (De Deyne et al., 2013, De Deyne et al., 2019), several criteria were used to screen participants with irregular responses. We removed participants who either responded using short sentences, frequently gave the same response to a cue, usually gave non-Spanish words as responses, or indicated they did not know most cues. Participants were excluded if more than 30% of their responses consisted of multi-word strings (15 participants matched this condition), had more than 20% of repeated responses to cues (307 participants), marked as unknown or did not respond to more than 60% of cues (986 participants), or gave more than 60% of responses that were not present in a list of Spanish words (722 participants). This list was compiled by combining word forms occurring at least twice in the SUBTLEX-ESP norms, a custom list of common proper nouns, and the list of spelling corrections used in the study (see "Normalization"). Finally, we also excluded participants under 16 years of age (897 participants). In total, 2030 participants (3%) that matched at least one of these conditions were removed, and the resulting norms were based on responses from a set of 64,598 valid participants.

Due to data collection logistics and the participant selection procedure described above, the data at this stage contain a different number of responses for each cue. The final dataset includes 70 responses for almost all cues, while 308 cues have at least 63 responses. In cases where more than 70 responses were available (12,808 cues), we selected 70 responses in the as follows. First we selected responses from native Rioplatense non-female participants. Then we randomly selected responses from native Rioplatense female participants. If needed, data from non-female and female participants that selected other Argentinian native language options was included. Data from participants included in the “Other” native language category were not included in further analyses. This procedure resulted in 89% of cues having at least 95%, 3% of cues having less than 90%, and just one cue having 80% of responses from native Rioplatense participants.

Participant gender imbalance

Regarding gender, the median proportion of responses from female participants was 84% (mean 80%), with 25% of cues having less than 74%. This proportion is greater than in the English (68%) and Dutch (66%) norms. We explored whether this imbalance could significatively impact the association data by selecting 2739 cues that had a proportion of responses from female participants between 0.3 and 0.7. Then, we calculated the cosine similarity between the response vectors from female and male participants, after discarding responses of frequency 1 for each cue. The mean cosine was 0.83, with quartiles 1 through 3 being 0.7, 0.83, and 0.92, respectively. We then implemented a randomization test, in which we randomly permuted the gender tag of each participant, and then calculated the cosine similarity scores between the two partitions. After 10,000 random permutations, we found evidence to reject the null hypothesis for 2.4% cues at the 0.01 significance level, and 0.5% cues at the 0.001 significance level. Hence, although there is a greater gender imbalance than in previous SWOW datasets, we estimate that this leads to greatly altered association patterns for only a small set of cuesFootnote 3.

After participant selection and data balancing, on average, cues were marked as unknown about 3.3% (min = 0%, max = 59%) of the time. Missing R2 (and hence also R3) responses occurred on average in 17% of cases, whereas missing R3 responses account for 8% of the data. These results are somewhat higher than previous findings for English where 2.5% of responses were unknown, but only 4.3% of R2 were missing, and 9.2% of R2 + R3 combined responses were missing (De Deyne et al., 2019). They were also higher than findings in Dutch (De Deyne et al., 2013) where the average percentage unknown was 1.5%, 1.3% of the data were missing R3 responses, and 4.4% of the data were missing R2 + R3 responses. Implications will be discussed in the General discussion.

Results

The following sections report the distributional properties of cues and response types and tokens and the prevalence of unknown cues and missing responses. Next, we explore the validity of the collected data by using centrality measures and derived semantic measures to account for variability in response times and similarity ratings in the psycholinguistic studies available for Spanish.

Types and tokens

There were 132,584 distinct word forms (types) when aggregating across all three responses (R123), 61% of which appeared only once. In the dataset containing only the first responses (R1), there were 73,145 types, 59% of which appeared only once. Table 1 shows the most frequent responses in the dataset in terms of type and token frequencies, that is, the number of cues for which a word form was given as a response, and the overall count of response occurrences, respectively. Both results including only the first response (R1) and all three responses given for each cue (R123 are shown. As can be seen from Table 1, most frequent words found tend to refer to basic human needs, which agrees with previous findings for SWOW-NL and SWOW-EN.

Table 1 Top 10 response tokens and types for the SWOW-RP dataset and English SWOW-EN dataset (De Deyne et al., 2019) based on the first (R1) response and all responses (R123)

Graph construction and response coverage

Word associations lend themselves naturally to be encoded as a semantic network where nodes represent words, and links indicate an unlabeled semantic relation between a cue and response word. While word association data can be encoded as low-dimensional embeddings (Mikolov et al., 2013; Richie & Bhatia, 2021) as well, the following section will primarily focus on constructing semantic graphs as these provide a single framework to study topological properties of the lexicon at multiple scales (De Deyne et al., 2019; De Deyne & Storms, 2008; Steyvers et al., 2005; Steyvers & Tenenbaum, 2005), investigate how important a word is in terms of centrality, and determine how both direct and indirect edges connecting pairs of words contribute to semantic processing.

In a semantic graph G based on word associations, nodes represent words and are connected to other nodes through weighted edges with values corresponding to the probability p(r|c) of obtaining a response r for a given cue c. Like previous work in Dutch and English, we include in the graph only words that appeared both as cues and responses and retained only the largest strongly connected component. In a directed graph, a set of nodes comprise a strongly connected component if every node can be reached from every other node in the set. In this way, we only include words with both incoming and outgoing links, and there is at least one path connecting every word pair in the network. This will allow us to calculate spreading activation as a random walk in the next section.

We derived two graphs from the data: one that includes only each participant's first response (GR1) and one that includes all three responses (GR123). These graphs included a total of 13,315 and 13,458 words, respectively. Hence, in the GR1 graph 184 words originally included as words were removed because they were never given as a first response. Of these, 40 were neither given as a second or third response and then were removed from the GR123 graph.

Given the large number of word types produced (over 130,000), a large number of types are excluded when only considering those that were presented as cue words. However, since most word types occur only once, the graphs will still represent those words that are commonly used in language. To demonstrate, we calculated how censoring responses that were not part of the set of cues affected the response coverage. This was done by counting the percentage of responses that were part of the set of +13,000 cue words. Most cues have coverage values above 80%, with 90 being the median and 88 the mean for R1, 88 the median and 87 the mean for R123. Hence, a high fraction of the responses generated by participants were included as cue words and were included in both graphs.

Grammatical gender and assortativity

Previous work suggests that word association networks exhibit assortative mixing: cues and responses tend to be congruent in part-of-speech, valence, concreteness, and potentially several other factors (Stella et al., 2019; Van Rensbergen et al., 2015). English has a natural gender system: that is, only some nouns are gendered according to the biological sex of their referents. The situation for Dutch is somewhat complex as gender distinctions between feminine, masculine, or neutral words can vary between Flanders and the Netherlands. Grammatical gender in Dutch is determined primarily by the use of a specific article or demonstrative pronoun (e.g., de hond (the dog, masc.), vs het boek (the book, neutral). However, articles are typically omitted in Dutch word associations.

In contrast, Spanish is a Romance language where grammatical gender features more prominently: all nouns have a grammatical gender assigned (masculine or feminine), and there is gender agreement between nouns, pronouns, determiners, and adjectives. For instance, the noun phrase el lago frío [themasc coldmasc lakemasc], with la sopa fría [thefem colfem soupfem]. Hence, pronouns and some adjectives have alternative gendered forms with the same grammatical role and semantic meaning but different grammatical gender. The frío and fría example is one of them. Since gender agreement is required for all nouns and their accompanying pronouns and adjectives, Spanish provides a unique opportunity to determine whether association patterns reflect a tendency towards agreement. If so, we expect congruency (i.e., assortative mixing) between cues and responses for grammatical gender. The goal of the following section is to provide an initial attempt to explore the role of grammatical gender agreement between cues and responses. In order to fully explore this phenomenon, we consider here all responses, not only those appearing as cues.

Gender balance in cues and responses

For each word, the grammatical gender and number was annotated using the UDPipe parser (Straka & Straková, 2017) and the ANCORA Spanish language model (Taulé et al., 2008). This allowed us to obtain gender tags for 94,567 word types which covers 55% of the R123 response tokens and 69% of cues.

The gender distribution of responses is shown in Fig. 3. Overall, in both R1 and R123 data, about half the responses are feminine, with a somewhat stronger bias towards masculine responses in R1. However, this distribution is contingent on cue gender: feminine cues elicit about 61% of feminine responses (59% in R123). Conversely, masculine cues elicit about 63% (60% in R123) of masculine response tokens. Although the association between gendered cues and responses can be categorized as weak, (the phi coefficient yields a value of φ=.23), it is nonetheless statistically significant (χ2(1) = 29029, p < .001). Considering response types instead of tokens yields essentially the same results (not shown).

Fig. 3
figure 3

Percentage of gendered responses labeled as masculine and feminine by cue gender

The following example is illustrative of this assortativity in grammatical gender. The adjective hermoso [handsomemasc] elicits lindo [prettymasc], and bello [beautifulmasc] as the two strongest associates, whereas in turn hermosa [handsomefem] elicits linda [prettyfem], and bella [beautifulfem]. In this case, although hermoso and hermosa would end up with rather different sets of associates, this does not reflect a critical semantic distinction between the cues, rather their difference in grammatical gender. Importantly, however, the third strongest associate for hermosa (occurring ten times) is mujer [woman], while the complementary word hombre [man] was not elicited even once in response to hermoso. These kinds of asymmetries may reflect subtle semantic connotations embedded in grammatical gender, as has been found using experimental approaches (Phillips & Boroditsky, 2003). They also raise the question about when it is appropriate to combine gendered forms or not, as this might affect both how central a word is (i.e., centrality might be split over two gendered forms) and how similar two words are (i.e., two words might share many lemmas where gender suffixes are removed, but few gendered word forms).

Validation

The previous sections described different aspects of our dataset. In what follows, we set out to test the validity of the SWOW-RP data in three stages. First, we compared our data with previously available free association data for Spanish. Then, we determine whether network centrality measures explain variability in psycholinguistic experiments, particularly reaction times in lexical decision tasks. Finally, we tested whether network similarity measures can provide a general index of semantic similarity for all cue words by correlating these measures with human similarity judgments for abstract and concrete words.

Comparison to other Spanish norms

So far, the largest set of published norms in Spanish are the University of Salamanca Free Association Norms for Iberian Spanish (Normas de Asociación Libre del Castellano: NALC; 4819 cues; Fernández et al., 2004, 2012). To the best of our knowledge, the only published norms for Rioplatense Spanish are the sets obtained by Manoiloff et al. (2010) using the names of the 400 images in the expanded (Snodgrass and Vanderwart (1980) picture naming study, and the norms by Luna et al. (2016) for 407 words. These norms include primarily people from Córdoba, Argentina, where a distinct yet closely related variety of Spanish is spoken (Silva-Corvalán, 2001).

We compared our data to NALC, Manoiloff et al.'s and Luna et al.'s datasets in terms of shared cues, response distribution, and response similarities. Shared cues reflect the percentage of cues in each dataset that are included in our norms. Response distribution was measured in several ways. First, we considered whether the strongest associate for each cue in these data sets matched the strongest (First associate overlap) or is included in the top 3 associates (first associate in top 3) in our dataset. Second, we compared the entire distribution for two measures: overall response frequency, and forward strength entropy.

Forward strength entropy was calculated as:

$$H(x)=\frac{\sum p\left({y}_i\right){\log}_2p\left({y}_i\right)}{\log_2(N)}$$

where p(yi) was the proportion of participants who responded yi to the cue x, and N the total number of response types for that cue. Hence, cues for which all responses were the same would have a forward strength entropy value of 0, and cues for which all responses were different (or for which all response types occurred with the same frequency) would have an entropy equal to 1. Finally, we compared the overall similarity of association strengths as follows. For each cue, a vector comprised of association strengths for each response was constructed. Then, we calculated the cosine between vectors for the same cue from both datasets, and computed the median value.

Table 2 shows the results of comparing these measures between our norms and previously published norms in terms of overlap or correlations between variables. Although cue overlap and forward strength entropy are high, the overlap in the first associate for each cue is not. Nonetheless, the first associate in these datasets is usually present in the top 3 associates in GR1, and for half the cues, the cosine similarity is higher than 0.6. In most cases, these indicators indicate less similar response patterns for the Iberian Spanish NALC dataset, which could be due to this norm corresponding to Iberian Spanish rather than the Rioplatense or its similar Cordobés variant.

Table 2 Comparison between previous Spanish association norms with SWOW-RP-1 and SWOW-RP-123

To further explore the source of discrepancies between NALC and SWOW-RP, we calculated the difference in overall log response frequencies for words present in both datasets, normalized per 10,000 responses. Figure 4 shows the distribution of these differences, along with the ten words with the largest differences in each direction. Although the difference in frequency is less than one order of magnitude for most words, many words appear over ten times more in one dataset than in the other. The extreme cases consist of words used mostly in one but very rarely in the other Spanish variant. Some of these words refer to places (Salamanca, Canarias), some to objects and instruments (jersey, lapicera [pen], auto [car], autos [cars], valija [suitcase]), while others are colloquial expressions (guapa [beautiful], guagua [bus]), or even adverbs and adjectives (adentro [inside], enojado [angry]). This supports the notion that differences between datasets reflect dialectical differences in word usage between Iberian and Rioplatense Spanish.

Fig. 4
figure 4

Distribution of differences in log frequency of responses presents in both NALC and SWOW-RP-R1 datasets. Insets show responses with the largest differences in each direction

Behavioral predictions in lexical and semantic tasks

Despite Spanish being a widely spoken language, there are far fewer publicly available psycholinguistics datasets compared to English. To determine the criterion validity of our data, we focused on two tasks for which there are available data and have been used to evaluate measures derived from associations in previous work (De Deyne et al., 2013, De Deyne et al., 2019). The first task uses response latencies from a lexical decision task (LDT) to determine the validity of network centrality indices. The second task uses pairwise semantic similarity judgments to determine the role of weighting response strength distributions and the contribution of indirect paths in measuring network-based cosine similarity.

Lexical decision task

Previously work in English and Dutch indicates that network centrality measures derived from word associations explain unique variability in lexical decision and categorization tasks, even after partialing out standard corpus-based frequency estimates (De Deyne et al., 2013, 2019). This suggests that network centrality or response frequencies capture how important a word is when estimating the centrality based on the number of times given as a response. This can be contrasted with corpus-based measures of centrality, such as word frequency or contextual diversity, which capture how often words are used in natural language. Both types of information are likely to capture complementary information, as some words are infrequently mentioned in natural language even though they feature prominently in daily life (e.g., toothbrush).

Here we set out to reproduce this result for Spanish data. However, few large-scale lexical decision studies are available in Spanish, and only one study included Rioplatense Spanish speakers. Two datasets with response latencies in a lexical decision task (LDT) were used. The first data set by González-Nosti and colleagues (González-Nosti et al., 2014), included responses towards 2765 words from Iberian Spanish speakers. The second dataset, SPALEX (Aguasvivas et al., 2018), consists of responses from a large-scale lexical decision experiment comprising 44,853 words and more than 169,000 participants from Spain and Latin America, including around 7% (12096) of participants from Argentina and Uruguay.

Figure 5 shows the correlation coefficients between word frequencies and mean decision latencies (both log-transformed) for these experiments. First, both R1 and R123 datasets fared similarly, with only a slight advantage for the latter in SPALEX data. Correlations for both SPALEX and the Gonzalez-Nosti et al. data are close to .60 for SWOW-RP, NALC, and SUBTLEX-ESP, which provides word frequency norms based on subtitles (Cuetos et al., 2012). Interestingly, if the SPALEX data is split according to whether the participant is from Spain or not (Iberian vs. non-Iberian split), the correlation is marginally higher for SWOW-RP with R123 than for NALC in the non-Iberian set. Conversely, NALC estimates outperform SWOW-RP for the Iberian set. Importantly, response frequencies explained residual variance in decision latencies after partialing out word frequency estimates from SUBTLEX-ES (and conversely, word frequency estimates explained residual variance after partialing out response frequency data). This suggests that response prevalence in association tasks and corpus-based word frequency estimates account for different aspects of psychologically relevant variability in word recognition, as previously reported for SWOW-EN and SWOW-NL (De Deyne et al., 2013, 2019).

Fig. 5
figure 5

Pearson correlation coefficients between mean reaction times in lexical decision and word frequency in free association data and the SUBTLEX-ES norms. Only words present in all frequency datasets were included (SPALEX: n = 11452; González-Nosti: n = 2616)

Semantic similarity

Early models of lexical-semantic knowledge characterized the lexicon as a semantic network or graph consisting of nodes linked by relational properties that denoted different types of relatedness between concepts (Collins & Loftus, 1975). This network-based representation of word meaning contrasts with the notion of semantic spaces, in which the coordinates in a low-dimensional space represent the meaning of a word, and semantic similarity is measured in terms of cosine similarity between the coordinates of two words. Since the 1990s, techniques such as latent semantic analysis have been used to obtain semantic spaces from large text corpora (Dumais, 2003; Landauer & Dumais, 1997). More recently, word embedding algorithms such as word2Vec have further improved prediction across a range of semantic tasks (Bojanowski et al., 2017; Mikolov et al., 2013; Pennington et al., 2014). Regardless of the specific model or source of data (natural language or elicited in a psychological experiment), low dimensional representations often provide better predictions of semantic similarity as they address the sparsity issue of language data, such as the fact that words can co-occur with many other words than those counted in a text corpus.

Word association data have also been used to derive semantic word representations, using either graph or vector representations (Elias Costa et al., 2009; Griffiths et al., 2007; Hills & Kenett, 2022; Richie & Bhatia, 2021; Steyvers et al., 2005; Wulff et al., 2019, 2022). Recently, De Deyne et al. (2019) described a measure of semantic similarity based on a random walk on word association graphs. Similar to word embeddings, a random walk addresses the sparsity issue of a word being associated with a small number of other words. Instead of comparing the direct neighbors, the random walk also considers indirect neighbors. Similarly, low dimensional spaces can also be directly derived from word associations using spectral methods applied to graphs (Steyvers et al., 2005). To provide a straightforward comparison with text-based models, we also obtain a word vector representation through embedding algorithms to the association graphs and compare the performance of these methods to some state-of-the-art word embeddings used to predict human semantic similarity judgments.

Graph-based similarity

To derive semantic similarity measures directly from the association network we take the same approach as in previous work (De Deyne et al., 2019). First, we construct a graph G from the association strength between cue words, as described earlier. We then replace each weight p(r|c) with the positive pointwise mutual information between response and cue:

$$\textrm{PPMI}\left(r|c\right)=\max \left(0,{\log}_2\left(p\left(r|c\right)/p(r)\right)\right)=\max \left(0,{\log}_2\left(\frac{\textrm{Np}\left(r|c\right)}{\sum_ip\left(r|c\right)}\right)\right)$$

and then normalize these values so that for each cue word, the sum of its association strengths is 1. Hence, the rows of the resulting adjacency matrix, which we identify as P, sum up to 1. Since it contains all direct associations between words, P could be used as the basis of a similarity metric. One could simply define the similarity between words a and b as their direct association strength, using for instance the geometric mean of the strengths in both directions, a→b and b→a. This implies that if one or both strengths are zero (i.e., they are not reciprocal associates), their similarity would be also zero. Since there are many word pairs who are not typically associated but have non-zero similarities (cheese and mayonnaise, for example), the simplistic measure should be expanded to include shared associates: that is, indirect paths from a to b (or vice versa) mediated by other words. In this case, two words would be more similar if they showed similar association patterns. This criterion can be easily implemented using the cosine between the corresponding rows of P.

The generalization of this notion of mediated association to paths of arbitrary length is akin to spreading activation. One way to implement this is to consider spreading activation as decaying random walks in the association graph (Abbott et al., 2015). Here we adopt the strategy described in De Deyne et al. (2016), where the similarity between two words is the degree of overlap between the distribution of direct and indirect associations arising from a random walk starting at each word. The corresponding matrix of indirect associations can be calculated as:

$${G}_{\textrm{rw}}={\left(I-\upalpha \textrm{P}\right)}^{-1}$$

where I is the identity matrix, and α is a decay parameter that we set to 0.75. The resulting graph is dense, frequency-biased and contains spurious edges. Consistent with De Deyne et al., (2016), a PPMI transform is applied to Grw to mitigate this issue. The resulting matrix can be used to compute semantic similarity between words a and b as, for instance, the cosine between the corresponding rows of the Grw matrix.

Graph embeddings

Another way to implement a similarity measure that is sensitive to indirect paths between words relies on graph embeddings. Graph embedding algorithms derive a low-dimensional approximation of the original graph. In the simplest case, these low dimensional representations are derived by calculating a truncated low dimensional solution through singular value decomposition, similar to LSA (Landauer & Dumais, 1997). More recent graph embedding approaches are different in that they aim to capture the topological structure of the graph as a whole, for example, by embedding random walks on graphs (Grover & Leskovec, 2016) or deriving a low-dimensional solution through singular value decomposition on an augmented graph that includes both direct and indirect edges (i.e., neighbors of neighbors). The latter closely matches the approach taken in the Word Association Space (WAS) measure by Steyvers et al. (2005). This approach carries the same advantages as other word embedding techniques in that the final representation is compact. That is, the meaning of a word is capture by a low-dimensional vector consisting of typically 300 to 300 scalars. It often leads to improved predictions compared to the original full-size representation. The WAS measure is derived by applying a simple spectral embedding technique to obtain a set of embedding vectors for each cue. First, the matrices S1 and S2 were calculated:

$${\displaystyle \begin{array}{c}{\textbf{S}}_{\textbf{1}}=\textbf{P}+{\textbf{P}}^{\textbf{T}}\\ {}{\textbf{S}}_{\textbf{2}}={\textbf{S}}_{\textbf{1}}+{\textbf{S}}_{\textbf{1}}{\textbf{S}}_{\textbf{1}}^{\textbf{T}}\end{array}}$$

Note S1 corresponds to a transformation to an undirected graph and that S2 is similar to a random walk of length two (i.e., the random walker only explores the neighbors of neighbors) without using a decay parameter. Then, we calculated the singular value decomposition of \({\mathbf S}_2\boldsymbol=\mathbf U\mathbf\Sigma\mathbf V^{\mathbf T}\). Finally, we obtained the forward embedding vectors for each word as:

$${S}_k={\textrm{U}}_k{\varSigma}_k^{1/2}$$

where Uk and Σk retain only the first k = 400 dimensions of the singular value decomposition.

Correlation with word similarity data

To test our graph-derived similarity measures, we obtained data for the Moldovan et al. (2015) study on semantic similarity, the Spanish section of the MultiSimlex corpus (Simlex-ES; Vúlić et al., 2020), and recent experimental data for Rioplatense Spanish speakers on a relatedness task collected in our lab (De Deyne et al., 2020). We compared these ratings to the cosine similarity measures described above. We also included results from four word embedding trained vector sets readily available for Spanish: fasTextFootnote 4 (Bojanowski et al., 2017), GloVEFootnote 5 (Pennington et al., 2014), and word2Vec (Mikolov et al., 2013) to provide a natural language baseline. The results are shown in Fig. 6. Across all datasets, we found that similarity measures that consider indirect paths (either through random walks or embeddings) outperform measures that only consider shared associates (SWOW-PPMI, and SWOW-Raw), which replicates previous results in other languages (De Deyne et al., 2019). Moreover, the SWOW-derived embedding vectors perform at the same level as the random walk-derived measure Grw. In all cases, the SWOW-PPMI solutions outperform the SWOW-RAW similarity indices, suggesting that similarity judgments and the association generation task require different processes. Finally, the SWOW-derived similarity measures significantly outperform corpus-based word embeddings, replicating a pattern observed across a range of studies (De Deyne et al., 2016, De Deyne et al., 2019).

Fig. 6
figure 6

Correlation between human similarity judgments and similarity scores obtained from word embedding algorithms and SWOW-RP data. SWOW-derived metrics outperform state-of-the-art corpus methods in the tested datasets

Interestingly, the similarity derived from the raw association frequency (SWOW-raw) performs better than corpus-based word embeddings in three out of four datasets. This in itself is significant as it implies that the meaning of a word can be represented explicitly through a small number of distinct associates, in contrast to the 300-dimensional representations derived from corpus-based word embeddings. In practice, choosing one of these measures depends on several considerations. For example, embeddings can be encoded in an economical format. On the other hand, representations in a random-walk augmented graph remain explicit, facilitating interpretation.

Altogether, the consistent results across all four studies suggest that word associations can be used to derive high-quality measures of semantic similarity when transparency and the prediction or experimental control between words is of concern.

Discussion

The norms introduced in this study are part of a multi-language international collaboration project, "The Small World of Words", that aims to collect online word association data for many languages from around the world. In previous works, we presented association norms for Dutch (SWOW-NL) and English (SWOW-EN) (De Deyne et al., 2019; De Deyne & Storms, 2008) and showed that including up to three responses to each cue word instead of just the first response used previously (e.g., Nelson et al., 2004; Wilson et al., 1988) improved the performance of the norm to explain variance in psycholinguistic experiments, such as lexical decision tasks, as well as semantic similarity judgment ratings. Here we introduce the first large-scale word association norm for Rioplatense Spanish (SWOW-RP), a variant spoken by more than 15 million people in Argentina and Uruguay. To our knowledge, this is the third free association dataset of this scale (more than 10,000 cues) and the first for a non-Germanic language. Hence, this work represents an important step toward developing a multilingual collection of word association data that enables the comparison of structural properties of the lexicon within and between languages (e.g., De Deyne et al., 2020).

While Spanish is an Indo-European language like Dutch and English it differs from these languages in fundamental ways. For instance, in Spanish, grammatical gender is explicit for all nouns, and many adjectives have distinct forms for agreeing with masculine and feminine nouns. Here, we briefly explore the assortativity and balance of grammatical gender between cues and responses and find that although the grammatical gender of the different responses is balanced (about half of response types that have grammatical gender are masculine), the total number of response tokens tend to agree with cue gender (about 60% of responses have the same gender as the cue). This assortative mixing pattern is consistent with what is observed for other grammatical properties such as part of speech and number and reflect the importance of grammatical gender in other areas of cognition such as those related to meaning (Boroditsky & Schmidt, 2000; Vigliocco et al., 2005). Several studies have used similarity judgements, feature production, or categorization to demonstrate how grammatical gender affects semantic processing (Cubelli et al., 2011). For instance, Phillips and Boroditsky (2003) asked native German and Spanish speakers to produce three adjectives that describe each word from a list of inanimate English nouns whose translations differed in grammatical gender in the two languages, for example sun, bridge (masculine in Spanish but feminine in German), or key (feminine in Spanish but masculine in German). For any given word, the adjectives produced tended to be perceived (by a different set of subjects) as more masculine (e.g., heavy, hard, big, dangerous) when the native word was masculine, and conversely, perceived as more feminine (e.g., elegant, fragile, peaceful, pretty) when the native word was feminine. Recently, multilingual studies used corpus-derived word embeddings to study the relationship between grammatical gender and semantics, finding that grammatical gender for inanimate nouns is semantically correlated across languages (Williams et al., 2019), and that grammatical gender is correlated with the kinds of adjectives and verbs paired with inanimate nouns in the same sentences (Williams et al., 2021). In this context, our dataset could also be used to investigate how grammatical gender affects semantic connotations within and across languages. For instance, one could compute semantic similarities between alternative-gendered versions of adjectives (such as hermoso/a [beautiful] or rojo/a [red]), or between gendered nouns and words describing stereotypically masculine or feminine words (such as strong or nurturing).

Despite language differences, the SWOW-RP norms are similar to the SWOW-EN and SWOW-NL in many respects. First, given that the procedure for obtaining the norms is the same, it is not surprising that all three datasets are similar in terms of number of words, number of responses, and participants. One fundamental difference is that SWOW-RP contains fewer responses per cue (70 instead of 100). Still, the cue coverage (the percentage of response tokens that are part of the cue set) is similarly high compared to Dutch and English, and the resulting networks consist of one large strongly connected component despite the lower number of responses. Another important difference is that SWOW-RP is heavily balanced towards female participants. Evidence showing the effects of participant’s gender on association patterns is scarce. Here we estimate this factor may not be important for most cues, since only a very small proportion of cues show significant differences in responses between genders. We recognize that this may be an underestimation, and there may be some cue–response pairs that show significant differences between genders. Nonetheless, when evaluating the word association norms using data from lexical decisions or semantic similarity judgments, the results not only followed the same qualitative pattern but also performed comparably in terms of prediction. This suggests that a smaller set of responses can suffice for a range of psychological applications, which renders data collection in new languages more feasible. The norms did differ from SWOW-EN and SWOW-NL when it came to missing R2 and R3 responses, which were higher for SWOW-RP. A variety of factors could contribute to this result, such as recruitment strategy (social media featured more prominently in SWOW-RP), a shift towards using mobile devices, or demographic differences. For now, the contribution of each of these remains entirely speculative, and more work is needed to identify which factors contributed to this difference. At a practical level, one of the implications is that the actual number of observations is relatively lower per stimulus presentation in SWOW-RP, which will affect the magnitude of specific quantities such as network density and in degree. As such, appropriate balancing of the amount of data (for example, through subsampling SWOW-NL and SWOW-RP) would be appropriate in future studies that aim to compare the networks in different languages.

Comparison to other Spanish norms

The norms presented here are also the most extensive dataset available for Spanish, covering more than twice the number of cues than the NALC norms of Fernández et al. (2012). A direct comparison of the two datasets shows very similar included cues (~ 85% of cues included in NALC are also in SWOW-RP). We also found that response frequencies and cue entropies show moderate correlations. Although first associates matched only for 40% of cues, in about 65% of them, the first associate in NALC is present in the top 3 associates in SWOW-RP. Overall, the pattern of responses between the two datasets was quite similar. For half the cues, the cosine similarity between the vector of response frequencies was equal to or higher than 0.76, and the correlation between total response frequencies (how many times each response appeared across the dataset) was also high (.86). When exploring the instances in which words differed significantly in their response frequency across datasets (see Fig. 4), it is evident that these reflect different usage frequencies between Iberian and Rioplatense speakers (e.g., dentro vs. adentro, baloncesto vs. básquetbol, espaguetis vs. tallarines, coche vs. auto) or proper nouns referring to local regions (salamanca, canarias). These differences in response frequencies are also observable when exploring the correlation with response latencies in lexical decision tasks. In the case of the González-Nosti et al. (González-Nosti et al., 2014) and the SPALEX dataset, NALC has a narrow advantage over SWOW-RP, yielding slightly higher correlation values. However, when splitting SPALEX into Iberian (participants from Spain) and non-Iberian (participants from Latin America) datasets, the advantage increases in the former but disappears in the latter. Since the González-Nosti et al.'s study includes only Iberian Spanish speakers, these results point to a differential explanatory power of the NALC vs. SWOW-RP due to differences in word usage, captured in the response frequencies in association norms, and latencies in psycholinguistic tasks.

Two small word-association datasets are already available for Spanish speakers in Argentina (Luna et al., 2016; Manoiloff et al., 2010). As expected, SWOW-RP shows greater similarity with these datasets in terms of cue and first associate overlap. However, the global measures of response similarity appeared to be lower than that of NALC, perhaps due to their small size (only ~400 cues).

Overall, these results show that our norms relate to previously obtained Spanish norms indicating reliability across different language variants. At the same time, these findings emphasize the need to construct psycholinguistic norms specific to Rioplatense Spanish, given the systematic differences in lexical usage that have developed since at least the late 18th century (Armstrong et al., 2016; Bertolotti & Coll, 2006; Moreno de Alba, 1992).

Response frequency and centrality

One of the key findings in studies on word processing is that high-frequency words have a processing advantage over low-frequency words. Words that readily come to mind in word association across a range of different cues might have a similar advantage. In network terms, a direct measure of centrality is the number of incoming links for each word, that is, the number of times a word was given as a response to any cue in the dataset. Although other network measures have also been shown to capture relevant aspects of lexical processing, such as PageRank (De Deyne et al., 2013; Griffiths et al., 2007) or closeness centrality (Stella et al., 2019), in-strength is a simple measure that captures how important or salient a given word is in the mental lexicon (De Deyne et al., 2013, 2019). This is likely to reflect the frequency of use in natural language only partially. Instead, it might also depend on other non-linguistic factors (e.g., mental imagery) and emphasize different aspects of lexical meaning (e.g., core meanings instead of pragmatic factors, figurative readings, or metaphors, which feature more prominently in natural language).

Some of the mental characteristics of central words are illustrated through the top 10 more frequent responses, which tend to refer to basic human needs (food, water, sex, love, life), important aspects of daily life (work, car, house, money), and basic categories and concepts (animal, bad, good, man, old, person, me, time). While there is considerable overlap among the most common words with languages such as Dutch and English, differences between languages are an interesting avenue for future cross-linguistic work as the relative frequencies of these words between datasets could reflect cultural differences. For example, the fact that car ranks fourth in SWOW-EN-123, but 18th in SWOW-RP-123, could reflect the central role of cars in everyday life in the United States vs. Argentina and UruguayFootnote 6. Also, the fact that money comes first in SWOW-EN-123, while dinero ranks 6th in SWOW-RP-123 could reflect putative differences in the attitudes towards money between language communities (see Medina et al., 1996). However, note that plata (ranking 24th) is also frequently used in Rioplatense Spanish to refer to money, and, if merged with dinero, the result would become the more frequent response in the dataset.

Further evidence that supports the interpretation of response frequency as a measure of psychological saliency or lexical centrality is that it can explain unique variance in lexical tasks such as lexical decision and, to a lesser extent, naming, and categorization (De Deyne et al., 2013, 2019). Here we consolidate this finding by showing it also holds for Rioplatense Spanish. Given that there are fewer studies of this kind in Spanish, we focused on the lexical decision task (LDT), in particular the large-scale SPALEX dataset (Aguasvivas et al., 2018), and a smaller study by González-Nosti et al. (2014). Negative correlations between response frequencies and latencies were in the range of 0.5–0.6, similar to SWOW-EN. Also, as found for English, word frequency estimates derived from text corpora (such as those based on subtitles, Brysbaert & New, 2009; Cuetos et al., 2012) perform slightly better than response frequency. More importantly, however, is the fact that as reported for English and Dutch, a significant correlation was obtained even after partialing out the overlap with the SUBTLEX-ESP word frequencies, which suggests that response frequencies represent a dimension of psychological saliency that is distinct from traditional corpus-based frequency estimates.

Meaning encoded in language and in word associations

In recent years, large-scale distributional models of meaning and simple neural network models like word2vec that learn embeddings from text have been able to explain a wide range of semantic behavior. Also, several approaches that model semantic cognition as processes operating on networks have also been employed with similar success (Hills & Kenett, 2022; Wulff et al., 2019, 2022). These models are attractive not only because of their performance, but also because they provide a cognitive mechanism through which meaning could be learned. While word associations do not provide a cognitive explanation about how meaning is acquired, they can complement linguistic approaches by highlighting common sense knowledge and conceptual properties in a concise and transparent format.

However, an outstanding question is why word associations capture meaning in the first place. In previous work, Jones et al. (2015) suggested that the substantial agreement between similarity ratings and word associations might reflect process overlap between generating associates and judging the similarity of word pairs. This idea stems from the assumption that word associations, semantic fluency and, potentially, similarity judgments share similar processes. (Jones et al., 2015). This might explain why word associations have superior performance in predicting similarities compared to natural language text models, even though the former is much smaller than the latter trained on gigantic amounts of text (often including encyclopedia data such as Wikipedia). In a similar vein as the results presented here, Richie and Bhatia (2021) thoroughly compared different text-based word embeddings and SWOW-derived representations using different similarity functions (cosine, Pearson, Euclidean distance, etc.). They evaluated the performance of both kinds of models in predicting similarity between co-hyponyms (i.e., categorical coordinates such as lettuce and tomato in the category “vegetables”), which is a more difficult task than predicting similarity between pairs that are either related (arm – leg) or largely unrelated (hammer – teeth) as is the case for instance in SimLex-999. Using text models such as fastText and GloVE trained on very large corpora (up to 840B words) and applying random walk and SVD-based transformations to SWOW-EN data, they found that word associations systematically outperformed text models on most categories. Moreover, when category-specific dimension weighting methods were applied, association measures outperformed text-based approaches in all categories, even approximating split-half reliability for one category (vehicles).

So far, empirical evidence for a direct and robust link between association frequency and similarity in terms of shared processes is meager. If process overlap is causing an advantage, then optimizing the match between the processes should lead to the best results. Yet, both word direct association strength and similarities based on untransformed association strength are consistently poorer predictors of meaning. This reflects the fact that participants are biased to produce high-frequency responses, which negatively affects the prediction of other behavioral tasks. Consequently, both direct association frequency and overlap measures based on direct association frequency provide a worse fit and a series of additional steps such as response weighting and adding indirect links are required, resulting in measures that correlate only moderately with raw associative strength.

Second, in similarity judgment tasks where participants are explicitly instructed to ignore association between word pairs (e.g., SimLex, Hill et al., 2015), similarity estimates based on word associations systematically provide state-of-the-art results, despite a task manipulation that reduces task overlap. Furthermore, correlation subjective measures such as centrality derived from word associations and LDT response times do not consistently provide better predictions than word frequency counts from subtitles, instead both resources capture unique variance.

Rather than process overlap, word associations might have an advantage for reasons that have more to do with the nature of the content of the semantic information than the nature of the process. Word associations tap into conceptual representations that highlight experiential content, which is evident in the top associations to a word like whale (big) or lemon (sour, yellow) highlight our attention to perceptual features.

Neuro-imaging evidence supports this idea by showing the involvement of brain regions related to imagery in generating word associations (Simmons et al., 2008). Recent experimental evidence from De Deyne et al. (2021) further supports this idea. In this study, we investigated the role of multimodal information in semantic models based on word associations or word embeddings. Our results showed that word embedding models require perceptual information to account for similarity among concrete objects belonging to the same category (e.g., fruits), whereas adding this perceptual information did not improve word association models to the same degree. The same study also demonstrated that encoding affective information is needed to account for the similarity between abstract concepts. Adding this information did not improve the results for word associations as much. Both results are consistent with the idea that word associations tap into embodied information (visual, affective, etc.), whereas language partially encodes this information.

Beyond access to modal-specific representations, word associations generate different types of information as a property of the task, where words are presented without context. This lack of context might induce access to literal senses or core common-sense aspects of meaning. This contrasts with models trained on natural language. In natural language, the common ground (i.e., lemons are yellow) is often presumed to be shared among participants in a conversation and therefore omitted. Instead, departures from the norm (green lemons) would be relevant in communication. Furthermore, many words have metaphorical or figurative meanings or senses that are largely inaccessible without additional context in an elicitation task such as word associations but contribute to corpus-based embeddings. Altogether, this suggests fundamental differences between what aspects of meaning are encoded in natural language and language elicitation tasks such as feature generation or word associations. This is largely consistent with the weak correlation found across a range of studies that have aimed to predict word associations from language (e.g., Nematzadeh et al., 2017), suggesting that both natural language and word association-based representation might have unique and complementary strengths.

Future directions

The association norms presented here contribute to the diversity of psycholinguistic data available, which may be necessary for further refinement of psychological and linguistic theories. For instance, together with norms of other variants of Spanish, they can inform new studies of linguistic change. We previously explored the differences in homonym distribution between Rioplatense and Iberian Spanish variants (Armstrong et al., 2016). The addition of these association norms can provide a fuller picture of lexical and morphological change. For example, during the COVID-19 pandemic, some words may have acquired new meanings (such as bubble, corona, zoom), shifted the relative importance of different associations (beverages vs. cleaning in the case of alcohol), or acquired new emotional colorings or connotations (mask) (Laurino et al., submitted).

Another exciting possibility opened by these norms is to use them as a standard against which individual associations can be compared. Individual comparisons have applications in probing cognitive deficits and decline (Gollan et al., 2006; Wulff et al., 2022), assessing students' vocabulary learning (Fitzpatrick & Thwaites, 2020; Meara, 1980), and in studies of development across the lifespan more broadly. This will require the creation of appropriate sampling methods to collect individual data and expand the norms to deal with different domains, populations, and dialects. The opportunity sample of the current study can make such comparisons (e.g., in terms of gender) complicated as the number of female volunteers tends to outstrip male volunteers. This suggests that targeting specific demographics might be needed to tap into the rich individual differences that can be discerned in the responses. Moreover, more work needs to be done to establish the amount of data needed to compare different groups of speakers, especially if group-specific associations are not among the strongest ones but are represented in the heavy tail of the response distribution. In those cases, larger samples might be required. Another exciting direction is the study of words in context. So far, few association studies have investigated this possibility. For example, one approach would be to provide a gloss (e.g., bat [animal]) when asking for associations to capture associations to a less dominant sense.

As in many previous works, here we treat word association data as a graph. Many network properties have been described in relation to psycholinguistic properties (De Deyne & Storms, 2008; De Deyne et al., 2013; Steyvers & Tenenbaum, 2005; Wulff et al., 2019, 2022). Although some interesting issues are still to be explored, such as assortativity regarding lexical properties such as POS (Van Rensbergen et al., 2015), a more promising approach is to build multiplex network representations that incorporate multiple sources of lexical information besides word association, such as phonological similarity, feature sharing and co-occurrence (Stella et al., 2019), or relationship types and association explanations (Liu et al., 2022).

Finally, the Rioplatense data presents the first dataset for which reaction times to all responses were recorded. We expect that the reaction times will provide a valuable measure to assess cognitive models that aim to capture the time-course of semantic access. The description, use and implications of such a measure will be the topic of an upcoming paper and will be made publicly available together with the data of this work.