Introduction

Emotionally charged concepts are processed differently to emotionally neutral concepts. This intuitive idea is supported by research in multiple domains, including brain imaging (Lane, Chua, & Dolan, 1999; Lang et al., 1998; Maddock, Garrett, & Buonocore, 2003; Mourão-Miranda et al., 2003), semantic categorization (Moffat, Siakaluk, Sidhu, & Pexman, 2015; Newcombe, Campbell, Siakaluk, & Pexman, 2012; Niedenthal, Halberstadt, & Innes-Ker, 1999), affective priming (Fazio, 2001; Klauer, 1997), word associations (Cramer, 1968; Isen, Johnson, Mertz, & Robinson, 1985; Johnson & Lim, 1964; Matlin & Stang, 1978; Pollio, 1964), or word recognition reaction times (De Houwer, Crombez, Baeyens, & Hermans, 2001; Kuperman, Estes, Brysbaert, & Warriner, 2014).

Research on the emotional aspect of words traditionally makes use of three dimensions: (1) valence or evaluative attitude, generally rated on a good/bad or happy/unhappy scale, (2) arousal or activity, often represented on an active/passive scale, and (3) dominance or potency, usually expressed on a strong/weak or dominant/submissive scale. The importance of these dimensions was first described by Osgood, Suci, and Tannenbaum (1957). In an undertaking to quantify connotative meaning, they performed a factor analysis on a large number of verbal judgments of a wide variety of concepts and found that most of the variance in emotional assessments was accounted for by these three affective dimensions. Subsequent research has replicated these findings across dozens of cultures (see Heise, 2010, or Osgood, 1975, for an overview), indicating that the importance of these factors may be near universal.

Word values on these dimensions are commonly used both for investigating the influence of affective meaning on some other aspect, and to control for a possible confounding effect of the emotional charge of stimuli. As such, it is not surprising that there is a high demand for databases with affective norming data.

Traditionally, these norms are obtained by asking participants to rate a large number of words on each dimension. This procedure can be very expensive and time-consuming, as multiple persons have to rate each word in order to arrive at reliable measures (by means of average ratings). As a result, most norming databases are rather limited in the number of different words they contain, making generalizations towards the entire lexicon somewhat unfeasible. For example, the original Affective Norms for English Words (ANEW) dataset, likely the most frequently used norms, contains “just” 1,034 unique words (Bradley & Lang, 1999). Despite the cumbersome nature of gathering ratings word by word, some researchers have recently managed to construct a much more comprehensive English database, containing norms for 13,915 words (Warriner, Kuperman, & Brysbaert, 2013). Affective rating datasets in other languages are not nearly as extensive, such as in Dutch (4,300 words: Moors et al., 2013), Finnish (420 words: Söderholm, Häyry, Laine, & Karrasch, 2013), French (1,031 words: Monnier & Syssau, 2014), German (2,900 words: Võ et al., 2009), Italian (1,034 words: Montefinese, Ambrosini, Fairfield, & Mammarella, 2014), Spanish (1,034 words: Redondo, Fraga, Padrón, & Comesaña, 2007), Polish (1,586 words: Imbir, 2015), or Portuguese (1,034 words: Soares, Comesaña, Pinheiro, Simões, & Frade, 2012).

Estimating affective ratings using word co-occurrence data

As the procedure of having participants rate words manually is both expensive and time-consuming, there has been some interest in deriving affective norms from other sources of information. One approach that has been suggested starts by deriving similarity measures for large numbers of words using their position in text corpora. For any given word in the corpus, norm ratings are then estimated using that word’s similarity to a number of words for which affective values are already known. This approach could lead to norming datasets significantly larger than those gathered using manual ratings, as large text corpora are available in many languages.

Two implementations of this technique have been put forward. A first approach makes use of latent semantic analysis (LSA; Landauer & Dumais, 1997), which quantifies the degree to which words are associated based on the assumption that similar words occur in similar pieces of text. LSA starts from a word by context matrix, where each cell contains how frequently that word occurs in that chunk of text (e.g., sentence, paragraph, or document). To diminish the influence of highly frequent words, a weighting function is applied to this matrix. Subsequently the most important dimensions (usually 300) are extracted from this matrix using singular value decomposition, yielding a relatively low-dimensional approximation of the original matrix. The similarity between any two words is then defined as the cosine of the angle between their corresponding row vectors in this new matrix. As a result, LSA can estimate the similarity between two words that never occur together, but do co-occur in similar contexts.

A second approach to predict similarity from text corpora makes use of pointwise mutual information (PMI: Church & Hanks, 1990; see also Bullinaria & Levy, 2007; Manning & Schütze, 1999), which derives relatedness from direct word co-occurrence rather than co-occurrence in contexts. Specifically, the PMI of two words x and y is defined as

$$ \mathrm{P}\mathrm{M}\mathrm{I}\left(x,y\right)={ \log}_2\frac{P\left(x,y\right)}{P(x)P(y)}, $$

where P(x,y) refers to the frequency of x and y co-occurring in some context divided by the total number of tokens in the corpus, and P(x) and P(y) refer to the frequency of x and y, respectively divided by the total number of tokens. Compared with LSA, the most prominent advantage of PMI is scalability, as it can be applied to corpora far larger than LSA can handle. Additionally, it has been suggested that PMI may be more plausible as a model of semantic organization (Recchia & Jones, 2009).

Once pairwise similarity estimates have been derived by applying either LSA or PMI to text corpora, one can estimate words’ values on various dimensions using their similarity towards words for which the values on those dimensions are already known.

Turney and Littman (2003) predicted the valence of words using their similarity to a small number of paradigm words, words commonly used to describe very low or very high levels of valence (e.g., good, bad). They compared the predictions of this approach with binary manual ratings (words rated positive or negative) for 3,596 English words, and report a correlation of .65 when using similarity derived from LSA (on a corpus comprising 10 million tokens), and between .61 (corpus containing 10 million tokens) and .83 (corpus containing 100 billion tokens) when using similarity derived from PMI.

Bestgen and Vincze (2012) employed a somewhat different approach. Rather than examine a word’s relation to a small number of seed words, they looked at its similarity to all words for which norming data exist: they define the estimated rating of words as the average of its k nearest neighbors included in the norming data, with k ranging from 1 to 50. Nearest neighbors were obtained from similarity indices between 17,350 English words, which were calculated by applying LSA to a corpus comprising 12 million tokens. The valence, arousal, and dominance of each of these words was then estimated as the average rating of its k nearest neighbors which were included in the ANEW norms. Note that a given word was never considered as one of its nearest neighbors, that is, predictions were based on a leave-one-out approach. Comparing obtained estimates with the ANEW norms they find the highest accuracy at k = 30, with a correlation of .71 for valence, .56 for arousal, and .60 for dominance.

Recchia and Louwerse (2015) used a comparable approach, with a number of differences. They obtained nearest neighbors through similarity measures derived with PMI rather than LSA, which allowed them to make use of a much larger corpus containing 1.6 billion English words. They also tested a wider array of values for neighborhood parameter k, with k ranging from 2 to 500. Additionally, instead of following a leave-one-out approach, predictions were based on the ratings of one dataset while accuracy was assessed through correspondence to ratings of a second dataset. This revealed correlations of up to .74 for valence (at k = 15), up to .57 for arousal (at k = 40), and up to .62 for dominance (at k = 60).

Finally, Mandera, Keuleers, and Brysbaert (2015) evaluated how the performance of these computational approaches is influenced by the size of available norming data. To that end, the 13,915 words in the Warriner et al. (2013) norms were split into a training set and a test set, using different splits (e.g., 90 %/10 % or 50 %/50 %). Similarity indices between all words were obtained through applying LSA or PMI to a corpus comprising 385 million tokens. These were then used to predict the valence, arousal, and dominance of words, with neighborhood parameter k set to 30 (the optimal value described by Bestgen & Vincze, 2012). They find that accuracy is somewhat reliant on the size of available norms. For example, when working with PMI-based similarity, increasing the training sample (i.e., the ratings that can contribute to the estimates) from 10 % of the Warriner norms to 90 % raises the correlation between the test sample and the norm ratings from .61 to .72 for valence, from .37 to .51 for arousal, and from .51 to .61 for dominance. (They also investigated a number of other extrapolation methods, all of which showed a similar or lower accuracy.)

Taken together, these studies indicate that ratings extrapolated from word co-occurrence data show medium to high correlations with human judgments, highlighting the usefulness of this computational approach. Moreover, the size of norming databases constructed using this method is likely to keep expanding in the coming years, as even more word corpora become available. This is especially useful for languages other than English, where existing norming datasets are often quite limited in size.

Word associations as a source of similarity

As we have seen, existing research on computationally estimating norms generally makes use of similarity values derived from word co-occurrences in text corpora. An alternate approach to obtaining similarity ratings is using word association data. In a word association task, participants respond with the first word(s) that come to mind after reading a certain cue word. A key assumption in using word associations to investigate meaning is that the probability of producing a certain response to a cue is a measure of the associative strength between cue and response in the mental lexicon (Cramer, 1968; De Deyne, Navarro, & Storms, 2013; Deese, 1966; Nelson, McEvoy, & Schreiber, 2004). This idea is supported by research on facilitation of word processing in associative priming (Hutchison, 2003), response times in lexical decision tasks (De Deyne et al., 2013), word recognition reaction times (De Deyne et al., 2013; Gallagher & Palermo, 1967; Nelson, McKinney, Gee, & Janczura, 1998), fluency task generation frequencies (Griffiths, Steyvers, & Tenenbaum, 2007), clustering in recall (Wicklund, Palermo, & Jenkins, 1965), and predicting cued recall (Nelson et al., 1998).

To obtain information about relatedness from word association data, one can make use of a cosine measure of similarity (Landauer & Dumais, 1997). While this measure is traditionally applied to spatial models such as LSA, it can also be used in the context of word association data (e.g., De Deyne et al., 2013; De Deyne, Verheyen, & Storms, 2015; Gravino, Servedio, Barrat, & Loreto, 2012). Here, the cosine similarity between two words reflects their overlap in associative links; two words that share no associations have a similarity of 0, while two words with the exact same associative responses have a similarity of 1. Similarity estimates obtained using this approach show a strong correspondence with relatedness judgments (De Deyne et al., 2013; De Deyne et al., 2015).

Research indicates that, compared with approaches based on text corpora, word association data can lead to a more valid measure of semantic relatedness. For example, (human) similarity judgments correlate more strongly with similarity estimates derived from association data than with predictions based on word co-occurrences (De Deyne, Peirsman, & Storms, 2009; De Deyne et al., 2015). Additionally, associative strength has been shown to predict priming effects on a word-level in both lexical decision tasks and naming tasks, while similarity derived from applying LSA to text corpora did not (Hutchison, Balota, Cortese, & Watson, 2008).

In the current study, we propose using word association data to obtain similarity estimates for a large number of words, and subsequently predict words’ values on affective dimensions (e.g., valence) using their similarity towards words for which the values on those dimensions are already known (e.g., pleasant). Using this approach, we will estimate valence, arousal, and dominance ratings for a large number of words. To verify the validity of these estimates, we will compare them with existing norm ratings.

Method

Materials

To obtain the associative strength for a large set of words, we made use of the Dutch Small World of Words project,Footnote 1 which contains 3.7 million word associations collected in response to 14,000 cue words. Each cue was presented to roughly 100 participants, who gave up to three responses per cue (see De Deyne et al., 2013, for full detailsFootnote 2).

Valence, arousal, and dominance ratings for 4,300 Dutch words were taken from Moors et al. (2013). In this study, words were rated on a Likert scale ranging from 1 (very negative/unpleasant, very passive/calm, and very weak/submissive, respectively) to 7 (very positive/pleasant, very active/aroused, and very strong/dominant). Ratings showed very high split-half reliabilities: .99 for valence, .97 for arousal, and .96 for dominance.

Procedure

We began by computing the cosine similarity (e.g., Landauer & Dumais, 1997) between each combination of the 14,000 cue words in the Dutch Small World of Words dataset. In this context, a cosine measure reflects the extent to which two words overlap in associative responses: two words that share no associations would have a value of 0, while two words with the exact same associative responses would have a value of 1. To obtain this measure, we first constructed a cue-by-cue count matrix, where cells reflected how often each cue was given as an association in response to each other cue. Rows of this matrix were normalized to sum to 1 and log-transformed. Finally, to obtain the cosines between the angles of these vectors, the matrix was multiplied by its transpose. At this point, cells of the matrix contained the cosine similarity between the cues corresponding to their rows and columns.

Subsequently, we used these similarity ratings to predict affective word covariates by applying two extrapolation methods, each of which estimates word’s values on affective dimensions using that word’s similarity to certain words for which affective ratings are already known.

The first extrapolation method we employed, Orientation towards Paradigm Words, predicted a word’s valence, arousal, and dominance using that word’s similarity towards certain paradigm words, words commonly used to describe extreme values on these dimensions (Kamps, Marx, Mokken, & de Rijke, 2004; Turney & Littman, 2003). Paradigm words were obtained from the instructions in the rating task described by Moors et al. (2013), which yielded two positive and two negative paradigm words for each dimension (Table 1).

Table 1 English translation of the paradigm words corresponding to valence, arousal, and dominance (Dutch source that was actually used)

At first, Orientation towards Paradigm Words predictions simply reflected the sum of a word’s similarity towards both positive paradigm words minus the sum of its similarity towards both negative paradigm words. These estimates were consequentially refined by including the target word’s similarities towards the k nearest neighbors of each of the paradigm words, that is, out of the 14,000 words, the k words with the highest similarity towards that paradigm word, where k ranged from 0 to 500. A target word’s final score was computed as the sum of its similarity towards both positive paradigm words and the k nearest neighbors of each positive paradigm word, minus the sum of its similarity towards both negative paradigm words and the k nearest neighbors of each negative paradigm word.

The second extrapolation method we applied, k-Nearest Neighbors, was very similar to the approach described by Bestgen and Vincze (2012), with the notable difference that our similarity estimates were derived from word association data rather than from word co-occurrence in text corpora. Under this approach, the score of any target word on some dimension is calculated as the mean score of its k nearest neighbors (as assessed with cosine similarity) for which the value on that dimension is known (that is, the k closest words for which human judgments are included in the dataset of Moors et al., 2013), for k ranging from 1 to 500. Note that a target word is never considered as one of its own nearest neighbors; as such, the human judgment of some word does not contribute to that word’s extrapolated rating.

It may be important to stress that with the k-Nearest Neighbors approach, k refers to the nearest neighbors of the target word (for which ratings were available), while under the Orientation towards Paradigm Words method, k refers to the nearest neighbors of the various paradigm words.

Results

We estimated the valence, arousal, and dominance of the 14,000 cue words in the Small World of Words dataset with the two extrapolation methods described above. Out of these 14,000 words, 3,872 are comprised in the norms of Moors et al. (2013) and can be used to assess the accuracy of the two methods. These 3,872 words represent 90 % of the 4,300 words in the norms, and 28 % of the cue words in the word association dataset.

The Orientation towards Paradigm Words method predicted the affective values of words using their similarity towards certain paradigm words (see Table 1), and the k nearest neighbors of each paradigm word. The left panel of Table 2 displays the correlations (Pearsons’s r) between these estimates and the human judgments described by Moors and colleagues (2013) for valence, arousal, and dominance, for k values ranging from 0 (only the paradigm words themselves are used) to 500 (the paradigm words and the 500 nearest neighbors of each paradigm word contribute to the final estimate).Footnote 3 When estimates are based solely on similarity to the paradigm words themselves, we find correlations of .79, .53, and .59 to human judgments of valence, arousal, and dominance, respectively. As we increase the number of neighbors of each paradigm word that contribute to our predictions, these correlations increase to up to .86, .65, and .69 for valence, arousal, and dominance, respectively.

Table 2 Correlations between human judgments and estimates derived using the Orientation towards Paradigm Words extrapolation method (left panel) and estimates derived using the k-Nearest Neighbors extrapolation method (right panel)

The k-Nearest Neighbors method estimated the valence, arousal, and dominance of the 14,000 words as the mean of the human ratings of its k nearest neighbors included in the Moors et al. (2013) dataset. The right panel of Table 2 displays the Pearson correlation between these estimates and human judgments of valence, arousal, and dominance, for k (the number of neighbors of a target word that contribute to its estimate) ranging from 1 to 500.Footnote 4 We find an optimal accuracy at k = 10, where the extrapolated ratings show a correlation of .91 for valence, .84 for arousal, and .85 for dominance.

We find that performance of both extrapolation methods shows a curvilinear function with respect to neighborhood parameter k: as k increases, accuracy improves up to a certain point and then starts to decline. This decreased performance at higher values of k is in line both with expectations, as “further” neighbors have a lower similarity to the target word, and with previous research (Recchia & Louwerse, 2015).

A downside of the k-Nearest Neighbors approach is that it relies on an existing set of human judgments. As a result, the number of words for which human ratings are available is certain to have an effect on the accuracy of this method. If only few norms are available, it is possible that some extrapolated values are based on ratings of words that are in fact not particularly close to the target word (if more similar words are not included in the norming dataset), which would certainly have consequences for the validity of those estimates. In Dutch, we have access to 3,872 words in the relatively large norms of Moors et al. (2013); in many languages, databases of this size are not available. To estimate how accurate this extrapolation method would be when only a limited set of norms is available, we followed an approach similar to that of Mandera et al. (2015) by running the k-Nearest Neighbors method restricted to random subsets of the available norming data (at k = 10, the optimal value in Table 2). We tested 12 different sample sizes, ranging from 100 words to 3,872 words (the entire dataset). To remove any sampling bias, this procedure was repeated 100 times for each sample size. Figure 1 indicates that even when only a small norming dataset is available, the k-Nearest Neighbors method manages to attain a high accuracy; for exampling, when norms for just 1,000 words are available, the extrapolated ratings show correlations with human judgments of up to .89 for valence, and up to .79 for arousal and dominance.

Fig. 1
figure 1

Relation between accuracy of the k-Nearest Neighbors extrapolation method and the size of available norms. Correlations were obtained by averaging across 100 iterations of running the extrapolation method limited to a random subset of human judgments (out of the available 3,872 norm words). Neighborhood parameter k was set to 10, the optimal value reported for running this extrapolation method with all human judgments (Table 2). Error bars (very small, due to low error rates) indicate standard error in accuracy across the 100 iterations

Finally, we wanted to have an idea of whether having access to a norming dataset larger than that of Moors et al. (2013) would lead to a significant improvement in accuracy. Although we cannot test this notion directly with the data currently at our disposal, we can estimate it by examining the slopes of the lines in Fig. 1. As all three lines keep increasing up to the largest sample size, it seems reasonable to assume that expanding the size of the used norming dataset would result in a small improvement in accuracy, especially for arousal and dominance.

Discussion

We have outlined two methods to computationally estimate subjective norms values. Both methods derive similarity from association data, and predict a word’s norms using its similarity towards words for which affective values are already known. The two approaches were used to extrapolate the valence, arousal, and dominance for 14,000 Dutch words; these estimates are available at https://osf.io/pmbvc/.

In comparing the extrapolated norms to human judgments, we find high to very high correlations for all three dimensions. Correspondence is highest for valence, suggesting that compared with arousal and dominance, valence is represented more strongly in the semantic similarity space. This finding is in line with the importance often attributed to this aspect, both in research on affective meaning (Osgood et al., 1957) and various other domains.

Of the two extrapolation methods we tested, accuracy is highest for the k-Nearest Neighbors technique, as would be expected because this method is based directly on the human ratings with which accuracy is assessed (although importantly, the human judgment of a given word does not contribute to that word’s extrapolated value). Note, though, that this reliance on human ratings brings with it a huge drawback: the k-Nearest Neighbors method can only work when human judgments are already available for some amount of words. In contrast, the Orientation towards Paradigm Words approach does not depend on human judgments in any form, and aside from a selection of paradigm words, only requires similarity indices.

Considering the k-Nearest Neighbors method relies on human judgments, its accuracy is likely tied to the quality of available human ratings. As our research was performed in Dutch, we had access to the large norming dataset of Moors and colleagues (2013). In many languages, existing databases are considerably smaller. To assess how accurate our approach is when limited to a smaller set of norms, we ran the k-Nearest Neighbors extrapolation method restricted to subsets of the available norming data. Correlations with human judgments were lower than when the method had access to all norming data, but still very high (between .78 and .88 when using a subset of 1,000 words). This suggests that even when only a small set of norms is available, the k-Nearest Neighbors method can be very effective at predicting affective word covariates.

In existing research on computationally predicting affective norms, similarity or semantic relatedness is generally derived from word co-occurrence data rather than from word associations. Using these similarity estimates, several studies have extrapolated affective ratings with the help of the same k-Nearest Neighbors technique we described. These studies report that their estimates display correlations with human judgments of up to .74, .57, and .62 (Bestgen & Vincze, 2012), up to .71, .56, and .60 (Recchia & Louwerse, 2015), and roughly up to .72, .51, and .61 (Mandera et al., 2015), for valence, arousal, and dominance, respectively.

In comparison, the predictions we present show a much higher accuracy, on all three dimensions. There are several potential explanations behind this improvement. It could be a result of a difference in language: we made use of Dutch associations and judgments, while the described corpus-based studies were performed in English. However, this seems an unlikely explanation, as similar corpus-based research has also been undertaken in French and Spanish, where estimates showed similar or lower correlations with human ratings (Bestgen, 2002, 2008; Vincze & Bestgen, 2011). Furthermore, as the importance of valence, arousal, and dominance is highly generalizable across cultures (Osgood, 1975), there is no a priori reason to expect these aspects to be represented differently in Dutch and English.

A more probable cause for the disparity between our findings and previous attempts at computationally estimating norms is the nature of the information from which similarity estimates were construed: existing research derived relatedness from word co-occurrence in text corpora, while we made use of word association data. Previous comparisons between corpus-based and association-based similarity estimates also report a higher accuracy for approaches reliant on word association data, in line with our findings (De Deyne et al., 2009; De Deyne et al., 2015; Hutchison et al., 2008). This is likely because word associations and text corpora represent information of a different nature. Written language is grounded in pragmatics; the goal is to communicate some discourse efficiently, and information that is known to both parties is often left out. Word associations, in contrast, are non-propositional, and generally free from pragmatics or intent (Deese, 1966; Szalay & Deese, 1978). As a result, mentally central concepts or properties (such as color or shape) are usually well represented in word associations, while they are somewhat uncommon in most written text. An additional asset of word association data is its very high signal to noise ratio, as almost every association reflects a meaningful relation; in contrast, text corpora are often characterized by a low signal-to-noise ratio, negating part of the advantage of scale that characterizes corpus-based approaches.

Taken together, we can conclude that word association data can be a very powerful source of information on semantic relatedness, and suggest that when computationally generating affective norms, an association-based approach may be a worthwhile addition to or substitute for procedures based on word co-occurrence in text corpora.

Of course, this approach does require access to word association data. While gathering word associations is a simple and straightforward procedure, it remains reliant on human participants. As a result, constructing a large dataset of this nature is far from effortless. Luckily, such databases already exist in many languages; for example, the Small World of Words project from which we obtained the Dutch associations also contains datasets in English, German, French, Spanish, Rioplatense Spanish, Vietnamese, Japanese, and Cantonese. Note that in terms of number of tokens, these databases are all much smaller than most text corpora. However, as we have seen, this quantitative shortcoming does not necessarily translate to deteriorated predictions; indeed, human judgments show a considerably higher correspondence to the estimates reported in the current paper, which are based on a dataset comprising 3.7 million tokens, than to the estimates based on word co-occurrence data described previously, which are based on much larger corpora (e.g., the predictions reported by Recchia & Louwerse, 2015, are based on a dataset containing 1.6 billion tokens).

An important caveat when working with computationally estimated word covariates is that even when they show a moderate to high correspondence with human judgments, they could lead to different conclusions than would be reached when using human ratings (Mandera et al., 2015). The data we present are likely somewhat less vulnerable to this issue, as our estimates show considerably higher correlations to human ratings; nevertheless, this is definitely a topic that should be investigated further in future research.

In the current paper, we estimated valence, arousal, and dominance ratings based on similarity values derived from word association data. The extrapolation methods we describe would conceivably work on other psychologically relevant dimensions as well, as long as these dimensions are captured by the associative technique, that is, as long as the associations people give to a certain word are in some way related to the cue’s or association’s value on that dimension. Existing research suggests that other examples of word covariates that could likely be predicted based on association data may include concreteness (the extent to which words refer to something perceptible; see Mandera et al., 2015, or Van Rensbergen, Storms, & De Deyne, 2015), age of acquisition (the age at which a word was learned; see Mandera et al., 2015), or dimensions relevant to personality profiles (e.g., openness, conscientiousness, extraversion, agreeableness, or neuroticism; see Yarkoni, 2010, or Park et al., 2015).