2.1 Frequency of emotional words
In detail, we have analyzed three lexica of affective word usage which contain 1,034 English words, 2,902 German words and 1,034 Spanish words, together with their emotional scores obtained from extensive human ratings. These lexica have effectively established the standard for emotion analyses of human texts . Each word in these lexica is assigned a set of values measuring different aspects of word emotionality. The three independent studies that generated the lexica for English , German , and Spanish  used the Self-Assessment Mannequin (SAM) method to ask participants about the different emotion values associated to each word in the lexicon. One of these values, a scalar variable v called valence, represents the degree of pleasure induced by the emotion associated with the word, and it is known to explain most of the variance in emotional meaning . In this article, we use v to quantify word emotionality.
In each lexicon, words were chosen such that they evenly span the full range of valence.a In order to compare the emotional content of the three different languages, we have rescaled all values of v to the interval . As shown in the left panel of Figure 2, indeed, the average valence, as well as the median, of all three lexica is very close to zero, i.e. they do not provide an emotional bias. This analysis, however, neglects the actual frequency of word usage, which is highly skew distributed [1, 10]. For our frequency estimations we have used Google’s N-gram dataset  which, with 1012 tokens, is one of the largest datasets available about real human text expressions on the Internet. For our analysis, we have studied the frequency of the words which have an affective classification in the respective lexicon in either English, German, or Spanish. Figure 1 shows emotion word clouds for the three languages, where each word appears with a size proportional to its frequency. The color of a word is chosen according to its valence, ranging from red for to green for . It is clear that green dominates over red in the three cases, as positive emotions predominate on the Internet. Some outliers, like “home”, have a special higher frequency of appearance in websites, but as we show later, our results are consistent with frequencies measured from traditional written texts like books.
In a general setup, the different usage of words with the same valence is quite obvious. For example, both words “party” and “sunrise” have the same positive valence of 0.715, however the frequency of “party” is 144.7 per one million words compared to 6.8 for “sunrise”. Similarly, both “dead” and “distressed” have a negative valence of −0.765, but the former appears 48.4 times per one million words, the latter only 1.6 times. Taking into account all frequencies of word usage, we find for all three languages that the median shifts considerably towards positive values. This is shown in the right panel of Figure 2. Wilcoxon tests show that the means of these distributions are indeed different, with an estimated difference in a 95% confidence interval of for English, for German, and for Spanish. Hence, with respect to usage we find evidence that the language used on the Internet is emotionally charged, i.e. significantly different from being neutral. This affects quantitative analyses of the emotions in written text, because the “emotional reference point” is not at zero, but at considerably higher valence values (about 0.3).
2.2 Relation between information and valence
Our analysis suggests that there is a definite relation between word valence and frequency of use. Here we study the role of emotions in the communication process building on the relation between information measures and valence. While we are unable to measure information perfectly, we can approximate its value given the frequencies of words and word sequences. First we discuss the relation between word valence and information content estimated from the simple word occurrences, namely self-information. Then we explain how this extends when the information is measured taking into account the different contexts in which a word can appear. The self-information of a word,  is an estimation of the information content from its probability of appearance, , as
Frequency-based information content metrics like self-information are commonly used in computational linguistics to systematically analyze communication processes. Information content is a better predictor for word length than word frequency [2, 41], and the relation between information content and meaning, including emotional content, is claimed to be crucial for the way humans communicate [15–17]. We use the self-information of a word as an estimation of information content for a context size of 1, to build up later on larger context sizes. This way, we frame our analysis inside the larger framework of N-gram information measures, aiming at an extensible approach that can be incorporated in the fields of computational linguistics and sentiment analysis.
For the three lexica, we calculated of each word and linked it to its valence, . As defined in Equation 1, very common words provide less information than very unusual ones, but this nonlinear mapping between frequency and self-information makes the latter more closely related to word valence than the former. The first two lines of Table 1 show the Pearson’s correlation coefficient of word valence and frequency , followed by the correlation coefficient between word valence and self-information, . For all three languages, the absolute value of the correlation coefficient with I is larger than with f, showing that self-information provides more knowledge about word valence than plain frequency of use.
The right column of Figure 3 shows in detail the relation between v and I. From the clear negative correlation found for all three languages (between −0.3 and −0.4), we deduce that words with less information content carry more positive emotions, as the average valence decreases along the self-information range. As mentioned before the Pearson’s correlation coefficient between word valence and self-information, , is significant and negative for the three languages (Table 1). Our results outperform a recent finding  that, while focusing on individual text production, reported a weaker correlation (below 0.3) between the logarithm of word usage frequency and valence. This previous analysis was based on a much smaller data set from Internet discussions (in the order of 108 tokens) and the same English lexicon of affective word usage  we used. Using a much higher accuracy in estimating word frequencies and extending the analysis to three different languages, we were able to verify that there is a significant relation between the emotional content of a word and its self-information, impacting the frequency of usage.
Eventually, we also performed a control analysis using alternative frequency datasets, to account for possible anomalies in the Google dataset due to its online origin. We used the word frequencies estimated from traditional written corpuses, i.e. books, as reported in the original datasets for English , German , and Spanish . Calculating the self-information from these and relating them to the valences given, we obtained similar, but slightly lower Pearson’s correlation coefficients (see Table 1). So, we conclude that our results are robust across different types of written communication, for the three languages analyzed.
It is not surprising to find a larger self-information for negative words, as their probability of appearance is generally lower. The amount of information carried by a word is also highly dependent on its context. Among other factors, the context is defined by the word neighborhood in the sentence. For example, the word “violent” contains less information in the sentence “dangerous murderers are violent” than in “fluffy bunnies are violent”, as the probability of finding this particular word is larger when talking about murderers than about bunnies. For this reason we evaluate how the context of a word impacts its informativeness and valence. The intuition behind measuring information depending on the context is that the information content of a word depends primarily on i) the amount of contexts it can appear and ii) the probability of appearance in each one of these contexts. Not only the most infrequent, but the most specific and unexpectable words are the ones that carry the most information. Given each context where a word w appears, the information content is defined as
where N is the total frequency of the word in the corpus used for the estimation. These values were calculated as approximations of the information content given the words surrounding w up to size 4.
We analyzed how word valence is related to the information content up to context size 4 using the original calculations provided by Piantadosi et al. . This estimation is based on the frequency of sequences of N words, called N-grams, from the Google dataset  for . This dataset contains frequencies for single words and N-grams, calculated from an online corpus of more than a trillion tokens. The source of this dataset is the whole Google crawl, which aimed at spanning a large subset of the web, providing a wide point of view on how humans write on the Internet. For each size of the context N, we have a different estimation of the information carried by the studied words, with self-information representing the estimation from a context of size 1.
The left column of Figure 3 shows how valence decreases with the estimation of the information content for each context size. Each bar represents the same amount of words within a language and has an area proportional to the rescaled average information content carried by these words. The color of each bar represents the average valence of the binned words. The decrease of average valence with information content is similar for estimations using 2-grams and 3-grams. For the case of 4-grams it also decreases for English and Spanish, but this trend is not so clear for German. These trends are properly quantified by Pearson’s correlation coefficients between valence and information content for each context size (Table 1). Each correlation coefficient becomes smaller for larger sizes of the context, as the information content estimation includes a larger context but becomes less accurate.
2.3 Additional analysis of valence, length and self-information
In order to provide additional support for our results, we tested different hypotheses impacting the relation between word usage and valence. First, we calculated Pearson’s and Spearman’s correlation coefficients between the absolute value of the valence and the self-information of a word, (see Table 2). We found both correlation coefficients to be around 0.1 for German and Spanish, while they are not significant for English. The dependence between valence and self-information disappears if we ignore the sign of the valence, which means, indeed, that the usage frequency of a word is not just related to the overall emotional intensity, but to the positive or negative emotion expressed by the word.
Subsequently, we found that the correlation coefficient between word length and self-information () is positive, showing that word length increases with self-information. These values of are consistent with previous results [1, 2]. Pearson’s and Spearman’s correlation coefficients between valence and length are very low or not significant. In order to test the combined influence of valence and length to self-information, we calculated the partial correlation coefficients and . The results are shown in Table 2, and are within the 95% confidence intervals of the original correlation coefficients and . This provides support for the existence of an additional dimension in the communication process closely related to emotional content rather than communication efficiency. This is consistent with the known result that word lengths adapt to information content , and we discover the independent semantic feature of valence. Valence is also related to information content but not to the symbolic representation of the word through its length.
Finally, we explore the sole influence of context by controlling for word frequency. In Table 3 we show the partial correlation coefficients of valence with information content for context sizes between 2 and 4, controlling for self-information. We find that most of the correlations keep significant and of negative sign, with the exception of for English. The weaker correlation for context sizes of 2 is probably related to two word constructions such as negations, articles before nouns, or epithets. These high-frequent, low-information constructions lead to the conclusion that does not explain more about the valence than self-information in English, as short range word interactions change the valence of the whole particle. This finding supports the assumption of many lexicon-based unsupervised sentiment analysis tools, which consider valence modifiers for two-word constructions [5, 6]. On the other hand, the significant partial correlation coefficients with and suggest that word information content combines at distances longer than 2, as longer word constructions convey more contextual information than 2-grams. Knowing the possible contexts of a word up to distance 4 provides further information about word valence than sole self-information.