1 Introduction

A distributional semantic model is a method of representing a word as a multi-dimensional vector learned from a significant number of use cases (e.g., Mikolov et al., 2013; Pennington et al., 2014). Several distributional semantic models have been developed. These models have shown impressive performance in various fields, such as natural language processing (Mikolov et al., 2013; Turney & Pantel, 2010), cognitive science (Jones et al., 2015; Mandera et al., 2017; Utsumi, 2011), and neuropsychology (Anderson et al., 2017; Mitchell et al., 2008).

To improve distributional semantic models further, accurate evaluations of these learning models are necessary. Word similarity prediction has been frequently used to assess the performance of distributional semantic models (De Deyne et al., 2009; De Deyne, Perfors, et al.,, 2016; Lenci, 2018; Levy et al., 2015; Mandera et al., 2017; Rothe & Schütze, 2017). The demonstration of high performance for this task is indicated by the model learning human semantic knowledge regarding the meaning of words.

In the ensuing subsections, we first give a brief summary of the concept of similarity, focusing on the difference between similarity and association. Next, we review the similarity databases in English and Japanese. Finally, we summarize the differences between the existing datasets and the objectives of this study.

1.1 Importance of distinguishing between similarity and association

It is difficult to clearly define “similarity.” We can find similarity for almost any pair of entities if we want to (De Deyne et al., 2016). In this study, we capture similarity in contrast to “association” with reference to previous studies.

Similarity, especially semantic similarity, has traditionally been defined intuitively. That is, how close they are to a “synonym,” and humans can assess this intuitively (Miller & Charles, 1991). Indeed, using conceptual features extracted by humans (such as the concept a robin < lays eggs > and < can fly >), we have also found that semantic similarity is well captured by shared conceptual features (De Deyne et al., 2009). A more objective framework for capturing similarity has been proposed, based on information from WordNet and Roget's Thesaurus (Jarmasz & Szpakowicz, 2003; Resnik, 1995).

The distinction between “similarity” and “association” is important. In most cases, similarity is a special case of association (Budanitsky & Hirst, 2006). To borrow an example from Hill et al. (2015), the difference between similarity and association is exemplified by the concept pairs [car, bike] and [car, petrol]. Car is said to be (semantically) similar to bike and associated with (but not similar to) petrol. Intuitively, car and bike can be understood as similar because of their common physical features (e.g., wheels), their common function (transport), or because they fall within a clearly definable category (modes of transport). In contrast, car and petrol are associated because they frequently occur together in space and language, in this case as a result of a clear functional relationship. As another example from Scheible et al. (2013), the synonyms hot and scorching and the antonyms hot and cold are both strongly “associated,” in that they share the dimension of “temperature.” On the other hand, in the dimension “temperature,” hot and scorching are close to each other whereas hot and cold are far from each other, the former pair is strongly “similar” to each other, whereas the latter is less strongly “similar.”

It would also be useful to distinguish between the concepts that are close. The term “association” is often confused with “word-association,” which is measured by a word-association task. It can be said that there is an association between words that is often answered by the word association task, but it is not necessarily similar. Co-occurrence, which is usually calculated from the corpus, is another type of association. However, again, there is an association between frequently co-occurred words, but it is not necessarily similar.

1.2 Similarity dataset

1.2.1 English dataset

Quite a few English datasets for word similarity tasks have been released, e.g., SimLex-999 (Hill et al., 2015), WordSim353 (Finkelstein et al., 2002), MC (Miller & Charles, 1991), RG (Rubenstein & Goodenough, 1965), Stanford’s Contextual Word Similarities dataset (Huang et al., 2012), Stanford Rare Word similarity dataset (Luong et al., 2013), CARD-660 (Pilehvar et al., 2018), Verb Similarity dataset (Baker et al., 2014), and SimVerb-3500 (Gerz et al., 2016). Of these, WordSim353 (Finkelstein et al., 2002) and SimLex-999 (Hill et al., 2015) are used with a distinction between similarity and association. The SimVerb-3500 (Gerz et al., 2016) dataset is an extension of the SimLex-999 set and contains 3500 English verbs.

WordSim353 (Finkelstein et al., 2002) does not make a distinction between “similarity” and “Association” in its instruction to participants, in contrast to Hill et al. (2015), as discussed below. Therefore, although the dataset name contains the word “similar,” it can be considered as a dataset that measures “association.” Therefore, Agirre et al. (2009) isolated the pairs in WordSim353 that were classified as synonyms, antonyms, identical, or hyponym-hyperonyms as similar pairs named “WordSim353 similarity” and used others as association pairs named “WordSim353 relatedness.” Thus, the method of separating “similarity” and “association” here is post hoc, based on the linguistic relations of the word pairs.

Hill et al. (2015) pointed out that simply asking “please assign a numerical similarity score” (Finkelstein et al., 2002) causes the contamination of association to similarity because participants do not pay much attention to the detailed differences between similarity and association. Therefore, Hill et al. (2015) conducted a pure similarity rating task, with detailed instruction contrasting association and similarity, and compiled it into the SimLex-999 dataset. On the other hand, association was estimated from an existing dataset of word association tasks—specifically, University of South Florida Free Association Database (USF) (Nelson et al., 2004). For example, the SimLex-999 dataset (Hill et al., 2015) has the “new–old” pair. The “new–old” pair is associated but dissimilar because the two words are significantly far apart on the time axis, like hot and cold in the dimension “temperature.” In fact, this pair had the 825th rank out of 999 word pairs on the. similarity rank in the SimLex-999 dataset (Hill et al., 2015). In contrast, “association” refers to the degree of some type of involvement, including similarity. In the SimLex-999 dataset (Hill et al., 2015), the “new–old” word pair had the 9th highest rank out of 999 word pairs on the association index based on the USF (Nelson et al., 2004).

Thus, even in English, there are few datasets that distinguish between similarity and association. Moreover, even in the SimLex-999 dataset (Hill et al., 2015), the measurement of association is inferred from word association tasks. If a clear distinction is made between similarity and association in instruction, it would be easier to compare the two rating values. This is one of the motivations for conducting this study.

1.2.2 Japanese dataset

Similarity datasets are extremely rare in Japanese. Recently, Vulic et al. (2020) developed a unified procedure for dataset construction and applied it to 12 languages: Chinese Mandarin, Welsh, English, Estonian, Finnish, French, Hebrew, Polish, Russian, Spanish, Kiswahili, and Yue Chinese; although this work is a major recent development in the field, it does not include Japanese. To the best of our knowledge, the Japanese Word Similarity Dataset (JWSD; Sakaizawa & Komachi, 2018) is the only publicly available similarity dataset.

JWSD (Sakaizawa & Komachi, 2018) is the first similarity dataset in Japanese, and it is characterized by the inclusion of low to high frequency words and four parts of speech: Noun, Verb, Adjective, and Adverb words. The instructions are simple, showing examples and asking users to assign the degree of similarity (“We asked annotators to assign the degree of similarity for each pair using the same 10-point scale.”) like WordSim353 (Finkelstein et al., 2002). Thus, it does not make a clear distinction between similarity and association. In addition, as noted by Karpinska et al. (2018), there are several compound words comprising two or more morphemes and past-tense verbs in JWSD. This feature can cause problems when users apply distributional models to the dataset, such as the occurrence of several out of vocabulary (OOV) words.

A different, but related dataset to the similarity dataset is the dataset for the Japanese Bigger Analogy Task Set (jBATS; Karpinska et al., 2018). This dataset does not contain similarity rating data, but it specifies in detail the linguistic relationships of word pairs and contains a wealth of examples. Later, we will compare the dataset in this study with these two datasets.

1.3 Explaining similarity and association ratings via distributional models

Levy et al. (2015) distinguished and tried to explain similarity and association ratings via distributional models using WordSim similarity and WordSim relatedness (Agirre et al., 2009) and SimLex-999 (Hill et al., 2015) described above. The distributional models that they used were positive pointwise mutual information (PPMI), PPMI with singular value decomposition (PPMI + SVD; Bullinaria & Levy, 2007), GloVe (Pennington et al., 2014), and skip-gram with negative sampling (SGNS; Mikolov et al., 2013). Using these models, it is possible to explain human rating values from the dataset. However, the correlation coefficients between these models and rating values vary. The highest correlation coefficients among these models and several parameters are approximately 0.79 for WordSim similarity, 0.69 for WordSim relatedness, and 0.43 for SimLex-999. The correlation coefficients depend on various factors of datasets. Thus, we need to construct similarity and association datasets that differ only in instruction between similarity and association tasks. In addition, this kind of study does not exist in Japanese although there is a study—specifically, Karpinska et al. (2018)—that examined JWSD (Sakaizawa & Komachi, 2018). This constitutes another of our motivations to construct a new dataset.

1.4 Purposes of this study

In this study, we developed a new Japanese word similarity and association dataset, called the Japanese Word Similarity and Association Norm (JWSAN). To the best of our knowledge, JWSAN is the first datasetFootnote 1 that contains both word similarity and association scores collected by instructions that clearly distinguish between “similarity” and “association.” JWSAN is also characterized by data collection from a sufficient number of age- and-gender-controlled assessors. This study was performed by (1) choosing word pairs prior to the survey, (2) collecting similarity and association ratings using a web-based survey for a full dataset, JWSAN, (3) reporting the characteristics of JWSAN, and (4) analyzing the effects of the gender and age of the raters. Regarding (2), this study adopted a rating task not only for similarity but also for association to unify measurement methods between similarity and association. This is contrast to previous studies (Hill et al., 2015) that used the rating task for similarity and the word association task for association from USF (Nelson et al., 2004).

2 Methods

2.1 Design for selecting word pairs

We have developed a dataset containing three classes of word pairs: noun pairs (noun–noun), verb pairs (verb–verb), and adjective pairs (adjective–adjective). For the dataset to include word pairs whose degrees of similarity and association are distributed as widely as possible, we selected word pairs according to two assumptions. First, we hypothesized that two words that belong to the same semantic category would have a higher similarity than two words that belong to a different semantic category. Thus, we created two types of word pairs using a thesaurus (Isahara et al., 2008): pairs of words that belong to the same semantic category (expected to have high similarity) and pairs of words that belong to different semantic categories (expected to have low similarity). Second, we assumed that a pair of words that frequently occurs in the same context has a high association. Thus, we created two types of word pairs using pointwise mutual information (PMI), an index that represents co-occurrence tendency: pairs of words that frequently co-occur in the same context (expected to have a high association) and pairs of words that are unlikely to co-occur in the same context (expected to have a low association). Based on these two assumptions, we created four types of word pairs: semantically similar and frequently co-occurring word pairs, semantically similar and infrequently co-occurring word pairs, semantically dissimilar and frequently co-occurring word pairs, and semantically dissimilar and infrequently co-occurring word pairs.

2.2 Selection of word pairs

Before selecting the words used for our dataset, we had to choose a proper word dictionary. This is very significant because Japanese sentences are written in a non-segmented form, and word boundaries are not explicitly marked. To process Japanese texts, which is necessary for training distributional word vectors, word segmentation must be performed using a morphological analyzer. However, different morphological analyzers use various word dictionaries; even worse, these dictionaries contain numerous inconsistencies from a linguistic perspective (Maekawa et al., 2014). Thus, we adopted UniDic (Ver. 2.1.2)Footnote 2 (Den et al., 2007). UniDic is a word dictionary developed to resolve the above problem and provide a proper tool for Japanese morphological analysis. The unit for identifying a word in UniDic is based on a short unit word (Maekawa et al., 2014), which does not contain compound words.

As a thesaurus for creating semantically similar–dissimilar pairs, we used Japanese WordNet (Ver. 1.1)Footnote 3 (Isahara et al., 2008). WordNet is a type of thesaurus in which words are grouped into sets of semantic categories, called synset. We extracted nouns, verbs, and adjectives contained in both WordNet and UniDic.

As pre-processing, we excluded words comprising only one kanji character.Footnote 4 Furthermore, low-frequency words were eliminated in reference to the vocabulary listFootnote 5 of Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al., 2014) as follows. For nouns, the bottom 75% (frequency of less than 152) were eliminated, and 4117 unique words were selected. For verbs, the bottom 50% (frequency of less than 91) were discarded, and 1589 unique words were extracted. For adjectives, because they were small in number, only those with frequencies less than four were eliminated, and 463 unique words were left.

We then constructed a set of two-word pairs using synsets of WordNet and PMI values from the word pool. First, we classified synsets into two types: one-word synsets that contain only one word in the vocabulary and multiple-word synsets that contain two or more words. Using multiple-word synsets, we created all pairs of words in the same multiple-word synset, except for those including the same kanji character, as candidates of semantically similar pairs. As candidates for semantically dissimilar pairs, we also created all pairs of words in different one-word synsets. Thus, we obtained a pool of semantically similar pairs, including 2387 noun pairs, 1577 verb pairs, and 214 adjective pairs, and a pool of semantically dissimilar pairs, including 493,053 noun word pairs, 83,132 verb pairs, and 3975 adjective pairs.

Next, we computed PMIFootnote 6 values for all pairs of words that co-occurred in the BCCWJ corpus (Maekawa et al., 2014), because PMI cannot be computed for two words that do not co-occur. We then selected the top 400 pairs of PMI values for nouns, 100 pairs for verbs, and 50 pairs for adjectives from the pool of semantically similar pairs, as semantically similar and frequently co-occurring word pairs (550 pairs). Similarly, semantically dissimilar and frequently co-occurring word pairs (550 pairs) were determined as the top 400 pairs of PMI values for nouns, 100 pairs for verbs, and 50 pairs for adjectives from the pool of semantically dissimilar word pairs. Furthermore, we determined semantically similar and infrequently co-occurring word pairs (550 pairs) and semantically dissimilar and infrequently co-occurring word pairs (550 pairs) by randomly choosing 400 pairs for nouns, 100 pairs for verbs, and 50 pairs for adjectives that did not co-occur in the corpus from the pool of semantically similar pairs and the pool of semantically dissimilar pairs. Consequently, the four word-pair types each comprised 400 noun pairs, 100 verb pairs, and 50 adjective pairs; thus, 2200 pairs were selected.

These 2200 pairs were divided into two sets of 1100 pairs each. Two graduate students judged each set to identify pairs of words that were difficult to understand. The percentage of agreements for the two sets were 98.3% and 97.5%. From these pairs, 55 (noun pairs: 22, verb pairs: 21, adjective pairs: 12) were judged to contain a word that was difficult to understand by at least one assessor. We eliminated these 55 pairs and used the remaining 2145 pairs for the web-based survey.

2.3 Assessment set

Consequently, 2145 pairs were divided into 21 almost equally sized sets, each of which included 102 or 103 pairs. Word class and word pair types were balanced across 21 sets.

2.4 Participants

A total of 9253 native speakers of Japanese, over a wide age range, participated in a web-based survey via an Internet research company in Japan. Table 1 shows the number of participants for 10 classes of gender (male and female) and age (20 s, 30 s, 40 s, 50 s, and 60 s).

Table 1 Number of participants before and after data screening

The participants were divided into 42 roughly equal-sized groups such that all groups contained at least 10 participants for each gender–age class. Each participant group was assigned one of 21 assessment sets and one of the two tasks (i.e., similarity rating and association rating). Hence, each participant was asked to rate either the similarity or association of one assessment set.

2.5 Query design and procedure

The participants were given different task-specific instructions. For the similarity rating task, we designed the instructions in accordance with Hill et al. (2015). In the instructions, we provided a number of examples of similar word pairs, rather than attempting to define the notion of similarity because it is difficult to capture a formal characterization of similarity, and, even if it exists, various instructions are needed for different concept types. However, the instruction for the association rating task is simple, as demonstrated in previous research (e.g., Finkelstein et al., 2002). Hence, participants in the similarity rating condition were given detailed instructions with some examples to clarify the distinction between similarity and association, whereas the participants in the association rating condition were given only brief instructions.

The instructions for the similarity rating task highlighted the importance of drawing a clear distinction between word similarity and word association, and the association was explained in addition to similarity using examples. To illustrate the difference between similarity and association, we provided instructions stating that, for example, “the word tire is “associated” with the word car in the sense that a tire is a part of a car, but they do not have similar meanings (or they are not synonymous) because a tire is different from a car itself.” For the participants in the similarity rating task, a validation test was administered immediately following the instructions. The validation test comprised three multiple-choice questions to identify the most similar pair from a set of three options, all of which were associated but only one of which was clearly similar (e.g., [cutter, paper], [cutter, frying pan], [cutter, knife]). The participants were provided with feedback on whether their answer was correct or not. In the instructions for the association rating task, the notion of association was explained by the following example: the degree of association between scholar and book was “6” (considerably associated), but the degree of association between ground meat and magnet was “2” (considerably unassociated). To ensure that participants read the instructions, they were not allowed to move to the next page until 30 s had passed for the similarity rating task and 10 s had passed for the association rating task. All instructions translated from Japanese for the similarity and association rating conditions are shown in Appendices A and B, respectively.

After going through the instructions, participants were asked to rate the degree of similarity or association of the word pairs presented on the screen by choosing the most appropriate number on a seven-point scale, ranging from not similar (associated) at all (1) to extremely similar (associated) (7). The rating screen for the similarity rating task is shown in Fig. 1.

Fig. 1
figure 1

Rating screen for similarity rating task (originally in Japanese)

2.6 Data screening

We eliminated, as unreliable, all the rating data of participants who met any one of the following criteria: all rated values were identical, the task completion time was much longer (i.e., it was within the longest 5% of all participants), and the task completion time was much shorter (i.e., it was within the shortest 5% of all participants). After preprocessing, the number of participants decreased to 8132.

Next, for each combination of 21 assessment sets and two rating tasks, we eliminated the data of randomly chosen participants such that 10 groups of gender (male, female) and age (20 s, 30 s, 40 s, 50 s, and 60 s) had an equal amount of rating data (i.e., participants). As a result, the data from 6450 participants remained in use for the dataset, and each combination of assessment sets and rating tasks had at least 100 participants (10 for each group) and at most 190 participants (19 for each group).

3 Results and discussion

We transformed the rating scale from the original 1–7 scoring to a range of 0–6 by subtracting one point from all rated scores such that the minimum value of scores was identical to other representative similarity datasets in English, such as WordSim-353 and SimLex-999. We then computed the mean similarity and association ratings for each of the 2145 word pairs, which are available in JWSAN.Footnote 7

The mean rating for all the 2145 pairs was 1.99 (SD = 1.25) for similarity and 3.08 (SD = 1.16) for association. The mean similarity rating was much lower than the midpoint of “3,” whereas the mean association rating was roughly equal to the midpoint. This suggests that many word pairs might be rated “dissimilar.” Table 2 shows some examples of word pairs in JWSAN and their similarity and association ratings. We selected examples of pairs with high-high, medium-medium, low–high, and low-low similarity and association values for each POS category.

Table 2 Examples of pairs with high similarity–high association, medium similarity–medium association, low similarity–high association, and low similarity–low association pairs for each POS category

3.1 Distributions of and correlation between rated similarity and association in JWSAN

Pearson’s and Spearman’s correlation coefficients between rated similarity and rated association were rs = 0.91, 0.94 (ps < 0.01). These results clearly show a very strong correlation between similarity and association ratings.

Figure 2 suggests one possible reason for the strong correlation. As shown in the scatterplot of Fig. 2, it is unlikely that two words are judged as semantically similar but unassociated, although the set of 2145 word pairs included infrequently co-occurring and semantically similar pairs. The mean similarity and association ratings of infrequently co-occurring and semantically similar pairs were 2.72 (SD = 1.05) and 3.68 (SD = 0.80). This suggests that semantically similar pairs were also judged as highly associated, which is not surprising. The similarity histogram in Fig. 2 shows a biased distribution of rated similarity in that there are many dissimilar pairs.

Fig. 2
figure 2

Scatterplot of similarity and association ratings with histograms in JWSAN

Next, we examined the two ratings based on the original conditions. Figure 2 shows a scatterplot by the conditions, Table 3 shows the mean and SD, and Fig. 3 shows a bar plot to clearly show the characteristics of each condition. With regard to similarity, WordNet's synsets-based procedure seems to have been successful: as can be seen in Table 3 and Figs. 2 and 3, the two semantically similar conditions are generally more similar than the two semantically dissimilar conditions. Many pairs of the semantically dissimilarity conditions are distributed near the lower limit, which may explain why the peak of the histogram of similarity in Fig. 2 is at the lower end of the similarity scale. It is clear from Table 3 and Fig. 3 that the procedure based on PPMI values for association did not have much effect on the similarity.

Table 3 Number of pairs and mean scores (SD) of rated similarity and rated association in each condition set before survey
Fig. 3
figure 3

Boxplots rated similarity and rated association in each condition set before survey

On the rated association, the procedures based on synsets in WordNet had an effect here as well. Overall, the two semantically similar conditions had a greater effect on rated association than the two semantically dissimilar conditions. This is to be expected, because, as discussed in the introduction, similarity is a kind of association. On the other hand, the procedure based on the value of PPMI had only minimal effect on the two semantically similar conditions, but had a significant effect on the two semantically dissimilar conditions, as is evident in Fig. 3.

3.2 Correlation between JWSAN and semantic spaces

In this section, we examine the applicability of JWSAN to the evaluation of distributional semantic models by analyzing the prediction performance of the distributional semantic models in estimating similarity and association ratings. Eight semantic spaces were constructed from combinations of four representative models and two training corpora. The distributional semantic models used for evaluation were positive pointwise mutual information with singular value decomposition (PPMI + SVD; Bullinaria & Levy, 2007), GloVe (Pennington et al., 2014), skip-gram with negative sampling (SGNS; Mikolov et al., 2013), and fastText (Bojanowski et al., 2017). Two corpora from which semantic spaces were trained were the Mainichi newspaper corpus (articles in 2000–2016), which includes approximately one billion word tokens, and the BCCWJ (National Institute for Japanese Language and Linguistics 2011), which includes 120 million word tokens.

The vocabulary of semantic spaces was determined as follows. For the newspaper corpus, we segmented all sentences using the morphological analyzer Mecab (ver. 0.996) with the IPAdic dictionary. As a result, the vocabulary included 750,258 unique words. For the BCCWJ corpus, we used the tags attached to the texts according to UniDic and all 175,800 unique words as vocabulary. All the parameters for training the eight semantic spaces were identical. The vector dimension was 300, and the size of the context window was ten words on either side of the target words. The size of the negative sampling of SGNS and fastText was five.

Figure 4 lists scatterplots, Pearson’s and Spearman’s correlation coefficients between the cosine values computed by semantic spaces and the mean similarity or association ratings for all 2145 pairs of words. The correlation coefficients between rated similarity and cosines ranged from 0.46 to 0.63 (Pearson) and from 0.54 to 0.67 (Spearman), whereas the correlation coefficients between rated association and cosines ranged from 0.59 to 0.74 (Pearson) and from 0.63 to 0.75 (Spearman). For all the semantic spaces, the correlation of association was higher than that of similarity, which is consistent with the results repeatedly observed for English word pairs (e.g., Hill et al., 2015; Levy et al., 2015).

Fig. 4
figure 4

Scatterplots of rated similarity and rated association in JWSAN and cosines of semantic spaces. Note: Values before and after the solidus symbol (/) are the Pearson's and Spearman's correlation coefficients, respectively. n is the number of pairs for the calculation of correlation coefficients. All correlation coefficients were statistically significant (ps < 0.01)

However, in the case of JWSAN, the differences in correlation between similarity and association were not large because the correlations for similarity were relatively high. It has been generally observed in experiments using an English dataset that correlation coefficients between cosines computed from semantic spaces and similarity ratings are low. For example, Hill et al. (2015) reported that the correlation of similarity in SimLex-999 was 0.28 in the Mikolov et al. (2013) skip-gram model and 0.23 in the PMI + SVD model, whereas the correlation of association in WordSim353 was 0.44 and 0.38 for the skip-gram and PMI + SVD models, respectively. Similarly, Levy et al. (2015) also reported a maximum correlation of 0.44 between the cosine and rated similarity of SimLex-999, although they showed a maximum correlation of 0.79 between the cosine and rated association in WordSim relatedness. All these findings suggest that our dataset JWSAN can easily predict the similarity of word pairs when compared with the English dataset SimLex-999. One possible reason for the unexpectedly high correlations between the similarity ratings of JWSAN and the model prediction is the biased distribution of similarity, as shown in Fig. 2. Many word pairs in JWSAN were judged to be dissimilar—1005 pairs (46.9%) have a similarity score of 1.5 or lower. In general, these dissimilar word pairs are likely to have low cosine values, regardless of the performance of the distributional semantic models. Such easy pairs would make the correlation between similarity ratings and cosines seem higher than it actually is.

3.3 Analysis of gender and age differences

We classified participants aged 20–39 (n = 2580) as young and those aged 50–69 (n = 2580) as old. The ratings for the participants aged 40–49 (n = 1290) were not used when we calculated the values for each age group. The number of participants and mean (SD) of the rated similarity and association are listed in Table 4. Scatterplots and Pearson’s and Spearman’s correlation coefficients are shown in Fig. 5.

Table 4 Numbers of participants, means, and SDs of rated similarity and association for each category of gender and age
Fig. 5
figure 5

Scatterplots and boxplots of rated similarity and rated association for each category of gender and age. Note: Correlation coefficients before and after slashes are Pearson's and Spearman's correlation coefficients, respectively. All correlation coefficients were statistically significant (ps < 0.01)

For rated similarity, we performed ANOVA with items as random variables (n = 2145). The independent variables were gender (male and female; within) and age (younger and older; within). The ANOVA showed significant main effects of gender and age and interaction effect (Fs(1, 2144) = 305.99, 31.80, 147.11, ps < 0.01). All simple main effects were significant: gender effects for young and old groups and age effects for male and female (Fs(1, 2144) = 20.90, 31.61, 10.09, 11.81, ps < 0.01). Male participants generally rated the word pairs as more similar than did female participants; however, the effect of age differed between males and females. Old male participants rated the pairs as more similar when compared with young males, whereas old females rated the pairs as less similar when compared with young females.

In the same manner, we conducted ANOVA for rated association (n = 2145) to study the significance of gender, the effect of interaction (Fs(1, 2144) = 255.54, 7.10, ps < 0.01), and the marginal significance of the effect of age (F(1, 2144) = 3.58, p = 0.059). The simple effect of gender on young and old participants and age effect for females were significant (Fs(1, 2144) = 20.90, 31.61, 10.09, 11.81, ps < 0.01), whereas the simple effect of age for males was not significant (F(1, 2144) = 0.13, n.s.). In contrast with the results of similarity ratings, female participants rated word pairs as more associated when compared with males. In addition, old females rated the pairs as more associated when compared with young females.

Tables 5, 6, 7, 8 list the top 10 word pairs whose gender and age differences were the largest. In summary, under both similarity and association rating conditions, males tended to rate high economic, political, and abstract noun pairs (e.g., [expenditure (支出), fee (費用)], [political (政治), decay (腐敗)]). Females rated higher values for adjective pairs (e.g., [dazzling (眩しい), beautiful (美々しい)]) and verb pairs (e.g., [to lower (引き下げる), drop (落とす)]). Although it appears that the difference between young and old participants was not clear, there were some pairs whose absolute values of difference were greater than one point (e.g., [strange (可笑しい), curious (物珍しい)] in Table 7, [rail track (線路), railway (鉄道)] and [ladder (梯子), gradually (段々)] in Table 8).

Table 5 Top 10 pairs in which similarity and association scores were rated higher by males
Table 6 Top 10 pairs in which similarity and association scores were rated higher by females
Table 7 Top 10 pairs in which similarity and association scores were rated higher by younger participants
Table 8 Top 10 pairs in which similarity and association scores were rated higher by older participants

Based on the results of the ANOVAs, we found that relatively large differences were caused by gender rather than age. Because absolute levels of rated values for similarity and association would differ between males and females, we need to keep that in mind when selecting raters. However, as shown in Fig. 5, the correlation coefficients of rated values between male and female participants and between young and old participants were very high. This suggests that the relative differences in rated values among word pairs do not depend on gender and age. Furthermore, the results of the correlation coefficients of rated similarity and association listed in Fig. 6 demonstrate that both gender and age differences may not be affected by the rating method, resulting in the relative differences between rated similarity and association values.

Fig. 6
figure 6

Scatterplots and boxplots between rated similarities and between rated associations for male and female and for younger and older participants. Note: Values before and after the solidus symbol (/) are the Pearson's and Spearman's correlation coefficients, respectively. All correlation coefficients were statistically significant (ps < 0.01)

3.4 Comparison with other Japanese datasets

It is useful to compare JWSAN with the limited Japanese datasets: JWSD (Sakaizawa & Komachi, 2018) and jBATS (Karpinska et al., 2018). Table 9 compares JWSAN and JWSD in terms of the number of pairs, OOV percentages, and representative values. In terms of the number of pairs of parts of speech, JWSAN contains more nouns, while JWSD contains more verbs and adjectives. Furthermore, JWSD is unique in that it contains adverbs, which are not included in JWSAN. When we examined the common pairs of JWSAN and JWSD, we found a very small number of pairs: seven pairs of nouns and nine pairs of adjectives. These probably resulted from the fact that the procedures for creating word pairs differed between the two.

Table 9 Comparison of JWSAN and JWSD

In terms of the OOV words, there are many instances in the JWSD. While we were able to calculate 2073 pairs (96.6%) for newspaper and 2145 pairs (100%) for BCCWJ in JWSAN (2145 pairs), 1928 pairs (43.5%) for newspaper and 1102 pairs (24.9%) for BCCWJ in JWSD (4429 pairs). This may be owed to the fact that the JWSD contains numerous compound words consisting of two or more morphemes or past-tense verbs, as noted by Karpinska et al. (2018).

Comparing the representative values, the neutral value on the JWSAN scale is “3,” whereas it is “5” for JWSD; for the JWSAN rated similarity, most values are smaller than “3,” as can be seen from the median; for the association, the median is 3.19, which is close to “3.” On the other hand, the JWSD has a median of 6.8, which is considerably larger than “5.” These probably resulted from the differences in the word pairs used for grading and the differences in teaching.

Furthermore, we analyzed the comparisons between semantic spaces and datasets performed for JWSAN (Fig. 4) to the JWSD (Table 10). We observed that the correlation values ranged from 0.20 to 0.25 for newspapers and from 0.23 to 0.31 for BCCWJ. Since the correlation coefficients of JWSAN (similarity) shown in Fig. 4 are in the range of 0.46–0.62 for newspapers and 0.46–0.67 for BCCWJ, those of JWSD are noted to be lower than the values for JWSAN. This might be caused by the differences in the distributions of rated similarities or parts-of-speech ratios between JWSAN and JWSD, as shown in Table 9.

Table 10 Correlation coefficients of rated similarities in JWSD and cosines of semantic spaces

In addition, it is likely that the difference in word frequency between JWSAN and JWSD might affect the results. For comparison, we show the histograms of the word frequencies (Fig. 7) by the following procedure: first, we removed the word pairs, including OOV words, from each dataset; next, we made unique word lists contained in word pairs; finally, we added the word frequencies computed for the newspaper and BCCWJ corpora. We observe from Fig. 7 that the JWSD includes more low-frequency words than JWSAN. Owing to the difficulty of explaining rare-word similarities using semantic spaces (Luong et al., 2013), this feature could be one of the reasons for the low correlation coefficients in Table 10.

Fig. 7
figure 7

Histograms of the Log10 word frequencies for JWSAN and JWSD computed from the Newspaper and BCCWJ corpora

jBATS is a dataset of four types of relations—inflectional_morphology, derivative_morphology, encyclopedic_semantics, and lexicographic_semantics. Further, each type has 10 categories, and each category has 50 examples. For example, in the category L07 [synonyms—intensity] of lexicographic_semantics, the word young (若い) has paired candidate words: immature (幼い), small (小さい), youth (年少), childish (幼稚), naïve (稚い), child (子供), baby (赤ちゃん). For all examples of all types and categories, when these were paired (e.g., young–immature, young–small, …), the result was 13,897 pairs. Regardless of the order of the pairs, the common pairs with JWSAN were only 13 pairs (noun: 6, verb: 1, adjectives: 6), all of which belonged to the lexicographic_semantics type. Specifically, we have the following: seven pairs in L07 [synonyms—intensity], three pairs in L08 [synonyms—exact], one pair in L02 [hypernyms—misc], one pair in L03 [hyponyms—misc], and one pair in L09 [antonyms—gradable]. To give specific pairs, JWSAN has a small–young pair with rated similarity and rated association of 1.92 and 3.02, respectively. There is also an immature–young pair, with rated similarity and rated association of 3.75 and 4.10, respectively.

As with JWSD, there were also very few common pairs with jBATS. In future study, in terms of increasing comparability, it would be worthwhile to assign rated similarity and rated association to JWSD and jBATS in the same manner as in the present study.

4 Conclusion

In the present study, JWSAN, a dataset of similarity and association ratings for Japanese word pairs, was created. JWSAN is the first Japanese dataset that includes both similarity and association ratings for Japanese word pairs. An additional characteristic feature of JWSAN is that it has a sufficient number of age- and gender-controlled participant-rated word pairs with instructions that clearly distinguish similarity from association. We hope that the proposed dataset will be widely used to assess the performance of Japanese distributional semantic models in the future.