Keywords

1 Introduction

In many NLP problems, including information retrieval [16], spelling correction [3] and noun compound interpretation [14] among many others, knowing the semantic similarity or semantic relatedness of words can be very useful. Although distributional semantic models (DSMs) [1] calculating these measures have many possible parameters, most research focuses only on one or two aspects of these models, while using some conventional settings for the rest of the parameters (e.g. cosine as vector similarity and (positive) pointwise mutual information as weighting). Therefore a truly comprehensive study evaluating the numerous parameters of DSMs for any language is still missing, and would be needed, as also suggested by [21]. Moreover, despite the fact that the best parameter settings for the parameters can differ for different languages, the vast majority of papers consider DSMs for only one language (mostly English), or consider multiple languages but without a real comparison of findings across languages. In this article we would like to address these gaps.

There are two distinct phases of DSMs in general: the extraction of statistical information from raw text, and the creation and comparison of feature vectors for words based on the extracted information. Within this study we focus on the parameters of the second phase, as the two phases are relatively distinct from each other, and the number of possible combinations of parameter settings (CPS) for the second phase is already well in the trillions. So we are searching for the best CPS of those 10 parameters during the creation and comparison of feature vectors in DSMs that we considered important, for English, Spanish and Hungarian separately, and then we compare our findings for the different languages.

A detailed description of our analysis can be found in [13], where tests were done only for English, without any comparison of findings across languages, and with far less settings tested for several parameters.

2 Background

Although there are a vast number of studies dealing with DSMs, most of them only consider one or two parameters of these models, and take the others granted with some standard setting. Most commonly, these models use cosine as vector similarity [1, 2, 6, 8, 17, 21, 27, 29, 30, 32, 34, 35] and (positive) pointwise mutual information as weighting [2, 17, 18, 21, 29, 30, 35]. Further, they also usually do not care for the interaction of these parameters, and experiment with the considered parameters one by one, and not simultaneously. Of course there are some studies that experiment with several parameters with multiple possible settings [7, 19, 20], but even these are far from being truly comprehensive.

Moreover, most models were only tested for English and neglect any other languages despite the fact that DSMs might work differently across multiple languages. Of course, there are several studies in which results were presented for languages other than English, including Spanish [4, 15, 23] and Hungarian [11, 24]. However, even those that include multiple languages usually only present some test results for the different languages separately, without any real analysis of the differences in the findings between the languages.

3 Data and Evaluation Methods

For input we used information extracted from the British National Corpus (BNC), the Spanish Wikicorpus (EsWiki) [28], and the 23.01.2012 dump of the Hungarian Wikipedia (HuWiki) for English, Spanish and Hungarian, respectively, with the help of the information extraction methods of [11], [12] and [29].

Tests were done on parts of the MEN [2] dataset for English, the Spanish WordSimilarity-353 [15], the Moldovan [23] and the Spanish Rubenstein-Goodenough [5] datasets for Spanish, and parts of the Hungarian version of the TOEFL [11] and Rubenstein-Goodenough datasets for Hungarian. The last was constructed the same way as the Hungarian TOEFL and Miller-Charles datasets in [11].

Out of these datasets only the MEN dataset is truly reliable, as the others are rather small and except for the Moldovan dataset just translated from English datasets, during which they can be distorted. The Hungarian datasets are especially small, and the type of the TOEFL dataset also makes the results on it even less reliable compared to the other datasets. However, due to the lack of truly suitable resources, we had to settle for these.

In case of the TOEFL dataset, the accuracy (A) of the models on the questions were calculated, while in case of all the other datasets, the Pearson’s (P) and Spearman’s (S) correlations with the gold standard scores and their modified harmonic mean (H) were computed, as follows:

$$\begin{aligned} H(P, S) = \frac{2 \times P \times S}{|P|+|S|} \end{aligned}$$
(1)

For more information, please refer to [13].

4 Our Heuristic Analysis

As the number of possible CPSs are in the magnitude of trillions, we had to use a heuristic approach to find the best one in case of each language. First, each parameter was tested separately on a development dataset, where a candidate list of settings were selected for each parameter. Then all combinations of the selected settings of all parameters were tested on a different dataset, to find the best CPS. These were done for the three languages separately.

The 10 parameters tested, together with the number of settings tried for them, are listed in Table 1. For several parameters, a large number of novel settings were tested. These were either brand new settings, modified versions of conventionally used settings, or combinations of multiple settings. A detailed description of the tested parameters and their settings can be found in [13].

We have to note that when using singular value decomposition (SVD) for dimensionality reduction or the various smoothing options, we usually did a smaller number of runs than in other cases due to our limited resources. Further, in case of Spanish we had to set MWFFreq to 3 instead of NoLimit when using SVD due to the too many features otherwise, which would have made running SVD unmanageable.

First we have done this two-step heuristic analysis for English and evaluated the results extensively in [13]. Then we have repeated the same analysis, with a greatly increased number of settings for several parameters, for English, Spanish and Hungarian, and compared the findings across the different languages in this article.

5 Results

5.1 Results of the First Phase

During the first phase of our analysis multiple runs were done for each setting of every parameter, and the most promising ones in case of each parameter were selected to be included in the second phase. In case of English, we used half of the development part of the MEN dataset for evaluation, while for Spanish the Spanish WordSimilarity-353 dataset and for Hungarian half of the Hungarian TOEFL dataset was employed. The top 5 performing settings for each parameter are listed in Table 2 in case of each language.

Table 1. The tested parameters, with the number of settings tested for each

Although presenting the definition of all settings for every parameter would be impossible within this article due to their large number, below we briefly define a couple of them to help interpreting our most important results.

In case of vector similarity measures, we have defined many new variants based on one or more conventional measures. For example, the best one for English is a combination of the Pearson, MarylandBridge [9] and AdjCos [31] measures, with some additional transformations:

$$\begin{aligned} \begin{aligned}&PearsMbAdjCosMod\text {-}3.Lb(u,v)= {\left\{ \begin{array}{ll} 1, &{} d \ge 0.1 \\ \frac{d}{0.1}, &{} d<0.1 \end{array}\right. } \\&d=0.5 \times \Bigg (\frac{\sum _{i=1}^n sgn(u_i-\bar{u}) \times lb(|u_i-\bar{u}|+1) \times sgn(v_i-\bar{v}) \times lb(|v_i-\bar{v}|+1)}{lbinv\left( \sum _{i=1}^n (lb(|u_i-\bar{u}|+1))^2\right) } \\&\qquad + \frac{\sum _{i=1}^n sgn(u_i-\bar{u}) \times lb(|u_i-\bar{u}|+1) \times sgn(v_i-\bar{v}) \times lb(|v_i-\bar{v}|+1)}{lbinv\left( \sum _{i=1}^n (lb(|v_i-\bar{v}|+1))^2\right) }\Bigg ) \\&lbinv(x) = min(max(sgn(x) \times (2^{|x|}-1), -2^{100}), 2^{100}) \end{aligned} \end{aligned}$$
(2)

On the other hand, the best vector similarity measure for Spanish is rather different. It is a modified and transformed version of the Hindler measure [22]:

$$\begin{aligned} \begin{aligned}&LinHindleRMod\text {-}7.1.2.Cu(u,v)= \frac{\root 3 \of {\sum _{i=1}^n lhr\text{- }1.Cu(u_i, v_i)}}{\sqrt{\sum _{i=1}^n u_i^2} \times \sqrt{\sum _{i=1}^n v_i^2}} \\&lhr\text {-}1.Cu(x, y) = {\left\{ \begin{array}{ll} min(x^3, y^3), &{} x \ne 0 \wedge y \ne 0 \\ 0, &{} otherwise \end{array}\right. } \end{aligned} \end{aligned}$$
(3)

Weighting schemes were constructed similarly as vector similarity measures, and here too there were also numerous new variants. For example, the best one for English is a combination of PMIAplha [21], PMI with Laplace smoothing [33], Unisubtuples [26] and PMI with discounting factor [25]:

$$\begin{aligned} \begin{aligned}&PmiAl\text {-}Tc3Tw0S2P4(x,y)= \frac{f_{xy}'}{f_{xy}'+1} \times \frac{min(f_x',f_y')}{min(f_x',f_y')+1} \\&\qquad \times \Bigg (lb\left( \frac{n'_{\alpha } \times f_{xy}'}{f_x' \times f_y'^{0.75}}\right) -3.29 \times \sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\Bigg ) \\&a=f_{xy}' \text{, } \quad b=f_x'-f_{xy}' \text{, } \quad c=f_y'-f_{xy}' \text{, } \quad d=n'-f_x'-f_y'+f_{xy}' \\&f_x' = f_x+1 \text{, } \ f_y'=f_y+1 \text{, } \ f_{xy}'=f_{xy}+1 \text{, } \ n'=n+1 \text{, } \ n'_{\alpha } = \left( \sum _{i=1}^{|V|} f_i^{0.75}\right) +1 \\&f_x, f_{y} \text{: } \text{ word } \text{ frequencies}\text{, }\quad f_{x,y} \text{: } \text{ xy } \text{ tuple } \text{ frequency } \\&n \text{: } \text{ total } \text{ number } \text{ of } \text{ words } \text{ in } \text{ the } \text{ corpus}\text{, }\quad |V| \text{: } \text{ size } \text{ of } \text{ the } \text{ vocabulary } \end{aligned} \end{aligned}$$
(4)

In case of feature transformation, we have experimented with transforming either raw frequencies or weights, both before and after normalization, and 7 different transformation functions were tried in all cases. For smoothing, we tried different versions of the Kneser-Ney smoothing [10]. In case of dimensionality reduction, we tried a couple of different techniques, including SVD and the method of [18]. And for minimum limits on word-feature weights we have tried the following two novel variants with multiple limit values:

$$\begin{aligned} limit(w, minValue)= {\left\{ \begin{array}{ll} w &{} if\ w \ge minValue\\ minValue &{} otherwise \end{array}\right. } \end{aligned}$$
(5)
$$\begin{aligned} zero(w, minValue)= {\left\{ \begin{array}{ll} w &{} if\ w \ge minValue\\ 0 &{} otherwise \end{array}\right. } \end{aligned}$$
(6)

The other parameters are much less complex and more commonly used in NLP, thus one should be able to have enough understanding of them from Table 1.

A more detailed description of the different parameters and settings can be found in [13] (although with far less settings for several parameters).

Table 2. The top 5 performing setting for each parameter in case of all 3 languages, in descending order of maximum H scores

5.2 Results of the Second Phase

In the second phase all possible combinations of the settings of each parameter were tested in case of all three languages, in order to find the best CPS for all languages. The second half of the development part of the MEN dataset was used for testing in case of English, while the Moldovan dataset and the second part of the Hungarian TOEFL dataset were used for Spanish and Hungarian, respectively. The top 5 performing CPSs for each language are presented in Table 3.

Table 3. The top 5 performing CPSs for each language with their achieved scores, in descending order of maximum H scores

5.3 Results on the Test Datasets

The best CPS for English was tested on the test part of the MEN dataset (MT), and the best CPSs for Spanish and Hungarian were tested on the respective version of the Rubenstein-Goodenough dataset (RG) to give us the final results. The best CPS of each language was also evaluated on the datasets of the other languages, to provide us a way of comparison. The results of these test can be found in Table 4.

Table 4. Results on the test datasets, in descending order of maximum H scores

6 Evaluation and Discussion

In this section we evaluate our results presented in the previous sections. Please note that the scores are not fully comparable across languages, even when considering the same datasets on different languages, as except for the Moldovan dataset all of the used Spanish and Hungarian datasets were constructed by translating the English versions, and thus the results on them can be distorted and less reliable than on their English counterparts. Furthermore, the Spanish and Hungarian datasets, especially the latter ones, are rather small, which also makes them less reliable than the English ones.

As there are many differences in the syntax and morphology of the different languages, we anticipated from the beginning that there will be at least some small differences in our findings for the different languages. However, our intuition was that our findings for the different languages will be subtle, and we will be able to find good and rather language-independent CPSs. As English and Spanish belong to the family of Indo-European languages, while Hungarian does not, we expected that the results for English and Spanish will be similar due to this. Further, as both Spanish and Hungarian have very rich morphology, we expected that there will also be a higher similarity between our results for Spanish and Hungarian because of this. We anticipated that the least similarities will be between English and Hungarian, as these languages are the least similar to each other.

In the first phase of our analysis we could observe that some of the parameters worked exactly the same way or very similarly across languages. These parameters were the weighting scheme, feature transformation, vector normalization and minimum limits on word-feature frequencies. These finding are in line with our initial intuitions. Dimensionality reduction seemed to be similar for English and Spanish, while a bit different for Hungarian. Smoothing seemed to perform similarly for Spanish and Hungarian, while differently for English. Minimum limits on word-feature weights seem to behave a bit differently for all three languages. However, it was interesting to see that the results for vector similarity measures, stop-word filtering and minimum limits on feature frequencies were rather similar for English and Hungarian, but different for Spanish, which is contrary to what we anticipated.

In the second phase, although there were similarities in the found best CPSs across the different languages, one could also observe many differences. Here too, the weighting schemes, feature transformation and minimum limits on word-feature frequencies were mostly similar. Compared to the first phase vector similarity, smoothing and minimum limits on feature frequencies were also alike for all languages. The other parameters showed a different behaviour for at least one language compared to the others.

We have to note here that there were actually two distinct CPSs with the same best score for English, and they were only different in their DimRed parameter setting. We have chosen the one with the “IslamInkpen 0.05” setting as BestEn, as that setting achieved better performance in the first phase than the “NoDimRed” setting in the other CPS. Furthermore, for Hungarian there were even more CPSs with the same best score. We have used a similar approach in selecting the BestHu version, as we have done in case of the BestEn version. However, as these different CPSs with the same best results have different settings in case of some parameters, one has to be careful drawing conclusions from the best CPSs of the different languages, and thus any conclusions drawn from them should be taken with some reservations.

The final conclusions for the parameters are the following:

  • VecSim: for all languages measures based on cosine similarity achieve the best results

  • Weight: measures based on PMI dominate the top of the table by far in case of all languages

  • FeatTransf: no transformation and transforming the word-feature weights after normalization preforms best for all languages

  • DimRed: dimensionality reduction seems to help in most situations: while in case of English the IslamInkpen version performed the best alongside no dimensionality reduction, for Spanish and Hungarian SVD is superior to these options

  • Smooth: the no smoothing option clearly outperforms all others for all languages

  • VNorm: for English the L1 option clearly seems to be the best, while for Spanish and Hungarian the best CPSs use either L1 or L2 normalization, and most CPSs achieve the same or very similar results with either

  • StopW: stop-word filtering seems to improve the results to some extent in case of Spanish, while it does not in case of English and Hungarian

  • MWFFreq: no limit is by far superior to the other options for all languages

  • MWFWeight: no limit seems to be the best option in case of Spanish and Hungarian, while the Zero option with different parameters seems to excel in case of English

  • MFFreq: a low limit or no limit seems to be best in case of all languages (as noted before, in case of SVD for Spanish we had to use a limit of 3 instead of no limit for computational reasons)

As we anticipated, there were parameters where the results for Spanish and Hungarian were similar, but different for English. However, it was interesting that we did not find any parameters that were alike for English and Spanish, but different for Hungarian. Further, to our surprise we found such a parameter, where the results were similar for English and Hungarian, but different for Spanish. These latter findings were in contrast to our initial intuition.

Although all Spanish scores in the second phase are much lower than the English and Hungarian ones, these are almost completely due to the dataset used, and do not mean that the found Spanish CPSs are worse than their English and Hungarian counterparts, as it was noted in the beginning of this section and can be seen from our results on the test datasets (see Table 4) too. It simply suggests that the dataset used for Spanish in this phase is considerably tougher than the ones used for English and Hungarian.

Table 5. Comparison of our best English CPS with a conventional and a state-of-the-art CPS, using the information extracted from the BNC with the method of [29] and the MT dataset for all tests

It was interesting to see that in the cross-language experiments on the test datasets the order of the Best CPSs of the different languages with respect to their performance is different in case of the datasets of the three languages. The best English CPS was always superior to its Spanish counterpart, but it has no absolute superiority over the best Hungarian CPS. Further, there is also no clear ranking between the best Spanish and Hungarian CPSs. It was also interesting to see that in case of the Spanish dataset, although the best Spanish CPS achieved rather good results, actually it achieved the lowest score out of the three best CPSs tested. It was the same for the best Hungarian CPS on the Hungarian dataset too.

All in all, there seems to be no clear ranking between the best CPSs of the different languages, and all of them achieved good results on the datasets of all languages. So, although we got different best CPSs for the different languages, all of them seem to be rather language-independent. These findings give us a strong intuition that our heuristic approach was good, and that our found best CPSs for all languages and their results are robust and reliable.

To further prove that our heuristic approach was successful, that our results are robust and reliable, and that our found best CPSs perform much better than conventional CPSs, we compared our best English CPS with the conventional cosine with positive pointwise mutual information setting (CosPPmi) and the state-of-the-art original settings combination (OSC) of [29], using the information extracted from the BNC with the method of [29] and the MT dataset for all tests. The results of these tests, presented in Table 5, clearly show that our found best CPS is robust, and is not just superior to conventional settings, but to a current state-of-the-art CPS too.

The fact that the best CPSs found in the second phase are not simply made up of the best parameter settings in the first phase proves that our intuition was correct, and the parameters of DSMs need to be tested simultaneously, rather than separately.

7 Conclusions

Within this article we have presented a systematic analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for English, Spanish and Hungarian, including novel parameters and novel parameter settings. To our best knowledge, we are the first to do such a detailed analysis for these parameters, and also to do such an extensive comparison of them across multiple languages.

With our heuristic approach we were searching for the best combination of parameter settings for all three languages. In accordance with our intuition, there were several parameters that worked very similarly in case of all three languages. We also found such parameters that were alike for Spanish and Hungarian, and different for English, which we also anticipated. However, it was interesting to see that there was such a parameter that worked similarly for English and Hungarian, but not for Spanish, and we did not find any parameters that worked similarly for the two Indo-European languages, but differently for Hungarian.

Although we have found that the very best results are produced by different settings combinations for the different languages, our cross-language tests showed that all of them work rather well for all languages. Based on this we think that our heuristic approach was successful, and we could find such combinations of parameter settings that are rather language-independent, and give robust and reliable results. Further, our best English CPS, incorporating multiple novel parameter settings, significantly outperformed both conventional and state-of-the-art parameter combinations.

Although our results seem rather robust and reliable for Spanish and Hungarian too, it would be interesting to redo our analysis on larger and more reliable Spanish and Hungarian datasets to check whether we could find even better CPSs for these languages, when such datasets will become available in the future.