Introduction

Language modeling and word vector representation is an interesting field of study, not only because it has the ability to group similar words together, but the number of applications using these models is quite extensive. These include speech recognition [9], text correction [6], machine translation [10] and many other natural language processing (NLP) tasks [16].

One of the first vector space models (VSMs) was developed by [17] and his colleagues [18]. A good survey of VSMs can be found in [20] and in [8] using neural networks. Not surprisingly, due to success of neural networks there are a lot of neural language models—for example, [2, 3]. Current methods use a local context window and a logarithm of counts how many times words appear together in various lengths of these windows. Works [15] and [14] are examples. Authors of [11] claim that learning word vectors via language modeling produces syntactic rather than semantic focus, and in their work, they wish to learn semantic representation.

Similar works which try to find synonyms automatically but without use of word vectors include [12] and [13].

In this work, we use the distance between words in a training dataset to train word vectors on higher-quality texts (almost 120,000 of New York Times articles published between 1990 and 2010) instead of relying on a mix of higher- and lower-quality texts coming from the internet as it was done in Glove. A text quality study can be found in [1]. Rather than claiming semantic representation which would be a significant step forward, we claim only word replaceability through word vectors representation. Two words are replaceable if they have approximately the same relationship with other words in the vocabulary, and a relationship with other words is expressed through distances in sentences they appear in. To test our results, we use 64,000 synonyms from ConceptNet 5.5.0 [19] instead of using just 353 related words in [5] or 999 related words in [7]; hence, we verify our results statistically more robust way.

Outline The remainder of this article is organized as follows. “Training Algorithm” section describes how we break the text into shorter forms, how we convert the text data into two-word combinations, and how we describe the training of words vectors in comparison with Glove. In “Results” section, we compare results with Glove on distributions of synonyms and non-synonyms from ConceptNet 5.5.0, which has 64,000 synonyms in its English dictionary intersected with vocabulary in the training set, and we also show words which end up closest together. Finally, in “Conclusions” section we draw conclusions.

Training Algorithm

Similarly to the works of many others, we represent each word \(w_i\) in vocabulary V by vector \(v_i\) of size N. The main idea behind finding replaceable words is to describe their relationship with other words in vocabulary, and words, which should be found as replaceable, should end up after training in each other’s vicinities represented by word vectors with low Euclidean distance. One way to do this is to count how many times the words appear with another word in an individual context windows as it was done in Glove [15], for example, using the entire dataset. In Glove, the main optimization function uses the dot product of word vectors \(v_i \cdot v_j = v^T_i v_j\)

$$\begin{aligned} J_{glove} = \sum _{i,j} f(X_{i,j}) (v_i^T \tilde{v_j} +b_i + \tilde{b_j} - \log X_{i,j})^2, \end{aligned}$$

where \(X_{i,j}\) are the counts of words combination appearances, f is weighing function, \(\tilde{v_j}\) is a separate context word vector, and \(b_i\), \(\tilde{b_j}\) are biases. Hence, rather than using Euclidean distance to measure word similarities, it is better to use cosine similarity on trained vectors as the main optimization function and cosine similarity both use the dot product. In Results section, we show both Euclidean distance and cosine similarity histograms for Glove trained word vectors and new word vectors as this simple test shows significant shift.

In order to achieve improved results over Glove, we go a step further and instead of only using counts of word co-appearances in the same context windows, we also measure distances between words \(d_{i_k,j_k}\), where d represents words \(w_{i_k}\) and \(w_{j_k}\) distance in a sub-sentence. As two words can appear many times together in different sub-sentences and those distances can differ, we specify k for these different occasions.

As a first step in training, we do pre-processing of a dataset and break all sentences into sub-sentences separating them by commas, dots, and semicolons in order to better capture short-distance relationships. In Glove, the authors used word window, which was moving continuously in texts and did not separate the sentences. Next we take all two-word co-appearances in these sub-sentences and use their distances, as it is done in Fig. 1, for optimizing word vectors.

Fig. 1
figure 1

Using distances between word co-appearances for optimizing word vectors

Optimization function can be described as follows:

$$\begin{aligned} J_1 = \sum _{k=1}^{O} (\Vert v_{i_k} - v_{j_k} \Vert - d_{i_k,j_k})^2, \end{aligned}$$

where O is the number of all training two-word occurrences (the same tuple of two different words can occur more than once). As this would create too large of a training dataset (O can be rather high), we simplify it by uniting parts of the sum by the same word tuples, and hence, we get

$$\begin{aligned} J_1 = \sum ^{M}_{i,j} a_0(i,j) \Vert v_i - v_j \Vert ^2 + a_1(i,j) \Vert v_i - v_j \Vert + a_2(i,j), \end{aligned}$$

where \(M < O\) is the number of all word co-occurrences (every tuple of two different words appears only once), \(a_0(i,j)\) is the number of occurrences of a two-word tuple \((w_i,w_j)\), \(a_1(i,j)=-2\sum _k d_{i_k,j_k}\) represents sum of distances between the same two words multiplied by \(-2\), and \(a_2(i,j)\) is the sum of the squared distances of the same two words. If the same word appears twice in the same sentence, we do not include such sample in training, as the Euclidean distance between the same word vector will always be 0 irrespective of optimization function, and such sample would have no influence on the final result anyway. By doing this, all we have to remember for training the word vectors is \(a_0(i,j)\) and \(a_1(i,j)\) for word tuples \((w_i,w_j)\) as \(a_2(i,j)\) does not have to be remembered since it represents only an offset of our optimization function and has no influence when computing partial derivatives for vector optimization and the optimization function becomes

$$\begin{aligned} J = \sum ^{M}_{i,j} a_0(i,j) \Vert v_i - v_j \Vert ^2 + a_1(i,j) \Vert v_i - v_j \Vert . \end{aligned}$$

As we already mentioned, while Glove used only \(a_0(i,j)\) information (how many times word tuples appeared—\(X_{i,j}\)), in our optimization we also use extra information \(a_1(i,j)\).

This optimization is equivalent to optimizing by remembering the count of two-word appearances and the average of their distances as the only thing that would differ would be the constant part of the optimization function, which we cut off anyway as it does not have influence on partial derivatives. The optimization function could be written also as

$$\begin{aligned} J_2= & {} \sum ^{M}_{i,j} a_0(i,j) (\Vert v_i - v_j \Vert - {\overline{d}}_{i,j} )^2 \\= & {} \sum ^{M}_{i,j} \Big (a_0(i,j) \Vert v_i - v_j \Vert ^2 - 2 a_0(i,j) {\overline{d}}_{i,j} \Vert v_i - v_j \Vert + a_0(i,j) {\overline{d}}_{i,j}^2 \Big ) \end{aligned}$$

After removing the constant part of it, it becomes

$$\begin{aligned} J_3= & {} \sum ^{M}_{i,j} \Big ( a_0(i,j) \Vert v_i - v_j \Vert ^2 - 2 a_0(i,j) {\overline{d}}_{i,j} \Vert v_i - v_j \Vert \Big ) \\= & {} \sum ^{M}_{i,j} \Big ( a_0(i,j) \Vert v_i - v_j \Vert ^2 - 2 a_0(i,j) \frac{\sum _k d_{i_k,j_k}}{a_0(i,j)} \Vert v_i - v_j \Vert \Big ) \\= & {} \sum ^{M}_{i,j} \Big ( a_0(i,j) \Vert v_i - v_j \Vert ^2 + a_1(i,j) \Vert v_i - v_j \Vert \Big ) = J \end{aligned}$$

Hence, optimizing J, \(J_1\), \(J_2\), and \(J_3\) yields the same partial derivatives of word vector values, as the only thing which differs is the constant offset for various versions of the optimization function.

As a side note, in our optimization function, \(a_0(i,j)\) can be thought of as a weight, because the more samples of the same word tuple we have, the more robust estimation of distance average we get and hence we want to give higher importance to tuples which appear more often. Also, quite important is to mention that in our optimization we do not train any biases and more importantly we do not train two vectors for each word as it was done in Glove (one is used as the main vector and one as a context vector).

The training can be summarized to the following steps:

  • Divide text to sentences by splitting texts using commas, dots, and semicolons

  • Find word tuples, which appear in same sentence and count how many times tuples appeared and their average distance (or sum of the distances)

  • Use optimization technique (we used gradient descent method using constant step size), which optimizes function J to find optimal values of vectors \(v_i\)

Time complexity of computing partial derivatives for vector’s values in every epoch of gradient descent is \({\mathcal {O}}(M N)\), where M is number of word tuples, which appeared at least once and N is chosen vector size. We used NVIDIA’s GPU GeForce GTX 1080 Ti and C++ CUDA programming language to speed up the training process instead of using C++ and multi-threaded computations on CPU as it was done in Glove.

Results

In this section, we compare Glove and the results of our optimization function by using ConceptNet’s synonyms in “Histogram Comparison” and look at the words that appear closest together in “Closest Words”.

Histogram Comparison

We show for both Glove and our word vectors that there is a shift in the histogram between words which are defined as synonyms in ConceptNet and words which are not defined as synonyms in ConceptNet. There are approximately 64,000 of synonyms in the English language in ConceptNet, and for non-synonyms, we choose random 64,000 word tuples, which are not defined as synonyms in ConceptNet. For Glove, this is shown in Fig. 2.

Fig. 2
figure 2

Results for Glove: [top] histogram of word similarities expressed by their cosine similarity (x-axis); [bottom] histogram of word similarities expressed by their Euclidean distances (x-axis). Blue is the histogram of randomly selected pairs of words which are not defined as synonyms in ConceptNet, while green is the histogram of synonyms in ConceptNet

Not surprisingly, there is a bigger shift for Glove looking at cosine similarity, as during the training, the dot product was used in the optimization function, which is closer to sinus similarity than Euclidean distance.

Fig. 3
figure 3

Results for our optimization function for size of vectors \(N=150\) after 31k iterations: [top] histogram of word similarities expressed by their cosine similarity (x-axis); [bottom] histogram of word similarities expressed by their Euclidean distances (x-axis). Blue is the histogram of randomly selected pairs of words which are not defined as synonyms in ConceptNet, while green is the histogram of synonyms in ConceptNet

The advantage of using our optimization function is that instead of one cluster (peaks of two histograms close together) which can be seen for Glove, we can see two clusters of words (peaks of histograms far from each other), one representing replaceable words and one representing words that cannot replace each other. The majority of synonyms from ConceptNet end up in the cluster which represents replaceable words. As we already mentioned, as our optimization function optimizes on Euclidean distances, it is not surprising that cosine similarity does not show much of a shift in Fig. 3 (Top), while the Euclidean distance histogram does in Fig. 3 (Bottom).

To see how well different algorithms perform on separating synonyms and non-synonyms, we measure the percentage of the histogram’s area intersection for synonyms distance histogram and non-synonyms distance histogram. The lower the number, the better separation of synonyms from non-synonyms there is (Table 1). Using our optimization, we achieved a better score 14.59 % in comparison with Glove which achieves 39.17 %. Using only the dot product for vectors instead of cosine similarity, which is normalized dot product, would only worsen Glove’s results; hence, we omit these results.

Table 1 Percentage of histograms area intersection between synonyms scores and non-synonyms scores

Closest Words

In Table 2, we show the closest words which came out of training of our optimization function to explain that rather than claiming semantic focus of word vectors, we use term replaceability instead. The most replaceable words found are different shortcuts of months (top of the table), followed by words expressing different sexes (middle of table). The bottom of the table shows other examples—names of people are easily replaceable, as well as days in a week, different time intervals and different counts. Some of the words which are used together such as Los Angeles appear at the top as well.

Table 2 Closest word tuple examples using training of our optimization function for word vector size 50 after 30,000 iterations

We did the same analysis for Glove as well in Table 3 and what can be seen there is that different days of weeks and different names of months come closest as well (using cosine similarity), but also a lot of words which do not have a lot of meaning but probably appeared in the training data due to different training dataset (Wikipedia, blogs, etc., versus only New York Times articles) and different ways of processing it (not excluding numbers, etc.).

Table 3 Closest word tuple examples using Glove with word vectors of size 300

Conclusions

We presented a new way to construct the vector space model, which finds replaceable words, and showed that in comparison with Glove, it creates two clusters and distinguishes more between words that can be replaced and words that cannot. To achieve that, we used not only word tuple counts (number of appearances) as it was done in various previous approaches, but we used additional information as well—distances between words in sub-sentences. We prefer claiming only word replaceability focus in a new word vector space, which we showed on cluster of synonyms in histogram, rather than claiming semantic representation on only few selected words or smaller tests.

Future work should focus on using even more information from the training dataset to even better capture similarities between words and their replaceability or semantic focus as well as including results on real tasks as it is done in other publications such as [4].