1 Introduction

In Wittum et al. [20], we presented an approach to compare the structure of natural languages with the language of mathematics. The question arises from empirical studies, suggesting such a connection and its relevance for learning [1, 2, 23, 24].

On the one hand, mathematics is a language with the formulae as vocabulary and logics as grammar. On the other hand, mathematics has its own language, which has been developed over the last centuries. Mathematical formalism constitutes the core and is embedded into natural language, which is usually quite simply structured and uses only a minimum, but very specific words. To compare it with natural language, we suggested to deconstruct texts grammatically and to compare the corresponding structure graphs with corresponding graphs for mathematical texts, see Wittum et al. [20]. To that end, we suggested to use the constrained tree edit distance see [7] as distance between the trees.

The algorithm, we set up in Wittum et al. [20] uses the following steps

  1. 1.

    Deconstruct two or more sentences into a dependence tree

  2. 2.

    Compute the constrained tree edit distance of the trees

  3. 3.

    PCA or cluster analysis of the distance matrix

In what follows, we describe details of the components of this algorithm.

In Wittum et al. [20], we encountered problems comparing simple sentences from natural languages which are very close such as German and English. As pointed out in the aforementioned work, the reason for these problems was, that the natural language processor (NLP) used was trained with data using inconsistent annotation for the different languages. In the present paper, we repeat these tests with different NLPs using the critical sentences from Wittum et al. [20]. The result shows, that these problems are serious with all the tools tested.

2 Dependence trees and their distance

For step 1, we use Natural Language Processors (NLP) which have been trained on a large set of data and are available via an OpenSource License, as described in Wittum et al. [20]. These NLPs are all based on Probabilistic Context Free Grammars (PCFG), which have been trained on large data sets using machine learning. Pretrained PCFGs are e.g. SpaCy, [16], or CoreNLP, also known as Stanford parser [3]. Both APIs come with their learning methods, in case the user seeks to train PCFGs for own data. An alternative is the Natural Language Toolkit (NLTK) which works without pretrained methods, is slower than the pretrained tools, but offers a larger set of methods, [14]. It is good for processing smaller samples. Both SpaCy and NLTK are written in Python, while CoreNLP is written in Java. Figure 1 illustrates a dependency tree derived by SpaCy using the sentence:

Fig. 1
figure 1

A dependency tree derived by SpaCy; IN = Preposition; JJ = Adjective; VBm = Verb, base form; VBG = Verb, gerund/present pple; VBN = Verb, past participle; VBP = Verb, non-3rd ps. sg. present; NN = Noun, singular or mass; NNS = Noun, plural; . = Sentence-final punctuation; , = Comma; PRP = Personal pronoun; RB = Adverb; TO = infinitival to

As well as defined over discrete sets of events, we also wish to consider probabilities with respect to continuous variables.

After constructing dependence trees from texts, we can compare two trees by computing their tree edit distance.

Following Wittum et al. [20] and Heumann and Wittum [7], we describe the constrained tree edit distance as a measure to quantify similarity of trees. Wagner and Fischer [19] originally proposed a distance function between strings, which is the minimal cost of a sequence of edit-operations, which modify a string slightly by deleting, inserting or substituting characters. This is a generalisation of the ideas of Levenshtein [9] and Hamming [5]. The algorithm for computing such a distance is the basis for many problems which can be modelled as strings, such as DNA-sequencing. Later on, the edit distance was generalised to trees [15, 17]. There are algorithms for computing the edit-distance between ordered labeled trees, however, Kilpeläinen and Mannila [8] and Zhang et al. [21] showed that the computation of this tree edit distance is NP-complete for unordered trees. This makes it infeasible to be used for computational purposes. As a consequence, Zhang [22] proposed an algorithm to compute a slightly modified distance, the constraint tree edit distance, which has quadratic complexity. We introduce these distances below.

Let a rooted tree T = (V, E) with a set of labeled vertices V and a set of edges E connecting the vertices as e.g. shown in Figs. 1 and 2. We now introduce the three basic edit operations shown in Fig. 3.

Fig. 2
figure 2

How to define the distance of trees T1 and T2?

Fig. 3
figure 3

Transforming T1 into T2

Definition 1

(Basic edit operation) The following basic operations on a labeled tree T are called basic edit operations:

  1. (1)

    Substitution: sub(b,g): Replace label b in vertex labeled b by label g

  2. (2)

    Delete: del(g): Delete vertex labelled g. Connects the predecessor of vertex g with the successors of g.

  3. (3)

    Insert: ins (f; a, d): Insert new vertex f between vertices a and d.

    or: ins (f; d, λ) Adds new vertex f after vertex d. λ stands for an empty vertex.

The three basic edit operations illustrated. Substitution of a label (left), deletion of a vertex (center) and insertion of a vertex between two other vertices (right). The substitution step (left) is of course not necessary and can be omitted when just determining the minimum sequence of elementary edit operations to transform T1 into T2.

Using these basic edit operations, we define the distance of two labeled trees, T1 and T2. Let a sequence S = (si)1≤in of those atomic edit-operations transform tree T1 into T2. By assigning a weight γ(si) > 0 to each basic edit operation si, the weight γ(S) of each of these sequences S is just defined as the sum of its elements \( \gamma (S) \, = \sum\nolimits_{i = 1}^{n} {\gamma (s_{i} )} \). We can now define the tree edit distance of the two labeled trees T1 and T2.

Definition 2

(Tree edit distance) The tree edit distance of two labeled trees T1 and T2 is given by

$$ d(T_{1} , T_{2} ): = \hbox{min} \left\{ {\gamma \left( S \right) = \sum\limits_{i = 1}^{n} {\gamma (s_{i} )} ;S = \left( {s_{i} } \right)_{1 \le i \le n} :T_{1} \mapsto T_{2} } \right\}. $$
(4)

It can be shown easily, that this distance is indeed a metric distance, that means it satisfies non-negativity, identity of indiscernibles, symmetry and the triangle inequality, if the weight γ of the edit-operations is a metric distance on the space of the labeled vertices joined with {λ}. Due to the finiteness of trees, it is always possible to find a sequence S = (si)1≤in of basic edit operations si, which transfer a tree T1 into another tree T2. This means, that 0 ≤ d(T1, T2) < ∞. Computing the tree edit distance between two arbitrary unordered trees is NP complete, meaning the tree edit distance is not realistically usable for practical computations. The NP completeness, however, makes it crucial to find a distance, which inherits the nice properties of the tree edit distance, but is of moderate complexity. We use the constrained tree edit distance as described in Wittum et al. [20].

3 Cost functions for weighted edit-operations

In Wittum et al. [20], we introduced the tree edit distance for dependency trees with weighted edit-operations. The cost functions to compute those weights have to match the problem domain. For natural languages the cost function might consider the type or semantic meaning of the words that are attached to the specified vertices. As we will explain, this question is related to how words shall be represented to make them accessible to a meaningful mathematical analysis.

At first, it might seem attractive to associate words with arbitrary ids that are unique in a multilingual context, e.g.

  • Computer = id95106.

  • Zoo = id849484.

While this representation is sufficient to identifying words, it is not satisfying due to the fact that it does ignore any potential relational information that might exist between words or phrases.

Additionally, for a mathematical analysis, we need a metric that computes the distance between words which is not possible with the previously mentioned encoding. Finding good metrics between arbitrary words is an important topic in natural language processing because it is a necessity for the mathematical analysis of language semantics and structure. To accomplish that, words can be represented as vectors in a high-dimensional vector space. Such an encoding should preserve and map as much structural and semantic information to the vector representation as possible.

One of the most successful encoding techniques is word2vec, see Goldberg and Levy [4] and McCormick [11]. The main hypothesis of the word2vec model, as introduced in Mikolov et al. [12], is that the word is closer related to the words it frequently occurs with. That means, the word can be replaced by other words it is often found together with. The goal is to represent a word wi by a dense feature vector vi. The dimension of the vector space is defined by the number of unique words that shall be encoded. Each entry vi,j of the vector vi represents the probability pi,j of word wj to be found close to the encoded word wi:

$$ v_{i} = \, (p_{i,1} ,p_{i,2} , \, \ldots ,p_{i,n} )^{\text{T}} . $$

Let a starting encoding be given for the words {wi, i = 1,…,n}. Let c be a context word, o be the current center word, uo be the vector representation of the current center word in the current encoding, uw the other center words in the same representation and vc the context words under their current encoding. Then the probabilities we want to obtain are defined as

$$ p_{o,c} = \frac{{\exp (u_{o}^{T} v_{c} )}}{{\sum\nolimits_{w = 1}^{n} {\exp (u_{w}^{T} v_{c} )} }} $$
(5)

where n is the number of encoded words. This is the softmax function applied to the scalar product of the vectors of a center and a context word in their respective encoding. The learning process is performed by the skip gram neural network model as introduced in Goldberg and Levy [4] and McCormick [11].

The fundamental idea is to obtain the probabilities by training the two-layer neural network depicted in Fig. 4 to perform the following task: For every given word w contained in the training data, select a random nearby word c. The output layer of the neural network contains the probabilities of each encoded word to co-occur with word w. The output layer is used to represent the dense encoding vector for word w mentioned above. In this context, a word wj is defined as being a nearby word of word wi if |i − j| < m and m is referred to as size of the so called “sliding window”, see Fig. 5.

Fig. 4
figure 4

Skip gram network architecture, from McCormick [11]

Fig. 5
figure 5

Skip gram with window size 2, from McCormick [11]

The previously discussed encoding allows us to define a metric between two words wi and wj, e.g., the cosine similarity between the vectors vwi and vwj, see Eq. (6).

$$ {\text{similarity}}\,\,(v_{{w_{i} }} ,v_{{w_{j} }} ) = \cos (\theta ) = \frac{{v_{{w_{i} }} \cdot v_{{w_{j} }} }}{{\left| {v_{{w_{i} }} } \right| \cdot \left| {v_{{w_{j} }} } \right|}} $$
(6)

The cosine similarity is often preferred over the euclidean distance since it is only a measure of orientation. That is, it works on non-normalised vectors as well. We can conclude that the weight (costs) for the substitution operation sk of a labeled vertex verti with vertj in a given tree T can be defined as γ(sk) = 1 − |vwi − vwj|.

4 Experiments: comparing natural languages

In Wittum et al. [20] we used SpaCy to construct dependency trees of several sentences in both, German and English. These trees are generated by language processors that are based on differently trained neural networks, one for English and another one for German. Unfortunately, SpaCy uses data sets with different annotation schemes for training each network. Since our previously obtained results heavily depend on the tree structure, we conducted experiments with different NLP tools, namely CoreNLP, SpaCy (with updated data sets) and UDPipe which claims to achieve very high tagging accuracy. These experiments are very important to the base data which we use to compute the constrained tree edit distance, see Wittum et al. [20].

4.1 NLP tools

CoreNLP is a Java-based NLP toolkit that supports POS tagging (part-of-speech tagging) as well as the construction of dependency trees. For German language support, it uses the NEGRA corpus, NEGRA [13]. The English language support is based on the Penn Treebank. The overall system architecture of CoreNLP is shown in Fig. 6.

Fig. 6
figure 6

System architecture of CoreNLP, from CoreNLP [3]

First the system reads raw text and forwards it to a so-called annotation object. CoreNLP provides different annotators for producing additional information, such as POS tags. This information is added to the text. After all annotators processed the input text, CoreNLP returns the annotated text in either XML or plain text, see CORENLP.

SpaCy is one of the most popular NLP toolkits. It is implemented in Python. For performance-critical parts, it uses Cython (Python that compiles to native C instead of being interpreted). Just like CoreNLP, SpaCy uses different training data for English and German. Unlike CoreNLP, the primary focus of SpaCy is not to provide an NLP platform for teaching and research. This makes SpaCy one of the most efficient NLP libraries. However, it lacks some flexibility that might be required for advanced scientific applications. SpaCy’s overall system architecture is shown in Fig. 7.

Fig. 7
figure 7

SpaCy system architecture, from SpaCy [16]

SpaCy works with two main data structures. The so-called Doc contains tokenised text as well as the annotations of those tokens. To prevent redundant copies of word vectors and other information, SpaCy provides the Vocab object which is used as lookup table and can be accessed by multiple documents, see [16].

UDPipe has the advantage that the training data for all supported languages is based on the Universal Dependencies collection, see [18]. While the individual training data contains different texts, depending on the language, the UD data uses a much more homogenous tagging system among different languages. This is supposed to improve the comparability of dependency trees among different languages see [18].

4.2 The experiments

4.2.1 Experiment 1

For comparing the POS tagging and dependency tree parsing of SpaCy, CoreNLP and UDPipe we used the following sentence that showed critical in our experiments from Wittum et al. [20].

  • English: Steven likes long books and he also eats carrots, but he does not watch movies.

  • German: Steven mag lange Bücher und er isst auch Karotten, aber er guckt keine Filme.

This sentence consists of three main clauses glued together with the conjunctions “and” and “, but”. They correspond almost verbally in the two languages. From a reasonable NLP, we expect that it first discovers this basic structure and then constructs the dependency trees of the three clauses. Instead, the tools tested produce the following results.

SpaCy

See Fig. 8.

Fig. 8
figure 8

Comparison of the dependence trees derived by SpaCy for the English and German test sentence in Exp. 1

Core NLP

As mentioned previously, CoreNLP uses different tagging systems for English and German. There’s no automatic mapping to the Google POS tagging system. Additionally, the trees are structured differently although the English and German versions of the sentence are very similar (Fig. 9).

Fig. 9
figure 9

Comparison of the dependence trees derived by CoreNLP for the English and German test sentence in Experiment 1

Even though CoreNLP did not produce the expected results, it has some advantages. The tagger can run completely independently and it is possible to develop, i.e., train custom UD based language models which use a consistent POS tagging system.

UD Pipe

UDPipe shows similar symptoms to CoreNLP where the German tree parser also used “Bücher”, the object of the first clause, instead of the verb “mag” as direct child of “<root>”. In contrast to CoreNLP, UDPipe uses consistent POS tags for both, German and English which makes it a good candidate for future research. Further advantages are the number of pretrained language models that are available for UDPipe (Fig. 10).

Fig. 10
figure 10

Comparison of the dependence trees derived by UD Pipe for the English and German test sentence in Experiment 1

No one of these word processors achieves the correct structure. First, it should separate the three main clauses, just connected by the conjunctions. Then it should structure each sentence in itself.

4.2.2 Experiment 2

For the second experiment, we chose the following sentences:

  • Sentence 1, English: The lazy chicken crossed the large road.

  • Sentence 1, German: Das faule Huhn überquerte die große Straße.

  • Sentence 2, English: The lazy chicken crossed the big road and laid an egg.

  • Sentence 2, German: Das faule Huhn überquerte die große Straße und legte ein Ei.

Again, we are interested in whether the dependency trees or at least the POS tags of German and English versions of the same sentence are similar if they are structured similarly. Sentence 2 is just an extended version of sentence 1. We are particularly interested in whether the coordinating conjunction “and/und” is processed similarly in both languages.

SpaCy

For sentence 1, SpaCy’s results are convincing. The trees are structured similarly for the German and English version of the sentence (Figs. 11, 12).

Fig. 11
figure 11

Sentence 1, English

Fig. 12
figure 12

Sentence 1, German

However, SpaCy returned different results for sentence 2. The conjunction “and/und” is handled differently, even though there is no apparent reason for it (Figs. 13, 14).

Fig. 13
figure 13

Sentence 2, English

Fig. 14
figure 14

Sentence 2, German

CoreNLP

Except for the differences in POS tagging, CoreNLP returned identical dependency trees for both sentences. Even the dependency labels are identical. Unlike in experiment 1, CoreNLP seems to recognise the conjunction correctly (Figs. 15, 16, 17, 18).

Fig. 15
figure 15

Sentence 1, English

Fig. 16
figure 16

Sentence 1, German

Fig. 17
figure 17

Sentence 2, English

Fig. 18
figure 18

Sentence 2, German

UDPipe

The results obtained for sentence 1 are as expected. The tree structure is identical. The results for sentence 2 are not identical though. UDPipe recognises the role of the verb “crossed/überquerte” correctly. For the second verb “legte/laid” the tagger recognises that it is a verb in both, English and German. But in the German version the root of the subtree is not “legte” as expected, but “Ei”. This behaviour is similar to the results obtained in experiment 1 (Figs. 19, 20, 21, 22).

Fig. 19
figure 19

Sentence 1, English

Fig. 20
figure 20

Sentence 1, German

Fig. 21
figure 21

Sentence 2, English

Fig. 22
figure 22

Sentence 2, German

5 Conclusion

As illustrated above, the current versions of multi-language Natural Language Processors are not capable to produce results, which can be compared across different languages. To successfully use the tree-edit distance in a multilingual context, it is necessary to use so-called parallel treebanks for at least two languages. Although there are some projects working on parallel treebanks for different languages, e.g. LinES [10] and Hansen-Schirra et al. [6] there is still a lot of work left to be done for an accurate multilingual analysis of dependency trees.