Our alignment process is based on a set of rules exploiting the word embedding similarity to discover the alignment. Our process is divided into four successive steps described in the following subsections. Our system supports two types of input (OWL ontologies and SKOS vocabulary), and two languages (French and English). But we can’t work with both languages at the same time as we have a different word embedding model per language.
3.1 Extracting Lexical and Structural Information from Ontologies
We started by extracting two types of information from inputs: (i) lexical information (e.g., labels of concepts) and (ii) structural information (e.g., to associate the labels of all child entities to their parent entities). To achieve this, the two inputs (OWL or SKOS) are parsed with rdflib and queried with a SPARQL query. The Listing 1.1 shows an example of queries that handle with SKOS input and french language. The same query is used for owl ontologies by replacing rdfs:label instead of skos:prefLabel to extract the label of the class or the properties, and rdfs:subClass or rfs:subproperties instead of skos:broader to get the hierarchical relation between classes or properties.
3.2 Computing Word Embedding Representations of Concepts
The second step of our approach is to compute the vector representations of concepts. We used a pre-trained word vectors for French and English, learned using fastText.Footnote 8 The French model contains 1,152,449 tokens, and the English model contains one million tokens. Both of them are mapped to 300-dimensional vectors [3].
The vector representation of a concept is constructed by averaging the word embedding vectors along each dimension of all the terms contained in its label and occurring in the dictionary \( concept Word Embedding(c) = \frac{1}{n} \sum _{i=1}^{n} w_i,\) where n is the number of words in the dictionary occurring in the label of a concept c and \(w_i \in \mathbb {R}^{300}\) denotes the word embedding vector of the ith word. If a term does not appear in the dictionary, it is just ignored.
In the case of structural information, the vector representation of a cluster is given by averaging the word embedding vector representation of the label of the root concept (which is itself an average) with the vector representations of its child concepts \(cluster Word Embedding(cl) = \frac{1}{k} \sum _{i=1}^{k} concept Word Embedding(c_i), \) where k is the number of concepts in cluster cl.
3.3 Searching for Matching Concepts
We match every concept in the source ontology \(O_1\) with the similar concept in the target ontology \(O_2\) using the cosine similarity between vector representations of concept and cluster. The correspondence is then added to the alignment list based on the similarity threshold. Our algorithm aims at collecting all the possible correspondences between concepts. We empirically chose the threshold, by varying its value and calculating for each one the recall and precision measures. Figure 1 shows that an optimal trade-off of performance is achieved by setting the similarity threshold equal to 0.8.
3.4 Refining the Nature of the Relationship Between Two Matching Concepts
The result of the previous step is a list of matching concepts whose relationship must be made more precise. To link two concepts that are sufficiently similar, we used skos:closeMatch for SKOS and owl:sameAs for OWL. To define a hierarchical mapping link between two concepts, we used skos:broader or skos:narrower for SKOS and rdfs:subClassOf or rdfs:subPropertiesOf for OWL.
This relationship between two matching concepts is refined by comparing the radii of their respective embedding vector clusters formed mainly using structural information. The radius of a cluster is the maximum distance between all the vector representing the terms and the centroid. We define the radius of a cluster of concepts as the standard deviation of their cosine dissimilarity with respect to the centroid: \(radius = \sqrt{\frac{1}{N} \sum _{i=1}^N \left( 1 - \frac{w_i \cdot \overline{w}}{|w_i|\cdot |\overline{w}|}\right) ^2}, \) where \(w_i \in \mathbb {R}^{300}\) is the vector representation of the ithe concept in the cluster, N is the size of the cluster, and \(\overline{w} \in \mathbb {R}^{300}\) is the centroid of the cluster, defined as \( \overline{w} = \frac{1}{N} \sum _{i=1}^N w_i \). We suppose that the cluster whose result has the lowest average distance between a point and the centroid is in broader relation with the cluster which have the biggest radius. We decide of the relationship holding between two similar concepts by comparing their radii based on the following rules:
$$\begin{aligned} |radius (C1) - radius (C2)|< 0.1 \Rightarrow C1 \; closeMatch \; C2 \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} |radius (C1) - radius (C2)|>0.1&\Rightarrow C1 \; narrowMatch \; C2\\ {}&\wedge \, C2 \;broadMatch \;C1 \end{aligned} \end{aligned}$$
(2)