Keywords

1 Introduction

The SilexFootnote 1 company develops a SaaS sourcing tool for the identification of the service providers that are best suited to meet some service requests. The Silex platform allows companies to provide a textual description of their professional activities, their offers and the services they are looking for. The work presented in this paper has been carried out in the context of a collaboration between Silex and the I3S research laboratory, to add a semantic layer to the Silex B2B platform, in order to be able to automatically process the descriptions of service requests and improve the recommendation of relevant providers. An ontology engineering work has been conducted to semantically annotate the text descriptions of companies, offers, and service requests, with three kinds of knowledge: skills, occupations, and business sectors. We developed the Silex ontology by combining several meta-data repositories: ESCO,Footnote 2 ROME,Footnote 3 Cigref,Footnote 4 NAF,Footnote 5 UNSPSCFootnote 6, KompassFootnote 7 and an internal Silex business sectors repository. Currently, the Silex ontology covers only the Computer Science (CS) field [1]. Our aim now is to automatically align the entire vocabularies to extend the Silex ontology to all business sectors.

In this paper, we present a new approach to ontology alignment based on word embedding and inspired by an existing proposals [6]. We consider word embedding to represent concepts and we use it to compute not only equivalence relations between concepts but also hierarchical relations. We report our experiments on several open datasets from the Ontology Alignment Evaluation Initiative (OAEI) benchmark and the Silex use case.

This paper is organized as follows: related work is discussed in Sect. 2. Section 3 describes our algorithm for ontology alignment. Section 4 reports and discusses the results of our experiments on the Silex use case. Section 5 draws some conclusions and discusses our perspectives as future work.

2 Related Work

The main issue when using several ontologies is to deal with their semantic heterogeneity when combining them: each ontology has its own designer, its own knowledge area and its own level of details. Ontology alignment is thus a crucial yet difficult task to achieve interoperability on the semantic Web. It aims to discover the correspondences between the entities of different ontologies, and express them as equivalence or hierarchical relations.

There are two main ontology alignment techniques [2]: (i) Element-level techniques are meant to discover correspondences by calculating the surface similarity between lexical information of entities (usually labels), (ii) Structure-level techniques rely on the analysis of the neighbourhood of two entities in order to determine their similarity. Both techniques suffer from their weakness in capturing the semantics of lexical information of entities, and have been extended by exploiting external information sources, such as WordNet or Wikipedia. However, these auxiliary resources still suffer from the incompleteness and non exhaustiveness of their entries. To overcome this problem, the approach presented in [6] uses word embedding to preserve the semantic and syntactic similarities between words. This work mainly extract the lexical information (names, labels and comments of an entity) and search equivalence relations between this informations based on word embeddings similarity. In our work, we have been inspired by [6] to calculate the similarites between entities based only on their labels. We extended this approach by using cluster’s radius to find equivalence and hierarchical relations between concepts.

3 Overview of Our Approach to Ontology Alignment

Our alignment process is based on a set of rules exploiting the word embedding similarity to discover the alignment. Our process is divided into four successive steps described in the following subsections. Our system supports two types of input (OWL ontologies and SKOS vocabulary), and two languages (French and English). But we can’t work with both languages at the same time as we have a different word embedding model per language.

3.1 Extracting Lexical and Structural Information from Ontologies

We started by extracting two types of information from inputs: (i) lexical information (e.g., labels of concepts) and (ii) structural information (e.g., to associate the labels of all child entities to their parent entities). To achieve this, the two inputs (OWL or SKOS) are parsed with rdflib and queried with a SPARQL query. The Listing 1.1 shows an example of queries that handle with SKOS input and french language. The same query is used for owl ontologies by replacing rdfs:label instead of skos:prefLabel to extract the label of the class or the properties, and rdfs:subClass or rfs:subproperties instead of skos:broader to get the hierarchical relation between classes or properties.

figure a

3.2 Computing Word Embedding Representations of Concepts

The second step of our approach is to compute the vector representations of concepts. We used a pre-trained word vectors for French and English, learned using fastText.Footnote 8 The French model contains 1,152,449 tokens, and the English model contains one million tokens. Both of them are mapped to 300-dimensional vectors [3].

The vector representation of a concept is constructed by averaging the word embedding vectors along each dimension of all the terms contained in its label and occurring in the dictionary \( concept Word Embedding(c) = \frac{1}{n} \sum _{i=1}^{n} w_i,\) where n is the number of words in the dictionary occurring in the label of a concept c and \(w_i \in \mathbb {R}^{300}\) denotes the word embedding vector of the ith word. If a term does not appear in the dictionary, it is just ignored.

In the case of structural information, the vector representation of a cluster is given by averaging the word embedding vector representation of the label of the root concept (which is itself an average) with the vector representations of its child concepts \(cluster Word Embedding(cl) = \frac{1}{k} \sum _{i=1}^{k} concept Word Embedding(c_i), \) where k is the number of concepts in cluster cl.

3.3 Searching for Matching Concepts

We match every concept in the source ontology \(O_1\) with the similar concept in the target ontology \(O_2\) using the cosine similarity between vector representations of concept and cluster. The correspondence is then added to the alignment list based on the similarity threshold. Our algorithm aims at collecting all the possible correspondences between concepts. We empirically chose the threshold, by varying its value and calculating for each one the recall and precision measures. Figure 1 shows that an optimal trade-off of performance is achieved by setting the similarity threshold equal to 0.8.

Fig. 1.
figure 1

Precision and recall as a function of the similarity threshold.

3.4 Refining the Nature of the Relationship Between Two Matching Concepts

The result of the previous step is a list of matching concepts whose relationship must be made more precise. To link two concepts that are sufficiently similar, we used skos:closeMatch for SKOS and owl:sameAs for OWL. To define a hierarchical mapping link between two concepts, we used skos:broader or skos:narrower for SKOS and rdfs:subClassOf or rdfs:subPropertiesOf for OWL.

This relationship between two matching concepts is refined by comparing the radii of their respective embedding vector clusters formed mainly using structural information. The radius of a cluster is the maximum distance between all the vector representing the terms and the centroid. We define the radius of a cluster of concepts as the standard deviation of their cosine dissimilarity with respect to the centroid: \(radius = \sqrt{\frac{1}{N} \sum _{i=1}^N \left( 1 - \frac{w_i \cdot \overline{w}}{|w_i|\cdot |\overline{w}|}\right) ^2}, \) where \(w_i \in \mathbb {R}^{300}\) is the vector representation of the ithe concept in the cluster, N is the size of the cluster, and \(\overline{w} \in \mathbb {R}^{300}\) is the centroid of the cluster, defined as \( \overline{w} = \frac{1}{N} \sum _{i=1}^N w_i \). We suppose that the cluster whose result has the lowest average distance between a point and the centroid is in broader relation with the cluster which have the biggest radius. We decide of the relationship holding between two similar concepts by comparing their radii based on the following rules:

$$\begin{aligned} |radius (C1) - radius (C2)|< 0.1 \Rightarrow C1 \; closeMatch \; C2 \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} |radius (C1) - radius (C2)|>0.1&\Rightarrow C1 \; narrowMatch \; C2\\ {}&\wedge \, C2 \;broadMatch \;C1 \end{aligned} \end{aligned}$$
(2)

4 Experiments

To evaluate the effectiveness of our approach, we performed experiments on two alignment datasets: (i) Task-oriented complex alignment on conference organisation and (ii) the Silex use case. The performances of our approach are measured by calculating precision, recall and F-measure [4].

4.1 Experiments on Task-Oriented Complex Alignment on Conference Organisation

To validate the proposed approach, we experimented it on a conference complex alignment benchmarkFootnote 9,Footnote 10 for ontology merging, which has been constructed within the framework of the OAEI. This data set contains 57 correspondences made on five owl ontologies. Following the evaluation process presented in [5], we have taken into account only the alignments that exist in the complex data set and we ignored the alignment of simple data set. We assume that if our system is able to find the correct match between a proposed list, we consider that the entire proposed list is correct. This decision is justified by the fact that our system was designed to support end-users by presenting a list of possible matches. We compared our matching results with the results of three state-of-the-art systems that were mentioned in [5]: Our system clearly outperforms the others on this benchmark, with a precision value equals to O.89 and recall value equals to 0.69 compared to 0.83, and 0.13 for the best state-of-the-art system. Many reasons can explain our result: (i) the cosine similarity between classes is much smaller, as a consequence this match gets discarded than the threshold (cosine similarity (‘chair main’, ‘demo chair’ = 0)). (ii) Our system is not designed to test hierarchical relations between two leaf nodes. This type of relationship must pass through the structural information to calculate the radius and, thus, infer the relationship. (iii) Based on Eq. 1, our system can assign equivalence relation instead of hierarchical relation because the threshold of the difference of radius between two classes is smaller than 0.1.

4.2 Experiments on the Silex Use Case

The second data set used in this evaluation is the vocabularies gathered for the Silex use case in the CS field: we tried to match (i) ESCO (160 concepts to represent occupations) to Cigref (42 concepts), (ii) ESCO to ROME (117 concepts), (iii) NAF to kompass (574 concepts) and (iv) NAF to Silex activity domains (14 concepts). A gold standard of each matching case was provided by an expert in the Silex company. Depending on the vocabularies to be aligned, the precision value ranges between (i) 0.71 and 0.8 for the closeMatch relation, (ii) 0.7 and 0.83 for the narrowMatch relation and (iii) 0.73 and 1 for the broadMatch relation. On the other hand, the recall value ranges between (i) 0.6 and 0.95 for the closeMatch relation, (ii) 0.69 and 1 for the narrowMatch relation and (iii) 0.68 and 1 for the broadMatch relation. For example, the ROME concept “computer developer” is stated to be broader than the ESCO concept of “Applications programmers” which is in broad relation with the ESCO concept of “Usability designer”, “System programmer”, “System developer”.

5 Conclusion

In this paper, we reported the results of a novel ontology alignment method, capable of distinguishing between equivalence and hierarchical relationships. Our first challenge was to answer on the real-world use case encountered by the Silex company. These results show that the proposed approach to ontology alignment based on a vector representation of the concepts to be matched is promising. As future work, we aim at defining a specific set of pre-trained word vectors that best covers the Silex B2B use case. We also aim at performing an empirical study to define the optimal threshold for radius difference.