Goal: Given two ontologies S and T, in two different natural languages \(L_s\) and \(L_t\) respectively, as RDF triples \(\langle s, p, o \rangle \in \mathcal {C} \times \mathcal {R} \times (\mathcal {C} \cup \mathcal {L})\) where \(\mathcal {C}\) is the set of ontology domain entities (i.e. classes), \(\mathcal {R}\) is the set of relations, and \(\mathcal {L}\) is the set of literals. We aim at finding the complementary information \(\mathcal {T}_{e} = S - (S \cap T)\) from S in order to enrich T.
The proposed approach comprises six phases (Fig. 1): translation, pre-processing, terminological matching, triple retrieval, enrichment, and validation. The input is the two ontologies in two different natural languages, i.e. the target ontology T and the source ontology S. The output is the multilingual enriched ontology \(T_{enriched}\) in two different natural languages \(L_1\) and \(L_2\). In the following subsections, we describe each of these phases in details.
3.1 Translation
Let \(\mathcal {C}_S\) and \(\mathcal {C}_T\) be the set of classes in S and T respectively. Each class is represented by a label or a local name. The aim of this phase is to translate each class in \(\mathcal {C}_S\) to the language of T (i.e. \(L_t\)). Google TranslatorFootnote 8 is used to translate classes of source ontologies. All available translations are considered for each class. Therefore, the output of the translation is \(\mathcal {C}_{S-translated}\) which has each class, in S, associated with a list of all available translations. For example, the class Thema in German has a list of English translations (Subject and Topic), and the class label “
” in Arabic has a list of English translations such as “Review, Revision, Check”. The best translation will be selected in the terminological matching phase (Subsect. 3.3).
3.2 Pre-processing
The aim of this phase is to process classes of \(\mathcal {C}_T\) and lists of translations in \(\mathcal {C}_{S-translated}\) by employing a variety of natural language processing (NLP) techniques, such as tokenization, POS-tagging (part-of-speech tagging), and lemmatization, to make it ready for the next phases. In order to enhance the similarity results between \(\mathcal {C}_T\) and \(\mathcal {C}_{S-translated}\), stop words are removed and normalization methods and regular expressions are used to remove punctuation, symbols, additional white spaces, and to normalize the structure of strings. Furthermore, our pre-processing is capable of recognizing classes such as camel cases “ReviewArticle” and adds a space between lower-case and upper-case letters “Review Article” (i.e. true casing technique). The output of this phase is \(\mathcal {C}'_{T}\), which has pre-processed translations of classes in T, and \({\mathcal {C}'_{S-translated}}\), which has pre-processed translations for each class in S.
3.3 Terminological Matching
The aim of this phase is to identify potential matches between class labels of S and T. We perform a pairwise lexical and/or semantic similarity between the list of translations of each class in \(\mathcal {C}'_{S-translated}\) and \(\mathcal {C}'_T\) to select the best translation for each class in S that matches the corresponding class in T (see Algorithm 1). Jaccard similarity [22] is used to filter the identical concepts instead of using semantic similarity from the beginning because there is no need for extra computations to compute semantic similarity between two identical classes. The reason behind choosing the Jaccard similarity is that according to the experiments conducted for the ontology alignment task for the MultiFarm benchmark in [2], Jaccard similarity has achieved the best score in terms of precision. For non-identical concepts, we compute the semantic similarity using the path length measure, based on WordNet, which returns the shortest path between two words in WordNet hierarchy [3]. If two words are semantically equivalent, i.e., belonging to the same WordNet synset, the path distance is 1.00. We use a specific threshold \(\theta \) in order to get the set of matched terms (matched classes) M. We obtained the best value of \(\theta = 0.9\) which has the best matching results after running the experiments for ten times. If no match is found, we consider this class as a new class that can be added to T and we consider its list of translations as synonyms for that class. Generally, class labels have more than one word, for example “InvitedSpeaker”, therefore, the semantic similarity between sentences presented in [21] is adapted as described in Algorithm 1 - line 9. Given two sentences sentence1 and sentence2, the semantic similarity of each sentence with respect to the other is defined by: for each word \(w_i \in sentence1\), the word \(w_j\) in sentence2 that has the highest path similarity with \(w_i\) is determined. The word similarities are then summed up and normalized with the number of similar words between the two sentences. Next, the same procedure is applied to start with words in sentence2 to identify the semantic similarity of sentence2 with respect to sentence1. Finally, the resulting similarity scores are combined using a simple average. Based on the similarity results, the best translation is selected and \(\mathcal {C}'_{S-translated}\) is updated. For example, in Fig. 2, the class “
” in Arabic, has a list of English translations such as “President, Head, Chief”. After computing the similarity between \(\mathcal {C}'_{S-translated}\) and \(\mathcal {C}'_T\), “President” has the highest similarityScore of 1.00 with the class “Chairman”, in \(\mathcal {C}'_T\), because they are semantically equivalent. Therefore, “President” is selected to be the best translation for “
”. The output of this phase is the list of matched terms M between \(\mathcal {C}'_T\) and the updated \(\mathcal {C}'_{S-translated}\).
3.4 Triple Retrieval
The aim of this phase is to identify which and where the new information can be added to T. Each class in S is replaced by its best translation found in \(\mathcal {C}'_{S-translated}\) from the previous phase in order to get a translated ontology \(S_{translated}\) (see Algorithm 2). We design an iterative process in order to obtain \(\mathcal {T}_e\), which is represented by \(\langle s, p, o\rangle \), that has all possible multilingual information from S to be added to T. We initiate the iterative process with all matched terms (\(newClasses = M\)) in order to get all related classes, if exist. The iterative process has three steps: (1) for each class \(c \in newClasses\), all triples tempTriples are retrieved from \(S_{translated}\) where c is a subject or an object, (2) a new list of new classes is obtained from tempTriples, (3) tempTriples is added to newTriples which will be added to T. These three steps are repeated until no new classes can be found (newClasses.isEmpty() = true). Next, we retrieve all available information from the other language for each class in newTriples such as \(\langle \)president, rdfs:label, “
”@ar\(\rangle \). The output of this phase is \(\mathcal {T}_{e}\) which contains all multilingual triples (i.e., in \(L_s\) and \(L_t\) languages) to be added to T.
3.5 Enrichment
The aim of this phase is to enrich T using triples in \(\mathcal {T}_{e}\). By using OECM, the target ontology can be enriched from several ontologies in different natural languages sequentially, i.e. one-to-many enrichment. In this case, \(T_{enriched}\) can have more than two natural languages. For instance, English T can be enriched from a German ontology, then the enriched ontology can be enriched again form a different Arabic ontology, i.e. the final result for \(T_{enriched}\) is presented in English, German, and Arabic. With the completion of this phase, we have successfully enriched T and create a multilingual ontology from monolingual ones.
3.6 Validation
The aim of this phase is to validate the enriched ontology, which is a crucial step to detect inconsistencies and syntax errors, which might be produced during the enrichment process [8]. There are two types of validations: syntactic and semantic validation. In the syntactic validation, we validate \(T_{enriched}\) to conform with the W3C RDF standards using the online RDF validation serviceFootnote 9 which detects syntax errors, such as missing tags. For semantic validation, we use two reasoners, FaCT++ and HermiT, for detecting inconsistencies in \(T_{enriched}\) [8].
Table 1. Use case: the sample output of each phase, from translation to triple retrieval.