From Monolingual to Multilingual Ontologies: The Role of Cross-lingual Ontology Enrichment

. While the multilingual data on the Semantic Web grows rapidly, the building of multilingual ontologies from monolingual ones is still cumbersome and hampered due to the lack of techniques for cross-lingual ontology enrichment. Cross-lingual ontology enrichment greatly facilitates the semantic interoperability between diﬀerent ontologies in diﬀerent natural languages. Achieving such enrichment by human labor is very costly and error-prone. Thus, in this paper, we propose a fully automated ontology enrichment approach (OECM), which builds a multilingual ontology by enriching a monolingual ontology from another one in a diﬀerent natural language, using a cross-lingual matching technique. OECM selects the best translation among all available translations of ontology concepts based on their semantic similarity with the target ontology concepts. We present a use case of our approach for enriching English Scholarly Communication Ontologies using German and Arabic ontologies from the MultiFarm benchmark. We have compared our results with the results from the Ontology Alignment Evaluation Initiative (OAEI 2018). Our approach has higher precision and recall in comparison to ﬁve state-of-the-art approaches. Additionally, we recommend some linguistic corrections in the Arabic ontologies in Multifarm which have enhanced our cross-lingual matching results.


Introduction
The wide proliferation of multilingual data on the Semantic Web results in many ontologies scattered across the web in various natural languages. According to the Linked Open Vocabularies 5 , the majority of the ontologies in the Semantic Web are in English, however, ontologies in other Indo-European languages also exist. For instance, out of a total 666 vocabularies, 482 are in English, 53 2 S. Ibrahim et al. in French, 39 in Spanish, and 33 in German. Few ontologies exist in non-Indo-European languages, such as 13 in Japanese and seven in Arabic. Monolingual ontologies with labels or local names presented in a certain language are not easily understandable to speakers of other languages. Therefore, in order to enhance semantic interoperability between monolingual ontologies, approaches for building multilingual ontologies from the existing monolingual ones should be developed [27]. Multilingual ontologies can be built by applying cross-lingual ontology enrichment techniques, which expand the target ontology with additional concepts and semantic relations extracted from external resources in other natural languages [24]. For example, suppose we have two ontologies; Scientific Events Ontology in English (SEO en ) and Conference in German (Conference de ). Both SEO en and Conference de have complementary information, i.e. SEO en has some information which does not exist in Conference de and vice versa. Let us consider a scenario where a user wants to get information from both SEO en and Conference de to be used in an ontology-based application. This may not be possible without a cross-lingual ontology enrichment solution, which enrich the former by the complementary information in the latter. Manual ontology enrichment is a resource demanding and time-consuming task. Therefore, fully automated cross-lingual ontology enrichment approaches are highly desired [24]. Most of the existing work in ontology enrichment focus on enriching English ontologies from English sources only (monolingual enrichment) [24]. To the best of our knowledge, only our previous work [1,14] has addressed the cross-lingual ontology enrichment problem by proposing a semi-automated approach to enrich ontologies from multilingual text or from other ontologies in different natural languages.
In this paper we address the following research question; how can we automatically build multilingual ontologies from monolingual ones? We propose a fully automated ontology enrichment approach in order to create multilingual ontologies from monolingual ones using cross-lingual matching. We extend our previous work [14] by: 1) using the semantic similarity to select the best translation of class labels, 2) enriching the target ontology by adding new classes in addition to all their related subclasses in the hierarchy, 3) using ontologies in non-Indo-European languages (e.g., Arabic), as the source of information, 4) building multilingual ontologies, and 5) developing a fully automated approach. OECM comprises six phases: 1) translation: translate class labels of the source ontology, 2) pre-processing: process class labels of the target and the translated source ontologies, 3) terminological matching: identify potential matches between class labels of the source and the target ontologies, 4) triple retrieval : retrieve the new information to be added to the target ontology, 5) enrichment: enrich the target ontology with new information extracted from the source ontology, and 6) validation: validate the enriched ontology. A noticeable feature of OECM is that we consider multiple translations for a class label. In addition, the use of semantic similarity has significantly improved the quality of the matching process. We present a use case for enriching the Scientific Events Ontology (SEO) 6 , a scholarly communication ontology for describing scientific events, from German and Arabic ontologies. We compare OECM to five state-of-the-art approaches for cross-lingual ontology matching task. OECM outperformed these approaches in terms of precision, recall, and F-measure. Furthermore, we evaluate the enriched ontology by comparing it against a Gold standard created by ontology experts. The implementation of OECM and the datasets used in the use case are publicly available 7 .
The remainder of this paper is structured as follows: we present an overview of related work in section 2. Overview of the proposed approach is described in section 3. In order to illustrate possible applications of OECM, a use case is presented in section 4. Experiments and evaluation results are presented in section 5. Finally, we conclude with an outline of the future work in section 6.

Related Work
A recent review of the literature on multilingual Web of Data found that the potential of the Semantic Web for being multilingual can be accomplished by techniques to build multilingual ontologies from monolingual ones [12]. Multilingual enrichment approaches are used to build multilingual ontologies from different resources in different natural languages [6,25,5]. Espinoza et al. [6] has proposed an approach to generate multilingual ontologies by enriching the existing monolingual ontologies with multilingual information in order to translate these ontologies to a particular language and culture (ontology localization). In fact, ontology enrichment depends on matching the target ontology with external resources, in order to provide the target ontology with additional information extracted from the external resources.
All the literature have focused on the cross-lingual ontology matching techniques which are used for matching different natural languages of linguistic information in ontologies [12,27]. Meilicke et al. [21] created a benchmark dataset (MultiFarm) that results from the manual translations of a set of ontologies from the conference domain into eight natural languages. This dataset is widely used to evaluate the cross-lingual matching approaches [29,7,15,16]. Manual translation of ontologies can be infeasible when dealing with large and complex ontologies. Trojahn et al. [28] proposed a generic approach which relies on translating concepts of source ontologies using machine translation techniques into the language of the target ontology, then they apply monolingual matching approaches to match concepts between the source ontologies and the translated ones. Fu et al. [10,11] proposed an approach to match English and Chinese ontologies by considering the semantics of the target ontology, the mapping intent, the operating domain, the time and resource constraints and user feedback. Hertling and Paulheim [13] proposed an approach which utilizes Wikipedias inter-language links for finding corresponding ontology elements. Lin and Krizhanovsky [18] proposed an approach which use Wiktionary 8 as a source of background knowl- edge to match English and French ontologies. Tigrine et al. [26] presented an approach, which relies on the multilingual semantic network BabelNet 9 as a source of background knowledge, to match several ontologies in different natural languages. In the context of OAEI 2018 campaign 10 for evaluating ontology matching technologies, AML [7], KEPLER [16], LogMap [15] and XMap [29] provide high-quality alignments. These systems use terminological and structural alignments in addition to using external lexicon, such as WordNet 11 and UMLSlexicon 12 in order to get the set of synonyms for the ontology elements. In order to deal with multilingualism, AML and KEPLER rely on getting one translation for a concept (one-to-one translation) using machine translation technologies, such as Microsoft translator, before starting the matching process. LogMap and XMap do not provide any information about the utilized translation methodology. Moreover, LogMap is an iterative process, that starts from initial mappings (almost exact lexical correspondences) to discover new mappings. It is mentioned in [15] that the main weakness of LogMap is that it can not find matching between ontologies which do not provide enough lexical information as it depends mainly on the initial mappings. A good literature of the state-of-the-art approaches in cross-lingual ontology matching is provided in [27].
Most of the literature have focused on enriching monolingual ontologies with multilingual information in order to translate or localize these ontologies. In addition, in the cross-lingual ontology matching task, there is a lack of exact oneto-one translation between terms across different natural languages which negatively affects the matching results. We address this limitations in our proposed approach by building multilingual ontologies, where a class label is presented by several natural languages, from monolingual ones. Such approach support the ontology matching process with multiple translations for a class label in order to enhance the matching results.

The Proposed Approach
Goal: Given two ontologies S and T , in two different natural languages L s and L t respectively, as RDF triples s, p, o ∈ C × R × (C ∪ L) where C is the set of ontology domain entities (i.e. classes), R is the set of relations, and L is the set of literals. We aim at finding the complementary information T e = S − (S ∩ T ) from S in order to enrich T .
The proposed approach comprises six phases ( Figure 1): translation, preprocessing, terminological matching, triple retrieval, enrichment, and validation. The input is the two ontologies in two different natural languages, i.e. the target ontology T and the source ontology S. The output is the multilingual enriched ontology T enriched in two different natural languages L 1 and L 2 . In the following subsections, we describe each of these phases in details.

Terminological Matching
Rdf triples

Translation
Let C S and C T be the set of classes in S and T respectively. Each class is represented by label or a local name. The aim of this phase is to translate each class in C S to the language of T (i.e. L t ). All available translations are considered for each class. Therefore, the output of the translation is C S−translated which has each class, in S, associated with a list of all available translations. For example, the class Thema in German has a list of English translations (Subject and Topic), and the class label " " in Arabic has a list of English translations such as "Review, Revision, Check". The best translation will be selected in the terminological matching phase (subsection 3.3).

Pre-processing
The aim of this phase is to process classes of C T and lists of translations in C S−translated by employing a variety of natural language processing (NLP) techniques, such as tokenization, POS-tagging (part-of-speech tagging), and lemmatization, to make it ready for the next phases. In order to enhance the similarity results between C T and C S−translated , stop words are removed and normalization methods and regular expressions are used to remove punctuation, symbols, additional white spaces, and to normalize the structure of strings. Furthermore, our pre-processing is capable of recognizing classes such as camel cases "ReviewArticle" and adds a space between lower-case and upper-case letters "Review Article" (i.e. true casing technique [19]). The output of this phase is C T , which has preprocessed translations of classes in T , and C S−translated , which has pre-processed translations for each class in S.

Terminological Matching
The aim of this phase is to identify potential matches between class labels of S and T . We perform a pairwise lexical and/or semantic similarity between the list of translations of each class in C S−translated and C T to select the best translation for each class in S that matches the corresponding class in T (see algorithm 1). Jaccard similarity [23] is used to filter the identical concepts instead of using semantic similarity from the beginning because there is no need for extra computations to compute semantic similarity between two identical classes. The reason behind choosing the Jaccard similarity is that according to the experiments conducted for the ontology alignment task for the MultiFarm benchmark in [2], Jaccard similarity has achieved the best score in terms of precision. For non-identical concepts, we compute the semantic similarity using the path length measure, based on WordNet 11 , which returns the shortest path between two words in WordNet hierarchy [3]. If two words are semantically equivalent, i.e., belonging to the same WordNet synset, the path distance is 1.00. We use a specific threshold θ in order to get the set of matched terms (matched classes) M . We obtained the best value of θ = 0.9 which has the best matching results after running the experiments for ten times. If no match is found, we consider this class as a new class that can be added to T and we consider its list of translations as synonyms for that class. Generally, class labels have more than one word, for example "InvitedSpeaker", therefore, the semantic similarity between sentences presented in [22] is adapted as described in algorithm 1 -line 9. Given two sentences sentence1 and sentence2, the semantic similarity of each sentence with respect to the other is defined by: for each word w i ∈ sentence1, the word w j in sentence2 that has the highest path similarity with w i is determined. The word similarities are then summed up and normalized with the number of similar words between the two sentences. Next, the same procedure is applied to start with words in sentence2 to identify the semantic similarity of sentence2 with respect to sentence1. Finally, the resulting similarity scores are combined using a simple average. Based on the similarity results, the best translation is selected and C S−translated is updated. For example, in Figure 2, the class " " in Arabic, has a list of English translations such as "President, Head, Chief". After computing the similarity between C S−translated and C T , "President" has the highest similarityScore of 1.00 with the class "Chairman", in C T , because they are semantically equivalent. Therefore, "President" is selected to be the best translation for " ". The output of this phase is the list of matched terms M between C T and the updated C S−translated .

Triple Retrieval
The aim of this phase is to identify which and where the new information can be added to T . Each class in S is replaced by its best translation found in C S−translated from the previous phase in order to get a translated ontology S translated (see algorithm 2). We design an iterative process in order to obtain T e , which is represented by s, p, o , that has all possible multilingual information from S to be added to T . We initiate the iterative process with all matched terms (newClasses = M ) in order to get all related classes, if exist. The iterative process has three steps: 1) for each class c ∈ newClasses, all triples tempT riples are retrieved from S translated where c is a subject or an object, 2) tempT riples ← getTriplesForNewClasses(S translated , newClasses) 5 newClasses ← getClasses(tempT riples).subtract(newClasses) 6 newT riples ← newT riples.union(tempT riples) 7 otherLangT riples ← getOtherLangTriples(newT riples, C S−translated ) 8 Te ← newT riples.union(f oreignLanguageT riples) a new list of new classes is obtained from tempT riples, 3) tempT riples is added to newT riples which will be added to T . These three steps are repeated until no new classes can be found (newClasses.isEmpty() = true). Next, we retrieve all available information from the other language for each class in newT riples such as president, rdfs:label, " "@ar . The output of this phase is T e which contains all multilingual triples (i.e., in L s and L t languages) to be added to T .

Enrichment
The aim of this phase is to enrich T using triples in T e . By using OECM, the target ontology can be enriched from several ontologies in different natural languages sequentially, i.e. one-to-many enrichment. In this case, T enriched can have more than two natural languages. For instance, English T can be enriched from a German ontology, then the enriched ontology can be enriched again form a different Arabic ontology, i.e. the final result for T enriched is presented in English, German, and Arabic. With the completion of this phase, we have successfully enriched T and create a multilingual ontology from monolingual ones.

Validation
The aim of this phase is to validate the enriched ontology, which is a crucial step to detect inconsistencies and syntax errors, which might be produced during the enrichment process [8]. There are two types of validations: syntactic and semantic validation. In the syntactic validation, we validate T enriched to conform with the W3C RDF standards using the online RDF validation service 13 which detects syntax errors, such as missing tags. In the semantic validation, we use two types of reasoners: FaCT++ 14 and HermiT 15 in order to detect inconsistencies in T enriched [8].

Use Case: Enriching the Scientific Events Ontology
In this use case, we use an example scenario to enrich the SEO en [9] ontology (with 49 classes), in English, using the MultiFarm dataset (see section 5). We use the Conference ontology (60 classes) and the ConfOf ontology (38 classes), in German and Arabic respectively, as source ontologies. This use case aims to show the whole process starting from submitting the source and target ontologies until producing the enriched multilingual ontology. Here, the source ontology is the German ontology Conference de and the target ontology is the English ontology SEO en . The output is the enriched ontology SEO en−de , which becomes a multilingual ontology in English and German. Table 1 demonstrates the enrichment process for SEO en from Conference de and shows the output sample of each phase starting from the translation phase to the produced set of triples which are used to enrich SEO en . In the terminological matching task, the relevant matching results (with similarity scores in bold) are identified with θ ≥ 0.9. The iterative process, in the triple retrieval phase, is initiated with the identified marched terms, for example, person class. At the first iteration, six triples (not all results are exist in the table because of the limited space) are produced such as conference contributor, rdfs:subClassOf, person , where the matched term person is located at the object position. New classes are determined from the produced triples such as conference contributor and committee member (in bold). At the second iteration, all triples that have these new classes, as subject or object, are retrieved, for example; for the committee member class, the triples committee member, rdf:type, Class and chairman, rdfs:subClassOf, committee member are retrieved. This process is repeated again and new classes are identified from the produced triples such as chairman. committee member, rdf:type, Class committee member, rdfs:label, "committee member"@en committee member,rdfs:label,"Angehörige des Ausschusses"@de chairman, rdfs:subClassOf, committee member The iterative process ended at the fifth iteration where three triples are produced without any new classes. The output of this phase is T e which has 40 new triples (with 20 new classes and their German labels), to be added to SEO en and produce SEO en−de . Figure 3 shows a small fragment of the enriched ontology SEO en−de , in Turtle, after completing the enrichment process. The resulting multilingual ontology contains a newly added class CommitteeMember with its English and German labels, a new relation rdfs:subClassOf between the two classes CommitteeMember and Chair, and new German labels such as Herausgeber and Vorsitzender for classes Publisher and Chair respectively. Similarly, SEO en−de is enriched from the Arabic ontology ConfOf ar , where all classes with English labels in SEO en−de are matched with class labels in ConfOf ar . The produced SEO en−de−ar has 113 new triples with 37 new classes with their Arabic labels. Final output results can be found at the GitHub repository 7 .

Evaluation
The aim of this evaluation is to measure the quality of the cross-lingual matching process in addition to the enrichment process. We use ontologies in MultiFarm benchmark 16 , a benchmark designed for evaluating cross-lingual ontology matching systems. MultiFarm consists of seven ontologies (Cmt, Conference, ConfOf, Edas, Ekaw, Iasted, Sigkdd ) originally coming from the Conference benchmark of OAEI, their translation into nine languages (Chinese, Czech, Dutch, French, German, Portuguese, Russian, Spanish and Arabic), and the corresponding crosslingual alignments between them.

Experimental Setup
All phases of OECM have been implemented using Scala and Apache Spark 17 . SANSA-RDF library 18 [17] with Apache Jena framework 19 are used to parse and manipulate the input ontologies (as RDF triples). Google Translator 20 is used to translate class labels/local names of source ontologies. In order to process the class labels, the Stanford CoreNLP 21 [20] is used. All experiments are carried out on Ubuntu 16.04 LTS operating system with an Intel Corei7-4600U CPU @ 2.10GHz x 4 CPU and 10 GB of memory. In our experiments, we consider English ontologies as target ontologies to be enriched from German and Arabic ontologies. Our evaluation has three tasks: 1) evaluating the effectiveness of the crosslingual matching process in OECM compared to the reference alignment provided in the MultiFarm benchmark, 2) comparing OECM matching results with four state-of-the-art approaches, in addition to our previous work (OECM 1.0) [14], and 3) evaluating the quality of the enrichment process.

Effectiveness of OECM
In this experiment, we use the English version of Cmt ontology as the source ontology, and German and Arabic versions of Conference, ConfOf, and Sigkdd ontologies as target ontologies. We match class labels in Cmt ontology with class labels of German and Arabic versions of Conference, ConfOf, and Sigkdd ontologies separately. The resulting alignments are compared with the reference alignments, as a gold standard, provided in the benchmark for each pair of  ontologies. Table 2 shows the precision, recall and F-measure of the matching process for each pair of ontologies. OECM achieves the highest precision of 1.00 for all pair of ontologies. Meanwhile, OECM achieves the highest recall and Fmeasure of 0.90 and 0.95 respectively for matching the German Sigkdd with the English Cmt. As two authors of this work are native speakers of Arabic, we found some linguistic mistakes in the Arabic ontologies which negatively affect the translation and the matching results. Therefore, we correct these mistakes and make it available at the GitHub repository 7 . Matching results before and after the corrections are presented in the table, where such corrections have greatly improved the matching results in terms of recall and F-measure. For instance, in matching the Arabic Sigkdd with the English Cmt, recall and Fmeasure are enhanced by 40% and 32% respectively.

Comparison with the state-of-the-art
We identified four of the related approaches (AML, KEPLER, LogMap, and XMap) to be included in our evaluation in addition to OECM 1.0. The other related work, neither publish their code, nor their evaluation datasets [26,11,10]. In order to compare our results with the state-of-the-art, we use German (Conference de ) and Arabic (Conference ar ) versions of the Conference ontology as the source ontologies, and Ekaw en and Edas en ontologies as the target English ontologies. We choose Ekaw en and Edas en ontologies in this evaluation because they are used in the state-of-the-art systems for evaluation, as mentioned in the results of OAEI 2018 10 . We generate the gold standard alignments between each pair of ontologies using the Alignment API 4.9 22 in order to compute precision, recall, and F-measures. Table 3 shows the comparison between our results against four state-of-the-art approaches and OECM 1.0 (results for matching English and German ontologies only). In addition, we add the updated Arabic ontology (Conference' ar ) with our linguistic correction in the matching process in order to show the effectiveness of such corrections. The current version of OECM (OECM 1.1) outperforms all other systems in precision, recall and F-measure. For instance, when matching Conference de × Ekaw en , OECM 1.1 outperforms LogMap, the highest precision, recall and F-measure among the others, by 29%, 60% and 58% in terms of precision, recall and F-measure respectively. The use of semantic similarity in OECM 1.1 significantly improves the matching results compared to the results of OECM 1.0. For instance, when matching Conference de × Ekaw en , matching results in OECM 1.0 have been enhanced by 25%, 13%, and 18% in terms of precision, recall and F-measure respectively. When matching Conference ar × Edas en , XMap outperform OECM by 14% in terms of precision, while OECM outperforms it in both recall and fmeasure. It is observed that the precision of OECM slightly decreased because of the linguistic mistakes found in Conference ar . When considering Conference' ar , which has the linguistic correction, as a source ontology in this matching, the matching results are improved.

Evaluating the Enrichment Process
According to [4], the enriched ontology can be evaluated by comparing it against a predefined reference ontology (Gold standard). In this experiment, we evaluate the enriched ontology SEO en−de (from section 4). A gold standard is created, by ontology experts, by enriching the target ontology manually using Conference de . By comparing SEO en−de with the gold standard, OECM achieves 1.00, 0.80, and 0.89 in terms of precision, recall, and F-measure respectively. This finding confirms the usefulness of our approach in cross-lingual ontology enrichment.

Conclusion
We present a fully automated approach, OECM, for building multilingual ontologies. The strength of our contribution lies in building such ontologies from monolingual ones using cross-lingual matching between ontologies concepts. Indo and non-Indo-European languages resources are used for enrichment in order to illustrate the robustness of our approach. Considering multiple translations of concepts and the use of semantic similarity measures for selecting the best translation have significantly improved the quality of the matching process. Iterative triple retrieval process has been developed to determine which information, from the source ontology, can be added to the target ontology, and where such information should be added. We show the applicability of OECM by presenting a use case for enriching an ontology in the scholarly communication domain. The results of the cross-lingual matching process are found promising compared to five state-of-the-art approaches, involving the previous version of OECM. Furthermore, evaluating the quality of the enrichment process emphasizes the validity of our approach. Finally, we propose some linguistic corrections for the Arabic ontologies in the MultiFarm benchmark that used in our experiment, which considerably enhanced the matching results. In conclusion, our approach provides a springboard for a new way to build multilingual ontologies from monolingual ones.
In the future, we intend to further consider properties and individuals in the enrichment process. In addition, we aim to apply optimization methods in order to evaluate the efficiency of OECM when enriching very large ontologies.