OWL2Vec*: embedding of OWL ontologies

Semantic embedding of knowledge graphs has been widely studied and used for prediction and statistical analysis tasks across various domains such as Natural Language Processing and the Semantic Web. However, less attention has been paid to developing robust methods for embedding OWL (Web Ontology Language) ontologies, which contain richer semantic information than plain knowledge graphs, and have been widely adopted in domains such as bioinformatics. In this paper, we propose a random walk and word embedding based ontology embedding method named OWL2Vec*, which encodes the semantics of an OWL ontology by taking into account its graph structure, lexical information and logical constructors. Our empirical evaluation with three real world datasets suggests that OWL2Vec* benefits from these three different aspects of an ontology in class membership prediction and class subsumption prediction tasks. Furthermore, OWL2Vec* often significantly outperforms the state-of-the-art methods in our experiments.


Introduction
In recent years, the semantic embedding of knowledge graphs (KGs) has been widely investigated [38].The objective of such embeddings is to represent in a vector space KG components such as entities and relations in a way that captures the structure of the graph.Various kinds of KG embedding algorithms have been proposed and successfully applied to KG refinement (e.g., link prediction [32] and entity alignment [36]), recommendation systems [31], zero-shot learning [8,39], interaction prediction in bioinformatics [33,27], and so on.However, most of these algorithms focus on creating embeddings for multi-relational graphs composed of triples in RDF (Resource Description Framework) 1 form such as England, is-PartOf, UK and UK, hasCapital, London .They do not deal with OWL2 ontologies (or ontological schemas in OWL) which include not only graph structures 3 , but also logic constructors such as class disjointness, existential and universal quantification (e.g., a country must have at least one city as its capital), and meta data such as the synonyms, definitions and comments of a class.OWL ontologies have been widely used in many domains such as bioinformatics and the Semantic Web [27,20].They are capable of expressing complex domain knowledge and managing large scale domain vocabularies, and can often improve the quality and usability of the KG [28].
Inspired by the success of KG embeddings, more recently there has been a growing interest in embedding simple ontological schemas consisting, e.g., of hierarchical classes, and property domain and range [17,26,1,16]; however, these methods rely on having a large number of facts (i.e., an ABox), and do not support more expressive OWL ontologies which contain some widely used logic constructors such as the class disjointness and the existential quantification mentioned above.Embeddings for OWL ontologies have started to receive some attention as well.Kulmanov et al. [21] and Garg et al. [13] proposed to model the semantics of the logic constructor by geometric learning, but their models only support some of the logic constructors from the description logics (DLs) EL ++ (which is closely related to OWL EL -a fragment of OWL) and ALC, respectively.Moreover, both methods consider only the logical and graph structure of an ontology, and ignore its lexical information that widely exists in the meta data (e.g., rdfs:label and rdfs:comment triples).OPA2Vec [34] considers the ontology's lexical information by learning a language model which encodes statistical correlations between items in a corpus.However, it treats each axiom as a sentence and fails to explore and utilize the semantic relationships between axioms.OWL2Vec [19], which is our very preliminary work before OWL2Vec * , captures the semantics of OWL ontologies by exploring the neighborhoods of classes, and learning embeddings using a language model.This was shown to be quite effective, but it does not fully exploit the lexical and (onto)logical semantics available in OWL ontologies.
In this work we have extended OWL2Vec in order to provide a more general and robust OWL ontology embedding framework which we call OWL2Vec * .OWL2Vec * exploits an OWL (or OWL 2) ontology by walking over its graph forms and generates a corpus of three documents that capture different aspects of the semantics of the ontology: (i) the graph structure and the logic constructors, (ii) the lexical information (e.g., entity names, comments and definitions), and (iii) a combination of the lexical information, graph structure and logical constructors.Finally, OWL2Vec * uses a neural language model to create embeddings of both entities and words from the generated corpus.Note that the OWL2Vec * framework is compatible to different neural language models, although the current implementation adopts the skip-gram model which is used in Word2Vec [25].
We have evaluated OWL2Vec * in two case studies -class membership prediction and class subsumption prediction, using three large scale real world ontologies -a healthy lifestyle ontology named HeLis [12], a food ontology named FoodOn [11] and the Gene Ontology (GO) [9].In the case studies we empirically analyze the impact of (i) different document and embedding settings which correspond to combinations of the semantics of the graph structure, lexical information and logic constructors, (ii) different graph structure exploration settings (e.g., the transformation methods from OWL ontology to graph, and the graph walking strategies), (iii) ontology entailment reasoning, and (iv) language model pre-training.The results suggest that OWL2Vec * can achieve significantly better performance than the baselines including the state-of-the-art ontology embeddings [21,13,34,19] and some classic KG embeddings such as RDF2Vec [30], TransE [6] and DistMult [41].We also calculated the Euclidean distance between entities and visualized the embeddings of some example entities to analyze different embedding methods.
The remainder of the paper is organized as follows.The next section introduces the preliminaries including both background and related work.Section 3 introduces the technical details of OWL2Vec * as well as the case studies.Section 4 presents the experiments and the evaluation results.The last section concludes and discusses future work.

OWL Ontologies
Our OWL2Vec * embedding targets OWL ontologies [4], which are based on the SROIQ description logic (DL) [3].Consider a signature Σ = (N C , N R , N I ), where N C , N R and N I are pairwise disjoint sets of, respectively, atomic concepts, atomic roles and individuals.Complex concepts and roles can be composed using DL constructors such as conjunction (e.g., C D), disjunction (e.g., C D), existential restriction (e.g., ∃r.C) and universal restrictions (e.g., ∀r.C) where C and D are concepts, and r is a role.An OWL ontology comprises a TBox T and an ABox A. The TBox is a set of axioms such as General Concept Inclusion (GCI) axioms (e.g., C D), Role Inclusion (RI) axioms (e.g., r s) and Inverse Role axioms (e.g., s ≡ r − ), where C and D are concepts, r and s are roles, and r − denotes the inverse of r.The ABox is a set of assertions such as concept assertions (e.g., C(a)), role assertions (e.g., r(a, b)) and individual equality and inequality assertions (e.g., a ≡ b and a ≡ b), where C is a concept, r is a role, a and b are individuals.
In OWL, atomic concepts, roles and individuals are referred to as entities; concepts, roles and individuals are referred to as classes, object properties and instances, respectively.A GCI axiom C D corresponds to a subsumption relation between the class C and the class D, while a concept assertion C(a) corresponds to a membership relation between the instance a and the class C. Each entity in an OWL ontology is uniquely represented by a Uniform Resource Identifier (URI).These URIs may be lexically 'meaningful' (e.g., vc:AlcoholicBeverages in Figure 1a) or consist of internal IDs that do not carry useful lexical information (e.g., obo:FOODON 00002809 in Figure 1b); in either case the intended meaning may also be indicated via annotations (see below).
In OWL, complex classes, complex properties, axioms and assertions can be serialised as (sets of) RDF triples.These triples use a combination of bespoke object properties (e.g., vc:hasNutrient) and RDF, RDFS 4 and OWL built-in properties (e.g., rdfs:subClassOf, rdf:type and owl:someValuesFrom).In Figure 1, for example, the relationship between the instances vc:FOOD-4001 and vc:VitaminC 100 is represented by a triple using the property vc:hasNutrient, while the existential restriction involving the class obo:FOODON 00002809 and the object property obo:RO 0001000 is represented by triples using the OWL built-in properties owl:Restriction, owl:onProperty and owl:someValuesFrom.As in RDF, the object of an OWL role assertion triple can also be a literal value; for example, the calories amount vc:Beer In addition to axioms and assertions with formal logic-based semantics, an ontology often contains metadata information in the form of annotation axioms.These annotations can also be represented in a triple form using annotation properties as predicate; e.g., the class obo:FOODON 00002809 is annotated using rdfs:label to specify a name string, using rdfs:comment to specify a description, and using obo:IAO-0000115 (a bespoke annotation property) to specify a natural language "definition".
Knowledge graph (KG) refers to structured knowledge resources which are often expressed as a set of RDF triples [18].Many KGs only contain instances and facts which are equivalent to an OWL ontology ABox.Some other KGs such as DBpedia [2] are also enhanced with an schema which is equivalent to the TBox of an OWL ontology.Thus, a KG can often be understood as an ontology.

Semantic Embedding
Semantic embedding refers to a series of representation learning (or feature learning) techniques that encode the semantics of data such as sequences and graphs into vectors, such that they can be utilized by downstream machine learning prediction and statistical analysis tasks [5].Neural language models such as Feed-Forward Neural Networks, Recurrent Neural Networks and Transformers are widely used for semantic embedding, and they have shown good performance in embedding the context (e.g., item co-occurrence) in sequences [24,29,10].Two classic auto-encoding architectures for learning representations of sequential items are continuous skip-gram and continuous Bag-of-Words (CBOW) [25,24].The former aims at predicting the surroundings of an item, while the latter aims at predicting an item based on its surroundings.Word2Vec is a well known group of neural language models for learning word embeddings from a large corpus, and was initially developed by a team at Google; it can be configured to use either skip-gram or CBOW architectures [25,24].
Semantic embedding has also been extended to KGs composed of role assertions [38].The entities and relations (object properties) are represented in a vector space while retaining their relative relationships (semantics), and the resulting vectors are then applied to downstream tasks including link prediction [32], entity alignment [36], and erroneous fact detection and correction [7].One paradigm for learning KG representations is computing the embeddings in an end-to-end manner, iteratively adjusting the vectors using an optimization algorithm to minimize the overall loss across all the triples, where the loss is usually calculated by scoring the truth/falsity of each triple (positive and negative samples).Algorithms based on this technique include translation based models such as TransE [6] and TransR [22] and latent factor models such as DistMult [41].
Another paradigm is to first explicitly explore the neighborhoods of entities and relations in the graph, and then learn the embeddings using a language model.Two representative algorithms based on this paradigm are node2vec [15] and Deep Graph Kernels [40].The former extracts random graph walks and creates skip-gram or CBOW models as the corpus for training, while the latter uses graph kernels such as Weisfeiler-Lehman (WL) subtree kernels as the corpus.However, both embedding algorithms were originally developed for undirected graphs, and thus may have limited performance when directly applied to KGs.RDF2Vec addresses this issue by extending the idea of the above two algorithms to directed labeled RDF graphs, and has been shown to learn effective embeddings for large scale KGs such as DBpedia [30,31].Recent studies have explored the use of new neural language models for learning embeddings; one example is RW-LMLM which combines a random walk algorithm with a Transformer model [37].
Our OWL2Vec * technique belongs to the language model paradigm, but we focus on OWL ontologies instead of typical KGs, with the goal of preserving the semantics not only of the graph structure, but also of the lexical information and the logical constructors.Note that the graph of an ontology, which includes hierarchical categorization structure, differs from the multi-relation graph composed of role assertions of a typical KG; furthermore the ontology's lexical information and logical constructors can not be successfully exploited by the aforementioned KG embedding methods.

Ontology Embedding
The use of machine learning prediction and statistical analysis with ontologies is receiving wider attention, and some approaches to embedding the semantics of OWL ontologies can already be found in the literature.Unlike typical KGs, OWL ontologies include not only graph structure but also logical constructors, and entities are often augmented with richer lexical information specified using rdfs:label, rdfs:comment and many other bespoke or built-in annotation properties.The objective of OWL ontology embedding in this study is to represent each OWL named entity (class, instance or property) by a vector, such that the inter-entity relationships indicated by the above information are kept in the vector space, and the performance of the downstream tasks, where the input vectors can be understood as learned features, is maximized.
EL Embedding [21] and Quantum Embedding [13] are two OWL ontology embedding algorithms of the end-to-end paradigm.They construct specific score functions and loss functions for logical axioms from EL ++ and ALC, respectively, by transforming logical relations into geometric relations.This encodes the semantics of the logical constructors, but ignores the additional semantics provided by the lexical information of the ontology.Moreover, although the graph structure is explored by considering class subsumption and class membership axioms, the exploration is incomplete as it uses only rdfs:subClassOf and rdf:type edges, and ignores edges involving other relations.
Onto2Vec [33] and OPA2Vec [34] are two ontology embedding algorithms of the language model paradigm using a neural language model of either the skip-gram architecture or the CBOW architecture.Onto2Vec uses the axioms of an ontology as the corpus for training, while OPA2Vec complements the corpus of Onto2Vec with the lexical information provided by, e.g., rdfs:comment.They have been evaluated with the Gene Ontology for predicting protein-protein interaction (i.e., a domain-specific relationship between classes), which is quite different from the class membership prediction and the class subsumption prediction in this study.Both methods treat each axiom as a sentence, which means that they cannot explore the correlation between axioms.This makes it hard to fully explore the graph structure and the logical relation between axioms, and may also lead to the problem of corpus shortage for small to medium scale ontologies.OWL2Vec * deals with the above issues of OPA2Vec and Onto2Vec by complementing their axiom corpus with a corpus generated by walking over RDF graphs that are transformed from the OWL ontology with its graph structure and logical constructors considered.In addition, to fully utilize the lexical information, OWL2Vec * creates embeddings for not only the ontology entities as the current KG/ontology embedding methods but also for the words in the lexical information.

Methodology
Figure 2 presents the overall framework of OWL2Vec * , which mainly consists of two core steps: (i) corpus extraction from the ontology, and (ii) language model training with the corpus and entity embedding.The corpus includes a structure document, a lexical document, and a document combining the structure and the lexical information.The former two aim at exploring the ontology's graph structure, logical constructors and lexical information, while the third aims at preserving the correlation between entities (URIs) and their lexical labels (words).Briefly, given an input ontology O and the target entities E of O for embedding, OWL2Vec * outputs a vector for each entity e in E, denoted as e ∈ R d , where d is the (configurable) embedding dimension.Note that E can be all the entities in O or just a part needed for a specific application.For class membership prediction we set E to all the named classes and instances; for class subsumption prediction we set E to all the named classes.Table 1: Projection rules, based on [35,19], used in the second strategy to generate an RDF graph. is one of: ≥, ≤, =, ∃, ∀.A, B, B i and C i are atomic concepts, s i , r and r are roles (object properties), r − is the inverse of a relation r, a and b are individuals (instances), is the top concept (defined by owl:Thing).

From OWL Ontology to RDF Graph
OWL2Vec * incorporates two strategies to turn the original OWL ontology O into a graph G in RDF form.The first strategy implements the transformation according to the OWL to RDF Graph Mapping defined by the W3C. 6For example, the existential restriction of the class obo:FOODON 00002809 in Figure 1b, namely ObjectSomeValuesFrom(obo:RO 0001000, obo:FOODON 03411347) is transformed into four triples, i.e., obo:FOODON 00002809, rdfs:subClassOf, :x , :x, owl:someValuesFrom, obo:FOODON 03411347 , :x, rdf:type, owl:Restriction and :x, owl:onProperty, obo:RO 0001000 , where :x denotes a blank node.The second strategy is based on the projection rules proposed in [35,19] (see Table 1).Every RDF triple X, r, Y (where r is an object property, X and Y are atomic concepts or instances) in the projection is justified by one or more axioms in the ontology.For example, the above mentioned existential restriction of the class obo:FOODON 00002809 would be represented with obo:FOODON 00002809, obo:RO 0001000, obo:FOODON 03411347 .This strategy avoids the use of blank nodes in the RDF graph; but, unlike the first strategy, it approximates the logical constructors of the OWL ontology.
Both strategies can incorporate an OWL reasoner to compute the TBox classification and ABox realization before O is transformed into an RDF graph G.Such reasoning grounds the axioms of logical constructors and leads to explicit representation of some hidden knowledge.In our experiments we use the HermiT reasoner [14], and we evaluate the impact of enabling or disabling reasoning.

Structure Document
The structure document aims at capturing both the graph structure and the logical constructors of the ontology.With the RDF graph G, one option is computing random walks for each target entity in E with the RDF graph G.Each walk, which is a sequence of entity URIs, acts as a sentence of the structure document.An example of a random walk of depth four starting from the class vc:Beer in Figure 1a is (vc:Beer, rdf:type, vc:FOOD-4001, vc:hasNutrient, vc:VitaminC 100).Another option is to use the Weisfeiler Lehman (WL) RDF sub-tree kernel, which encodes the structure of a sub-tree into a unique identity to enable the comparison of sub-tree structures.Briefly, the WL subtree kernel solution replaces the final entity of each random walk with the kernel (identity) of the sub-tree rooted in this entity.
To capture the logical constructors, OWL2Vec * extracts all the axioms of the ontology as a complement of the sentences of the structure document, where each axiom is transformed into a sequence following the OWL Manchester Syntax 7 .For example, the axiom of the existential restriction of the class obo:FOODON 00002809 in Figure 1b is transformed into the sequence (obo:FOODON 00002809, subClassOf, obo:RO 0001000, some, obo:FOODON 03411347).

Lexical Document
The lexical document includes word sentences transformed from the entity URI sentences in the structure document and the relevant lexical annotation axioms in the ontology.For the former, given an entity URI sentence, each of its entities is replaced by its English label defined by rdfs:label.Note the label is parsed and transformed into lowercase tokens, and those tokens with none letter characters are filtered out, before it replaces the entity URI.It is possible that some entities have no annotations or no English annotations, such as the class vc:MilkAndYogurt and the instance vc:VitaminC 1000 in Figure 1a.In this case, we prefer to use the name part of the URI, assuming that the name follows the camel case.As an example to show the transformation, the above mentioned random walk (vc:Beer, rdf:type, vc:FOOD-4001, vc:hasNutrient, vc:VitaminC 100) is transformed into ("beer", "type", "blonde", "beer", "has", "nutrient", "vitamin", "c").
For the latter, OWL2Vec * selects those annotation axioms by bespoke annotation properties such as obo:IAO 0000115 (definition) and those by built-in annotation properties such as rdfs:comment.Note that annotation axioms by rdfs:label are ignored as these labels are already considered in replacing the entity URIs mentioned above.More specifically, for each annotation axiom, OWL2Vec * replaces the subject entity by its English label or URI name as in transforming the URI sentence, and keeps the lowercase word tokens parsed from the annotation value.For example, axiom (obo:FOODON 00002809, obo:IAO 0000115, "Edammame is a preparation of immature soybean ...") is turned into ("edamame", "edamame", "is", "a", "preparation", "of ", "immature", "soybean", ...) which can help build a correlation between "soybean" and "edamame".

Combined Document
OWL2Vec * further extracts a combined document from the structure document and the entity annotations, so as to preserve the correlation between entities (URIs) and words in the lexical information.To this end, we developed two strategies to deal with each URI sentence in the structure document.One strategy is to randomly select an entity in an URI sentence, keep the URI of this entity, and replace the other entities of this sentence by their lowercase word tokens extracted from their labels or URI names as in the creation of the lexical document.For example, for the URI sentence (vc:FOOD-4001, vc:hasNutrient, vc:VitaminC 100), if the first entity is selected, then the generated combined sentence is (vc:FOOD-7000637, "has", "nutrient", "vitamin", "c") which can help build correlations between vc:FOOD-4001 and words such as "nutrient" and "vitamin".The other strategy is traversing all the entities in a URI sentence.For each entity, it generates a combined sentence by keeping the URI of this entity, and replacing the others by their lowercase word tokens as in the random strategy.Thus for one URI sentence, it generates m combined sentences where m is the number of entities of the URI sentence.
On the one hand the combined document captures the correlation between URIs and words, which may benefit the embedding of URIs with word semantics.On the other hand it may add noise for the correlation among words.The impact of the combined document and its two strategies is analyzed in our evaluation (cf.Section 4.3.1).

Embeddings
OWL2Vec * first merges the structure document, the lexical document and the combined document as one document, and then uses this document to train a Word2Vec neural language model with the skip-gram architecture.The training is ended when the loss trends to be stable.The hyperparameter of the minimum count of words is set to 1 such that each word or entity (URI) is encoded as long as it appears in the documents at least once.Specially, we can pre-train the Word2Vec model by a large and general corpus such as a dump of Wikipedia articles.This brings some prior correlations between words, especially between a word's synonyms and between a word's variants, which enables the downstream machine learning tasks to identify their semantic equality or similarity in the word vector space.However, such prior correlations may also be noisy and play a negative role in a domain specific task (cf. the evaluation in Section 4.3.4).Note that OWL2Vec * is compatible to the CBOW architecture and other neural language models, but the selection and evaluation is out of the scope of this study.
With the trained neural language model, OWL2Vec * calculates the embedding of each target entity e in E. Its embedding e is the concatenation of V uri (e) and V word (e), where V uri (e) is the vector of the URI of e, and V word (e) is the average of the vectors of all the lowercase word tokens of e.As in the case of constructing lexical sentences from URI sentences, the word tokens of e are extracted from its English label if such a label exists, or from its URI name otherwise.Due to the concatenation, the embedding size of e, i.e., d, is twice the original embedding size of the neural language model.V uri (e) and V word (e) can also be independently used.A comparison of their performance can be found in Section 4.3.1.

Case Studies
We applied OWL2Vec * in ontology completion which first trains a prediction model from known relations (axioms) and then predicts those plausible relations. 8It includes two tasks: class membership prediction and class subsumption prediction, where the embedding of an entity can be understood as the features automatically learned from its neighbourhood, relevant axioms and lexical information without any supervision.
Given a head entity e 1 and a tail entity e 2 , where e 1 is an instance and e 2 is a class, the membership prediction task aims at training a model to predict the plausibility that e 1 is a member of e 2 (i.e., e 2 (e 1 )).The input is the concatenation of the embeddings of e 1 and e 2 , i.e., x = [e 1 , e 2 ], while the output is a score y in [0, 1], where a higher y indicates a more plausible membership relation.For the prediction model, a basic binary machine learning classifier such as Random Forest can be adopted.
In training, the positive training samples (membership axioms) are directly from the ontology, while the negative samples are constructed by corrupting each positive sample.Namely, for each positive sample (e 1 , e 2 ), one negative sample (e 1 , e 2 ) is generated, where e 2 is a random class of the ontology and e 1 is not a member of e 2 even after entailment reasoning.In prediction, given a head entity (i.e., the target), a candidate set of classes are selected (e.g., all the classes except for the top class owl:Thing, or a subset after filtering via some heuristic rules), each candidate is predicted with a score, and the candidates are then ranked according to their scores where the top is the most likely class of the instance.Class subsumption prediction is similar to class membership prediction, except that e 1 and e 2 are both classes, the goal is to predict whether e 1 is subsumed by e 2 (i.e., e 1 e 2 ), and the head entity e 1 itself is excluded from the candidate classes.

Experimental Setting
We evaluated OWL2Vec * on class membership prediction with the HeLis 9 ontology [12], and on class subsumption prediction with the FoodOn 10 ontology [11] and the Gene ontology (GO) 11 .HeLis captures general knowledge about both food and healthy lifestyles, 8 Our ontology completion task is different from ontology reasoning.Our goal is not to infer relations that logically follows from the given input, but to try to discover plausible relations that complement the original ontology.(Most) plausible relations may not be inferred, and our evaluation focuses exactly on those plausible relations that cannot be inferred. 9HeLis project: https://horus-ai.fbk.eu/helis/Table 2: Statistics of the HeLis ontology, the FoodOn ontology and the GO ontology. 12oodOn captures more detailed knowledge about food, and GO is a major bioinformatics initiative to unify the representation of gene and gene product attributes.Some statistics of the two ontologies are shown in Table 2. Due to different knowledge representations, HeLis has a large number of membership axioms but a very small number of subsumption axioms, while FoodOn and GO has only subsumptions axioms.This is the reason why we evaluated membership prediction on HeLis, and subsumption prediction on FoodOn and Go.Data and code are available at https://github.com/KRR-Oxford/OWL2Vec-Star.
The experiment on membership and subsumption prediction follows the following setting: all the explicitly declared class membership axioms (or class subsumption axioms) are randomly divided into three sets for training (70%), validation (10%) and testing (20%), respectively.For each axiom in the validation/testing set, the head entity (i.e., an instance for membership prediction and a class for subsumption prediction) is the target whose class is to be predicted from all the candidates and compared against the tail entity (as the ground truth class) in evaluation.All the candidates are ranked according to the predicted score which indicates the likelihood of being the head entity's class.We calculate the following widely adopted metrics: Hits@1, Hits@5, Hits@10 and MRR (Mean Reciprocal Rank).The first three measure the recall of the ground truths within the top 1/5/10 ranking positions, while the fourth averages the reciprocals of the ranking positions of the ground truths.The higher the metrics, the better the performance.
The performance of OWL2Vec * is reported with the following settings.If not specified, OWL2Vec * uses OWL to RDF Graph Mapping without entailment reasoning.For the Word2Vec model, the dimension is set to 100 if no pre-training is adopted, and otherwise set to be consistent with the pre-trained model (we used a model pre-trained on a 2019 Wikipedia dump with a dimension of 200); the window size is set to 5; the minimum count of words is set to 1; the iteration number of training is set to 10, which is based on the observation of the loss.Random Forest is adopted as the basic binary classifier.Other hyperparameters such the walking strategy (WL subtree kernel or random walk) and the walking depth, as well as the hyperparameters of the baselines are adjusted through the validation set as well -the setting that leads to the highest MRR on the validation set is adopted.
The evaluation is organized as follows.We first compare OWL2Vec * with the baselines, then analyze the impact of different settings including the type of document, the use of reasoning, the selection of URI and word embeddings, and the adoption of pretrained embeddings, and finally analyze the embeddings via visualization and comparing Euclidean distances.The selected baselines include (i) four well-known knowledge graph embedding methods, i.e., RDF2Vec, TransE, TransR and DistMult, (ii) four state-of-theart ontology embedding methods, i.e., Onto2Vec, OPA2Vec, EL Embedding and Quantum Embedding, 13 (iii) the original OWL2Vec which is equivalent to OWL2Vec * using the URI embedding, structure document and ontology projection rules, and (iv) the pre-trained Word2Vec model.The embeddings of these baselines are applied to the two tasks in the same way as OWL2Vec * .Note that RDF2Vec, TransE, TransR and DistMult are trained with the RDF graph G using OWL to RDF Graph Mapping without entailment reasoning, while the pre-trained Word2Vec calculates the average word vector of an entity according to its label (or its URI name if the label does not exist) as in OWL2Vec * .

Comparison with Baselines
Table 3 reports the performance of OWL2Vec * , with the setting optimized via the validation set.It shows that OWL2Vec * outperforms all the baselines.Note OWL2Vec * performance with different settings can be found in Section 4.3.Among all these ontology embedding and KG embedding baselines which directly calculate the URI's vector without considering the word vector, OPA2Vec achieves the best performance on FoodOn and GO for subsumption prediction; while the KG embedding method RDF2Vec performs the best on HeLis for class membership prediction.In contrast, the two logic embedding methods Quantum Embedding and EL Embedding, and TransE perform poorly on all the three ontologies.Our preliminary work OWL2Vec achieves promising results on HeLis (close to RDF2Vec) and FoodOn (close to OPA2Vec), but performs poorly on GO.OWL2Vec * outperforms both KG embedding methods and ontology embedding methods; for example, it has 325.6% higher Hits@1 than RDF2Vec on HeLis, and 146.6% higher Hits@1 than OPA2Vec on FoodOn.
Meanwhile, OWL2Vec * also outperforms the pre-trained Word2Vec, with 6.0%, 56.6% and 38.2% higher MRR on HeLis, FoodOn and GO, respectively.It is interesting to see that the pre-trained Word2Vec using entity labels or URI names achieves good performance, outperforming those ontology and KG embedding baselines such as RDF2Vec and OPA2Vec.It means that the lexical information plays a very important role in embedding real world ontologies, especially for membership prediction and subsumption prediction as the names of instances and classes with a membership or subsumption relationship often use some common words, synonyms or word variants.This is verified by our following analysis on different settings of OWL2Vec * (see V uri , V word and V uri,word in Table 4).A key difference between OWL2Vec * and Word2Vec is that the word embedding of OWL2Vec * is trained by an ontology tailored corpus underpinned by its graph structure and logical axioms.
Note that the performance of membership prediction with HeLis is much higher than that of the subsumption prediction with FoodOn and Go.This is because the former has much less candidate classes (cf.Table 2) and is thus less challenging.

Lexical Information
According to Table 4 we can find that the lexical document D l leads to a significant improvement of performance when it is merged with the structure document Ds (i.e., D s,l ).The MRR of D s,l outperforms Ds by 26.3% on HeLis, by 60.4% on FoodOn and by 139.0% on GO when the URI embedding (V uri ) is used, and by 192.9%, 134.1% and 234.1% respectively when both URI embedding and word embedding (V uri,word ) are used.
Unlike the lexical document, the combined documents (D s,l,rc and D s,l,tc ), which also rely on the lexical information of the ontology, lead to a limited positive impact.For class HeLis Method MRR Hits@1 Hits@5 Hits@10  membership prediction, the best performance of D s,l,rc and the best performance of D s,l,tc are both very close to the best performance as D s,l , while for class subsumption prediction, they are both worse than the best performance of D s,l .We find that the combined document has a positive impact when the URI embedding alone is adopted, but often has a negative impact when the word embedding is concatenated or used alone.This is because the combined sentences build the correlation between words and URIs, which benefits the URI embedding, but brings noise to the correlation between words.The traversal combination strategy, which corrupts more word correlations, has a similar impact to the random combination strategy on HeLis, but a more negative impact on FoodOn.
Besides the lexical document, the word embedding which also benefits from the utilization of the lexical information of the ontology shows a very strong positive impact.On the one hand, as discussed in Section 4.2, the two methods that use the word embedding, i.e., OWL2Vec * and the pre-trained Word2Vec, both dramatically outperform the remaining methods.On the other hand, in Table 4, the best performance on HeLis comes from V uri,word , while the best performance on FoodOn and GO comes from V word .The outperformance of V uri,word and V word over V uri is quite significant; for example, with the lexical and structure documents, the Hits@1 of V uri,word can be 0.934 while the Hits@1 of V uri is only 0.295.The combined document can improve the performance of V uri a little due to MRR Hits@1 Hits@5 Hits@10 MRR Hits@1 Hits@5 Hits@10  the correlation between URIs and words, but the improvement is very limited in comparison with directly using the word embedding.
Regarding the URI embedding, on the one hand it can alone outperform the baselines in Table 3 except for the pre-trained Word2Vec.On the other hand, the impact of the URI embedding when it is concatenated with the word embedding varies from task to task.It has a positive impact on class membership prediction with HeLis; for example, when trained by the structure document and lexical document (D s,l ), the MRR of V uri,word is 1.8% higher than V word .However, on class subsumption prediction with FoodOn and GO, the URI embedding shows a negative impact.

Graph Structure
Figure 3 shows the performance of the URI embedding of OWL2Vec * when it is trained using structure documents extracted under different graph structure exploration settings.We first compare the two solutions that generate the RDF graph G: (i) the OWL 2 to RDF Graph Mapping defined by W3C, which leads to redundant blank nodes and longer paths between relevant entities but keeps all the semantics, and (ii) the ontology projection rules which lead to a more compact graph but approximate most axioms with logical constructors (with much semantics loss) (cf.Section 3.1).On HeLis, the former has a higher MRR in 6 out of 8 cases, and its top MRR value (i.e., 0.353) is also higher than that of the projection rules (i.e., 0.335), while on FoodOn, the former has a higher MRR in 3 out of 8 cases, but its top MRR value is still a bit higher than that of the latter.Therefore, the OWL 2 to RDF Graph Mapping is adopted for OWL2Vec * , in contrast to our preliminary work OWL2Vec.
With Figure 3 we can also compare different settings used in extracting URI sentences from the RDF graph G. Two observations are made.First, the walking depth is important for both WL subtree kernel and random walk.In general, to achieve the best performance, the former needs a smaller walking depth.Consider the OWL 2 to RDF Mapping, the optimal walking depth is three on HeLis and two on FoodOn for the WL subtree kernel, but is four for the random walk.Second, the top MRR of the WL subtree kernel is higher than that of the random walk on both HeLis and FoodOn.This is expected because the WL subtree kernel incorporates the structure information of the subtree of the final entity of a random walk.

Logical Constructors
On the one hand, the performance of the baselines in Table 3  in comparison with the other methods.On the other hand, the logical structure has a positive impact when it works together with the graph structure.Note that the difference between OWL2Vec * with the setting of the structure document and the URI embedding (i.e., Ds + V uri ) and RDF2Vec is that the former additionally uses axiom sentences.From the results in Table 3 and Table 4, we can see that the former achieves 2.3% (resp.16.7%) higher MRR than the latter on class membership prediction (resp.class subsumption prediction).
We also analyzed the impact of using reasoning (provided by OWL 2 reasoner HermiT) before the ontology is transformed into an RDF graph, as shown in Table 5.We can see that reasoning has a limited impact in the conducted experiments; the MRR results with and without reasoning are quite close w.r.t.all four methods tested.

Pre-training
It is worth noting that using a pre-trained Word2Vec as an initial language model in OWL2Vec * does not help, but dramatically decreases the performance, although the pre-trained Word2Vec itself can achieve a good performance.For example, MRR of OWL2Vec * (D s,l,rc , V word ) drops from 0.945 to 0.491 on class membership prediction (HeLis) when pre-training is used.In fact, the pre-trained Word2Vec is short of prior correlations involving entity URIs, and its usage also leads to a less compact embedding size.This also indicates that the word correlation in the generated documents underpinned by the graph structure and the logical structure are tailored to the specific characteristics of the given ontology.

Interpretation and Visualization
To show that the learned embeddings (i.e., input features of the classifier for membership and subsumption prediction) are discriminative and effective, we analyze the Euclidean distance between the embeddings of the two entities in a membership or subsumption axiom.We calculate the average distance for the positive training axioms and the negative training axioms, for the embeddings learned by OPA2Vec, the pre-trained Word2Vec, and OWL2Vec * with two settings, as shown in Fig. 4. Note that the difference of the Euclidean distance between the entities in the positive axioms and the entities in the negative axioms is sufficient to indicate the discrimination of the features, but it is not necessary.We can find that Word2Vec and OWL2Vec * with D s,l + V word (i.e., using the structure document, the lexical document and the word embedding) have quite discriminative average distances for all the three ontologies.Namely, the positive axioms lead to much shorter average distance than the negative axioms.This is consistent with their final good performance shown above.Specially, for OPA2Vec and OWL2Vec * with Ds + V uri (i.e., using the structure document and the URI embedding) on HeLis, we can find the distance is also discriminative.However, in contract, the positive axioms has longer average distance than the negative axioms.This is because the instance usually lies in one end of a sequence where it co-occurs with its class (i.e., a walk of WL sub-tree kernel of depth 3 for OWL2Vec * , or a membership axiom for OPA2Vec), and thus its distance of co-occurrence to its class becomes larger than to a random class.We also visualize the embeddings of some example classes/instances via t-SNE [23] in order to obtain further insights about the quality of the computed embeddings.In Figure 5a (for HeLis) we can find two characteristics for the embeddings learned by OWL2Vec * with D s,l and V word : (i) the instances of each class are clustered into a compact cluster, and (ii) these instances are very close to their corresponding class.Both characteristics are promising: they verify that the embeddings are discriminative and explain why the embeddings enable a very good performance in membership prediction (e.g., Hits@5 is as high as 0.978).For the embeddings learned by OPA2Vec and OWL2Vec * with Ds and V uri , they have the first characteristic as well, but the distance of an instance to its class is often longer than its distance to some other class, which is consistent with the average Euclidean distance analyzed above.Such embeddings can still benefit membership prediction under the standard supervised learning setting adopted in our evaluation, where some instances of one class are used for training while the other instances of this class, which are close to the training instances in the embedding space, are for testing.However, the generalization will be dramatically impacted, especially under a zero-shot learning setting where the instances of a new class, which have never appeared in the training samples, are used for testing.
In Figure 5b (for FoodOn) we can observe similar characteristics for the embeddings learned by OWL2Vec * with D s,l and V word .Namely, for each class, its subclasses are mostly quite close to each other (i.e., being clustered into one cluster), and their distances to this class are mostly shorter than their distance to any other class.However, the two characteristics are not as significant as in HeLis, especially for the class "Barley Malt Beverage" and its subclasses, indicating that embedding FoodOn, which has more axioms and entities (see Table 2), is more challenging.On the other hand, the two characteristics of OWL2Vec * with D s,l and V word are more significant than those of the other three methods -Word2Vec, OPA2Vec and OWL2Vec * with Ds and V uri , which verifies its better performance on subsumption prediction.For example, in comparison with Word2Vec which has the second best performance, OWL2Vec * with D s,l and V word closes the distance between "Fish" and its subclasses, and makes the subclasses of "Yogurt Food Product" closer to each other.

Discussion and Outlook
In this paper we have presented OWL2Vec * , a robust semantic embedding framework for OWL ontologies.OWL2Vec * extracts documents from the ontology that capture its graph structure, axioms of logical constructors, as well as its lexical information, and then learns a neural language model for both entity embedding and word embedding.We applied OWL2Vec * to class membership prediction and class subsumption prediction with three real world ontologies, namely HeLis, FoodOn and GO, and we empirically analysed different semantics and techniques such as entailment reasoning and ontology to RDF graph transformation.The evaluation demonstrates that on these tasks OWL2Vec * can significantly outperform state-of-the-art methods.
Ontology Text Understanding.Our experiments suggest that lexical information plays a very important role in both class membership prediction and class subsumption prediction.In real world ontologies such as HeLis, FoodOn and GO, entity names often reflect, in natural language, their relationships to surrounding entities; in HeLis, for example, the instance vc:FOOD-700637 (Soy Milk) is an instance of the class vc:SoyProducts.In addition, ontologies often contain a large number of entity annotations ranging from short phrases to long textual descriptions.In FoodOn, for example, 169, 630 out of 241, 581 axioms are annotations.However, patterns within the textual information in the ontologies, which is underpinned by the graph and logical structure, are quite different from normal natural language text (cf.Section 4.3.4).To further improve ontology embedding in the future, we need to develop new language model architectures and training methods that are tailored to the kinds of textual information typically present in state-of-the-art ontologies.
Ontology Completion via Prediction.In this study OWL2Vec * has been applied to ontology completion by discovering plausible axioms.We adopted a typical supervised learning setting to model a common scenario in ontology completion, where satisfactory results have been achieved; in class membership prediction, the classes of 93.2% of the test instances can be recalled.In some real world cases, however, there is often a bias between the axioms for training and the axioms for prediction.For example, consider the case of membership prediction for a new class defined on the fly without any known instances (i.e., zero-shot learning scenario discussed Section 4.4).This leads to sample shortage in training and becomes much more challenging -the above metric drops to 65.6% for OWL2Vec * and less than 10% for other KG embedding and ontology embedding methods in Table 3.In future work we plan to develop more robust ontology embeddings with higher generalization for dealing with such cases, and to consider other more challenging tasks such as ontology alignment and ontology error detection.

Fig. 3 :
Fig. 3: Comparison of structure documents by different graph structure exploration settings, where the results of MRR of OWL2Vec * (Ds + V uri ) on HeLis and FoodOn are reported.

Fig. 4 :
Fig.4: The average Euclidean distance between the class and its instance (resp.subclass) for the the positive and negative memberships (resp.subsumptions) used in classifier training.The number above every pair of positive and negative bars is their ratio.

Table 3 :
Overall results of OWL2Vec * and the baselines.

Table 5 :
which adopt the logical structure alone, including EL Embedding, Quantum Embedding and Onto2Vec, is relatively poor Performance with and without reasoning.MRR on membership prediction with HeLis is reported.