Approximate Knowledge Graph Query Answering: From Ranking to Binary Classification

Large, heterogeneous datasets are characterized by missing or even erroneous information. This is more evident when they are the product of community effort or automatic fact extraction methods from external sources, such as text. A special case of the aforementioned phenomenon can be seen in knowledge graphs, where this mostly appears in the form of missing or incorrect edges and nodes. Structured querying on such incomplete graphs will result in incomplete sets of answers, even if the correct entities exist in the graph, since one or more edges needed to match the pattern are missing. To overcome this problem, several algorithms for approximate structured query answering have been proposed. Inspired by modern Information Retrieval metrics, these algorithms produce a ranking of all entities in the graph, and their performance is further evaluated based on how high in this ranking the correct answers appear. In this work we take a critical look at this way of evaluation. We argue that performing a ranking-based evaluation is not sufficient to assess methods for complex query answering. To solve this, we introduce Message Passing Query Boxes (MPQB), which takes binary classification metrics back into use and shows the effect this has on the recently proposed query embedding method MPQE.


Introduction
In many organizations, a vast amount of complex information is used in operations daily. This data is often stored in various databases or file systems while information can be retrieved using query languages and information retrieval techniques. During the past decade, several companies have started taking up knowledge graphs (KG) [10], as a way to represent heterogeneous data and make it useful for a large variety of applications [14]. To make said data accessible, various querying languages like SPARQL and Cypher have been developed. Such querying languages allow for accessing nodes in the graph, traversing them via specific relations, or retrieve nodes that match a specific pattern. At the core of these languages lie graph patterns. These patterns can be thought of as graph shaped structures where some nodes and edges can correspond to nodes existing in the graph, while others correspond to variables (with specific variable names). When a match for this pattern is found in the graph, the variables are bound and the appropriate values are returned as the result.
However, the performance of the previously described process is heavily dependent on the level of completeness in the graph.
To go in detail, completeness refers to whether it contains all the nodes and edges in the graph pattern, and has a binding for all variables. Having a single node or edge missing from the graph, which represents a comparatively small bit of information, results in missing answers. This phenomenon could be good, in case of an erroneous piece of information, or bad, in case of information missing from the graph.
In this paper, we focus on this issue, specifically the case of missing edges in the graph. Ideally, we would like a query system that can still give answers when the phenomenon described before applies. We would like to have approximate query answering.
One way to approach this, is by performing link prediction. In link prediction, one would try to predict missing links in the graph, by training a machine learning model on the known parts of it. While not trivial, it is possible to use the single link prediction mechanism to answer queries with missing links. Another way to approach this problem is by using the so-called query encoders. These encoders take a query as input and produce an embedding (a high dimensional vector representation) for it. This query embedding is later compared to learned embeddings for the entities in the graph. This machine learning system is optimised in such a way that entities close to the query embedding in vector space, are also its probable answers.
In this paper we focus on the analysis and evaluation of these systems. Typically, such systems return a series of candidate answers to the query, accompanied by a likelihood or distance from the query embedding in vector space. In the evaluation phase, this ranking is compared to, not a ground truth ranking, but rather the set of correct answers to the query. To do this, typical measures like hits@n (how many correct answers out of n) and mean reciprocal rank (MRRwhat is the average reciprocal of the rank of correct answers) are used. While these measures are appropriate for information retrieval systems, they fall short when it comes to query systems. In the latter, the results are not ranked, but are rather the correct answer or not. This is also reflected in how these measures are usually adapted by modifying them to filtered versions. In this case, measures like hits@n and MRR are com-puted such that true answers higher in the returned ranking are ignored when computing for example the rank for lower ranked entities.
We argue that we need to look into metrics that are not based on specific ranking of the results, but rather on a crisp set of results retrieved from these systems. A main argument for why this is necessary is that many downstream tasks using the aforementioned results need to get a finite set of answers from the knowledge graph, not just a ranked list of all possible entities. That is, we need a query engine that does not just act as a ranking system, but as a binary classifier: it must provide a set of entities that are answers to the query while all other entities are not. In this scenario, the evaluation would be the same as what has traditionally been used for classification problems, with measures such as precision and recall.
This paper is structured as follows: in section 2, we provide an example for several algorithms used for approximate query answering. Then, in section 3 we discuss how metrics for binary classification can provide additional insight on top of the metrics used for ranking. We end that section with a general direction on how this could be achieved in the existing systems using volumetric query embeddings. Section 4 details a first approach for solving this problem using axis-aligned hyper-rectangles for these queries. We describe the MPQB model, a proof-of-concept, in the section after that. Finally, we provide a conclusion and future outlook.
This work is largely based on the Bachelor thesis works of Ruud van Bakel [3] and Teodor Aleksiev [1], who both worked under the supervision of Michael Cochez at the Vrije Universiteit Amsterdam.

Approximate Query Answering on Knowledge Graphs
We define a knowledge graph as a tuple G = (V, R, E), where V is a set of entities, R a set of relation types, and E a set of binary predicates of the form r(h, t) where r ∈ R and h, t ∈ V. Each binary predicate represents an edge of type r between the entities h and t, and thus we call E the set of edges in the knowledge graph.
A query on a KG looks for the set of entities that meet a particular condition, specified in terms of binary predicates whose arguments can be constants (i.e. entities in V), or variables. As an example, consider the following query (adapted from [4]): "Select all projects P , such that topic T is related to P , and both Alice and Bob work on T ". In this query, the constants entities are Alice and Bob, and the variables are denoted as P and T . We can define such a query formally in terms of a conjunction of binary predicates, as follows: q = P.∃T, P : related(T, P ) ∧ works on(Alice, T ) ∧ works on(Bob, T ). (1) More formally, we are interested in answering conjunctive queries, that have the following general form: In this notation, r i ∈ R, and a i and b i are constant entities in the KG, or variables from the set {V t , V 1 , . . . , V m }.
Recent works have proposed to use machine learning methods to answer such queries. These methods operate by learning a vector representation in a space R d for each entity and relation type. These representations are also known as embeddings, and we denote them as e v for v ∈ V and e r for r ∈ R. Similarly, these methods define a query embedding function φ (usually defined with some free parameters), that maps a query q to an embedding φ(q) = q ∈ R d .
Given a query embedding q, a score for every entity in the graph can be obtained via cosine similarity: The entity and relation type embeddings, as well as any free parameters in the embedding function φ, are optimized via stochastic gradient descent on a specific loss function. Usually the loss is defined so that for a given embedding of a query, the cosine similarity is maximized with embeddings of entities that answer the query, and minimized for embeddings of entities sampled at random. The dataset used for training consists of query-answer pairs mined from the graph. Once the procedure terminates, the function φ can be used to embed a query. The entities in the graph can then be ranked as potential answers, by computing the cosine similarity of all the entity embeddings and the embedding of the query.
Note that in contrast with classical approaches to query answering, such as the use of SPARQL in a graph database, this approach can return answers even if no entity in the graph matches exactly every condition in the query.
In the next sections we review the specifics of recently proposed methods, which consider particular geometries for embedding entities, relation types, and queries; as well as scoring functions.

GQE
Conjunctive queries can be represented as a directed acyclic graph, where the leaf nodes are constant entities, any intermediate nodes are variables, and the root node is the target variable of the query. In this graph, the edges have labels that correspond to the relation type involved in a predicate.
We illustrate this in Fig. 1 for the example query introduced previously. In Graph Query Embedding (GQE) [9], the authors note that this graph can be employed to define a computation graph that starts with the embeddings of the entities at the leaves, and follows the structure of the query graph until the target node is reached.
GQE was one of the first models that defined a query embedding function to answer queries over KGs. The function relies on two different mechanisms, each of which handles paths and intersections, respectively. This requires generating a large dataset of queries with diverse shapes that incorporate paths and intersections.

MPQE
Graph Convolutional Networks (GCNs) [5,11,8] are an extension of neural networks to graph-structured data, that allow defining flexible operators for a variety of machine learning tasks on graphs. Relational Graph Convolutional Networks (R-GCNs) [17] are a special case that introduces a mechanism to deal with different relation types as they occur in KGs, and have been shown to be effective for tasks like link prediction and entity classification.
In MPQE [4], the authors note that a more general query embedding function can be defined in comparison with GQE, if an R-GCN is employed to map the query graph to an embedding. The generality stems from the fact that the R-GCN uses a general message-passing mechanism to embed the query, instead of relying on specific operators for paths and intersections.

Query2Box
Both GQE and MPQE embed a query as a single vector (i.e., a point in space). Query2Box [15] deviates from this idea and uses a box shape to represent a query. The method further narrows the allowed embedding shape to axis-aligned hyper-rectangles. We will discuss more in section 4 why that is beneficial. This method has several benefits, especially for conjunctive queries; for these queries, the answer set can be seen as the intersection of the answers to the conjuncts. Such an operation can be imagined with an embedded volume, but not with a vector embedding.
While this method would have made it possible to create a binary classifier, the model is not specifically trained, nor evaluated for multiple answers.

Complex Query Decomposition
Complex Query Decomposition (CQD) [2], is a recently proposed method for query answering based on using simple methods for 1-hop link prediction to answer more complex queries. In CQD, the link predictors used are DistMult [21] and ComplEx [20]. Such link predictors are more data efficient than the previous methods, since they only need to be trained with the set of observed triples. In contrast, to be effective the previous methods require mining millions of queries covering a wide range of structures.
In CQD, a complex query is decomposed in terms of its binary predicates. The link predictor is used to compute scores for each of them, and the scores are then aggregated with t-norms, which have been employed in the literature as continuous relaxations of the conjunction and disjunction operators [18,13,12].
CQD provides an answer to the query by providing a ranking of entities based on the maximization of the aggregated scores. Therefore, the evaluation procedure for CQD is the same as the previous methods.

From Ranking Metrics to Actual answers
As discussed above, there are merits to returning a hard answer set as opposed to returning a ranking. One way to obtain such binary classifications is to define a threshold within a ranking. As we will further describe in section 4, one can create such a threshold by using shapes (e.g. axis aligned hyper-rectangles) for query embeddings.

Closed-world assumption
Binary classification does introduce new challenges. One such challenge can be seen in the definition of a loss function that can act differently for entities within the set and entities not in the set. Since the knowledge graph may contain missing edges, the retrieved target set may be a subset of the ground truth. This in turn could result in entities being incorrectly used within the loss function (i.e. an incorrect closed-world assumption).
However, this is not necessarily problematic. We define T to be the ground truth target set of a query and T to be the retrieved target set (i.e. when directly querying the KG). Assuming the number of entities missing from T is considerably smaller than V − T , most entities that do not belong in T are also not answers to the query (i.e. not in T ). This means that if we sample a relatively small subset of the inverse found target set (V − T ) it will likely not contain entities that are also in T . In the case where we need to be certain that our sample from V − T does not contain entities in T we could restrict our sampling process to entities which could never appear in T . This is possible for example, by sampling entities which are incompatible with the domain and range of specific relations in a query (e.g. house entities will never appear in a has sibling(a,b) relation). Potential downsides of such methods include a potential slow down during learning or a limit in the model's overall performance, as having very different entities in T and our sample from V − T could prevent our model from learning the differences between the two sets. On the other hand, if these two sets are very similar the model would be forced to uncover differences even when they are not very apparent. In fact, it is often good practise to use so-called "hard" negative samples, which are similar to entities in T . A better alternative for finding entities not in T would be using more advanced techniques as proposed in [16].

From ranking to classification
Another focal point where binary classification differs from ranking as a metric, is in the way performance is measured (e.g. F-score against Mean Reciprocal Rank). On binary classification, a common performance measure would be the F-score, which is the harmonic mean between Precision and Recall, while in a ranking setting we encounter the Mean Reciprocal Rank. While these metrics differ significantly, there are ways for them to relate. This insight can be evident, considering that rankings could be turned in binary classifications, using a threshold. In particular, we notice that ranking metrics typically focus on having entities in T higher in the rank. As a result, having many high-ranking entities that are not in T is also penalised. Effectively these measures then provide some notion of how well T and V − T can be separated. This means that in the case of a low ranking measure, the binary classification can also under-perform. Moreover, it could either result in low precision, recall or both, depending on where the threshold is placed among the ranking. Geometrically, there is also a correspondence between a ranking with a cutoff point and a system where all answer embeddings withing a given distance would be included as answers. One could view a classifier with high precision and low recall as having an embedding with relatively small volume, while viewing a classifier with high recall and low precision as having an embedding with relatively large volume instead. In this setting, the interpretation of a ranking measure would be whether entities in T are closer to our geometric query embedding than entities not in T . This measure of closeness is defined via a distance metric (e.g. the L1 norm) and can be used in the loss function [15].

Using Axis-aligned Boxes for Query Embedding
As discussed in section 2 an entity is a valid answer to a specific structured query if it satisfies the query. The ultimate aim is to find the set of all valid answers, as entities in the Knowledge Graph, that satisfy the given query even when a missing edge in the KG is required for the binary predicates. As discussed, we could either attempt to use a cut-off point in the ranking to obtain a binary classifier, or we could train the embedding model such that it indicates a volume in the embedded space that contains the answers. In this section we present a first possible design of such a system to show the feasibility. We alter the earlier work done on query2box [15] method in two ways. First, we do interpret the boundaries of the hyperrectangle used for the embedding as a bounding box. All entities within the box are predicted answers to the query, while answers outside are predicted to not be answers. Second, we do not use the embedding procedure proposed in query2box, but rather perform the embedding using the technique devised in MPQE. Now, we could choose to embed entities using points, as is done in other query embedding methods. Then, entities that get embedded inside the box would be seen as answers to the query, while points outside of it would be seen as non-answers. This is illustrated in figure 2 But, as we will discuss in more detail in the following subsection, we can also use hyper-rectangles for these. The choice we make in the experiments in this paper is to consider an entity, embedded as a box, to be valid answer to the query if there is an intersection between the two boxes. This is also illustrated in figure 3, for the two-dimensional case. An alternative choice could be to consider an entity and answer in case the entity box is completely inside the query box.
To formalize this, we operate on the embedding space R d . What we want is to describe an axis-aligned hyper-rectangle in this space. We do this by keeping two vectors, one to indicate the center of the box and one to indicate the offset of the sides of the box. So, in the described model every entity v ∈ V has an embedding e v ∈ R 2d . Additionally an embedding for the query is defined that maps the full vector of the query: q ∈ R 2d .
The boxes in R d corresponding to the 2d-dimensional vectors are defined as p = (Cen(p), Off (p)) ∈ R 2d : where denotes element-wise inequality.
Note that a completely analog definition could be made by keeping two extreme counterpoints of the box rather than a center and offset.

Boxes for Entities
It was already mentioned in the previous section that we represent our entity embeddings with boxes, as well. This idea comes forward from the fact that entities could play different roles in different contexts. For example, we could have a person who both works at a university, buy is also a member of a political party. Having a single point to represent that person forces a query asking for members of that political party and a query asking for people working at that university to overlap. If we instead use a box for the entity, the query embeddings do not have that additional problem. The issue is also illustrated in figure 4 and 5. The nodes representing Alice and Bob are close to each other in the one context, but far away in the other one. In the embedding of the entities in fig. 5 shows that with boxes it is possible to have the entities close to each other and far away from each other at the same time. With the entities as boxes, we can have it as an answer to two disjoint queries as illustrated in fig. 3.

Proof of Concept
In this section, we perform an evaluation of the system we discuss above. Note that our goal is not to provide state-of-the-art results. Firstly, this is because what we propose is just a proof of concept for an approximate embedding system which can find a set of answers for a query. But, the main reason we cannot really compare with other systems is because they are evaluated with ranking metrics as discussed in section 3. Figure 6 shows seven distinct query graph structures. We only consider these structures when training and testing our model for the query answering task. These structures were originally proposed in GQE [9]. Each of these structures starts with actual entities from a graph (i.e. anchor entities) and ends with a set of target entities. Some of these structures are chains without any intersections (e.g. B.∃A, B : knows(Alice, A) ∧ is related to (A, B)), whilst other only have intersections (e.g. B.∃B : knows(Alice, B) ∧ is related to (Bob, B)) or even combinations of both. Our goal is to train a model that finds the answer set of a given query, using a query embedding. This is in contrast to other related work [15,9,4] as we want to be able to find multiple answers. As mentioned before, we could create such a set by embedding the query as box, thus getting a hard boundary for separating entities in and not in the target set.

Experimental setup
Datasets While previous work [4,9] incorporated multiple datasets, our implementation has yet solely been tested on the AIFB dataset. This dataset is a knowledge graph of academic institution in which persons, organizations, projects, publications, and topics are the entities. Table 1 give some statistics of this dataset and also for two more datasets often used for the evaluation of approximate query answering. Query Generation To train our model we have to sample for query graphs from our dataset. This is done by initially sampling anchor nodes and relations which are later used to form graphs based on specific query patterns ( fig. 6). After acquiring the anchor nodes and the relations connecting them, we can obtain the target set. Although this may appear straightforward, there are some caveats. The biggest one is that some queries contain considerable sets of potential target entities (over 100,000 answers). Because we sample for edges first these particular graphs actually appear often.
Luckily, for most query structures this was not the case, but specifically the 2-chain and 3-chain query structures occasionally suffer from it. This is likely explained by the fact that knowledge graphs contain "hub nodes", nodes with a very high degree, to which a plethora of other nodes connect via a certain relation. Table 2 shows the average size of the target sets of sampled queries for the aforementioned datasets. One interesting thing to note is that for the AM dataset the 3-chain-inter structure actually had the largest average target set. This could indicate that this problem is indeed very graph-dependent. Since this is a problem with the AIFB dataset, we limit the query target sets to a maximum of 100 answers. We also sample for entities not in the target set to be used as negative samples during training. For the query structures that contain an intersection we incorporate hard negative samples by finding entities that would have been in the target set if the conjunctive intersections were to be relaxed to disjunctions.  Evaluation In order to test whether the model is actually able to find answers to queries that involve edges which are not in the graph, careful preparation of our data splits was necessary. We started by our original graph and marked 10% of the edges to be removed (they are still there at this stage). Then, we sample the graph for the query patterns. If the sample makes use of any edge marked as removed, it will be added to either the validation set or the test set (10/90 split). If the sample contains no such marked edge, then we put it in the training set. This way, we end up with validation and test queries that make use of at least one edge that is not in the graph seen during training. Post sampling, we end up with around 2 million targets and the corresponding query graphs to be used in the training set. For the validation set we used about 30,000 targets worth of queries and for the test set we will had approximately 300,000 targets worth of query graphs. The validation set is also used to perform early stopping in case specific conditions were not met.
Since our method uses boxes, which allow for binary classification, we report our model's performance in the form of a confusion matrix (see figure 7). Given the fact that our entities are also boxes, we have more freedom to choose when an entity is considered an answer. This is because entities now inhabit more space than a single point which allows for partial overlap with query boxes. In order to allow flexibility we have decided that an entity is considered an answer to a query if its box representation overlaps with the box representation of the respective query box. Naturally, other more strict conditions could be applied such as requiring full overlap or define a fraction based threshold (e.g. requiring at least 50% overlap). We expect these conditions to change based on the potential downstream task. Model Our model has the same basic functionality as the MPQE [4] model. MPQE is used as an embedding component, but the input and output are in-terpreted as boxes. MPQE first performs several steps of message passing using an R-GCN architecture after which the node states are aggregated to form the query embedding. With this query embedding a loss function is evaluated which is used as a signal (using SGD) to update the embeddings and weights in the network. For the aggregation operation we have several options (SUM, MAX, TM, MLP ) at the end of our model. We test our model with some of these different aggregation functions.
Since we train an embedding matrix (as opposed to having a latent embedding to start with) we need to initialize it. We do this by sampling the 32 dimensional center vectors from a uniform distribution between 0 and 10, whilst sampling the 32 dimensional offset vectors from a unit Gaussian with a mean 3.
For TM aggregation, the MPQE model uses 3 layers; the TM aggregation function requires a number of message passing steps equal to the query diameter, in our case 3. For the MLP aggregation function we applied a two layer fullyconnected MLP. As for the non-linearities in our model, we used the ReLU function. To update the parameters of the model we used Adam optimizer with a learning rate of 0.01.
Our code base is based on PyTorch. In particular, we made use of the library PyTorch Geometric [7], which is a PyTorch extension specialised for graph-based models. While there are potential baselines to consider [9,4], they are not suitable for our work. This happens because we perform a binary classification as opposed to ranking-based methods. To our knowledge there have not been any related work that performed binary classification in the context of approximate graph querying. In the area of link prediction, we do find some work, like the early work on Neural Tensor Networks [19] and a more recent one which looks at triple classification [6]. This did not prove to be a major concern, as our main goal was not to achieve state-of-the-art results, but rather explore whether this direction of research may prove worthwhile.

Results
After having trained the MPQB model for over 200,000 iterations it appeared to still not have converged. After this amount of iterations the query boxes seemed to not overlap with any target boxes (i.e. no entities in T were returned). Apart from training the model for longer and on multiple epochs, there are some other settings that could still be experimented with. For example, how many samples are in each epoch (less samples allow for training on more epochs), whether we use T fully during train or use a subset, and how many entities should be in our sample from V − T . The latter two settings also influence how many distinct queries we could train on within a given time span. In may be worth noting that previous works [9,4,15] train using single positive samples. While we want to focus on answering queries with multiple answers, we do not necessarily need to train on multiple answers. In theory, if a method can produce a good ranking, it should also be able to produce a good classification, given that the optimal thresholds for these rankings could be found. Since we do not have direct result in a manner we would have liked, we will instead analyse the trained models to see if there are relevant insights to be found. For this we looked at models using different aggregation functions, trained on the AIFB dataset. While we have no intersections between query boxes and target boxes, we could still look whether the target boxes (from T ) appear relatively close to the entity boxes, when compared to the box representations of entities in V − T . This effectively provides some measure as to whether the produced rankings are good. Table 3 shows these results. While these scores may not indicate state-of-the-art results, they do seem to suggest that the model did at least produce decent nontrivial rankings using the SUM and TM aggregators. This could suggest that further research is indeed in order. The fact that TM outperformed SUM is not surprising considering that it is a more involved method that also takes query diameter into account. This result is also in line with the findings in [4]. A more surprising result is that the MLP method did not seem to perform well at all. This could be a result of a faulty implementation, or an implementation that simply does not work for boxes as is. Overall, the results seem promising. Table 3. Percentage (%) of answers embedded closer to the query box compared to a non answer, with regard to the query structure, using different aggregation function.

Conclusion and Outlook
In this work, we looked critically at the currently prevailing evaluation strategy for approximate complex structured query algorithms for knowledge graphs. Typically, these systems take a query as an input and produce a ranking of all entities in the KG as an output. The performance of these systems is than determined using metrics typically used in information retrieval. What we propose is to augment the current evaluations by also requiring these systems to produce a binary classification of the nodes into a class of answers and one of non-answers. This is needed because many applications can simply not work with a ranking and need a fixed set of answers to work with.
As a first proof of concept, we have adapted ideas from MPQE and query2Box, and created an embedding algorithm that represents the queries and the entities as axis-aligned hyper-rectangles. We noticed that the performance of this system is pretty low, and expect that future works can heavily improve upon this first attempt.
As future research directions, we see a need to expand our experiments to include other query types (disjunctions, negations, filters, etc. ), in order to show the generalizability of our approach. This will, however, require new representation for the volumes as these operations are not possible if we would stay with just boxes. For example, the negation of a box, would no longer be a box.
Moreover, we it needs to be investigated how our method can be applied on different kinds of graphs. This will give us insights as to what changes need to be made in terms of training data (via query generation) as well as the effects on model performance. Also, it seems worth experimenting with different geometric representations for the parts of the query (anchor, variables and targets). Finally, since our experiments were relatively small-scale, further research could also start by simply experimenting with different settings for our current architecture.