for Keyword Factual Query Interpretation

. Information retrieval is regarded as pivotal to empower lay users to access the Web of Data. Over the past years, it achieved momentum with a large number of approaches being developed for diﬀerent scenarios such as entity retrieval, question answering, and entity linking. This work copes with the problem of entity retrieval over RDF knowledge graphs using keyword factual queries. It discloses an approach that incorporates keyword graph structure dependencies through a conditional spread activation. Experimental evaluation on standard benchmarks demonstrates that the proposed method can improve the performance of current state-of-the-art entity retrieval approaches reasonably.


Introduction
Over the last years, information aplenty has been published as structured data. The Resource Description Framework (RDF) 1 became a standard format for many knowledge graphs (KG) publicly available such as DBpedia [18] and Wikidata [26]. An RDF KG organizes the information in the form of subjectpredicate-object statements expressing semantic relations between entities (e.g. persons, organizations, and places) and concepts (e.g. given names, addresses, and locations). Currently, approximately 10.000 RDF KGs are available via public data portals. 2 Together, these graphs compose the so-called Linked Open Data Cloud (LOD).
Ultimately, approaches designed to retrieve or use KG's information has been getting substantial attention. Some of these approaches are Entity Retrieval (ER), Entity Linking (EL), Entity Disambiguation (ED), and Question Answering (QA). ER specifies a category of information retrieval (IR) whereas the result of a natural language search query is an entity or an entity's property rather than a document. ER methods play a fundamental role in IR on KGs. It enables lay users to access KG's information as well as other approaches on performing EL [7], ED [24,36], and QA [10,33,35] tasks. Improving ER methods can have a substantial impact on the whole IR chain.
ER on RDF KG has peculiar characteristics that make it stand apart from standard document retrieval. The information in KG is structured in entities, attributes, classes, and their relationships. Exploring this structure makes ER a thriving research topic. Early approaches applied bag-of-word document retrieval techniques [4,8,38]. The research has been shifted to explore the KG entities and concepts relations in fields, the field retrieval models [5]. Late studies focus on evaluating the word sequence and property-type influence [2,21,41]. Recently, the use of EL is being considered for ER improvement [14].
This work presents CACAO, a novel approach for ER on large 3 and diverse RDF KGs. It relies on a novel spread activation (SA) method to improve information access. SA is a method that iteratively propagates weights in a graph from one node to another [6]. It differs from the previous approaches by evaluating query's intent on entities and concepts rather than fields and avoiding keyword over-and under-relatedness-estimation by accounting only the highly activated ones. The evaluation of the approach in two standard benchmarks shows an f-measure improvement of ≈10%.
The remaining of this work is organized as follows. Section 2 defines RDF KG and states the problem. Section 3 describes the conditional spread activation model entitled CACAO. Section 4 presents the evaluation and discusses the results. Section 5 provides a literature overview on related work. Finally, Sect. 6 concludes giving an outlook on approach limitations and potential future work.

Preliminaries
An RDF KG can be regarded as a set of triples in the form of <s, p, o> ∈ (I ∪ B) × P × (I ∪ L ∪ B) where: I is the set of all IRIs; B is the set of all blank nodes, B ∩ I = ∅; P is the set of all predicates, P ⊆ I; E is the set of all entities, E = I ∪ B \ P ; L is the set of all literals, and; R is the set of all resources R = I ∪ B ∪ P ∪ L. In this graph, an entity type is specified by the property rdf:type while the label, by the property rdfs:label. A field of an entity is a predicate object f = <p, o> belonging to an entity triple <e, p, o>. The aim of entity retrieval is to recover the top-K ranked entities that best address the information need behind a given query as follows.

Definition 1 (Problem Statement).
Formally, a top-K entity retrieval takes a keyword query Q, an integer 0 < k, a set of entities E = {e 1 , e 2 , ..., e |E| }, and returns the top-k entities based on a scoring function S(Q, e).

The Approach
CACAO is an ER approach to facilitate information access using keyword factual queries in RDF knowledge graphs. Factual queries are those whose intent can be formalized by simple Basic Graph Patterns (BGP). 4 Entity retrieval on KGs has been a long-studied research topic for many years. Early approaches rely on bag-of-words models [4,8,38] that suffers from unrelatedness [5] and verbosity [29]. They were built under the assumption that the distribution of keywords is proportional to its subject relatedness [19]. This idea contradicts with the fact that people can describe things differently. Authors can be more descriptive or verbose than others. Particularly in case of DBpedia, editors' experience or knowledge can unconsciously influence keyword frequency or even graph connectivity. To address the problem of verbosity, researchers proposed to score keywords normalized by the information (entity) length [29]. Other generation of ER approaches focused on the problem of unrelatedness by employing field retrieval models [5]. Late studies focused on evaluating how to weight fields differently so that to improve ER accuracy [2,21,41]. Nevertheless, field retrieval models are unable to relate query keywords with a specific predicate or object because they are treated as one, a bag-of-(field-words). Recent approaches introduced the use of two stage techniques employing ER followed by an Entity Link Retrieval (ELR) [14].
CACAO addresses the ER problem in a different manner. It relies on a SA method that works in threefold. A query triggers an activation function that measures the relatedness of KG resources w.r.t. the query. The resource relatedness values are then spread to their connected entities using a conditionally backward propagation, and, in a latter process, conditionally forward. The individual resource relatedness measurement addresses the problem of finding the query's intent. The conditional propagation avoids the over-and the underestimation of frequent and rare keywords. The next sections describes how the (1) Activation, (2) Conditional Backward Propagation and (3) Conditional Forward Propagation works.

Activation
CACAO performs the activation in the resources. It uses the resource label coverage to evaluate its query relatedness. In this judgment, a query containing birth date should be more related to the property dbo:birthDate than to the property dbo:deathDate or dbpprop:date, while the query date should be more related with the property dbpprop:date than dbo:birthDate. Equation 1 formalizes the evaluation of the query label's coverage. It receives as parameters the query − → Q and a resource label − → L represented by bit vectors. In these vectors, keywords are dimensions in which their occurrence are either zero or one.
Yet, the equation above cannot be used as an activation function, because it measures equally resources with the same query coverage rate. For the sake of  The picture illustrates the conditional forward activation being performed on query "carrot cake ingredients".The activation value of the entity dbpedia:Carrot Cake and the property dbo:ingredient is being transfered to the property's entities ( 3 -4 ). illustration, let us take as an example the query "carrot cake". For this query, either dbr:Carrot, dbr:Cake and dbr:Carrot Cake are going to have the same coverage value of one, although dbr:Carrot Cake has two overlapping keywords. Thus, full label-query overlaps are evaluated as the number of query keywords to the power of label keywords, ( The incomplete overlaps are still considered, but treated with less importance. For those, the query-label intersects over their union suffices (Eq. 2). Equation 3 outlines the activation function. Notice, however, two important properties. First, entities whose resources were activated for mere casualty will always valuate lower than the query length ( − → Q i ). Second, it makes it easier to differentiate among resources with full and partial query coverage.

Conditional Backward Propagation
Backward Propagation consists of distributing backward computed values through a network. It is used in neural networks to transfer the errors throughout the network's layers [12]. In CACAO, the backward propagation is used to spread the resource's activation values to their connected entities. By doing so, the approach computes implicitly the relatedness of the entity and its connected resources to the query elements. However, the transfer is conditioned only to the most activated keyword value. It spreads from the resource to the fields, and, likewise, from the fields to the entity. This strategy prevents the frequency of the keywords on impacting the activation value while preserving their informativeness. For example, the entity dbpedia:Aristotle contains either dbo:birthDate and dbo:deathDate. In this case, the keyword "date" will have twice more impact on dbpedia:Aristotle then in entites containing solenly one of the properties (e.g. dbo:birthDate, dbo:deathDate or dbpprop:date). Previous works demonstrate that scoring fields differently can improve the ER accuracy [2,21,41]. Hence, CACAO employs field weighting as described by Marx et al. [21]. Additionally, a query intent can be one or a set of entities. In the latter case, an important feature is the relevance ranking. As an example, the query "give me all persons" can return more than one million persons if applied to the DBpedia KG. But not all these entities may be relevant to the user. To deal with this problem, each activated entity receives a Page-Rank value normalized lower than a keyword weight. This work uses a modified version of PageRank [32] dubbed DBpedia Page-Rank which has been shown to produce better estimations [22].
Algorithm 1 describes the computation of the conditional backward propagation formally (Fig. 1a).

It starts when function
receives a bit vector representing the query − → Q , the field, and a set of processed keywords. The activation field value (a f ) is initialized with 0, R f with ∅, and R f receives the field's resource list. In sequel, the function iterates over R f computing the activation value a f using the vectorized resource label returned by the function − → V L (r). In line 19, the function INSERT operates an insertion sort on the list set R f . The insertion is performed in the ascending order of the resource's activation value to ensure that only the highly activated keywords have their value transfered to the entity. Subsequently, an iteration operates over the resource sorted list R f . The activation a r is now evaluated over the resource label after removing the keywords that were computed on previous iterations ( − → L U r ). In the last iteration instructions, the resource activation value is transferred to the field a f (line 25), and the resource keywords are added to the computed keyword list − → L (line 26). The function resums adding the field's weight φ(f ) to the final activation value a f (line 28). Notice that we did not discuss the use of stop words removal or tokenization to describe the algorithm because they are optional and does not influence the overall computation.
The entity activation is computed over the fields' activation as follows. The function A r ( − → Q, e) receives a vectorized query − → Q and an entity e. The entity activation value a e is initialized with 0. The computed keywords − → L and the field set F e receives ∅. The fieldset R f receives the list of entity fields. Similar to the field activation function , the entity activation consists in two iterations. The first (line 3) computes the field activation value a f on every field's keyword, and uses an insertion sort function (line 5) to add them in F e according to their inverse activation value. In this iteration, the computed keywords parameter − → L from the field activation function A f ( − → Q, f, − → L ) receives an empty set (line 5), allowing it to compute the activation on every keyword. It then iterates over the sorted fields F e (line 8) discarding the computed keywords, and transferring the field's activation value to the entity, a e . The activation value then receives a normalized Page-Rank value returned by the ψ(e) function.

Conditional Forward Propagation
The forward propagation is only applied when a property contributes to the field's activation. It forwards the entity activation to its activated properties, and from them to their objects. It results in objects having a higher activation value than their associated entity. Let us suppose that an user is looking for "carrot cake ingredients". In case of dbpedia:Carrot Cake, the label activation will be backward propagated to the entity and then forwarded to the dbo:ingredient fields' object herewith the property activation. Thus, the dbo:ingredients' object on the BGP <dbpedia:Carrot Cake dbo:ingredient ?object> is going to have a higher activation value then dbpedia:Carrot Cake. The Fig. 1b shows the conditional forward propagation for our running example query "carrot cake ingredients".

Evaluation
The evaluation was designed to measure the accuracy of CACAO compared to other ER, and Entity Linking methods. All output generated by the systems is publicly available at https://github.com/AKSW/irbench. There are several benchmark data sets that could be used on this task, including benchmarks from Semantic Search initiatives [13] 5 and QA Over Linked Data (QALD). 6 Semantic Search is based on user queries extracted from the Yahoo! search log, containing an average distribution of 2.2 words per-query. QALD provides both QA and keyword search benchmarks for RDF data. The QALD data sets are the most suitable due to the wide type of queries they contain and also because they make use of DBpedia, a very large and diverse KG. In this work, we use the QALD version 2 (QALD-2) data set benchmark from The Test Collection for Entity Search (DBpedia-Entity) [1], and; QALD version 4 (QALD-4) [34]. Table 1 shows the number of queries evaluated on each of them.

Experimental Setup
The evaluation contains two setups: The first setup evaluates CACAO against state-of-the-art Entity Retrieval (ER) using the QALD-2 from DBpedia-Entity. The second setup evaluates CACAO using state-of-the-art ER and Entity Linking Retrieval (ELR) for RDF data with the QALD-4. Both setups evaluate the approach with (CACAO+F) and without (CACAO) forward propagation.

Function
INSERT(R f , r, ar); 29 return a f ; 30 end

Algorithm 1. A Conditional Backward Propagation.
First Setup. The first setup evaluates CACAO against thirteen different ER models distributed over three groups (Unstructured, Fielded and Other models) using the QALD-2 DBpedia-Entity data set benchmark. Results are reported using the benchmark standard evaluation metrics: Mean Average Precision (MAP) and Precision at rank 10 (P@10) [20]. The evaluated unstructured retrieval models use flattened entity representation: LM (Language Modeling) [27]; SDM (Sequential Dependence Model) [23], and; BM25 [29]. Five retrieval models employed fielded entity representation: MLM (Mixture of Language Models) [25]; FSDM (Fielded Sequential Dependence Model) [41]; BM25F [5]; MLM-all, with equal field weights, and; PRMS (Probabilistic Model for Semistructured Data) [15]. The LTR (Learning-to-Rank) approach [3] employs 25 features from various retrieval models trained using the RankSVM algorithm. All EL (Entity Link) methods used TAGME [11] for annotating queries with entities, and an URI-only index (with a single catchall field) for computing the EL component. CA suffixes refer to models that are trained using Coordinate Ascent. The idea is that there is a need to address inflections only on properties where verbs occur rather than objects that usually contain proper names. The Levenshtein a and Jaccard a methods are used to measure local keyword frequency without global occurrency normalization. CACAO and Glimmer Y! [2] performed all queries in OR mode. The performance considers only the top-K entries returned by each approach, where k equals to the number of entries in the target test query. The EL evaluation on Table 5 evaluates the mentioned baseline functions as well as the last version of DBpedia Spotlight (version 1.0), AGDISTIS [36] and the state-of-the-art ED MAG [24] in simply BGP queries. This evaluation was designed to measure how accurate CACAO can be when dealing with approaches that use EL on factual keyword queries. We discard queries that can only be answered using classes and properties. We avoid the use of these queries because annotators usually can only handle entities. QALD-4 has ten queries that follow this criteria, Queries 12,13,21,26,30,32,34,41,42, and 44. All queries evaluated over DBpedia Spotlight used a refinement operator approach starting from confidence 0.5 in decreasing scale of 0.05 until reaching an annotation-when it was possible-or zero. AGDISTIS [36] and MAG [24] were evaluated over manually marked entity queries. Query and Resource Parsing. All implemented models (CACAO, CACAO+F, Jaccard and Levenshtein) perform the query and resource parsing extracting individual keywords, removing punctuation and capitalization as well as applying lemmatization.

Results
The results show that CACAO outperforms the state-of-the-art in both ER and EL tasks with keyword factual queries. It achieved ≈10% more accuracy than ER and EL approaches. Further, as expected, annotators performed better than ER on EL task. Tables 2 and 3 shows resp. the MAP and P@10 performance of CACAO compared to 13 methods. The tables show the score with a precision of four digits. It is possible to notice that MAP@10 scores considerably lower than P@10.
That occurs because MAP is calculated on the average entry's precision per question while P is computed only over matching entries. It means that although the entities are retrieved, their query rank can still be improved. Except for CACAO, some methods achieved different position in P@10 and MAP. The outcomes reveal that CACAO could produce more (1) precise and (2) complete results. In general, except SDM, the results confirm previous findings [14] that shows that CA and EL approaches could achieve better performance than their simple versionwithout-while EL versioned methods performed better than CA ones. CACAO could outperform previous methods because it acts as a resource linking approach. It evaluates resource dependencies rather than bi and trigrams keyword dependencies used in fielded approaches. It also suppresses SDM weakness of sorting entities in relevance order [41] using Page-Rank. Table 4 shows the Precision, Recall and F-measure achieved by each baseline models on QALD-4. CACAO achieved a better F-measure than CACAO P 65 mainly because it could overcome the problem of vocabulary mismatch on Query 29 by annotating the keyword "Australian" with dbpedia:Australia, and Query 49, by annotating the keyword "Swedish" with dbpedia:Sweden. As expected, methods empowered by disambiguation (Levenshtein a and Jaccard a ) scores better than bag-of-words (Levenshtein b and Jaccard b ). Levenshtein a scores better than Jaccard a , confirming previous research conclusion [40]. However, Jaccard b and Levenshtein b have their major drawbacks in the path disambiguation level. When retrieval scoring functions consider keywords equally weighted, they cannot disambiguate among resources containing the same keywords. For instance, in case an user query "places", both property dbo:place and the entity-type dbo:Place can be equally weighted, leading these models to retrieve either places as well as the entities connected to the property dbo:place. Not surprisingly, there was an issue related to the local 8 term frequency on BMF25F [2] model. On Query 30, it retrieves the entity dbpedia:Halloween (Dave Matthews Band song) because the word "halloween" occurs more frequently than in the desired one (dbpedia:Halloween). Table 5 shows the EL evaluation over ten queries. There, CACAO P 65 achieved the highest F-measure of 1. CACAO achieved an F-measure of 0.90, obtaining ≈0.10% more accuracy than MAG, the third best-performing approach. CACAO annotates wrongly Query 21 keyword bach by dbpedia:Bachs. CACAO P 65 applied 65 rule only to the properties, assigning correctly dbpedia:Bach. MAG could not annotate correctly Query 34 and 44, and; DBpedia Spotlight Queries 12, 41, and 42. The results expose a deficiency of EL systems in dealing with single entity factual queries.
Entity Linking and Disambiguation approaches [7,24] exploit IR for finding the corresponding entity. For these systems, incomplete labels can lead to a non or an inconsistent annotation. For example, in our evaluation DBpedia Spotlight links the keyword "baldwin" in Query 47 with the entity dbpedia:Baldwin Locomotive Works. Other queries do not generate any annotation. That is the case of Query 36 whereas DBpedia Spotlight does not annotate it using confidence score 0.5, but annotates it wrongly using confidence 0.45. 9 The use of the 65 rule, enhanced the results achieved by CACAO when applied to subjects, properties, and objects in comparison to when applied to only properties (CACAO P 65 ), see Table 4. This happens because it can help to annotate noun resources that are not handled by the lemmatization, i.e., Sweden and Swedish on Query 43. However, the use of this method decreases the precision of the approach in Entity Linking task (see Table 5) because the 65 rule increases the possible overlaping resources leading to wrong annotations. That's the case of Query 21.
Complexity Analysis. In general, entity (document) retrieval algorithms can be implemented as an entity-or term-a-time. Entity-a-time retrieval algorithms aggregates scores over entities whereas term-a-time over terms. Term-a-time is the most common retrieval method and relies on posting lists implemented in popular IR frameworks such as Lucene. Intuitively, the complexity of term a time methods are bounded by the size of the posting list matching terms M and E matching entities insertions on a tree of size k (top-k) which leads to a complexity of O(M + E log k). 10

Related Work
IR. Existing IR approaches commonly aim to retrieve the top-K ranked documents for a given NL input query. Term Frequency-Inverse Document Frequency (TF-IDF) [30] evaluates query keywords based on their local and global frequency. BM25 [28] extends TF-IDF introducing a document length normalization. Fieldbase extensions from bag-of-words have been proposed for IR on structured data. BM25F [5] is an extension of BM25 to retrieve structured data using different weighted fields. Mixture of Language Models (MLM) [25] extends the Language Model (LM) [27] using a linear combination of query keyword probability in a multi-field language model (MLM). Although individual field weights in BM25F and MLM can be tuned for a particular collection, they are fixed across different query keywords. Probabilistic Retrieval Model for Semistructured Data (PRMS) [16] overcomes this limitation using a probabilistic classification to map query keywords into fields. Other IR approaches extend field retrieval models adding keyword dependencies. The Markov Random Field (MRF) retrieval model [23] proposes three variants of keyword query dependencies: (1) full independence (FIM); (2) sequential dependence (SDM), and; full dependence (FDM). Zhiltsov et al. [41] proposed an fielded ER model based on unigrams and bigrams applied to five different fields (names, categories, similar entity names, related entity names, and other attributes). The model uses different field weights for ordered (e.g., keywords that appear consecutive ly) and unordered bigrams. Koumenides et al. Hasibi et al. [14] shows that entity linking can improve entity retrieval models. Asi et al. [17] gives a comprehensive overview of ER approaches.
Semantic Web. Swoogle [9] introduces a modified version of PageRank that takes into consideration the types of the links between ontologies. Semplore [39], Falcons [4], and Sindice [8] explore traditional document retrieval for querying RDF data. YAHOO! BNC and Umass [13] were respectively the best and second best ER in SemanticSearch'10. YAHOO! BNC uses BM25F aplaying specific boosts on different fields (title, name, dbo:title, others). Blanco et al. [2] uses BM25F boosting important and unimportant fields differently. The proposed adaptation is implemented in the Glimmer Y! engine and is shown to outperform other state-of-the-art methods on the task of ER. Virgilio et al. [37] introduced a distributed technique for ER on RDF data using MapReduce. The retrieval is carried out using only the high ranked (Linear) and all matched fields (Monotonic) strategies. Our work distinguish from the previous by (1) computing the similarity on the individual resources and avoiding the over-and the under-estimation of frequent and rare keywords.

Conclusion, Limitations and Future Work.
Whereas recent ER systems gain more precision, retrieving the desired information still imposes a major challenge. This work presented a conditional activation approach for efficient ER over RDF KG using factual query interpretation. The results show a significant improvement of accuracy in comparison to the stateof-the-art ER and EL systems in standard benchmark data sets. In particular, CACAO shows an increase of ≈10% on P@10 and MAP in standard ER benchmark data set. CACAO could outperform other ER and EL methods because it relies on a model that combines two properties: (1) It is a resource-based rather than a fielded retrieval approach, and; (2) It performs a conditional activation that avoids the over-and the under-estimation of frequent and rare keywords.
Nevertheless, there are a few challenges not addressed in the current implementation such as the keyword and character position as well as approach memory and runtime optimizations. Queries such as "peace and war" and "war and peace" can be activated equally. However, one can refer to dbpedia:Peace and War whereas the other to dbpedia:War and Peace. Recent works [41] have shown promising results in addressing this problem. The evaluation shows that current benchmarks do not address this issue. In future work, we plan to overcome the mentioned challenges. We see this work as the first step of a broader research agenda for designing more accurate ER systems over Linked Data.