Diversified spatial keyword search on RDF data

The abundance and ubiquity of RDF data (such as DBpedia and YAGO2) necessitate their effective and efficient retrieval. For this purpose, keyword search paradigms liberate users from understanding the RDF schema and the SPARQL query language. Popular RDF knowledge bases (e.g., YAGO2) also include spatial semantics that enable location-based search. In an earlier location-based keyword search paradigm, the user inputs a set of keywords, a query location, and a number of RDF spatial entities to be retrieved. The output entities should be geographically close to the query location and relevant to the query keywords. However, the results can be similar to each other, compromising query effectiveness. In view of this limitation, we integrate textual and spatial diversification into RDF spatial keyword search, facilitating the retrieval of entities with diverse characteristics and directions with respect to the query location. Since finding the optimal set of query results is NP-hard, we propose two approximate algorithms with guaranteed quality. Extensive empirical studies on two real datasets show that the algorithms only add insignificant overhead compared to non-diversified search, while returning results of high quality in practice (which is verified by a user evaluation study we conducted).

able to the public. Such knowledge bases typically adopt the resource description framework (RDF) data model. A knowledge base in RDF is a table of subject, predicate, object triplets, where subjects correspond to entities and objects can be other entities or literals (i.e., constants) associated with the subjects via the predicates. For example, the triplet Beethoven, born_in, Bonn captures the fact that the entity Beethoven was born in the city of Bonn. The English version of DBpedia currently describes 4.5M entities, including about 1.4M persons, 883K places, 411K creative works, 241K organizations, 251K species, etc. YAGO contains more than 10M entities (e.g., persons, organizations, cities) and 120M facts about these entities. Data.gov [13] is the largest open-government, data-sharing website that has more than a thousand datasets in RDF format with a total of 6.4 billion triplets, covering information about business, finance, health, education, local government, etc.
Recently, RDF has been enriched with spatial semantics. For example, YAGO2 [34] is an extension of YAGO that includes spatial and temporal data. Such knowledge bases enable location-based retrieval. Indicatively, a key research direction of BBC News Lab is: How might we use geolocation and linked data to increase relevance and expose the coverage of BBC News? [6]. To fully utilize spatially enriched RDF data, the GeoSPARQL standard [5], defined by the Open Geospatial Consortium (OGC), extends RDF and SPARQL to represent geographic information. RDF stores such as Virtuoso [53], Parliament [43], and Strabon [39] are developed to support GeoSPARQL features. However, retrieval on such systems requires that query issuers fully understand the query language (e.g., SPARQL or GeoSPARQL) and the data domain, which is restrictive and discouraging for common users.
In view of this limitation, keyword search paradigms facilitate retrieval using only keywords [16][17][18][19][20][21][22][23]26,40,46,50]. Given a query that consists of a set of keywords, an answer is a subgraph of the RDF graph. The vertices of the subgraph should collectively cover all the input keywords. The sum of the lengths of the paths connecting the keywords defines a looseness score for the subgraph [27,40,50]. Compact results, i.e., subgraphs of low looseness, are more relevant. This is analogous to finding the smallest (tuple) subgraphs in relational keyword search [35] and general keyword search on graphs [32].
RDF keyword search has been enhanced to be location aware. Shi et al. [45] propose a model for searching spatial entities, i.e., entities associated with locations. For example, Bonn is a spatial entity since it has a fixed location, whereas Beethoven is not. A spatial keyword search query takes as input a location, a set of query keywords, and an integer k. The result is the set of top-k spatial entities according to a ranking function that considers both the spatial distance between each candidate entity and the query location, and the graph-based proximity of the keywords to the entity. More precisely, a qualified place p is a spatial entity for which there is a compact tree rooted at p that collectively covers all query keywords. To effectively capture the textual semantics of each entity, in a preprocessing phase, the original RDF graph is reduced to a graph for which the keywords on all emitting edges from an entity are absorbed by the entity. Hence, a document (i.e., a set of keywords) is generated for each entity and the edges carry no keyword information. In addition, each entity document absorbs all literals (i.e., constants) associated with it. Figure 1a-d shows the preprocessed graph representation of several triplets extracted from DBpedia. Each node is an entity associated with a document (denoted by the set of keywords in curly brackets), predicates, and literals [40]. Squares correspond to places, for which the locations have been extracted and are shown in Fig. 1e. Circles are non-spatial entities of the RDF graph. The edges model the relationships between entities. Assume a top-3 query issued by a tourist at location q in Fig. 1e with keywords {ancient, roman, catholic, history}. According to [45], the result would consist of places p 1 (rooted at subgraph { p 1 , v 1 , v 2 , v 3 }, Fig. 1a), p 2 ( Fig. 1b) and p 3 . This is a good result in terms of semantic relevance and spatial distance, as the places (1) are rooted at compact subgraphs covering all query keywords [32,40] and (2) are geographically close to the query location q. However, results based entirely on relevance may have similar content [15,38,47] and location. For instance, the top-3 places share nodes v 1 and v 3 , implying similar semantics (they all represent communes). In addition, they are all located in the same direction with respect to the query.
Indeed, several studies reveal that users strongly prefer spatially [49] and textually [56] diversified query results over un-diversified ones. Thus, in this paper, we introduce diversified spatial keyword search on RDF data. Our framework enables a trade-off between relevance and diversity. Namely, the output places, in addition to being relevant to the query, should minimize the number of common nodes in their subgraphs and should have diverse locations w.r.t. direction. For instance, a diversified query result for Fig. 1 could include p 1 , p 4 (a river confluence) and p 5 (a church). These places are close to q, their subgraphs are compact, and they contain all keywords. Moreover, they are diverse because they are located around q and their subgraphs have no common nodes. For this purpose, we propose a new spatial diversity metric (Ptolemy's diversity) which also considers the query location and has several attractive properties, e.g., it is naturally normalized to range [0, 1], satisfies triangle inequality, etc. These properties render Ptolemy's diversity superior to the existing metrics for spatial diversity that consider either only the distance [37] or the angle [51] between a pair of locations.
We show that diversified spatial keyword query evaluation on RDF data is NP-hard, by a reduction from the maximum clique problem. Thus, we propose two efficient branch-andbound algorithms. The first, referred to as IAdU, generates the results by adding and updating the scores of candidate entities. The second algorithm, ABP, incrementally builds results by adding the best pair at each iteration. IAdU is faster than ABP, but has an approximation bound of 4, whereas ABP returns a 2-approximation of the optimal solution. This tradeoff renders the investigation of both algorithms interesting. Concretely, our contributions can be summarized as follows: -We define the problem of top-k diversified spatial keyword search and show that it is NP-hard. -We introduce Ptolemy's spatial diversity, a novel spatial diversity metric. -We propose two efficient algorithms for retrieval of diverse results. -We provide a theoretical analysis with approximation bounds of our algorithms. -We conduct a thorough experimental evaluation on real datasets, demonstrating the efficiency of our algorithms, the effectiveness, and user preference (we conducted a user evaluation) of our methodology.  The rest of the paper is organized as follows: Sect. 2 presents related work. Section 3 contains the necessary background on spatial RDF keyword search. Section 4 formalizes the top-k diversified spatial keyword search problem and introduces the general framework. Sections 5 and 6 present the IAdU and ABP algorithms. Section 7 provides a theoretical analysis of their approximation bounds. Section 8 contains our experimental evaluation. Finally, Sect. 9 concludes the paper with directions for the future work.

Related work
To the best of our knowledge, there is not any previous work on diversified spatial keyword search over RDF graphs. Hereby, we briefly discuss work related to keyword search on RDF data and (spatial) diversification and how it relates to our work.
Keyword search on RDF data A keyword-based retrieval model over RDF graphs, such as [18,40,48,50], identifies a set of maximal subgraphs whose vertices contain the query keywords. They follow the definition as proposed in earlier work of keyword search on graphs, [7,8,32,35,36] (which is also analogous to the definition we use in this work). Diversified keyword search on RDF graphs [9] is limited only to the diversification of results by considering the content and the structure of the results.
Diversification Diversification of query results has attracted a lot of attention recently as a method for improving the quality of results by balancing similarity (relevance) to a query q and dissimilarity among results [12,24,25,30,52]. Diversification has also been considered in keyword search over graphs and databases, where the result is usually a subgraph that contains the set of query keywords. In conventional (nondiversified) keyword search methods, a set of results usually consists of many duplicated answers that contain the same set of nodes (i.e., nodes containing a query keyword). Thus, users are overwhelmed with many similar answers with minor differences [38]. Two recent works, PerK [47] and DivQ [15], address this problem by using Jaccard distance on the set of nodes of the results, namely by considering the common nodes. In [38], the problem of finding duplication-free answers is addressed. Liu et al. [42] developed a feature selection algorithm in order to highlight the differences among structural XML data.
Spatial diversification Several works consider spatial diversification, which finds results such that objects are well spread in the region of interest. In [29,37], diversity is defined as a function of the distances between pairs of objects in R. However, considering only the distance between a pair and disregarding their orientation could be inappropriate. In view of this, van Kreveld et al. [51] incorporate the notion of angular diversity, wherein a maximum objective function controls the size of the angle made by an object in R, the query location q, and an unselected object.
There is no previous work on spatial diversification over RDF data. Our work extends the only existing spatial RDF keyword search framework [45] to support both spatial and textual diversity. In the next section, we describe [45] in detail.
ψ containing the entity's URI, its emitting edges (i.e., predicates), and literals. An entity p is called a place vertex or place, if it is associated with a spatial location. Each RDF triplet corresponds to a directed edge from an entity (subject) to another entity (object). A top-k semantic place (kSP) query q consists of three arguments: (i) the query location q · λ, (ii) the query keywords q · ψ, and (iii) the number of requested semantic places k. Definition 1 Qualifying Tree Given a kSP query q and an RDF graph Simply speaking, the documents of the vertices in a qualifying tree collectively cover all the query keywords. Given a kSP query, there may exist multiple qualifying trees with the same root p, but different sets of vertices. Following the existing work on keyword search over graphs [32,40], the looseness of a qualifying tree is defined as follows: Looseness aggregates the proximity of the query keywords to the root of the tree. 1 is added to the sum of the paths for normalization purposes. The lower the looseness, the more relevant the root of the tree is to the vertices that cover the query keywords. Given a place vertex p, the tightmost qualifying tree (TQT) T p for the given query keywords is the qualifying tree rooted at p with the minimum looseness. 1 For instance, all trees in Fig. 1 are TQTs. A kSP query q aims at finding the k places that minimize f (L(T p ), S( p)) = α · L(T p ) + (1 − α) · S( p), where T p is the TQT of p and S( p) is the Euclidean distance between the query location and p. Parameter α is used to control the relative importance of textual relevance and spatial proximity.
Shi et al. [45] propose the basic semantic place (BSP) and semantic place retrieval with pruning (SPP) algorithms for kSP query processing. BSP retrieves the place vertices in the RDF graph in ascending order of their spatial distances to the query location using an R-tree [4,33]. For each retrieved place p, BSP computes the corresponding TQT T p . TQT computation is performed by breadth-first search from p until the query keywords are covered. SPP is an extension of BSP that applies two pruning techniques. The first discards unqualified places for which there does not exist a tree rooted at them covering all query keywords. This is achieved by a reachability index (i.e., TFlabel [11]) and a pruning rule that disregards places whose TQT cannot be constructed. The second one eliminates places by aborting their TQT computation, based on dynamically derived bounds on their looseness. The original algorithms compute and return the top-k places in a batch; in our implementation, we modify them to incrementally retrieve the next place at each iteration according to its relevance score.

kDSP problem definition
A top-k diversified semantic place (kDSP) query generalizes a kSP query by combining a relevance function to the query and a diversity function on the set of query results that considers their relative location and content. In accordance with [45,55], we represent the RDF data in their native graph form (i.e., using adjacency lists) in memory. Disk-based graph representations for RDF data (e.g., [57]) can also be used for larger-scale data. At a preprocessing phase, we also perform the following. (1) We extract the document descriptions of all vertices and index them by an inverted file, which facilitates the fast search of vertices containing a given keyword. (2) For each vertex, we store in a table the document description and the spatial location (in the case of a place entity), which enables direct access to the keywords and location of a vertex during graph browsing. (3) We use an R-tree [28] to spatially index all place entities, which facilitates incremental nearest place retrieval. Section 4.1 presents the relevance function by building upon the kSP model of [45]. Section 4.2 introduces the diversity function, and Sect. 4.3 defines the kDSP problem. Table 1 contains the symbols used throughout the paper.

Relevance function
Consider a kDSP query, with location q · λ and keywords q · ψ. Recall that for any place entity p, TQT T p denotes the tightmost tree rooted at p that covers all query keywords q ·ψ. In the context of kDSP queries, we define the loosenessbased relevance of a place p as follows: where L(T p ) is defined according to Definition 2 and L max is the maximum looseness that we can tolerate (the concept of L max has been used often in earlier work, e.g., [36]). For instance, considering the example of Fig. 1, for T p 1 we have Holistic diversity between p and p (Eq. 9)

H D f (R)
Holistic diversity and relevance score of R (Eq. 10) Trade-off between relevance and diversity in H D f ( p) (and H D f ( p, p )) (Eq. 8 (and 9)) The contribution of p if added to R (used by IAdU heuristic) L(T p 1 ) = 5 and assuming L max = 15, then f L(T p 1 ) = 0.67. We also define the spatial distance score f S( p) of a place p as: where S( p) is the Euclidean distance between p and q and S max is the maximum distance that can be tolerated (e.g., the largest distance among all pairs of places in the map of a city; the concept of S max has also been used in earlier work, e.g., [2]). Considering the same example for p 1 with S( p 1 ) = 1.93 km and S max = 5 km, then f S( p 1 ) = 0.61. Both relevance and distance scores range in [0, 1], which is helpful when comparing diversification scores (to be discussed shortly). The holistic relevance f ( p) of a place p is: where β controls the contribution of the two relevance components (β = 0 considers only f S( p) and β = 1 only f L( p)).

Diversity function
Let T p and T p be the TQTs of places p and p . The Jaccard distance between the vertex sets of T p and T p provides a simple and effective way to measure diversity of keyword search results [15,47]. Specifically, if we overload T p to denote the set of nodes in the TQT T p , we can define: The Jaccard distance ranges in [0, 1] and satisfies triangle inequality [41] (as we discuss later, this property enables approximation bounds on the proposed algorithms). For instance, in our example of Fig. 1, the two trees T p 1 and T p 2 (with two common nodes) will give us d L( To measure geographic variety of two places p and p with respect to query location q · λ, we introduce Ptolemy's spatial diversity d S( p, p ) as follows: 2 where || p, p || is the Euclidean distance between p and p . Similar to Jaccard distance, d S( p, p ) is naturally normalized to range [0, 1], since ||q, p|| + ||q, p || ≥ ||p, p || (triangle inequality). We illustrate other attractive properties of our spatial scattering function with the help of Fig. 2. Two places p and p receive a maximum diversity score d S( p, p ) = 1, if they are diametrically opposite to each other w.r.t. to q · λ, e.g., points p A1 and p A2 . Pair of places ( p C1 , p C2 ) have the same distance as pair ( p A1 , p A2 ), but d S( p C1 , p C2 ) < d S( p A1 , p A2 ), because p C1 and p C2 are in the same direction w.r.t q (i.e., north of q). Pair ( p B1 , p B2 ) are further from each other compared to the places in pair ( p C1 , p C2 ) and consequently have a higher diversity score.
(This can be shown using Pythagorean theorem.) In addition, when a place p is far from q, the diversity score of any place pair ( p, p ) that includes p is heavily penalized, because || p, q · λ|| and || p, p || become similar and dominate over || p , q · λ||. Finally, as we show in Sect. 7, this measure also satisfies the triangle inequality and helps us derive tight approximation ratios for our greedy algorithms.
Given d L( p, p ) and d S( p, p ), D f ( p, p ) measures the total diversity between places p and p : where γ controls the contribution of the two diversification components. The weighting parameters β, γ can be unified to a single parameter which captures the relative importance of content and location in the computation of relevance and diversity.
The diversity score D f ( p) of p in the query result R, containing k places, is computed as: Equation 8 shows the holistic score H D f ( p) of place p that combines relevance and diversity, where λ adjusts their trade-off. A linear function and the respective trade-off λ have been used extensively in earlier work in diversity, e.g., [52]. 3 We multiply f ( p) by k − 1 in order to normalize both components in the same range (since D f ( p) compares p against the other k − 1 elements in the result set R). The relevance f ( p) of p is computed by Eq. 3.
To simplify the presentation, we introduce the holistic diversity function of a pair of places, where we re-define our objective as: is scaled up by a factor of 2 to balance the two values of f ( p) and f ( p ). Note that computing the holistic diversity function, denoted as H D f (R), of all places of a set R using either Eq. 8 or Eq. 9 gives the same result: In addition, we introduce notations f (R) and D f (R) for the weighted and normalized summation of f ( p) and D f ( p) scores, respectively, of all p ∈ R. Namely,

Problem definition
Finally, we can define the diverse kSP place (kDSP) retrieval problem as follows.
Definition 3 kDSP Problem Definition. Given a query q with location q · λ, set of keywords q · ψ, and an integer k, the kDSP query returns a set R of k place entities that have the highest H D f (R) score.
Since the objective function H D f ( p) of a place p necessitates the comparison with the other k −1 places of a candidate R set, we have to consider all O(n k ) candidate R sets. This problem as proven by Theorem 1 is NP-hard. In view of this limitation, in the next sections, we propose efficient greedy algorithms with approximation guarantees. Note that the above definition is equivalent to the max-sum problem [52].

Theorem 1 The kDSP problem is NP-hard.
Proof In order to prove the hardness of kDSP, we construct a reduction from the clique problem: given an undirected graph G(V , E) and a positive integer k, (k ≤ |V |), the decision problem is to answer if G contains a clique of size k. We start the reduction, by creating the complementary graph Then, we generate an instance of kDSP as follows. Each vertex v i in V corresponds to a place p i that has a TQT T p i , rooted at node p i of the RDF graph. For every edge (v i , v j ) in E , we add node v i, j as a child of roots p i and p j in the TQT T p i and T p j , respectively. This reduction takes polynomial time, since the cost is O(1) per edge, and the number of edges is O(|V | 2 ). After generating the TQTs, we set λ = 1 (i.e., we disregard relevance) and γ = 1 (i.e., we disregard Ptolemy's diversity) and construct a kDSP query, such that, based on the query location and keywords, (i) the places retrieved are those corresponding to the vertices of G and (ii) the TQTs of the places are exactly those defined above. Then, the original graph G contains a clique of size k, iff there is a kDSP result R with holistic diversity function To explain this, assume that there is a clique of k vertices of TQTs have zero overlap, and their contextual diversity (Eq. 4) is d L(T p i , T p j ) = 1. For γ = 1, the total diversity between two places equals their contextual diversity, Finally, according to Eq. 10, the holistic diversity of the result R is the total diversity for the k · (k − 1)/2 distinct pairs of places in R, i.e., H D f (R) = k · (k − 1). Conversely, if there is no clique of size k in G, any result R of k places in kDSP has holistic diversity H D f (R) < k ·(k −1). This is because there is at least a pair of places ( p i , p j ) in R, whose corresponding Subsequently, there is a common node v i, j in trees T p i and T p j . Thus, based on Eq. 4, their contextual diversity is d L(T p i , T p j ) < 1, and the holistic diversity of all the pairs in R cannot reach k · (k − 1). This completes the proof. Figure 3b shows (as sets of nodes) the resulting TQTs for the input graph of Fig. 3a. The gray dashed lines represent edges of G . Each of these edges (e.g., (v 2 , v 4 )) adds the same node (e.g., v 2,4 ) under the roots of the corresponding trees (e.g., T p 2 and T p 4 ). The holistic diversity of the 4DSP query containing all four places is: In general, any result of k places with score k · (k − 1) corresponds to a clique of size k.
As a final note, we can easily construct the TQTs, shown in the example, as follows. We consider as many query keywords as the maximum degree of a vertex in G (i.e., 2 keywords w 1 and w 2 in this example). Then, we assume that each node added to the trees contains one of the keywords (e.g., v 2,3 contains w 1 and v 2,4 contains w 2 ). For every vertex with the highest degree in G , the root node (e.g., p 2 ) of the corresponding TQT does not contain any of the query keywords. For each of the other vertices, the root node contains the keywords that are not covered by the non-root nodes (e.g., p 3 contains w 2 and p 1 contains both keywords). Again, the construction can be done in PTIME.

Incremental addition and update (IAdU) algorithm
We apply a greedy heuristic in a combination with a branch-and-bound approach that can be injected to any kSP algorithm (e.g., BSP, SPP) [45]. The heuristic iteratively constructs the result set R by selecting a new place entity p that maximizes the contribution it can make toward the overall score H D f (R). The contribution cH D f ( p) of a add cur P to R 7: for each p in H do 8: Get next cur P using a kS P algorithms (e.g., BSP, SPP or SP); f min = f (cur P) 12: if (R == ∅) then 13: cH D f (cur P) = f (cur P) 14: addOnR(cur P) 15: else 16: for each p in R do 17: cH D f (cur P)+ = H D f ( p, cur P) 18: H .add(cur P, cH D f (cur P)) 19: cH D f ( p) considers the f ( p) score and also the diversity of p against the existing elements in R. In the first iteration, R is empty; thus, the available contribution of a place can only be the corresponding f ( p) score. The contributions of all other places are then updated to reflect the new entry in R. Then, the algorithm iteratively selects the place that maximizes cH D f ( p) to R, adds it to R, and updates the score of the unselected places. The population of all valid places can be prohibitively large and expensive to calculate. Thus, we employ a branch-and-bound paradigm that incrementally generates and processes places in combination with a threshold. More precisely, we reuse kS P algorithms to incrementally retrieve places in descending order of their f (·) scores. Note that the kS P algorithms of [45] do not produce results incrementally but return the top-k results as a batch; still, we can easily modify them to generate results incrementally. (More precisely, we can use a revised threshold that facilitates the output of the current largest result.) In summary, we have to update cH D f ( p) scores in two cases: (1) when a place is added to R, where we need to update the score of all seen elements and (2) when a new place is emerged from our kS P algorithms, where we need to calculate its diversity score against all elements in R. Finally, IAdU algorithm uses a threshold, θ , that facilitates the pruning of unseen places if they cannot qualify in R. Algorithm 1 illustrates the pseudo-code of the IAdU Algorithm. A max heap H maintains seen places according to their cH D f (·) values and is initially set to null (line 1). In addition, a threshold of cH D f (·) score, θ , of all unseen places is also maintained and is initially set to ∞ (line 2). The algorithm starts by adding the place p with the largest f ( p) score on R (obtained by kS P algorithms; lines [12][13][14]. During the next iterations, new places are added to the heap H according to their cH D f (·) score (lines [15][16][17][18][19] which is calculated against places already in R (lines 16,17). The threshold θ is updated accordingly (line 19). More precisely, θ considers the minimum value f (·) of a seen place and the maximum diversity (i.e., 1) against all elements in R. If the top place on the heap H , which has the largest score, has a score greater than the current threshold, then this place will have the next largest score. Thus, we de-heap it and add it on R (lines 4-6). Due to the new addition on R, we need to update accordingly the cH D f (·) score of all places still in H (lines 7 and 8) and the threshold of unseen places (line 9). The algorithm terminates when the size of R becomes k (line 20).
Example We demonstrate the IAdU algorithm with the example in Fig. 4 for k = 4. In Fig. 4, we show the current value of (i) cur P (with its respective f ( p)), (ii) heap H , (iii) θ and (iv) R. At first, the place with the largest f (.) score is added on R (i.e., p 6 ). Then, iteratively, we retrieve and compute the scores of the next places that are retrieved by the kSP algorithm and add them to the heap (according to their H D f (, ) values against p 6 , which is currently the only element in R) and also update the threshold. After processing p 4 , max(H ) (corresponding to p 3 ) becomes larger than the threshold; thus, p 3 is de-heaped and added on R. Then, all elements in H and θ are updated accordingly. We repeat this until we obtain the 4 results.
Complexity. The running time of the algorithm is dominated by the retrieval of the necessary places, in increasing order of their relevance, using the kSP algorithm, until the result delete p i and p j from F List 11: Update(max Pair H ) 13: else 14: Get next p using a kS P algorithm (e.g., BSP, SPP) 15: add p to F List 16: for each p in F List do 17: add an arbitrary place from F List to R 24: return R if ( p l not in R) then 7: add ( p i , p l ) on max Pair H 8: is finalized. Let K be the number of these places; the cost of their retrieval is O(K · kS PT ), where kS PT is the time required to generate a result incrementally using a given kS P algorithm. An additional cost is due to the updates to the contributions of the retrieved places and heap updates due to emergence of new results. This cost is bounded by the K · k heap updates, i.e., it is O(K · k · logK ). Finally, we have to add the cost for updating the contributions which is O(K 2 ) because for all pairs ( p, p ) of retrieved places we have to compute H D f ( p, p ). Thus, the complexity of the algorithm is O(K · (kS PT + k · logK + K )).

Add best pairs (ABP) algorithm
The add best pairs (ABP) algorithm greedily constructs the result set R by iteratively selecting the pair of places ( p, p ) with the best H D f ( p, p ) score. As opposed to IAdU, which selects the next place by considering its diversity to the places already selected, ABP selects the next pair ( p, p ) based on only its H D f ( p, p ) value, independently of p or p 's diversity to places already in R. Once a pair is selected, both its constituent elements and any pairs they make are removed from further consideration by the algorithm (in a lazy fashion). Since a single pair is selected in each iteration, k/2 iterations apply when the value of k is even. When k is odd, an arbitrary place is chosen to be inserted in the result set R as its last entity. In a nutshell, ABP efficiently implements this heuristic by iteratively processing the places and by managing (1) a ranked list, F List , of places in descending order of their f ( p) scores, (2) one max heap Pairs H ( p) for each p in F List , containing pairs ( p, p ) with previously seen places p (organized by their H D f ( p, p ) score), (3) a max heap, max Pair H , which organizes all top elements of the Pairs H ( p)'s. Finally, we maintain (4) a threshold θ , which helps us to check whether the top pair on max Pair H is guaranteed to be the one with the highest H D f (·, ·) among pairs which have not been added to the result R yet.
Algorithm 2 illustrates the pseudo-code of the algorithm. We start with an empty result set R, an empty list F List of retrieved places, and a threshold θ = ∞. Initially, lines 14-20 retrieve the two most relevant places, say p 1 and p 2 , to q using the kSP algorithm and add them to F List . Pairs H [ p 1 ] is an empty heap because there is no place in F List before p 1 . After obtaining the second place p = p 2 , then we add on Pairs H [ p], the first pair ( p = p 2 , p i = p 1 ), which becomes the top pair in all Pairs H 's, so it is then de-heaped and added to max Pair H .
Then, the θ threshold takes as value the best possible case for a place pair which consists of the next place (not retrieved yet) and one of the places in F List (line 20). This case happens, when the place in F List with the maximum relevance (denoted by max(F List )) is combined with a place from those not seen yet having the maximum possible relevance. This place should have the same relevance (at most) as the relevance of the last accessed place by the kSP algorithm (denoted by last(F List )). At the same time, these two places should have the maximum possible diversity score (i.e., 2).
As soon as there are at least two places in F List , the algorithm can check at line 5 whether the top pair in max Pair H has a score greater than θ . If so, this pair ( p i , p j ) is deheaped from max Pair H and added to R. Then, the heap Pairs H [ p i ], which keeps track of the pairs that include p i and other places in F List is deleted, since it will not be of any further use. The same happens with Pairs H [ p j ]. p i and p j are also removed from F List . Then, the threshold θ is updated to reflect these changes. Finally, max Pair H is updated in a lazy fashion in order for the currently best pair on the top of this heap to be a valid pair (function Update()).
Specifically, in the Update() function, while the top pair of max Pair H includes a place in R (this place is already selected and cannot be selected again), then we de-heap this pair ( p i , p j ). If the place in this pair which is not yet in R is p i , we then have to replace it with a new pair from Pairs H [ p i ], so that the top of this heap is not a pair ( p i , p l ) for which p l ∈ R. Hence, while for the top pair ( p i , p l ), we have p l ∈ R, we keep de-heaping ( p i , p l ). If, in the end, for the top ( p i , p l ) of Pairs H [ p i ], p l is not in R, we add ( p i , p l ) on max Pair H and stop the updates. If Pairs H [ p i ] becomes empty, then a replacement from it is not achievable. Then, we iteratively repeat this process for the new top of max Pair H (line 1); namely, we check if the top pair intersects with R and if so repeat the process until the current top pair does not.
In the main algorithm, if the top of max Pair H does not have a score better than θ (line 5), then ABP retrieves the next place p using the kSP algorithm and creates the corresponding Pairs H [ p] by comparing p with all places in F List (lines 14-18). max Pair H is also updated to include the top of Pairs H [ p], i.e., the best pair that includes p (line 19). The threshold θ is then decreased (line 20) because the last relevance score last(F List ) is updated to the relevance score of p (which is smaller compared to the places retrieved before p). Hence, potentially at the next iteration, the condition of line 5 becomes true due to (i) the decrease in θ , (ii) the new addition to max Pair H , which may increase the score of its top element.
ABP terminates as soon as k/2 pairs are added to R. If k is even, then we are done; if k is odd, ABP adds one arbitrary place from F List to R, e.g., the one with the maximum relevance.
Note that each heap, Pairs H [ p], includes all pairs of p having as elements previously seen places in F List . Thus, each pair can appear only once in the list of these heaps, i.e., on the heap of the constituent element with the smallest f (.) score. During the execution of the algorithm, if a pair is added on R, then the two respective heaps are deleted for further consideration. However, still other heaps may contain pairs that contain the two newly added to R places. We apply a lazy approach in managing these pairs. More precisely, we only delete a pair that includes an added place, if it is de-heaped from the heap (as it is on the top of the heap). Therefore, a heap may still include pairs with many elements that have already been added on R. Similarly, on max Pair H , we apply a lazy approach to manage this heap by ensuring only that the top pair includes elements that are not in R. Namely, the heap may still include pairs (other than the top) containing elements that have already been added on R. We only de-heap as to disregard a pair, when it is on the top of the heap. In such cases, we replace it with the next pair from the respective Pairs H [.] heap.
Example The example of Fig. 5 illustrates ABP for k = 4 (i.e., 2 pairs). We iteratively retrieve places and their respective f ( p) scores and add them on F List . In Fig. 5a, the pair  ( p 2 , p 6 ). ABP deletes such pairs in a lazy fashion when needed. Hence, the next top pair ( p 4 , p 6 ) to ( p 3 , p 6 ) is disregarded (indicated with strike-through format). Then, ABP de-heaps the next pair from the respective Pairs H heap (Pairs H [ p 4 ]), which is ( p 4 , p 3 ). We disregard this pair as well, since p 3 is already in R. Then, we de-heap pair ( p 4 , p 2 ) and add on max Pair H (Fig. 5b). Now, ( p 1 , p 3 ) becomes the top of the heap which also needs to be replaced with ( p 1 , p 2 ); ( p 1 , p 2 ) becomes the top of the heap. θ in Fig. 5a is defined based on p 6 and p 4 as they represent the first and last elements in the F List . In Fig. 5b, θ will be based on p 2 and p 5 , as they represent the current first and last elements of the list.  H [ p] heap has a maximum size K . Since each of the K 2 pairs can appear only once in all heaps, the total time managing all these heaps is bounded O(K 2 · log(K 2 )). The max Pair H heap also has maximum size of q w u Fig. 6 Ptolemy's spatial diversity (Lemma 1) K 2 , since we may have to en-heap all pairs from Pair H [ p] heaps, and the management of max Pair H also costs O(K 2 · log(K 2 )). Hence, the worst case complexity of the algorithm is O(K · (kS PT + K · log(K 2 ))). Hence, the complexity of this algorithm is higher than that of IAdU. In practice, both algorithms perform well, because K is relatively small. Their costs are dominated by the kS PT factor, which is typically larger that the other ones.

Theoretical analysis
In this section, we analyze the approximation bounds of our algorithms (IAdU and ABP). We first deduce some preliminary results from our problem formulation, which enable us to achieve tight bounds for the algorithms. Our proof is based on two key observations: (i) for a result set R, the summation of the H D f (u, v) values of all pairs (u, v) ∈ R equals H D f (R) (Eq. 10) and (ii) H D f (u, v) satisfies the triangle inequality (proved below).

Preliminary results
Lemma 1 Given u, v, w ∈ V , the Ptolemy's spatial diversity d S(u, v) satisfies triangle inequality, as given by Proof We use Fig. 6 to support our construction. By definition of d S(u, v), the inequality in Lemma 1 can be rewritten as By multiplying both sides of the inequality by ( q, u + q, v )( q, u + q, w )( q, v + q, w ), we get For the rest of the proof, we assume that u, w = 0, since when u, w = 0, the inequality in question becomes obvious. To prove Eq. 12, we make use of Ptolemy's inequality [3], which gives a relationship between the side lengths and the diagonals of a quadrilateral with vertices q,u,v, and w as We now consider Eq. 13 for two cases: Under condition (i), Eq. 12 reduces to: Under condition (ii), Eq. 12 is simplified to: Both Eqs. 14 and 15 hold as a consequence of the triangle inequality on u, v, w, i.e., u, v + v, w − u, w ≥ 0. This completes the proof. S(u, v)) satisfies triangle inequality, as given by Proof By definition of D f (u, v), the inequality can be rewritten as: S(u, w)). From Lemma 1, we know that d S(u, v) satisfies the inequality. Additionally, Levandowsky and Winter [41] have shown that Jaccard distance is also a metric, and hence, it satisfies the triangle inequality, which means that d L(v, w) (defined as the Jaccard distance between two node sets) also satisfies the inequality. Thus, the two inequalities satisfied are: The addition of these equations completes the proof.
In general, the diversity function D f (u, v) maintains its triangle inequality properties as long as the constituent components follow triangle inequality. Theorem 2 Given u, v, w ∈ V , H D f (u, v) (Eq. 9) satisfies triangle inequality, i.e.,

H D f (u, v) + H D f (v, w) ≥ H D f (u, w).
Proof By expanding H D f (u, v) according to its definition, we get: From Lemma 2, we know that , w), and hence, Eq. 17 holds.
Note that the triangle inequality property of H D f (u, v) is independent to f (.). Namely, f (.) does not need to satisfy the triangle inequality and can have an arbitrary value. It is easy to see that the triangle inequality property of H D f (u, v) is independent to the value of the tuning parameters (i.e., β, γ , and λ).

Theorem 3 IAdU algorithm achieves an approximation ratio of 4.
Proof Consider a complete undirected graph, where each node u corresponds to a place entity and an edge (e.g., e(u, v)) corresponds to the distance between the corresponding pair of places. More precisely, each edge has its H D f (u, v) value as edge weight. IAdU selects, at every iteration, the place u that has the maximum available contribution cH D f (u). The heuristic used in IAdU is similar to the one proposed by Ravi et al. [44]. The difference lies in the selection of its first edge, i.e., the first step. In particular, they first (i) select the pair of nodes with the maximum pairwise distance in the entire graph and (ii) complete the remaining top-k result set by successively selecting the next element that maximizes the distance to the set of already selected elements. They prove by mathematical induction that their greedy heuristic achieves an approximation of 4.
To achieve our desired analysis, we show a different deduction for the base case of the inductive step. For the base case of k = 1 IAdU adds the node with the highest f (u) score in V . We can easily see that this constitutes the optimal result for k = 1. With the second addition, IAdU adds a new node v that participates in an edge with u and has the maximum H D f (u, v) score. We prove below that the value of H D f (u, v) is at least half the maximum value of H D f (u, v) in the optimal solution for k = 2.

Lemma 3
The first two nodes added by IAdU form an edge with a score at least half of the maximum score of all edges in the complete graph. Proof Let us assume that the optimal solution for k = 2 (i.e., the best edge) is the edge e = (x, y) and has a weight of w max . Now consider the greedy selection of IAdU of edge (u, v) in the first two iterations, where node u is selected before node v. Three cases arise: Case 1 |{u, v} ∩ e| = 2 (namely, the selected edge is the optimal edge e). Trivial. The optimal edge is selected.
Case 2 |{u, v} ∩ e| = 1 (i.e., only one node of the selected edge belongs to the optimal edge). W.l.o.g. assume that u = x. For the subcase (a), node u is selected before v; then, the optimal edge will be selected by the greedy heuristic of IAdU. For the subcase (b), where v is selected first, it is easy to see that between edges (v, x) and (v, y), the one with the larger weight is selected. This edge's weight is at least w max /2 according to the triangle inequality between v, x, and y. Case 3 |{u, v}∩e| = 0 (i.e., none of the selected edge's nodes belongs to the optimal edge). We use Fig. 7 to illustrate this case. W.l.o.g. assume that node u was selected first by IAdU. By the triangle inequality, we have w(u, x) + w(u, y) ≥ w(x, y) = w max . And due to the greedy selection of IAdU, w(u, v) > w(u, x), w(u, y). Hence, w(u, v) ≥ w max /2. This completes the inductive step for k = 2. The rest of the proof (i.e., the inductive case for k > 2) follows directly from [44].

Theorem 4 ABP algorithm achieves an approximation ratio of 2.
Proof We make the same mapping as above, namely we have a complete undirected graph, where each node corresponds to a TQT and caries an f (u) score, and each edge (u, v) carries its H D f (u, v) score as edge weight. The greedy heuristic in [31] achieves the approximation ratio of 2 in the case where edge weights satisfy the triangle inequality. This heuristic chooses in every iteration a new pair of elements that has the maximum pairwise distance. In ABP, we efficiently implement the same heuristic under the suggested mapping, wherein at each iteration we add to R the edge (u, v) with the current maximum H D f (u, v) score in G and then delete this edge and all its adjacent edges from G. We continue until we get the k/2 edges. Since H D f (u, v) satis-

Experiments
We evaluate the efficiency of the proposed greedy algorithms and the effectiveness (and approximation quality) of the proposed kDSP framework against the kSP framework of [45]. Finally, we present a user evaluation comparing kSP and kDSP results.

Setup
Datasets We used the datasets and settings that have been used in [45], namely DBpedia and YAGO (version 2.5). Queries We defined the query points to be in metropolitan areas (e.g., New York, London, Beijing, Tokyo), which contain plethora of places. For each of these cities, we extracted queries with locations and keywords that can return many results. More precisely, we used queries which, according to the kS P framework, can produce at least five times more results than our largest k. Specifically, as we discuss below in our experimental settings, the largest k is 20; thus, we generated queries (80 in total) for which the non-diversified version (i.e., kS P) can retrieve at least 100 results. We set L max as 5 * |q · ψ| (the concept of L max has been used often in earlier work, e.g., [36]). We set S max as the largest distance in the map of the city. (this concept has also been used in earlier work [2]) Platform All methods were implemented in Java and evaluated on a 2.7 GHz dual-core, quad-thread machine, with 16 GBytes of memory, running Windows 10.

Experimental settings
Indexes and preprocessing costs The time and space costs of constructing the indexes and the data structures are moderate (see [45] for more details). For instance, the construction of the R-tree of the DBpedia and YAGO datasets requires about 3 min and 31 min and occupies 50 MB and 273 MB, respectively. The inverted index construction requires about 4 min and 1 min and occupies 1307 MB and 231 MB respectively. The DBpedia data is richer in terms of text, and therefore, it needs more time to build the corresponding inverted indexes. The reachability index (TFlabel) [11] construction requires 22 min and 6 min, respectively. The R-tree, TFlabel index, and RDF graph are all memory resident; on the other hand, the inverted index is disk resident.

Efficiency evaluation
In the first set of experiments (Figs. 8, 9), we measured the average run-time costs of the tested algorithms on the 80 queries for the various parameter values on DBpedia and YAGO. We show (the average of) the combined costs of our algorithms (IAdU and ABP) with the kSP algorithms (BSP and SPP). Recall that our methods (IAdU and ABP) use as a module a kSP algorithm to incrementally retrieve the places in order of relevance to the query q. Hence, there are four combinations: BSP+IAdU, BSP+ABP, SPP+IAdU, The results show that kSP retrieval dominates the combined cost. Naturally, SPP outperforms BSP because of the two pruning rules employed by SPP [45]. The diversification algorithms, IAdU and ABP, require insignificant time (in comparison with kSP algorithms). In all cases, this cost is 1 ms or less. Hence, although ABP costs up to twice the time required by IAdU, this extra time is negligible when combined to the place retrieval costs by BSP and SPP. Overall, the additional overhead required to achieve diversification is negligible in comparison with the kSPs cost; however, the number of places K that have to be retrieved in order to answer a kDSP query is roughly K = 5 * k. Still, their retrieval is necessary in order to ensure that the results qualify the diversification requirements.
Varying k. In Figs. 8a and 9a, we depict the effect of k. As expected, K and all respective costs increase with k. Observe that the kSP retrieval costs continue to dominate the overall cost of kDSP queries.
Varying |q · ψ|. Figures 8b and 9b show how the number of keywords affects the running time. As |q · ψ| increases, both kSP and kDSP costs slowly increase. kSP algorithms need to explore more RDF vertices in order to discover the TQTs covering all keywords. This also results in an increase in the size of the TQTs; in turn, kDSP algorithms spend more time on computing Jaccard distances. On the other hand, |q · ψ| has almost negligible impact on K .
Note that for the DBpedia dataset, the cost when |q·ψ| = 5 drops for the following two reasons. In these experiments, we use different sets of queries for each value of |q · ψ| (e.g., {ancient} for |q · ψ| = 1, {ancient, roman} for |q · ψ| = 2, etc.). This is in contrast to the experiments comparing k and λ (e.g., Fig. 8a, c), where we have the same query for all experiments, and we just vary the values of k and λ. In addition, for larger values of |q · ψ|, we could run fewer queries because there are not many places that include all keywords in their trees. Indicatively, in Fig. 8b, the number of queries that we ran for of |q · ψ| = 3 is 80, while the number of queries that we ran for |q ·ψ| = 5 is 63. Naturally, these queries included mostly keywords which are frequent and their graph distance to the places that include them is not large. Hence, the retrieval of these places may be easier (i.e., their looseness scores may be easier to compute) compared to places retrieved for queries with smaller |q · ψ| values.
Varying λ, β, γ . Figures 8c and 9c show the effect of λ on the running time on the two datasets. Figures 10 and 11  of these parameters have no significant impact on the total time.

Effectiveness
We assess the effectiveness of the two frameworks (kSP against kDSP) by comparing their respective H D f (R) and D f (R) scores. (Note that the use of either BSP or SPP makes no difference, since both these algorithms return the same sets of places in the same order.) Figures 12, 13 Varying k. Figures 12a and 13a show the effect of k on H D f (R) and D f (R) scores. As the value of k increases, the improvement of the diversification algorithms against kSP reduces, because it becomes harder to find more diverse results (especially in the two-dimensional space of locations).
Varying |q ·ψ|. Figures 12b and 13b show that as the number of keywords |q · ψ| increases, the improvement of H D f (R) and D f (R) also increases. This is because the size of TQTs increases and it becomes easier to find more diverse ones. Namely, more nodes (and more new paths) are included in the TQTs; this results in a larger expected Jaccard distance between two TQTs.
Varying λ, β, γ . Figures 12c and 13c show the effect of λ on H D f (R) and D f (R) scores. We observe that as the value of λ increases, so does the improvement of both H D f (R) and D f (R). This is expected, as bigger λ values favor diversity and consequently magnify the difference between the approaches. Figures 14 and 15 show the effect of β, γ on H D f (R) and D f (R) scores. An interesting observation here is that as γ increases, the improvement gap drops (recall that γ is a trade-off between spatial and content diversity). Larger γ values give higher weight to content diversity and since it is easier to find diverse results in terms of content than in the 2-dimensional space of locations, the results of kSP and IAdU have higher chances to find more diverse place sets.

Re-ranking
Diversification achieves an effective re-ranking of search results by combining both relevance and diversity. This facilitates a bird's eye view of results, which is preferable by users [1,10] (also verified by our own user evaluation, Sect. 8.4). Figure 16a, b depicts the average correlation of the place rankings by kSP with that of each kDSP algorithm (i.e., IAdU and ABP) for all tested queries (for the default settings). As k increases, the correlation of the kSP ranking against the kDSP ranking drops. Namely, the kSP ranking of places differs significantly from the corresponding kDSP ranking. Our user evaluation (Sect. 8.4) revealed that the top kSP results are very similar whereas the top kDSP results are very diverse, offering a bird's eye view of results which was favored by our evaluators. Hence, our approach is useful even for large values of k, for which the H DF(R) improvement is not large as we discussed before (see Figs. 12, 13). The figure also shows that the rankings by IAdU and APB are highly correlated (i.e., their results have high overlap).

User evaluation
We also conducted a user evaluation on kDSP and kSP queries, which confirms the preference of users to diversified results and the effectiveness of re-ranking. We asked help from ten evaluators, who are professors and researchers from our universities. (None of them was involved in this paper.) First, we familiarized them with the query concepts and relevance metrics (distance and looseness). In addition, we explained to them the concepts of (1) diversity and (2) ranking facilitating a bird's eye view; to avoid any bias, we avoided to discuss their advantages or disadvantages. Then, we presented to the evaluators ten random queries and their top-k kSP and kDSP results and ask them to evaluate: (1) the general content of the results and (2) their ranking. For each set of top-k results, we showed a map with the places and the TQT of each place. We presented the output of each method in a random order (to avoid any bias) and asked the evaluators to give a preference score in a scale of one to ten, considering how representative and informative the overall top-k results were.  Figure 17a, d averages the evaluators' preference scores of the two methodologies (i.e., kSP and kDSP), for the two criteria (i.e., general content and ranking), for various values of k, for the two datasets (using the default settings). For kDSP, we used ABP (since IAdU and ABP give similar results). For the first criterion (general content), we observe that the users prefer diversified results (kDSP) for small values of k (i.e., for k = 5 and k = 10). For larger values of k, we observe that the gap between preference of kDSP and kSP results is reduced. This is because the results of these two methodologies have higher overlap for larger k. On the other hand, for the second criterion (ranking), users prefer kDSP over kSP for all values of k. The study revealed that for small values of k (i.e., k = 5 and k = 10), the results of the kSP approach included many similar places (e.g., for k = 3 for the example of Fig. 1, all three places were communes located in the same area); on the other hand, the results of the kDSP approach included almost completely diverse places (e.g., for k = 3, only one place was a commune). This finding justifies the preference for kDSP ranking for large values of k. The top places are typically diverse to each other, whereas only some bottom results had some similarity to previous ones; this bird's eye view is preferable by users. For example, for k = 20, the top 10 places are all of different types (e.g., a commune appears only once), and some of these types also appear in the bottom 10 places (e.g., additional communes). In general, the user evaluation findings are in accordance with the effectiveness findings (discussed in Sect. 8.3). Figure 17b averages the evaluators' preference scores of the two methodologies for various values of |q · ψ| (for k = 10). We observe that users prefer for all values of |q · ψ| diversified results. More precisely, we observe that the value of |q · ψ| does not affect significantly the difference between the preference of kSP over kDSP results. Namely, the difference between the user preferences for the two criteria remains approximately the same for different values of |q · ψ|. Figure 17c averages the evaluators' preference scores of the two methodologies for various values of λ (k = 10). Note that when λ = 0, kDSP gives the same results as kSP. We observe that users prefer diversified results; more precisely, users prefer results produced with λ = 0.5 as they facilitate more effective diversification.

Discussion
In conclusion, the combination of SPP and ABP appears to be the best choice for diversified spatial keyword search on RDF data. SPP is very fast (compared to BSP) for kSP incremental search, while ABP is negligibly more expensive than IAdU and achieves better approximation and effectiveness scores. For example, search on DBpedia using SPP+ABP never requires more than a second; hence, real-time results can be obtained. Although the increase in k reduces the effectiveness improvement, the achieved re-ranking based on relevance and diversification remains useful for all values of k. Finally, our user evaluation confirms the users' preference for kDSP over kSP results and ranking.

Conclusions
In this work, we enrich spatial keyword search on RDF data with the ability to diversify query results. Our framework combines relevance and diversification, w.r.t. both content and location. We propose two greedy algorithms (IAdU and ABP) and provide theoretical guarantees for their quality. Our experiments on real data verify the effectiveness, approximation quality, and efficiency of our algorithms (where ABP is shown to be superior to IAdU) and confirm that our framework is preferred by human evaluators. In our future work, we will study alternative scoring functions for the spatial and content-based search components (e.g., road network distance in place of Euclidean distance). long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.