Automatic Facet Generation and Selection over Knowledge Graphs

With the continuous growth of the Linked Data Cloud, adequate methods to efficiently explore semantic data are increasingly required. Faceted browsing is an established technique for exploratory search. Users are given an overview of a collection’s attributes that can be used to progressively refine their filter criteria and delve into the data. However, manual facet predefinition is often inappropriate for at least three reasons: Firstly, heterogeneous and large scale knowledge graphs offer a huge number of possible facets. Choosing among them may be virtually impossible without algorithmic support. Secondly, knowledge graphs are often constantly changing, hence, predefinitions need to be redone or adapted. Finally, facets are generally applied to only a subset of resources (e.g., search query results). Thus, they have to match this subset and not the knowledge graph as a whole. Precomputing facets for each possible subset is impractical except for very small graphs. We present our approach for automatic facet generation and selection over knowledge graphs. We propose methods for (1) candidate facet generation and (2) facet ranking, based on metrics that both judge a facet in isolation as well as in relation to others. We integrate those methods in an overall system workflow that also explores indirect facets, before we present the results of an initial evaluation.


Introduction
A facet is by definition 1 a particular aspect or feature of something. In the present work, this is applied to a set of resources that could be viewed under different aspects. Each aspect is called a facet and consists of several categories, facet values, which can be used to filter the initial resource set. The number of resources that are associated with a certain facet value is called value size.
Considering an example, a list of books can be viewed under the aspect of their genre. Choosing the facet value science fiction, books of this specific genre would be selected. The number of selected resources then corresponds to the value size of the facet value science fiction. The same list could be viewed under the aspect of their publication year, with each sublist containing only books published in one particular year. These two aspects, genre and publication year, are just two of the many possible facets for books.
To obtain different facets, we assume each resource to have properties assigned, linking them either to other resources (genre, with, e.g., a description for itself) or plain literal values (publication year ). While our method works on any resource set possessing such properties, we use semantic models as rigorous formulation. In particular, we consider knowledge graphs (KGs). They provide significant advantages for the creation of facets: First of all, assuming the resources are drawn from a rich KG, we automatically get a large amount of direct resource information from their properties. The values of those properties may be resources themselves and can be used to generate indirect facets over the initial resource set. For example, an indirect facet for books can be an author's place of birth, where place of birth is linked to author, not to the book itself.
However, considering continuously changing and heterogeneous resources, manually predefining facets is often impractical. Using concepts from large KGs, e.g., the Linked Data Cloud, for semantic annotation induces a large number of possible facets. Hence, an automated method has to rank the large number of candidate facets to be able to pick the most suitable ones among them.
Nevertheless, determining the single, best facet is not enough. Users generally expect a list of facets to choose from. Moreover, this list should not be extremely long, and its items should be "useful" both individually and as collection. Were it not for the requirement of usefulness also as collection, simply choosing the top-k highest-ranked facets would be sufficient. However, avoiding facets that are semantically very close to each other is important as well. After their identification, criteria need to be defined to decide which of the candidates to drop to arrive at the final list of facets.
We propose an approach for dynamic facet generation and facet ranking over KGs. Our ranking is based on intra-and inter-facet metrics to determine the usefulness of a facet, also in the presence of others. A key aspect is exploiting indirect properties to find better categorizations. Since inter-facet metrics have not been satisfactorily addressed so far, we present semantic similarity as a usefulness criterion.
Based on our previously proposed workflow [1], we integrated all methods into an initial prototypical implementation [2]. While this leverages data from a specific KG, i.e., Wikidata [3], the methods we describe and use are generally applicable without or with only minimal changes to a wide range of KGs. Possible applications include exploratory browsing of a data catalog of semantically annotated datasets, or the reduction of a search result set using facets as filters.
In Sect. 2 we first revisit some of the related works in this direction. We then discuss methods we used for candidate facet generation and ranking in Sect. 3 and propose our workflow in Sect. 4. We present evaluation results in Sect. 5. Finally, we conclude and discuss future work in Sect. 6.

Related Work
Faceted browsing over KGs has been the subject of various research efforts, e.g., [4]. Prominent approaches such as Ontogator [5] or mSpace [6] use statically predefined facets for data navigation and do not consider continuously changing data sources. Moreover, their evaluation scenarios suppose data homogeneity and domain-dependent collections like cultural artifacts [5] or classical music [6].
Facet ranking is of particular importance for dynamic facet generation in order to select from the considerable number of facet candidates. Frequencybased ranking was adopted by [10][11][12]15]. In Faceted Wikipedia [10], facet values are ranked based on the value sizes. For facet ranking, the most frequent facets corresponding to the selected type are candidates. They are ranked based on their most frequent facet value. Note that a ranking is applied only in case of resource type selection, otherwise generic facets are displayed. VisiNav [11] also adopts a frequency-based approach to rank facets and facet values inspired by PageRank [20]. The respective scores are calculated based on the PageRank score of the data sources [21]. Rhizomer [12] defines relevant facets based on the properties usage frequency in the resource type instances and the number of different facet values. In Grafa [15], facets are ranked according to the number of search result resources that have a value for the specific facet and facet values are ordered by PageRank. BrowseRDF [7] proposes three metrics to measure the quality of facets: (1) predicate balance, considering faceted browsing as the operation of traversing a decision tree where the tree should be well balanced (2) object cardinality, the number of facet values as also considered in [12] (3) predicate frequency similar to [10,12,15]. The metrics are combined to a final score that is used to rank facets. In MediaFaces [16], facets are ranked based on the analysis of image search query logs and users tags of Flickr 3 public images. Hippalus [17] introduces a different ranking approach involving user interactions where users rank facets and facet values according to their manually defined preferences.
We notice that all the previously described efforts concerning facet ranking only involve intra-facet metrics that rate facets individually without taking into consideration the significance of facet co-occurrence, or in other words inter-facet metrics. To the best of our knowledge, only Facetedpedia [22] includes a metric for measuring the collective usefulness of a facets collection. However, it does not take advantage of KGs or semantically annotated collections, but generates facets over Wikipedia 4 pages based on the Wikipedia category system. They consider the navigational cost, i.e. the number of edges traversed, as an intrafacet metric that is based on the number of steps required to reach target articles and the number of choices at each step. Furthermore, facets are penalized if they have a low coverage, i.e., not all the articles can be reached using the considered facet. Besides the navigational cost, the average pairwise similarity is proposed as an inter-facet metric. However, the used metric is specifically designed to be applied on the Wikipedia category system and is not generic enough to express semantic similarity in the sense of arbitrary KGs.

Methods
Before presenting our proposed workflow, this section provides details on the employed methods. This includes initial candidate facet generation, handling of literal facet values, and the metrics used to compare facets. The latter discussion is split into two parts: Intra-facet metrics evaluate a facet in isolation, whereas inter-facet metrics judge facets in relation to others.

Candidate Facet Generation
We aim to generate facets over a set of resources given by their respective Internationalized Resource Identifiers (IRIs) within the KG. In such a graph we treat the relations of the given resources as their properties and thus any applicable property path is equivalent to a candidate facet. To achieve a better categorization of resources, we consider not only the direct properties (i.e., values that are connected to the resource by a single link), but also indirect properties (i.e., chained links are needed to connect a resource and a value). As an example, consider a set of resources referring to people. A direct property can be derived from a relation place of birth pointing to instances of a class city. An indirect property could then also exploit an existing link between city and country 5 to arrange the connected cities into possibly fewer categories 6 . Indirect properties are only possible, if the range of the associated relation is not a literal, as those can not be the subject of further statements in the standard RDF model.
A candidate facet is now given by a property path within the KG. In case of direct properties this path is of length one, whereas for indirect properties any path length greater than one is possible. However, longer paths loosen the connection between resources and facets values. At some point this renders a facet useless for the given task or at least makes it unclear to users how that facet is supposed to support them. Furthermore, longer paths increase the number of candidates and thus require more computations in later phases. For these reasons, we limit the path length for candidates by a threshold τ .
We categorize candidate facets into two types: (1) Categorical facets that result from property paths connecting exclusively to other resources and (2) quantitative facets whose values are given by literals. While we allow quantitative candidates for numeric or date literals, we exclude string literals. The rationale is that those oftentimes contain labels or descriptions specific to single resources and, hence, are barely shared between different ones. As facets rely on common values to categorize the given input set, these properties will only rarely provide a suitable candidate facet. If a string value is common to multiple resources, there is a high chance, that this should have been modeled as a distinct resource instead of a literal. Of course, resources are often not modeled perfectly. Future work might need to include these to be able to cope with this type of data.

Clustering of Quantitative Facets
As mentioned before, facets can be created from numeric or date literals. Unlike categorical facets, it is highly unlikely that the number of distinct values is sufficiently small to generate a useful facet. However, these values can be clustered by dividing their continuous range into discrete subranges.
The clustering step is only applied to quantitative facets. It replaces the associated values with value ranges. The number of these clusters is determined by the optimum value cardinality as defined by the respective intra-facet metric (see Subsect. 3.3). The clustering technique itself is a consequence of the rationale behind another intra-facet metric, the value dispersion. It assembles approximately the same number of values in each cluster.

Intra-Facet Metrics
To select the most useful facets among the candidates, we define metrics to judge their usefulness. The first set of metrics presented here assigns scores to individual candidates independently of each other. Each metric is designed to reflect one intuition of what constitutes a useful facet.
The first requirement concerns the applicability of the facet. For each facet we also include an unknown value. This accumulates the resources that do not support the respective property path, i.e., at least one of the corresponding relations is missing for this resource. For heterogeneous resource sets, the unknown value size will be non-zero for most facets. However, for a facet to be useful, it should apply to as many resources as possible. So we strive for the value size of unknown to be small in comparison with the overall size of the resource set.
These thoughts lead to the definition of predicate probability of a facet f , score predicateP rob , as given in Eq. 1. It calculates, for a randomly chosen resource, the probability to support the property path of a given facet.
Our next requirement deals with the number of facet values. We consider a facet with only a single value as not useful, as it can not be used to narrow down the given set of resources. But then again, facets with too many values provide little help as well. Here, users have to scan through a long list of possible options, which may even rival the number of input resources. We believe that there is a number of values that is optimal in the sense that it balances between a concise categorization and a sufficient number of options to choose from.
Following these considerations, we define the value cardinality, score valueCard , of a facet f with a number of values c f as given in Eq. 2. The minimum cardinality is denoted by minCard and the optimal one by optCard. Note that we chose an asymmetric function that favors facets with fewer values rather than more. This follows the intuition that better categorizations tend to have fewer categories. The parameter θ = 0 allows to adjust the preference for value sizes between minCard and optCard.
Our final requirement follows the principle of self-balancing search trees: Each decision made while traversing the tree should eliminate roughly the same number of results from consideration. In other words, no leaf node (representing a specific result) is preferred over others in terms of steps needed to reach it from the root node. Similarly, we do not want to favor any specific category.
For a facet, this means that all value sizes within a single facet should be approximately equal 7 . As a measure for the variance in value sizes, we employ the coefficient of variation c v (see Eq. 3). We chose this coefficient over the plain standard deviation, as it allows to better compare across multiple facets with possibly different value sizes. Using this, we define the value dispersion, score dispersion , as given in Eq. 4. Here, N is the number of facet values, x i denotes the value size of the ith facet value, and x is the average of all value sizes. We exclude the value size of the special facet value unknown from this calculation, as this value is already exploited in score predicateP rob . c All presented metrics are designed to return only values in the range between zero and one. In order to combine them into a single metric used in the ranking process (see Sect. 4), we can use a weighted average as shown in Eq. 5. With the individual weights summing up to one as well, we assure that the final score is also between zero and one.
with i w i = 1

Inter-Facet Metrics
In contrast to their intra-facet counterparts, inter-facet metrics assess the relationship between different candidate facets. We use semantic similarity of facets as an inter-facet metric. The motivation is to prevent facets that are too close to one another and thus would provide about the same partitioning of the resource set. Moreover, semantically distant facets increase the chances of meeting users' information need and/or mindset. Generally, no restrictions are imposed on the semantic similarity measure chosen to be included in the current facet generation workflow. However, we base our workflow on a structure-based measure that combines the shortest path length and the depth. In particular, we consider the one proposed by [23] as reference similarity metric between two concepts c i and c j , defined as follows: where length(c i , c j ) is the shortest path length between c i and c j and depth(c lcs ) is the shortest path length between the Least Common Subsumer (LCS) of the two concepts, c lcs and the root concept. α ≥ 0 and β > 0 are used to adjust the importance assigned to the shortest path length and the depth, respectively. Based on the correlation evaluation conducted by [23], the optimal parameters are α = 0.2 and β = 0.6.
The previously defined semantic similarity metric takes a pair of concepts as input. Therefore, a mapping between properties and concepts needs to be available. For this purpose, we exploit a particular characteristic of Wikidata's data model: Properties are annotated with a matching entity. For example, the property author (P50 ) is itself linked to the entity author (Q482980 ). This allows us to retrieve entities corresponding to the property path of a facet.
When comparing two facets, we first retrieve the respective entities for the first property in their property paths. We then calculate the semantic similarity between the entity pair. Two entities are considered similar, if sim is larger than a defined threshold σ. Since we calculate the similarity over Wikidata taxonomy, we only consider links using subclass of (P279 ) and instance of (P31 ) here.

Workflow
We consider the facet generation to be part of larger applications. In particular, we assume that the retrieval of an initial resource set is subject to other independent components. Hence, details of the resource retrieval process are out of scope at this point. For the sake of argument, we base our workflow on the results of a keyword-based full text search over the string properties of entities in the KG. Its result is represented as a set of IRIs, each identifies a single result item or resource and forms the input to our proposed facet generation workflow. We structured the overall process into four phases as shown in Fig. 1.

Candidate generation
Intra-facet scoring and ranking

Selection of better categorization
Inter-facet scoring and filtering Fig. 1. Phases of the facet generation process.

Phase 1: Candidate Generation
This first phase enumerates possible facets by querying for a list of propertypaths associated with the input list of resources. As the predicate probability score predicateP rob is a simple metric, we choose to include it as part of the query. Candidates that have a score predicateP rob below a predefined threshold, minP redP rob, are already removed in this phase. This reduces the necessary data transfers and the calculation of computationally expensive metrics. The result is a list of candidates, each comprised of a basic graph pattern (BGP), that describes the facet, and a score to reflect the fraction of resources it applies to.

Phase 2: Intra-Facet Scoring and Ranking
As a prerequisite for the remaining intra-facet metrics, now the facet values along with the respective value size are retrieved from the SPARQL endpoint.
We distinguish between object and data properties 8 at this point. The latter are subjected to the clustering described in Subsect. 3.2 to derive comparable characteristics with regard to intra-facet metrics. After augmenting the facets with their respective values, the remaining intrafacet metrics, score dispersion and score valueCard , are calculated for all candidates. This allows us to compute the final intra-facet score, score(f ), and accordingly rank all facets in decreasing order.

Phase 3: Selection of Better Categorization
The number of necessary inter-facet metrics calculations grows quadratically with the remaining number of candidates. To reduce the list of candidates before the next step, we exploit a key characteristic of the semantic similarity metric. The similarity only depends on the first direct property of each facet. Consequently, out of all candidates sharing the direct property, only one will be chosen for the final result, as all others will be too similar to it. Leveraging this observation, we can group the candidates by their direct properties and only choose the best-ranked one within each group.

Phase 4: Inter-Facet Scoring and Filtering
The final result is derived by consecutively applying inter-facet metrics to chosen pairs of candidates. Calculating semantic similarities is rather expensive. To minimize the comparisons required, facets are selected in a greedy fashion.
Let C be the list of candidates in decreasing order w.r.t. the intra-facet metric scoring of Phase 2 and S be the final collection of facets as returned by Phase 4. Finally, S will contain a subset of facets deemed most suitable for the given input set of resources. The suitability has been determined by employing both the intra-and inter-facet metrics, which can be extended or changed without affecting the corresponding workflow. S can now be presented to users. Note that selecting specific value and subsequently reducing the result set will trigger a new facet generation process, as the basis for our calculations-the input resource set-might have changed substantially.

Evaluation
The methods described in Sect. 3 were implemented in a prototype that issues dynamic SPARQL queries to the public SPARQL endpoint of Wikidata (WDQS) 9 . The source code is available online [2], under an MIT license.

Benchmarking
To evaluate the performance of our prototype we used a collection of IRIs extracted from Wikidata (instances of novel (Q8261) or its subclasses).
First, we examined the change in the number of candidates depending on the path length τ and number of input IRIs. Results are shown in Table 1. As expected, the number of candidates increases significantly -about 20-fold-for each additional hop in the paths. However, a growth in input IRIs yields only a small effect in comparison. These figures and the considerations of Subsect. 3.1, led to a path length of τ = 2 for the remainder of the evaluation.
Subsequently, we looked at the run-time of our prototype for varying sizes of input IRIs. We fixed the semantic similarity threshold (σ = 0.70), the parameters for value cardinality scoring (optCard = 10, minCard = 2, and θ = 3), and the predicate probability threshold (minP redP rob = 0.1). Figure 2 shows a breakdown of the measured execution times, averaged over about 350 individual measurements over the course of a week. We observe a less than linear growth of run-time depending on the input IRI size. The most expensive operations are (1) candidate generation, (2) facet value retrieval, and (3) semantic similarity. Other operations such as intra-facet metric calculation and selection of better categorization do not contribute significantly. A detailed analysis revealed that the execution times are largely dominated by querying the SPARQL endpoint.
Overall, we acknowledge that the current performance prohibits any productive use. However, the overwhelming impact of query response times on the overall execution time indicates potential for improvement. Further parallelization and caching of reoccurring queries might prove fruitful.

User Evaluation
Setup. In a survey-based user evaluation, we examined whether facets generated by the proposed workflow match user expectations. Based on a fictitious scenario, we assumed an initial search with the keyword "film".
After introducing users to the general concepts of faceted search and the given scenario, we asked for user preferences in a series of questions categorized into two kinds of situations: one for facet selection and one for facet ranking. In facet selection (cf. Fig. 3), users were presented with a static user interface that resembles a common search engine and includes three different facets, e.g., director of photography, production designer, and number of seasons. They were then given two more facets, e.g., genre and camera operator, and were asked which would be a better addition to the existing three facets. In facet ranking, we presented three to four different facets per question and asked users to rate their usefulness in the given scenario using a five point Likert scale [24].
Unlike facet selection, where only facet headers are shown, facet ranking also includes facet values. Unless noted otherwise, all facets and their values are modeled according to the data present in Wikidata as of February 2019 using a path length of τ = 2. The facets are generated by an initial, prototypical implementation of the workflow, but were manually adapted to reflect the respective evaluation intent to emphasize specific intra-facet scores.
Using these situations, the following order of questions was used in the survey. Overall, we created a pool of 43 questions, out of which a random subset of 15 v e r y f r e q u e n t 19% f r e q u e n t 50% o c c a s io n a ll y 19% r a r e ly 4% v e r y r a r e ly 8% Fig. 4. Usage of facets. An option "never" was provided, but not chosen by any user.
was chosen for each user. This approach is intended to reduce the bias that might arise from certain terms used throughout.
In a first set of questions we focus on inter-facet comparisons using facet selection. In particular, this evolves around the selection of better categorization (Phase 3 in Sect. 4) and semantic similarity (Subsect. 3.4).
A second set of questions uses facet ranking with facets modeled after Wikidata. This compares multiple indirect facets with their respective direct counterparts. Here, the indirect facets also vary in their intra-facet scores, allowing us to evaluate our strategy in the selection of better categorization.
Finally, we used facet ranking, this time with abstract facets, i.e., replacing facet headers with "Facet 1" etc. and values with "Value 1" etc. The reason is again to reduce bias stemming from the actual semantics of the proposed facets. In this last part of the evaluation, we issued questions, where the proposed facets differed only with respect to one intra-facet metric 10 . In a similar fashion, we also examined combinations of two and all three proposed intra-facet metrics.
For the survey, we recruited 26 volunteers differing in age (18-44) and educational background. In total, they performed 130 facet selections and 936 individual facet ratings. Most of the participants stated at least an occasional use of facets, if they are provided (cf. Fig. 4). Consequently, we assume that they are familiar with the general behavior of faceted browsing.
Results. For each question in facet selection, we derive the percentage of participant selections that match the system decision. Figure 5 shows the results of the first question set with each dot representing agreement of one particular question 11 . For the selection of better categorization we see an overall agreement between the survey users and our system of ∼83%.
The average result for semantic similarity is mixed (∼63%). However, when analyzing the agreement per question, we see a more polarized result. While users most often agree on a specific facet, our system is not always able to concur with this choice. This leads us to believe that the survey responses were driven more by the applicability of the individual facet and not its relation to the already given ones. Yet, this is dependent on the available information and hence, out of control of the proposed workflow.
In facet ranking, we are not interested in the specific numerical values each metric provides, but focus on the ranking induced by those metrics. To compare the ranking determined by our system with the ranking induced by the sur-  vey responses, we encoded the latter using numerical values and calculated an average rating for each facet. For each question, we ranked the presented facets according to these ratings, which results in a survey ranking. We then chose Kendall's Tau-B 12 to compare our system ranking with this survey ranking. The survey responses for the second question set, concerned with the selection of better categorization, are shown in the topmost lane of Fig. 6. The overall result shows no clear support for our approach in this step. When there was no (obvious) relation between the indirect property and the initial resource set (e.g., a facet for country of origin/driving side), users rated the facet rather low. However, the system sometimes favors these facets, as they oftentimes provide a good categorization with respect to the defined metrics. On other occasions, like the facet country of origin/continent, both users and the system agree that this is a helpful facet. This leads us to believe that, although indirect facets are promising, they require additional refinement to ensure their relevancy.
The final question set verified our metrics independent of semantic biases induced by real-world facets. Results are shown in the lower parts of Fig. 6. In general, survey participants agree almost completely with our approach. The only exceptions are due to a tie (Card, Disp) or a different opinion about the order of one particular pair of facets (Disp, Prob and Card, Disp, Prob).
The user evaluation suggests that the technical criteria seem well suited in isolation. However, resulting facets not only have to be evaluated against each other, but also against the semantic context of the input IRIs. While in search tasks user input can be used to assess this intent, it remains open how this can automatically be approximated for arbitrary resource sets.

Conclusion
We have proposed methods to enable automatic facet generation and ranking over KGs. In particular, we provided an approach for dynamic candidate facet generation for arbitrary input sets of resources. We defined intra-and inter-facet metrics to rank the candidates and reduce the possible facet space by selecting the most useful ones. We explored indirect properties to find better categorizations and consequently enhance facets' usefulness. We proposed semantic similarity as a criterion to select among multiple candidate facets. Finally, we developed a holistic workflow that integrates all proposed methods.
Initial survey results support the used metrics. While indirect facets show promise as a helpful addition, their relevancy for the initial resource set needs to be ensured. This latter issue is also the main focus of our future efforts: How can we estimate the relatedness to the initial input for indirect facets? Another prime direction is a performance improvement of our initial prototype, to make it applicable for real-world systems (e.g., caching and parallelization of queries).