Keywords

1 Introduction

In many knowledge bases, entities are described with numerous properties. However, not all properties have the same importance. Some properties are considered as keys for performing instance matching tasks while other properties are generally chosen for quickly providing a summary of the key facts attached to an entity. Our motivation is to provide a method enabling to select what properties should be used when depicting the summary of an entity, for example in a multimedia question answering system such as QakisMediaFootnote 1 or in a second screen application providing more information about a particular TV programFootnote 2.

Our approach consists in: (i) reverse engineering the Google Knowledge Panel by extracting the properties that Google considers as sufficiently important to show (Sect. 2), and (ii) analyzing users’ preferences by conducting a user survey and comparing the results (Sect. 3). We finally show how we can explicitly represent this knowledge of preferred properties to attach to an entity using the Fresnel vocabulary before concluding (Sect. 4).

2 Reverse Engineering the Google KG Panel

Web scraping is a technique for extracting data from Web pages. We aim at capturing the properties depicted in the Google Knowledge Panel (GKP) that are injected in search result pages [1]. We have developed a Node.js application that queries all DBpedia concepts that have at least one instance which is owl:sameAs with a Freebase resource in order to increase the probability that the search engine result page (SERP) for this resource will contain a GKP. We assume in our experiment that the properties displayed for an entity are “entity type dependent" and that context (country, query, time, etc.) can affect the results. Moreover, we filter out generic concepts by excluding those who are direct subclasses of owl:Thing since they will trigger ambiguous queries. We obtained a list of \(352\) conceptsFootnote 3.

For each of these concepts, we retrieve \(n\) instancesFootnote 4. For each of these instances, we issue a search query to Google containing the instance label. Google does not serve the GKP for all user agents and we had to mimic a browser behavior by setting the \(User-Agent\) to a particular browser. We use CSS selectors to extract data from a GKP. An example of a query selector is \(.\_om\) (all elements with class name \(\_om\)) which returns the property DOM element(s) for the concept described in the GKP. From our experiments, we found out that we do not always get a GKP in a SERP. If this happens, we disambiguate the instance by issuing a new query with the concept type attached. However, if no GKP was found again, we capture that for manual inspection later on. Listing 1 gives the high level algorithm for extracting the GKP. The full implementation can be found at https://github.com/ahmadassaf/KBE.

figure a

3 Evaluation

We conducted a user survey in order to compare what users think should be the important properties to display for a particular entity and what the GKP shows.

User survey. We set up a surveyFootnote 5 on February 25th, 2014 and for three weeks in order to collect the preferences of users in term of the properties they would like to be shown for a particular entity. We select one representative entity for nine classes: TennisPlayer, Museum, Politician, Company, Country, City, Film, SoccerClub and Book. 152 participants have provided answers, 72 % from academia, 20 % coming from the industry and 8 % having not declared their affiliation. 94 % of the respondents have heard about the Semantic Web while 35 % were not familiar with specific visualization tools. The detailed resultsFootnote 6 show the ranking of the top properties for each entity. We only keep the properties having received at least 10 % votes for comparing with the properties depicted in a KGP. Hence, users do not seem to be interested in the INSEE code identifying a French city while they expect to see the population or the points of interest of this city.

Comparison with the Knowledge Graphs. The results of the Google Knowledge Panel (GKP) extractionFootnote 7 clearly show a long tail distribution of the properties depicted by Google, with a top N properties (N being 4, 5 or 6 depending on the entity) counting for 98 % of the properties shown for this type. We compare those properties with the ones revealed by the user study. Table 1 shows the agreement between the users and the choices made by Google in the GKP for the 9 classes. The highest agreement concerns the type Museum (66.97 %) while the lowest one is for the TennisPlayer (20 %) concept. We think properties for museums or Books are more stable (no many variety) while for entities categories of Person/Agent, they change a lot according to the status, the function, etc. And so more subjective.

Table 1. Agreement on properties between the users and the Knowledge Graph Panel

With this set of 9 concepts, we are covering \(301,189\) DBpedia entities that have an existence in Freebase, and for each of them, we can now empirically define the most important properties when there is an agreement between one of the biggest knowledge base (Google) and users preferences.

Modeling the preferred properties with Fresnel. FresnelFootnote 8 is a presentation vocabulary for displaying RDF data. It specifies what information contained in an RDF graph should be presented with the core concept fresnel:Lens [2]. We use the Fresnel and PROV-O ontologiesFootnote 9 to explicitly represent what properties should be depicted when displaying an entity.

figure b

4 Conclusion and Future Work

We have shown that it is possible to reveal what are the “important” properties of entities by reverse engineering the choices made by Google when creating knowledge graph panels and by comparing users preferences obtained from a user survey. Our motivation is to represent this choice explicitly, using the Fresnel vocabulary, so that any application could read this configuration file for deciding which properties of an entity is worth to visualize. This is fundamentally different from the work in [4] where the authors created a generalizable approach to open up closed knowledge bases like Google’s by means of crowd-sourcing the knowledge extraction task. We are aware that this knowledge is highly dynamic, the Google Knowledge Graph panel varies across geolocation and time. We have provided the code that enables to perform new calculation at run time and we aim to study the temporal evolution of what are important properties on a longer period. This knowledge which has been captured will be made available shortly in a SPARQL endpoint. We are also investigating the use of Mechanical Turk to perform a larger survey for the complete set of DBpedia classes.