Dataset Summary Visualization with LODSight
- 1.2k Downloads
We present a web-based tool that shows a summary of an RDF dataset as a visualization of a graph formed from classes, datatypes and predicates used in the dataset. The visualization should allow to quickly and easily find out what kind of data the dataset contains and its structure. It also shows how vocabularies are used in the dataset.
KeywordsDataProperty SPARQL Endpoint Incremental Exploration Real-time Summarization User Ontology
In contrast to the RDBMS world, (RDF) datasets on the semantic web usually are not provided with a schema. There are of course RDFS/OWL vocabularies,1 but those only define sets of concepts that can be used in a dataset rather than what combinations of concepts should be used to describe the data. The relationship between a vocabulary and a dataset is thus by far not as strict as the relationship between an SQL database and its schema. If users encounter an RDF dataset they are not familiar with, finding out what kind of data it contains is nontrivial since up-to-date and complete documentation of the dataset is usually not present. The other way around, when users encounter a vocabulary they are not familiar with, they can try to learn the proper usage of the vocabulary by reading the documentation and inspecting the axioms, labels and comments in the RDFS/OWL source code. However, even the vocabulary documentation may be insufficient or missing, and the axioms do not fully specify the usage, as said above. Then the only remaining option is to look at datasets where the vocabulary is used (provided they exist).
2 Related Research
The visualization in LODSight is based on the same principle as maps of ontology usage . However, maps of ontology usage focus the visualization on entities from single namespace while we display all classes in one summarization regardless of namespace. Maps of ontology usage rely on YARS2 while we support remote summarization of theoretically any SPARQL endpoint. ExpLOD  offers a more complex approach based on bisimulation contraction. The result is a node-link visualization similar to ours but more accurate: showing a combination of links that reportedly exist in the dataset while we show combinations of links that possibly exist. Our visualization might be on the other hand more intuitive as it shows types of instances directly as node labels while ExpLOD shows types as separate nodes which might lead to clutter. In addition to both maps of ontology usage and ExpLOD we allow to show examples of existing combinations of instances represented by the generalized (sub)graph. An approach similar to ExpLOD is presented by Li . None of the above mentioned tools seem to be publicly available. SPARQture  shows a summary of the dataset in the form of a mere list of classes but allows incremental exploration showing predicates and example instances related to a selected class. Incremental exploration is applicable on datasets of the size of DBpedia, which is simply too big for detailed view in LODSight. Somewhat similar results are offered by Rhizomer , which includes a method for displaying the hierarchy of classes used in the dataset in a treemap. We use the notion of type-property paths as mentioned by Presutti et al.  (and also as used in ), which aims at construction of prototypical SPARQL queries showing examples of typical data. Campinas et al.  discuss efficient graph summarization using MapReduce, but we aim at easier applicability accessing a SPARQL endpoint directly without the need to transform the data. Our approach is also inspired by the notion of characteristic sets  as used by Minh-Duc  to summarize a dataset into a relational schema.
3 Summarization Method
Summarization in LODSight is based on type-property and datatype-property paths. Type-property path of length 1 is a sequence type1 - property - type2. Type1 and type2 are the types of instances from the dataset that are connected by the property. We use the term path frequency to denote the number of triples ?s ?property ?o in the dataset where ?s is an instance of type1 and ?o of type2. We use the term datatype-property path for a sequence type - data property - datatype. Such path is present in the dataset if there is a triple ?s ?dataproperty ?l, where ?s is an instance of type and ?l is a literal of type datatype. To create a dataset summary in LODSight we first find all type-property paths and merge them into one graph. Then we add datatype-property paths, whose subject type is present in the type-property paths.
Evaluation. We tested LODSight on first 32 available datasets listed by default order at datahub.io.3 In 17 cases the summarization worked, out of which in 10 cases only the type-property paths were retrieved, while queries for datatypes ended in errors.4 Successful (including datatypes) summarizations of datasets with triple counts between 18,834 and 11,485,244 took between 4 and 44 s. Successful summarization of 117,000,000 triples took 5 min and 7 s.
Three main features contribute to the goals stated in the introduction: first, each node is colored according to its namespace. Therefore, the user can easily see which parts of the dataset structure are realized with which vocabularies. Second, the user can change the minimum frequency threshold of the type-property paths. A slider is provided that can change the threshold between 1 and a value near the maximum frequency in the dataset. The latter leads to displaying only the most frequent paths thus showing the “core” structure of the dataset. Third, the user can select any subset of the nodes (a subgraph) to retrieve example instantiations of it. The instantiations are retrieved on-line from the SPARQL endpoint and displayed above the corresponding nodes one combination at a time while the user can switch between them: see examples in Fig. 1 showing a combination of instances of pc:Tender, gr:BusinessEntity, schema:ContactPoint and the phone number literal that are actually present in the dataset and linked as the visualization suggests.
The demonstration will include showing visualizations of 17 datasets at various levels of detail and retrieving example instantiations for selected subgraphs.
5 Conclusions and Future Work
We presented LODSight: an RDF dataset summary visualization tool that displays typical combinations of types and predicates in a similar way as other existing approaches. There are four main advantages of our approach. First, it relies solely on SPARQL and thus allows to summarize (theoretically) any dataset accessible through a SPARQL endpoint. Second, it enables to dynamically change the level of detail and show summarizations of very large datasets. Third, it provides an option to load examples of instances and literals represented by a selected subgraph. Last, it makes the visualizations publicly available through a web browser. Real-time summarization might be considered for smaller datasets – preliminary results suggest that processing several hundred triples takes less than 20 s. For large datasets, implementing incremental exploration similar to SPARQture might be an option. We also plan to improve the reliability of the summarization: preliminary evaluation shows it worked on 53 % of tested datasets. A possible workaround for that is copying the dataset into a more reliable triple store and running the summarization there. Future evaluation will include tests with a group of users, students with basic knowledge of linked data, asked to describe displayed data or learn the usage of a vocabulary.
This research is supported by VŠE IGA No. F4/90/2015 and long-term institutional support of research activities by Faculty of Informatics and Statistics, University of Economics, Prague.
By vocabulary we mean ontology or vocabulary.
Available at http://lod2-dev.vse.cz/lodsight.
The list was filtered to show only datasets provided with SPARQL endpoint.
The problem seems to be in support of demanding queries and bind() and datatype() functions in specific endpoints.
Available at http://lod2-dev.vse.cz/lodsight/?sumid=5024128&minfreq=1.
- 2.Campinas, S., et al.: Efficiency and precision trade-offs in graph summary algorithms. In: Proceedings of the 17th International Database Engineering & Applications Symposium, pp. 38–47. ACM (2013)Google Scholar
- 3.Kinsella, S., et al.: An interactive map of semantic web ontology usage. In: 12th International Conference Information Visualisation, IV 2008, pp. 179–184. IEEE (2008)Google Scholar
- 4.Khatchadourian, S., Consens, M.P.: ExpLOD: summary-based exploration of interlinking and RDF usage in the linked open data cloud. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part II. LNCS, vol. 6089, pp. 272–287. Springer, Heidelberg (2010) CrossRefGoogle Scholar
- 6.Maali, F.: SPARQture: A More Welcoming Entry to SPARQL Endpoints. http://ceur-ws.org/Vol-1279/iesd14_9.pdf
- 7.Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 984–994. IEEE (2011)Google Scholar
- 8.Pham, M.: Self-organizing structured RDF in MonetDB. In: 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW), pp. 310–313. IEEE (2013)Google Scholar
- 9.Presutti, V., et al.: Extracting core knowledge from linked data. In: Proceedings of the Second Workshop on Consuming Linked Data, COLD 2011 (2011)Google Scholar
- 10.Svátek, V., et al.: B-Annot: supplying background model annotations for ontology coherence testing. In: 3rd Workshop on Debugging Ontologies and Ontology Mappings at ESWC 2014, Heraklion, Crete (2014)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.