An analytical study of large SPARQL query logs

Bonifati, Angela; Martens, Wim; Timm, Thomas

doi:10.1007/s00778-019-00558-9

An analytical study of large SPARQL query logs

Special Issue Paper
Published: 02 August 2019

Volume 29, pages 655–679, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

1086 Accesses
57 Citations
Explore all metrics

Abstract

With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end users has become more and more common in SPARQL endpoints. In this paper, we conduct an in-depth analytical study of the queries formulated by end users and harvested from large and up-to-date structured query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, spanning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries that exhibits already interesting results on this generalized corpus, we drill deeper in the structural characteristics related to the graph and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs and characterize their treewidth, length of their cycles, maximal degree of nodes, and more. For queries that cannot be adequately represented as graphs, we investigate their hypergraphs and hypertreewidth. Moreover, we analyze the evolution of queries over time, by introducing the novel concept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users and brings us to draw a number of conclusions and pinpoint future directions for SPARQL query evaluation, query optimization, tuning, and benchmarking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LSQ: The Linked SPARQL Queries Dataset

SPARQL Web-Querying Infrastructure: Ready for Action?

Querying Datasets on the Web with High Availability

Notes

The exception is [21], where logs from the Linked SPARQL Queries (LSQ) dataset were studied, combining data from four sources (from 2010 and 2014) that we also consider.
We consider extensions with Filter, Opt, and Values, but only in a way for which we know that tree-likeness of the query graph ensures the existence of efficient evaluation algorithms.
For instance, as can be seen immediately in Fig. 1, the DBpedia endpoint receives many more large queries than the unique valid logs lead us to suspect.
http://www.openlinksw.com.
http://aksw.github.io/LSQ/.
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples.
We discovered that we received three log files from USEWOD as well as from Openlink, in the sense that only the hash values used for anonymization were different. These duplicate log files were deleted prior to all analysis and are not taken into account in Table 1.
The query was called “Public Art in Paris” and was malformed (closing braces were missing and it had a bad aggregate). It was still malformed on June 29th, 2017.
We also investigated the occurrence of other operators (Service, Bind, Assign, Data, Dataset, Sample, Group Concat), each of which appeared in less than 1% of the queries. We omit them from the table for succinctness.
The remaining solution modifier, Reduced, was only found in 6,126 (1,149) queries.
Conjunctions in SPARQL are actually denoted by “.” or “;” for brevity, but we group them under “And ” in this paper for readability.
For instance, 95% (97%) of the Describe statements in our corpus do not have a body and therefore no triples.
There is one exception: For Wikidata, we removed SERVICE subqueries before the analysis (which appears in approximately 200 of its queries and is used to change the language of the output).
This difference can be understood as follows: If the query tests the presence of a k-clique, then without projection we are given a k-tuple of nodes and need to verify if they form a k-clique. With projection, we need to solve the NP-complete k-clique problem.
In particular, it consists of finding embeddings of the directed and edge-labeled variant of the graph, but we omit the edge directions and labels for simplicity. They do not influence the structure and cyclicity of graph patterns.
https://github.com/graphMark/gmark.
We recall that gMark can generate queries of four shapes: chain, star, chain–star, and cycle. We have thus cherry-picked chain queries as representatives of queries with hypertreewidth equal to 1.
Every CPU has 6 physical cores and, with hyperthreading, 12 logical cores.
In the case in which we let PG run beyond the time-out and collect the new numbers.
Perez et al.’s definition also has a safety condition on the filter statements of the patterns, but the omission of this condition does not affect the results in this paper.
Available on https://maxbannach.github.io/Jdrasil/.
The free or distinguished variables of a query considered as a first-order propositional formula are the set of variables used as output in the formula.
Even though our set of Wikidata queries is very small, Malyshev et al. [39] recently found a similar percentage of property path usage in Wikidata logs consisting of \(\sim \) 480M valid queries.
We normalized the measure by dividing the Levenshtein distance by the length of the longer string.
For 88,201 streaks, all queries had an empty body. Another 31 streaks had a non-empty body, containing no triples.

References

Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. In: International Conference on Management of Data (SIGMOD), pp. 431–446 (2016)
Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: Old techniques for new join algorithms: a case study in RDF processing. In: International Conference on Data Engineering (ICDE) Workshops, pp. 97–102 (2016)
Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)
Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G.H.L., Lemay, A., Advokaat, N.: gmark: Schema-driven generation of graphs and queries. IEEE Trans. Knowl. Data Eng. 29(4), 856–869 (2017)
Article Google Scholar
Bagan, G., Bonifati, A., Groz, B.: A trichotomy for regular simple path queries on graphs. In: Principles of Database Systems (PODS), pp. 261–272 (2013)
Bagan, G., Durand, A., Grandjean, E.: On acyclic conjunctive queries and constant delay enumeration. In: Computer Science Logic (CSL), pp. 208–222 (2007)
Bannach, M., Berndt, S., Ehlers, T.: Jdrasil: a modular library for computing tree decompositions. In: 16th International Symposium on Experimental Algorithms (SEA), pp. 28:1–28:21 (2017)
Barceló, P., Pichler, R., Skritek, S.: Efficient evaluation and approximation of well-designed pattern trees. In: Principles of Database Systems (PODS), pp. 131–144 (2015)
Bielefeldt, A., Gonsior, J., Krötzsch, M.: Practical linked data access via SPARQL: the case of wikidata. In: Workshop on Linked Data (LDOW) (2018)
Bonifati, A., Fletcher, G.H.L., Voigts, H., Yakovets, N.: Querying Graphs. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)
Google Scholar
Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)
Google Scholar
Bonifati, A., Martens, W., Timm, T.: Navigating the maze of Wikidata query logs. In: The World Wide Web Conference (WWW), pp. 127–138 (2019)
Chandra, A., Merlin, P.: Optimal implementation of conjunctive queries in relational data bases. In: Symposium on the Theory of Computing (STOC), pp. 77–90 (1977)
Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. In: International Conference on Database Theory (ICDT), pp. 56–70 (1997)
Google Scholar
Czerwiński, W., Martens, W., Niewerth, M., Parys, P.: Minimization of tree patterns. J. ACM 65(4), 26:1–26:46 (2018)
Article MathSciNet Google Scholar
Czerwiński, W., Martens, W., Parys, P., Przybylko, M.: The (almost) complete guide to tree pattern containment. In: ACM Symposium on Principles of Database Systems (PODS), pp. 117–130 (2015)
detkdecomp. wwwinfo.deis.unical.it/~frank/Hypertrees. Visited on August 10th (2016)
Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: Principles of Database Systems (PODS), pp. 57–74 (2016)
Gottlob, G., Greco, G., Scarcello, F.: Treewidth and hypertree width. In: Tractability: Practical Approaches to Hard Problems, pp. 3–38. Cambridge University Press (2014)
Gottlob, G., Leone, N., Scarcello, F.: Hypertree decompositions and tractable queries. J. Comput. Syst. Sci. 64(3), 579–627 (2002)
Article MathSciNet Google Scholar
Han, X., Feng, Z., Zhang, X., Wang, X., Rao, G., Jiang, S.: On the statistical analysis of practical SPARQL queries. In: WebDB, p. 2 (2016)
Harris, S., Seaborne, A.: SPARQL 1.1 query language. Technical report, World Wide Web Consortium (W3C), March (2013). https://www.w3.org/TR/2013/REC-sparql11-query-20130321
Huelss, J., Paulheim, H.: What SPARQL query logs tell and do not tell about semantic relatedness in LOD—or: The unsuccessful attempt to improve the browsing experience of DBpedia by exploiting query logs. In: ESWC Satellite Events, pp. 297–308 (2015)
Chapter Google Scholar
http://ldbcouncil.org
https://databasetheory.org/node/47
http://wikidata.org
http://wiki.dbpedia.org/datasets
http://www.blazegraph.com
http://www.opencypher.org/ocig2
http://www.postgresql.org
Idris, M., Ugarte, M., Vansummeren, S.: The dynamic yannakakis algorithm: compact and efficient query processing under updates. In: International Conference on Management of Data (SIGMOD), pp. 1259–1274 (2017)
Jagadish, H. V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: International Conference on Management of Data (SIGMOD), pp. 13–24 (2007)
Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. In: International Conference on Extending Database Technology (EDBT), pp. 282–293 (2017)
Kaminski, M., Kostylev, E.V.: Beyond well-designed SPARQL. In: International Conference on Database Theory (ICDT), pp. 5:1–5:18 (2016)
Kimelfeld, B., Sagiv, Y.: Revisiting redundancy and minimization in an XPath fragment. In: International Conference on Extending Database Technology (EDBT), pp. 61–72 (2008)
Kröll, M., Pichler, R., Skritek, S.: On the complexity of enumerating the answers to well-designed pattern trees. In: International Conference on Database Theory (ICDT), pp. 22:1–22:18 (2016)
Letelier, A., Pérez, J., Pichler, R., Skritek, S.: Static analysis and optimization of semantic web queries. ACM Trans. Database Syst. 38(4), 25:1–25:45 (2013)
Article MathSciNet Google Scholar
Libkin, L., Martens, W., Vrgoc, D.: Querying graphs with data. J. ACM 63(2), 14:1–14:53 (2016)
Article MathSciNet Google Scholar
Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of wikidata: Semantic technology usage in wikipedia’s knowledge graph. In: International Semantic Web Conference (ISWC), pp. 376–394 (2018)
Google Scholar
Martens, W., Niehren, J.: On the minimization of XML schemas and tree automata for unranked trees. J. Comput. Syst. Sci. 73(4), 550–583 (2007)
Article MathSciNet Google Scholar
Martens, W., Trautner, T.: Enumeration problems for regular path queries. CoRR, arXiv:1710.02317 (2017)
Miklau, G., Suciu, D.: Containment and equivalence for a fragment of xpath. J. ACM 51(1), 2–45 (2004)
Article MathSciNet Google Scholar
Möller, K., Hausenblas, M., Cyganiak, R., Handschuh, S., Grimnes, G.: Learning from linked open data usage: Patterns & metrics. In: Web Science Conference (WSC) (2010)
Morsey, M., Lehmann, J., Auer, S., Ngomo, A.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: International Semantic Web Conference (ISWC), pp. 454–469 (2011)
Chapter Google Scholar
Nandi, A., Jagadish, H.V.: Guided interaction: rethinking the query-result paradigm. PVLDB 4(12), 1466–1469 (2011)
Google Scholar
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
Article Google Scholar
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 16:1–16:45 (2009)
Article Google Scholar
Picalausa, F., Vansummeren, S.: What are real SPARQL queries like? In: International Workshop on Semantic Web Information Management (SWIM), pp. 1–7 (2011)
Saleem, M., Ali, I., Hogan, A., Mehmood, Q., Ngonga Ngomo, A.-C.: LSQ: The linked SPARQL queries dataset. In: International Semantic Web Conference (ISWC), pp. 261–269 (2015)
Chapter Google Scholar
Vidal, M., Ruckhaus, E., Lampo, T., Martínez, A., Sierra, J., Polleres, A.: Efficiently joining group patterns in SPARQL queries. In: Extended Semantic Web Conference (ESWC), pp. 228–242 (2010)
Google Scholar
Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar

Download references

Acknowledgements

We would like to acknowledge USEWOD and Patrick van Kleef together with the team of OpenLink Software for hosting the official DBPedia endpoint and granting us the access to the large DBpedia query logs analyzed in this paper. We thank Stijn Vansummeren for his suggestion to investigate free-connex acyclicity of queries.

Author information

Authors and Affiliations

Lyon 1 University, Lyon, France
Angela Bonifati
University of Bayreuth, Bayreuth, Germany
Wim Martens & Thomas Timm

Authors

Angela Bonifati
View author publications
You can also search for this author in PubMed Google Scholar
Wim Martens
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Timm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angela Bonifati.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bonifati, A., Martens, W. & Timm, T. An analytical study of large SPARQL query logs. The VLDB Journal 29, 655–679 (2020). https://doi.org/10.1007/s00778-019-00558-9

Download citation

Received: 12 December 2018
Revised: 03 May 2019
Accepted: 16 July 2019
Published: 02 August 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s00778-019-00558-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An analytical study of large SPARQL query logs

Abstract

Access this article

Similar content being viewed by others

LSQ: The Linked SPARQL Queries Dataset

SPARQL Web-Querying Infrastructure: Ready for Action?

Querying Datasets on the Web with High Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An analytical study of large SPARQL query logs

Abstract

Access this article

Similar content being viewed by others

LSQ: The Linked SPARQL Queries Dataset

SPARQL Web-Querying Infrastructure: Ready for Action?

Querying Datasets on the Web with High Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation