Abstract
With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end users has become more and more common in SPARQL endpoints. In this paper, we conduct an in-depth analytical study of the queries formulated by end users and harvested from large and up-to-date structured query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, spanning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries that exhibits already interesting results on this generalized corpus, we drill deeper in the structural characteristics related to the graph and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs and characterize their treewidth, length of their cycles, maximal degree of nodes, and more. For queries that cannot be adequately represented as graphs, we investigate their hypergraphs and hypertreewidth. Moreover, we analyze the evolution of queries over time, by introducing the novel concept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users and brings us to draw a number of conclusions and pinpoint future directions for SPARQL query evaluation, query optimization, tuning, and benchmarking.
Similar content being viewed by others
Notes
The exception is [21], where logs from the Linked SPARQL Queries (LSQ) dataset were studied, combining data from four sources (from 2010 and 2014) that we also consider.
We consider extensions with Filter, Opt, and Values, but only in a way for which we know that tree-likeness of the query graph ensures the existence of efficient evaluation algorithms.
For instance, as can be seen immediately in Fig. 1, the DBpedia endpoint receives many more large queries than the unique valid logs lead us to suspect.
We discovered that we received three log files from USEWOD as well as from Openlink, in the sense that only the hash values used for anonymization were different. These duplicate log files were deleted prior to all analysis and are not taken into account in Table 1.
The query was called “Public Art in Paris” and was malformed (closing braces were missing and it had a bad aggregate). It was still malformed on June 29th, 2017.
We also investigated the occurrence of other operators (Service, Bind, Assign, Data, Dataset, Sample, Group Concat), each of which appeared in less than 1% of the queries. We omit them from the table for succinctness.
The remaining solution modifier, Reduced, was only found in 6,126 (1,149) queries.
Conjunctions in SPARQL are actually denoted by “.” or “;” for brevity, but we group them under “And ” in this paper for readability.
For instance, 95% (97%) of the Describe statements in our corpus do not have a body and therefore no triples.
There is one exception: For Wikidata, we removed SERVICE subqueries before the analysis (which appears in approximately 200 of its queries and is used to change the language of the output).
This difference can be understood as follows: If the query tests the presence of a k-clique, then without projection we are given a k-tuple of nodes and need to verify if they form a k-clique. With projection, we need to solve the NP-complete k-clique problem.
In particular, it consists of finding embeddings of the directed and edge-labeled variant of the graph, but we omit the edge directions and labels for simplicity. They do not influence the structure and cyclicity of graph patterns.
We recall that gMark can generate queries of four shapes: chain, star, chain–star, and cycle. We have thus cherry-picked chain queries as representatives of queries with hypertreewidth equal to 1.
Every CPU has 6 physical cores and, with hyperthreading, 12 logical cores.
In the case in which we let PG run beyond the time-out and collect the new numbers.
Perez et al.’s definition also has a safety condition on the filter statements of the patterns, but the omission of this condition does not affect the results in this paper.
Available on https://maxbannach.github.io/Jdrasil/.
The free or distinguished variables of a query considered as a first-order propositional formula are the set of variables used as output in the formula.
Even though our set of Wikidata queries is very small, Malyshev et al. [39] recently found a similar percentage of property path usage in Wikidata logs consisting of \(\sim \) 480M valid queries.
We normalized the measure by dividing the Levenshtein distance by the length of the longer string.
For 88,201 streaks, all queries had an empty body. Another 31 streaks had a non-empty body, containing no triples.
References
Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. In: International Conference on Management of Data (SIGMOD), pp. 431–446 (2016)
Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: Old techniques for new join algorithms: a case study in RDF processing. In: International Conference on Data Engineering (ICDE) Workshops, pp. 97–102 (2016)
Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)
Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G.H.L., Lemay, A., Advokaat, N.: gmark: Schema-driven generation of graphs and queries. IEEE Trans. Knowl. Data Eng. 29(4), 856–869 (2017)
Bagan, G., Bonifati, A., Groz, B.: A trichotomy for regular simple path queries on graphs. In: Principles of Database Systems (PODS), pp. 261–272 (2013)
Bagan, G., Durand, A., Grandjean, E.: On acyclic conjunctive queries and constant delay enumeration. In: Computer Science Logic (CSL), pp. 208–222 (2007)
Bannach, M., Berndt, S., Ehlers, T.: Jdrasil: a modular library for computing tree decompositions. In: 16th International Symposium on Experimental Algorithms (SEA), pp. 28:1–28:21 (2017)
Barceló, P., Pichler, R., Skritek, S.: Efficient evaluation and approximation of well-designed pattern trees. In: Principles of Database Systems (PODS), pp. 131–144 (2015)
Bielefeldt, A., Gonsior, J., Krötzsch, M.: Practical linked data access via SPARQL: the case of wikidata. In: Workshop on Linked Data (LDOW) (2018)
Bonifati, A., Fletcher, G.H.L., Voigts, H., Yakovets, N.: Querying Graphs. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)
Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)
Bonifati, A., Martens, W., Timm, T.: Navigating the maze of Wikidata query logs. In: The World Wide Web Conference (WWW), pp. 127–138 (2019)
Chandra, A., Merlin, P.: Optimal implementation of conjunctive queries in relational data bases. In: Symposium on the Theory of Computing (STOC), pp. 77–90 (1977)
Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. In: International Conference on Database Theory (ICDT), pp. 56–70 (1997)
Czerwiński, W., Martens, W., Niewerth, M., Parys, P.: Minimization of tree patterns. J. ACM 65(4), 26:1–26:46 (2018)
Czerwiński, W., Martens, W., Parys, P., Przybylko, M.: The (almost) complete guide to tree pattern containment. In: ACM Symposium on Principles of Database Systems (PODS), pp. 117–130 (2015)
detkdecomp. wwwinfo.deis.unical.it/~frank/Hypertrees. Visited on August 10th (2016)
Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: Principles of Database Systems (PODS), pp. 57–74 (2016)
Gottlob, G., Greco, G., Scarcello, F.: Treewidth and hypertree width. In: Tractability: Practical Approaches to Hard Problems, pp. 3–38. Cambridge University Press (2014)
Gottlob, G., Leone, N., Scarcello, F.: Hypertree decompositions and tractable queries. J. Comput. Syst. Sci. 64(3), 579–627 (2002)
Han, X., Feng, Z., Zhang, X., Wang, X., Rao, G., Jiang, S.: On the statistical analysis of practical SPARQL queries. In: WebDB, p. 2 (2016)
Harris, S., Seaborne, A.: SPARQL 1.1 query language. Technical report, World Wide Web Consortium (W3C), March (2013). https://www.w3.org/TR/2013/REC-sparql11-query-20130321
Huelss, J., Paulheim, H.: What SPARQL query logs tell and do not tell about semantic relatedness in LOD—or: The unsuccessful attempt to improve the browsing experience of DBpedia by exploiting query logs. In: ESWC Satellite Events, pp. 297–308 (2015)
Idris, M., Ugarte, M., Vansummeren, S.: The dynamic yannakakis algorithm: compact and efficient query processing under updates. In: International Conference on Management of Data (SIGMOD), pp. 1259–1274 (2017)
Jagadish, H. V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: International Conference on Management of Data (SIGMOD), pp. 13–24 (2007)
Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. In: International Conference on Extending Database Technology (EDBT), pp. 282–293 (2017)
Kaminski, M., Kostylev, E.V.: Beyond well-designed SPARQL. In: International Conference on Database Theory (ICDT), pp. 5:1–5:18 (2016)
Kimelfeld, B., Sagiv, Y.: Revisiting redundancy and minimization in an XPath fragment. In: International Conference on Extending Database Technology (EDBT), pp. 61–72 (2008)
Kröll, M., Pichler, R., Skritek, S.: On the complexity of enumerating the answers to well-designed pattern trees. In: International Conference on Database Theory (ICDT), pp. 22:1–22:18 (2016)
Letelier, A., Pérez, J., Pichler, R., Skritek, S.: Static analysis and optimization of semantic web queries. ACM Trans. Database Syst. 38(4), 25:1–25:45 (2013)
Libkin, L., Martens, W., Vrgoc, D.: Querying graphs with data. J. ACM 63(2), 14:1–14:53 (2016)
Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of wikidata: Semantic technology usage in wikipedia’s knowledge graph. In: International Semantic Web Conference (ISWC), pp. 376–394 (2018)
Martens, W., Niehren, J.: On the minimization of XML schemas and tree automata for unranked trees. J. Comput. Syst. Sci. 73(4), 550–583 (2007)
Martens, W., Trautner, T.: Enumeration problems for regular path queries. CoRR, arXiv:1710.02317 (2017)
Miklau, G., Suciu, D.: Containment and equivalence for a fragment of xpath. J. ACM 51(1), 2–45 (2004)
Möller, K., Hausenblas, M., Cyganiak, R., Handschuh, S., Grimnes, G.: Learning from linked open data usage: Patterns & metrics. In: Web Science Conference (WSC) (2010)
Morsey, M., Lehmann, J., Auer, S., Ngomo, A.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: International Semantic Web Conference (ISWC), pp. 454–469 (2011)
Nandi, A., Jagadish, H.V.: Guided interaction: rethinking the query-result paradigm. PVLDB 4(12), 1466–1469 (2011)
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 16:1–16:45 (2009)
Picalausa, F., Vansummeren, S.: What are real SPARQL queries like? In: International Workshop on Semantic Web Information Management (SWIM), pp. 1–7 (2011)
Saleem, M., Ali, I., Hogan, A., Mehmood, Q., Ngonga Ngomo, A.-C.: LSQ: The linked SPARQL queries dataset. In: International Semantic Web Conference (ISWC), pp. 261–269 (2015)
Vidal, M., Ruckhaus, E., Lampo, T., Martínez, A., Sierra, J., Polleres, A.: Efficiently joining group patterns in SPARQL queries. In: Extended Semantic Web Conference (ESWC), pp. 228–242 (2010)
Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Acknowledgements
We would like to acknowledge USEWOD and Patrick van Kleef together with the team of OpenLink Software for hosting the official DBPedia endpoint and granting us the access to the large DBpedia query logs analyzed in this paper. We thank Stijn Vansummeren for his suggestion to investigate free-connex acyclicity of queries.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bonifati, A., Martens, W. & Timm, T. An analytical study of large SPARQL query logs. The VLDB Journal 29, 655–679 (2020). https://doi.org/10.1007/s00778-019-00558-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00558-9