Skip to main content
Log in

An analytical study of large SPARQL query logs

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end users has become more and more common in SPARQL endpoints. In this paper, we conduct an in-depth analytical study of the queries formulated by end users and harvested from large and up-to-date structured query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, spanning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries that exhibits already interesting results on this generalized corpus, we drill deeper in the structural characteristics related to the graph and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs and characterize their treewidth, length of their cycles, maximal degree of nodes, and more. For queries that cannot be adequately represented as graphs, we investigate their hypergraphs and hypertreewidth. Moreover, we analyze the evolution of queries over time, by introducing the novel concept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users and brings us to draw a number of conclusions and pinpoint future directions for SPARQL query evaluation, query optimization, tuning, and benchmarking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The exception is [21], where logs from the Linked SPARQL Queries (LSQ) dataset were studied, combining data from four sources (from 2010 and 2014) that we also consider.

  2. We consider extensions with Filter, Opt, and Values, but only in a way for which we know that tree-likeness of the query graph ensures the existence of efficient evaluation algorithms.

  3. For instance, as can be seen immediately in Fig. 1, the DBpedia endpoint receives many more large queries than the unique valid logs lead us to suspect.

  4. http://www.openlinksw.com.

  5. http://aksw.github.io/LSQ/.

  6. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples.

  7. We discovered that we received three log files from USEWOD as well as from Openlink, in the sense that only the hash values used for anonymization were different. These duplicate log files were deleted prior to all analysis and are not taken into account in Table 1.

  8. The query was called “Public Art in Paris” and was malformed (closing braces were missing and it had a bad aggregate). It was still malformed on June 29th, 2017.

  9. We also investigated the occurrence of other operators (Service, Bind, Assign, Data, Dataset, Sample, Group Concat), each of which appeared in less than 1% of the queries. We omit them from the table for succinctness.

  10. The remaining solution modifier, Reduced, was only found in 6,126 (1,149) queries.

  11. Conjunctions in SPARQL are actually denoted by “.” or “;” for brevity, but we group them under “And ” in this paper for readability.

  12. For instance, 95% (97%) of the Describe statements in our corpus do not have a body and therefore no triples.

  13. There is one exception: For Wikidata, we removed SERVICE subqueries before the analysis (which appears in approximately 200 of its queries and is used to change the language of the output).

  14. This difference can be understood as follows: If the query tests the presence of a k-clique, then without projection we are given a k-tuple of nodes and need to verify if they form a k-clique. With projection, we need to solve the NP-complete k-clique problem.

  15. In particular, it consists of finding embeddings of the directed and edge-labeled variant of the graph, but we omit the edge directions and labels for simplicity. They do not influence the structure and cyclicity of graph patterns.

  16. https://github.com/graphMark/gmark.

  17. We recall that gMark can generate queries of four shapes: chain, star, chain–star, and cycle. We have thus cherry-picked chain queries as representatives of queries with hypertreewidth equal to 1.

  18. Every CPU has 6 physical cores and, with hyperthreading, 12 logical cores.

  19. In the case in which we let PG run beyond the time-out and collect the new numbers.

  20. Perez et al.’s definition also has a safety condition on the filter statements of the patterns, but the omission of this condition does not affect the results in this paper.

  21. Available on https://maxbannach.github.io/Jdrasil/.

  22. The free or distinguished variables of a query considered as a first-order propositional formula are the set of variables used as output in the formula.

  23. Even though our set of Wikidata queries is very small, Malyshev et al. [39] recently found a similar percentage of property path usage in Wikidata logs consisting of \(\sim \) 480M valid queries.

  24. We normalized the measure by dividing the Levenshtein distance by the length of the longer string.

  25. For 88,201 streaks, all queries had an empty body. Another 31 streaks had a non-empty body, containing no triples.

References

  1. Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. In: International Conference on Management of Data (SIGMOD), pp. 431–446 (2016)

  2. Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: Old techniques for new join algorithms: a case study in RDF processing. In: International Conference on Data Engineering (ICDE) Workshops, pp. 97–102 (2016)

  3. Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)

  4. Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G.H.L., Lemay, A., Advokaat, N.: gmark: Schema-driven generation of graphs and queries. IEEE Trans. Knowl. Data Eng. 29(4), 856–869 (2017)

    Article  Google Scholar 

  5. Bagan, G., Bonifati, A., Groz, B.: A trichotomy for regular simple path queries on graphs. In: Principles of Database Systems (PODS), pp. 261–272 (2013)

  6. Bagan, G., Durand, A., Grandjean, E.: On acyclic conjunctive queries and constant delay enumeration. In: Computer Science Logic (CSL), pp. 208–222 (2007)

  7. Bannach, M., Berndt, S., Ehlers, T.: Jdrasil: a modular library for computing tree decompositions. In: 16th International Symposium on Experimental Algorithms (SEA), pp. 28:1–28:21 (2017)

  8. Barceló, P., Pichler, R., Skritek, S.: Efficient evaluation and approximation of well-designed pattern trees. In: Principles of Database Systems (PODS), pp. 131–144 (2015)

  9. Bielefeldt, A., Gonsior, J., Krötzsch, M.: Practical linked data access via SPARQL: the case of wikidata. In: Workshop on Linked Data (LDOW) (2018)

  10. Bonifati, A., Fletcher, G.H.L., Voigts, H., Yakovets, N.: Querying Graphs. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)

    Google Scholar 

  11. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)

    Google Scholar 

  12. Bonifati, A., Martens, W., Timm, T.: Navigating the maze of Wikidata query logs. In: The World Wide Web Conference (WWW), pp. 127–138 (2019)

  13. Chandra, A., Merlin, P.: Optimal implementation of conjunctive queries in relational data bases. In: Symposium on the Theory of Computing (STOC), pp. 77–90 (1977)

  14. Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. In: International Conference on Database Theory (ICDT), pp. 56–70 (1997)

    Google Scholar 

  15. Czerwiński, W., Martens, W., Niewerth, M., Parys, P.: Minimization of tree patterns. J. ACM 65(4), 26:1–26:46 (2018)

    Article  MathSciNet  Google Scholar 

  16. Czerwiński, W., Martens, W., Parys, P., Przybylko, M.: The (almost) complete guide to tree pattern containment. In: ACM Symposium on Principles of Database Systems (PODS), pp. 117–130 (2015)

  17. detkdecomp. wwwinfo.deis.unical.it/~frank/Hypertrees. Visited on August 10th (2016)

  18. Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: Principles of Database Systems (PODS), pp. 57–74 (2016)

  19. Gottlob, G., Greco, G., Scarcello, F.: Treewidth and hypertree width. In: Tractability: Practical Approaches to Hard Problems, pp. 3–38. Cambridge University Press (2014)

  20. Gottlob, G., Leone, N., Scarcello, F.: Hypertree decompositions and tractable queries. J. Comput. Syst. Sci. 64(3), 579–627 (2002)

    Article  MathSciNet  Google Scholar 

  21. Han, X., Feng, Z., Zhang, X., Wang, X., Rao, G., Jiang, S.: On the statistical analysis of practical SPARQL queries. In: WebDB, p. 2 (2016)

  22. Harris, S., Seaborne, A.: SPARQL 1.1 query language. Technical report, World Wide Web Consortium (W3C), March (2013). https://www.w3.org/TR/2013/REC-sparql11-query-20130321

  23. Huelss, J., Paulheim, H.: What SPARQL query logs tell and do not tell about semantic relatedness in LOD—or: The unsuccessful attempt to improve the browsing experience of DBpedia by exploiting query logs. In: ESWC Satellite Events, pp. 297–308 (2015)

    Chapter  Google Scholar 

  24. http://ldbcouncil.org

  25. https://databasetheory.org/node/47

  26. http://wikidata.org

  27. http://wiki.dbpedia.org/datasets

  28. http://www.blazegraph.com

  29. http://www.opencypher.org/ocig2

  30. http://www.postgresql.org

  31. Idris, M., Ugarte, M., Vansummeren, S.: The dynamic yannakakis algorithm: compact and efficient query processing under updates. In: International Conference on Management of Data (SIGMOD), pp. 1259–1274 (2017)

  32. Jagadish, H. V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: International Conference on Management of Data (SIGMOD), pp. 13–24 (2007)

  33. Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. In: International Conference on Extending Database Technology (EDBT), pp. 282–293 (2017)

  34. Kaminski, M., Kostylev, E.V.: Beyond well-designed SPARQL. In: International Conference on Database Theory (ICDT), pp. 5:1–5:18 (2016)

  35. Kimelfeld, B., Sagiv, Y.: Revisiting redundancy and minimization in an XPath fragment. In: International Conference on Extending Database Technology (EDBT), pp. 61–72 (2008)

  36. Kröll, M., Pichler, R., Skritek, S.: On the complexity of enumerating the answers to well-designed pattern trees. In: International Conference on Database Theory (ICDT), pp. 22:1–22:18 (2016)

  37. Letelier, A., Pérez, J., Pichler, R., Skritek, S.: Static analysis and optimization of semantic web queries. ACM Trans. Database Syst. 38(4), 25:1–25:45 (2013)

    Article  MathSciNet  Google Scholar 

  38. Libkin, L., Martens, W., Vrgoc, D.: Querying graphs with data. J. ACM 63(2), 14:1–14:53 (2016)

    Article  MathSciNet  Google Scholar 

  39. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of wikidata: Semantic technology usage in wikipedia’s knowledge graph. In: International Semantic Web Conference (ISWC), pp. 376–394 (2018)

    Google Scholar 

  40. Martens, W., Niehren, J.: On the minimization of XML schemas and tree automata for unranked trees. J. Comput. Syst. Sci. 73(4), 550–583 (2007)

    Article  MathSciNet  Google Scholar 

  41. Martens, W., Trautner, T.: Enumeration problems for regular path queries. CoRR, arXiv:1710.02317 (2017)

  42. Miklau, G., Suciu, D.: Containment and equivalence for a fragment of xpath. J. ACM 51(1), 2–45 (2004)

    Article  MathSciNet  Google Scholar 

  43. Möller, K., Hausenblas, M., Cyganiak, R., Handschuh, S., Grimnes, G.: Learning from linked open data usage: Patterns & metrics. In: Web Science Conference (WSC) (2010)

  44. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: International Semantic Web Conference (ISWC), pp. 454–469 (2011)

    Chapter  Google Scholar 

  45. Nandi, A., Jagadish, H.V.: Guided interaction: rethinking the query-result paradigm. PVLDB 4(12), 1466–1469 (2011)

    Google Scholar 

  46. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)

    Article  Google Scholar 

  47. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 16:1–16:45 (2009)

    Article  Google Scholar 

  48. Picalausa, F., Vansummeren, S.: What are real SPARQL queries like? In: International Workshop on Semantic Web Information Management (SWIM), pp. 1–7 (2011)

  49. Saleem, M., Ali, I., Hogan, A., Mehmood, Q., Ngonga Ngomo, A.-C.: LSQ: The linked SPARQL queries dataset. In: International Semantic Web Conference (ISWC), pp. 261–269 (2015)

    Chapter  Google Scholar 

  50. Vidal, M., Ruckhaus, E., Lampo, T., Martínez, A., Sierra, J., Polleres, A.: Efficiently joining group patterns in SPARQL queries. In: Extended Semantic Web Conference (ESWC), pp. 228–242 (2010)

    Google Scholar 

  51. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge USEWOD and Patrick van Kleef together with the team of OpenLink Software for hosting the official DBPedia endpoint and granting us the access to the large DBpedia query logs analyzed in this paper. We thank Stijn Vansummeren for his suggestion to investigate free-connex acyclicity of queries.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angela Bonifati.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bonifati, A., Martens, W. & Timm, T. An analytical study of large SPARQL query logs. The VLDB Journal 29, 655–679 (2020). https://doi.org/10.1007/s00778-019-00558-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00558-9

Keywords

Navigation