World Wide Web

, Volume 14, Issue 5–6, pp 495–544 | Cite as

Comparing data summaries for processing live queries over Linked Data

  • Jürgen Umbrich
  • Katja Hose
  • Marcel Karnstedt
  • Andreas Harth
  • Axel Polleres
Article

Abstract

A growing amount of Linked Data—graph-structured data accessible at sources distributed across the Web—enables advanced data integration and decision-making applications. Typical systems operating on Linked Data collect (crawl) and pre-process (index) large amounts of data, and evaluate queries against a centralised repository. Given that crawling and indexing are time-consuming operations, the data in the centralised index may be out of date at query execution time. An ideal query answering system for querying Linked Data live should return current answers in a reasonable amount of time, even on corpora as large as the Web. In such a live query system source selection—determining which sources contribute answers to a query—is a crucial step. In this article we propose to use lightweight data summaries for determining relevant sources during query evaluation. We compare several data structures and hash functions with respect to their suitability for building such summaries, stressing benefits for queries that contain joins and require ranking of results and sources. We elaborate on join variants, join ordering and ranking. We analyse the different approaches theoretically and provide results of an extensive experimental evaluation.

Keywords

index structures Linked Data RDF querying 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aberer, K., Cudré-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: building internet-scale semantic overlay networks. In: ISWC’04, pp. 107–121 (2004)Google Scholar
  2. 2.
    Adjiman, Ph., Goasdoué, F., Rousset, M.-Ch.: SomeRDFS in the semantic web. JDS 8, 158–181 (2007)Google Scholar
  3. 3.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS ’02, pp. 1–16 (2002)Google Scholar
  4. 4.
    Berners-Lee, T.: Linked Data, July 2006. http://www.w3.org/DesignIssues/LinkedData
  5. 5.
    Berners-Lee, T., Connolly, D.: Notation3 (N3): a readable RDF syntax, January 2008. W3C Team Submission. Available at http://www.w3.org/TeamSubmission/n3/
  6. 6.
    Bizer, Ch., Heath, T., Berners-Lee, T.: Linked data—the story so far. JSWIS 5(3), 1–22 (2009)Google Scholar
  7. 7.
    Brickley, D., Miller, L.: FOAF vocabulary specification 0.91, November 2007. http://xmlns.com/foaf/spec/
  8. 8.
    Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Rec. 30(2), 211–222 (2001)CrossRefGoogle Scholar
  9. 9.
    Cai, M., Frank, M.: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. In: WWW’04, pp. 650–657 (2004)Google Scholar
  10. 10.
    Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2–3), 199–223 (2001)MATHGoogle Scholar
  11. 11.
    Cheng, G., Qu, Y.: Searching linked objects with falcons: approach, implementation and evaluation. JSWIS 5(3), 49–70 (2009)Google Scholar
  12. 12.
    Clark, K.G., Feigenbaum, L., Torres, E.: SPARQL Protocol for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-protocol/
  13. 13.
    Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: ICDCS ’02, pp. 23–32 (2002)Google Scholar
  14. 14.
    Cudré-Mauroux, P., Agarwal, S., Aberer, K.: GridVine: an infrastructure for peer information management. IEEE Internet Computing 11(5), 864–875 (2007)CrossRefGoogle Scholar
  15. 15.
    Cyganiak, R., Stenzhorn, H., Delbru, R., Decker, S., Tummarello, G.: Semantic sitemaps: efficient and flexible access to datasets on the semantic web. In: ESWC’08, pp. 690–704 (2008)Google Scholar
  16. 16.
    d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with Watson. In: EON’07, pp. 1–10 (2007)Google Scholar
  17. 17.
    Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: ESWC 2010, pp. 240–256 (2010)Google Scholar
  18. 18.
    Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Englewood Cliffs (1999)Google Scholar
  19. 19.
    Gibbons, P., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. In: VLDB ’97, pp. 466–475 (1997)Google Scholar
  20. 20.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: VLDB ’01, pp. 79–88 (2001)Google Scholar
  21. 21.
    Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)Google Scholar
  22. 22.
    Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD ’00, pp. 463–474 (2000)Google Scholar
  23. 23.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD ’84, pp. 47–57 (1984)Google Scholar
  24. 24.
    Harth, A., Decker, S.: Optimized index structures for querying RDF from the web. In: 3rd Latin American Web Congress, pp. 71–80 (2005)Google Scholar
  25. 25.
    Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data summaries for on-demand queries over Linked Data. In: WWW’10, pp. 411–420 (2010)Google Scholar
  26. 26.
    Hartig, O., Bizer, Ch., Freytag, J.-Ch.: Executing SPARQL queries over the Web of Linked Data. In: ISWC’09 (2009)Google Scholar
  27. 27.
    Hayes, P.: RDF semantics. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-mt/
  28. 28.
    Heimbigner, D., McLeod, D.: A federated architecture for information management. ACM Trans. Inf. Syst. 3(3), 253–278 (1985)CrossRefGoogle Scholar
  29. 29.
    Heine, F.: Scalable P2P based RDF querying. In: InfoScale’06, pp. 17–22 (2006)Google Scholar
  30. 30.
    Heine, F., Hovestadt, M., Kao, O.: Processing complex RDF queries over P2P networks. In: Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR’05), pp. 41–48 (2005)Google Scholar
  31. 31.
    Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: Measuring index quality using random walks on the web. Comput. Netw. 31(11–16), 1291–1303 (1999)CrossRefGoogle Scholar
  32. 32.
    Hogan, A., Harth, A., Umbrich, J., Decker, S.: Towards a scalable search and query engine for the web. In: WWW’07, pp. 1301–1302 (2007)Google Scholar
  33. 33.
    Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Technical Report DERI-TR-2010-07-23, DERI (2010)Google Scholar
  34. 34.
    Hose, K.: Processing rank-aware queries in schema-based P2P systems. Ph.D. thesis, TU Ilmenau (2009)Google Scholar
  35. 35.
    Hose, K., Karnstedt, M., Koch, A., Sattler, K., Zinn, D.: Processing rank-aware queries in P2P systems. In: DBISP2P’05, pp. 238–249 (2005)Google Scholar
  36. 36.
    Hose, K., Klan, D., Sattler, K.: Distributed data summaries for approximate query processing in PDMS. In: IDEAS ’06, pp. 37–44 (2006)Google Scholar
  37. 37.
    Huang, S.-H.S.: Multidimensional extendible hashing for partial-match queries. JPP 14, 73–82 (1985)MATHGoogle Scholar
  38. 38.
    Ioannidis, Y.: The history of histograms (abridged). In: VLDB ’03, pp. 19–30 (2003)Google Scholar
  39. 39.
    Karnstedt, M.: Query processing in a DHT-based universal storage. Ph.D. thesis, AVM (2009)Google Scholar
  40. 40.
    Karnstedt, M., Sattler, K., Richtarsky, M., Müller, J., Hauswirth, M., Schmidt, R., John, R.: UniStore: querying a DHT-based universal storage. In: ICDE’07 Demonstrations Program, pp. 1503–1504 (2007)Google Scholar
  41. 41.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM 46(5), 604–632 (1999)MathSciNetMATHCrossRefGoogle Scholar
  42. 42.
    Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)CrossRefGoogle Scholar
  43. 43.
    Langegger, A., Wöß, W.: RDFStats—an extensible RDF statistics generator and library. In: Workshop on Web Semantics, DEXA (2009)Google Scholar
  44. 44.
    ldspider. Google code, April 2010Google Scholar
  45. 45.
    Manola, F., Miller, E.: RDF Primer. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-primer/
  46. 46.
    Marzolla, M., Mordacchini, M., Orlando, S.: Tree vector indexes: efficient range queries for dynamic content on peer-to-peer networks. In: PDP’06, pp. 457–464 (2006)Google Scholar
  47. 47.
    Miller, L., Seaborne, A., Reggiori, A.: Three implementations of SquishQL, a simple RDF query language. In: ISWC’02, pp. 423–435 (2002)Google Scholar
  48. 48.
    Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD 88, pp. 28–36 (1988)Google Scholar
  49. 49.
    Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmer, M., Risch, T.: Edutella: a P2P networking infrastructure based on RDF. In: WWW’02 (2002)Google Scholar
  50. 50.
    Neumann, Th., Weikum, G.: RDF-3X: a RISC-style engine for RDF. VLDB Endowment 1(1), 647–659 (2008)Google Scholar
  51. 51.
    Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)CrossRefGoogle Scholar
  52. 52.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)Google Scholar
  53. 53.
    Petrakis, Y., Koloniari, G., Pitoura, E.: On using histograms as routing indexes in peer-to-peer systems. In: DBISP2P, pp. 16–30 (2004)Google Scholar
  54. 54.
    Petrakis, Y., Pitoura, E.: On constructing small worlds in unstructured peer-to-peer systems. In: EDBT Workshops, pp. 415–424 (2004)Google Scholar
  55. 55.
    Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: VLDB ’97, pp. 486–495 (1997)Google Scholar
  56. 56.
    Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/
  57. 57.
    Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538, Tenerife, Spain. Springer (2008)Google Scholar
  58. 58.
    Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538 (2008)Google Scholar
  59. 59.
    Rathi, A., Lu, H., Hedrick, G.E.: Performance comparison of extendible hashing and linear hashing techniques. SIGSMALL/PC Notes 17(2), 19–26 (1991)CrossRefGoogle Scholar
  60. 60.
    Schlosser, M., Sintek, M., Decker, S., Nejdl, W.: HyperCuP, hypercubes, ontologies, and efficient search on peer-to-peer networks. In: Agents and Peer-to-Peer Computing, vol. 2530, pp. 133–134. Springer (2003)Google Scholar
  61. 61.
    Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: consistent histogram construction using query feedback. In: ICDE ’06, p. 39 (2006)Google Scholar
  62. 62.
    Stuckenschmidt, H., Vdovjak, R., Broekstra, J., Houben, G.-J.: Towards distributed processing of RDF path queries. JWET 2(2/3), 207–230 (2005)Google Scholar
  63. 63.
    Stuckenschmidt, H., Vdovjak, R., Houben, G.-J., Broekstra, J.: Index structures and algorithms for querying distributed RDF repositories. In: WWW’04, pp. 631–639 (2004)Google Scholar
  64. 64.
    Umbrich, J., Karnstedt, M., Land, S.: Towards understanding the changing web: mining the dynamics of Linked-Data sources and entities. In: LWA 2010, FG-KDML, pp. 159–162 (2010)Google Scholar
  65. 65.
    Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. VLDB Endowment 1(1), 1008–1019 (2008)Google Scholar
  66. 66.
    Zinn, D.: Skyline queries in P2P systems. Master’s thesis, TU Ilmenau (2004)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Jürgen Umbrich
    • 1
  • Katja Hose
    • 2
  • Marcel Karnstedt
    • 1
  • Andreas Harth
    • 3
  • Axel Polleres
    • 1
  1. 1.Digital Enterprise Research InstituteNational University of IrelandGalwayIreland
  2. 2.Max-Planck-Institut für InformatikSaarbrückenGermany
  3. 3.Institute AIFBKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations