Advertisement

The VLDB Journal

, Volume 19, Issue 1, pp 91–113 | Cite as

The RDF-3X engine for scalable management of RDF data

  • Thomas Neumann
  • Gerhard Weikum
Special Issue Paper

Abstract

RDF is a data model for schema-free structured information that is gaining momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 platforms. The “pay-as-you-go” nature of RDF and the flexible pattern-matching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style architecture with streamlined indexing and query processing. The physical design is identical for all RDF-3X databases regardless of their workloads, and completely eliminates the need for index tuning by exhaustive indexes for all permutations of subject-property-object triples and their binary and unary projections. These indexes are highly compressed, and the query processor can aggressively leverage fast merge joins with excellent performance of processor caches. The query optimizer is able to choose optimal join orders even for complex queries, with a cost model that includes statistical synopses for entire join paths. Although RDF-3X is optimized for queries, it also provides good support for efficient online updates by means of a staging architecture: direct updates to the main database indexes are deferred, and instead applied to compact differential indexes which are later merged into the main indexes in a batched manner. Experimental studies with several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching, manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform the previously best alternatives by one or two orders of magnitude.

Keywords

RDF Batched updates Indexing Query processing Query optimization SPARQL Database engine 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable semantic web data management using vertical partitioning. In: VLDB, pp. 411–422 (2007)Google Scholar
  2. 2.
    Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., Tolle, K.: The ICS-FORTH RDFSuite: Managing voluminous RDF description bases. In: SemWeb (2001)Google Scholar
  3. 3.
    Anyanwu, K., Maduko, A., Sheth, A.P.: SPARQ2L: towards support for subgraph extraction queries in rdf databases. In: WWW, pp. 797–806 (2007)Google Scholar
  4. 4.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia: a nucleus for a web of open data. In: ISWC/ASWC, pp. 722–735 (2007)Google Scholar
  5. 5.
    Bayardo, R.J., Jr.: Efficiently mining long patterns from databases. In: SIGMOD, pp. 85–93 (1998)Google Scholar
  6. 6.
    Baolin, L., Bo, H.: Path queries based RDF index (2005)Google Scholar
  7. 7.
    Baolin, L., Bo, H.: HPRD: A high performance RDF database. In: NPC, pp. 364–374 (2007)Google Scholar
  8. 8.
    BioPAX: Biological Pathways Exchange. http://www.biopax.org/
  9. 9.
    Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: an architecture for storing and querying RDF data and schema information. In: Spinning the Semantic Web, pp. 197–222 (2003)Google Scholar
  10. 10.
  11. 11.
    Chaudhuri F., Shim K.: Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24(2), 177–228 (1999)CrossRefGoogle Scholar
  12. 12.
    Chaudhuri, S., Weikum, G.: Rethinking database system architecture: towards a self-tuning RISC-style database system. In: VLDB, pp. 1–10 (2000)Google Scholar
  13. 13.
    Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient SQL-based RDF querying scheme. In: VLDB, pp. 1216–1227 (2005)Google Scholar
  14. 14.
    Chu, E., Beckmann, J.L., Naughton, J.F.: The case for a wide-table approach to manage sparse relational data sets. In: SIGMOD, pp. 821–832 (2007)Google Scholar
  15. 15.
    DeHaan, D., Tompa, F.W.: Optimal top-down join enumeration. In: SIGMOD, pp. 785–796 (2007)Google Scholar
  16. 16.
    den Bercken, J.V., Seeger, B.: An evaluation of generic bulk loading techniques. In: VLDB, pp. 461–470 (2001)Google Scholar
  17. 17.
    Eickler, A., Gerlhof, C.A., Kossmann, D.: A performance evaluation of OID mapping techniques. In: Dayal, U., Gray, P.M.D., Nishio, S. (eds.) VLDB, pp. 18–29. Morgan Kaufmann (1995)Google Scholar
  18. 18.
    Galindo-Legaria, C.A., Pellenkoft, A., Kersten, M.L.: Fast, randomized join-order selection—why use transformations? In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds) VLDB, pp. 85–95. Morgan Kaufmann (1994)Google Scholar
  19. 19.
    Getoor L., Diehl C.P.: Link mining: a survey. SIGKDD Explor. 7(2), 3–12 (2005)CrossRefGoogle Scholar
  20. 20.
    Graefe G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)CrossRefGoogle Scholar
  21. 21.
    Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)Google Scholar
  22. 22.
    Hart T.E., McKenney P.E., Brown A.D., Walpole J.: Performance of memory reclamation for lockless synchronization. J. Parall. Distrib. Comput. 67(12), 1270–1285 (2007)Google Scholar
  23. 23.
    Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A federated repository for querying graph structured data from the web. In: ISWC/ASWC, pp. 211–224 (2007)Google Scholar
  24. 24.
    Hartig, O., Heese, R.: The SPARQL query graph model for query optimization. In: ESWC, pp. 564–578 (2007)Google Scholar
  25. 25.
    Hogan, A., Harth, A.: The ExpertFinder corpus 2007 for the benchmarking development of expert-finding systems. In: International ExpertFinder Workshop (2007)Google Scholar
  26. 26.
    Huynh D., Mazzocchi S., Karger D.R.: Piggy bank: experience the semantic web inside your web browser. J. Web Sem. 5(1), 16–27 (2007)Google Scholar
  27. 27.
  28. 28.
    Jena: a Semantic Web Framework for Java. http://jena.sourceforge.net/
  29. 29.
    Jermaine C.M., Omiecinski E., Yee W.G.: The partitioned exponential file for database storage management. VLDB J. 16(4), 417–437 (2007)CrossRefGoogle Scholar
  30. 30.
    Kersten, M., Siebes, A.P.: An organic database system. Technical report, CWI (1999)Google Scholar
  31. 31.
  32. 32.
    Lomet D.B., Salzberg B.: Concurrency and recovery for index trees. VLDB J. 6(3), 224–240 (1997)CrossRefGoogle Scholar
  33. 33.
    Maduko, A., Anyanwu, K., Sheth, A.P., Schliekelman, P.: Estimating the cardinality of RDF graph patterns. In: WWW, pp. 1233–1234 (2007)Google Scholar
  34. 34.
    Maduko, A., Anyanwu, K., Sheth, A.P., Schliekelman, P.: Graph summaries for subgraph frequency estimation. In: ESWC, pp. 508–523 (2008)Google Scholar
  35. 35.
    Moerkotte, G., Neumann, T.: Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products. In: VLDB, pp. 930–941 (2006)Google Scholar
  36. 36.
  37. 37.
    Muth P., O’Neil P.E., Pick A., Weikum G.: The LHAM log-structured history data access method. VLDB J. 8(3–4), 199–221 (2000)Google Scholar
  38. 38.
    Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: ICDE, pp. 461–472 (2004)Google Scholar
  39. 39.
    Neumann T., Weikum G.: RDF-3X: a RISC-style engine for RDF. PVLDB 1(1), 647–659 (2008)Google Scholar
  40. 40.
    O’Neil P.E., Cheng E., Gawlick D., O’Neil E.J.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)CrossRefGoogle Scholar
  41. 41.
  42. 42.
    Oracle technical network, semantic technologies center. http://www.oracle.com/technology/tech/semantic_technologies/index.html
  43. 43.
  44. 44.
  45. 45.
    Schmidt, M., Hornung, T., Knchlin, N., Lausen, G., Pinkel, C.: An experimental comparison of RDF data management approaches in a SPARQL benchmark scenario. In: International Semantic Web Conference, pp. 82–97 (2008)Google Scholar
  46. 46.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: SIGIR, pp. 222–229 (2002)Google Scholar
  47. 47.
    Sears R., Callaghan M., Brewer E.: Rose: compressed, log-structured replication. PVLDB 1(1), 526–537 (2008)Google Scholar
  48. 48.
    Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Bernstein, P.A. (ed.) SIGMOD, pp. 23–34. ACM (1979)Google Scholar
  49. 49.
    Semantic web challenge. http://challenge.semanticweb.org
  50. 50.
    Sidirourgos L., Goncalves R., Kersten M.L., Nes N., Manegold S.: Column-store support for RDF data management: not all swans are white. PVLDB 1(2), 1553–1563 (2008)Google Scholar
  51. 51.
    Simmen, D.E., Shekita, E.J., Malkemus, T.: Fundamental techniques for order optimization. In: SIGMOD, pp. 57–67 (1996)Google Scholar
  52. 52.
    Steinbrunn, M., Peithner, K., Moerkotte, G., Kemper, A.: Bypassing joins in disjunctive queries. In: Dayal, U., Gray, P.M.D., Nishio, S. (eds.) VLDB, pp. 228–238. Morgan Kaufmann (1995)Google Scholar
  53. 53.
    Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: Sparql basic graph pattern optimization using selectivity estimation. In: WWW, New York, NY, USA, April 2008. ACM Press, to appearGoogle Scholar
  54. 54.
    Stonebraker, M., Bear, C., Çetintemel, U., Cherniack, M., Ge, T., Hachem, N., Harizopoulos, S., Lifter, J., Rogers, J., Zdonik, S.B.: One size fits all? part 2: Benchmarking studies. In: CIDR, pp. 173–184 (2007)Google Scholar
  55. 55.
    Suchanek F.M., Kasneci G., Weikum G.: Yago: a large ontology from wikipedia and wordNet. J. Web Sem. 6(3), 203–217 (2008)Google Scholar
  56. 56.
    Theoharis, Y., Christophides, V., Karvounarakis, G.: Benchmarking database representations of rdf/s stores. In: International Semantic Web Conference, pp. 685–701 (2005)Google Scholar
  57. 57.
    Udrea, O., Pugliese, A., Subrahmanian, V.S.: GRIN: A graph based RDF index. In: AAAI, pp. 1465–1470 (2007)Google Scholar
  58. 58.
  59. 59.
    Vanetik, N., Gudes, E.: Mining frequent labeled and partially labeled graph patterns. In: ICDE, pp. 91–102 (2004)Google Scholar
  60. 60.
    Weiss C., Karras P., Bernstein A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)Google Scholar
  61. 61.
    W3C: Resource Description Framework (RDF). http://www.w3.org/RDF/.
  62. 62.
    W3C: SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/
  63. 63.
    Westmann T., Kossmann D., Helmer S., Moerkotte G.: The implementation and performance of compressed databases. SIGMOD Rec. 29(3), 55–67 (2000)CrossRefGoogle Scholar
  64. 64.
    Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF storage and retrieval in Jena2. In: SWDB, pp. 131–150 (2003)Google Scholar
  65. 65.
    W3C: RDF/OWL representation of WordNet. http://www.w3.org/TR/wordnet-rdf/
  66. 66.
  67. 67.
    Zhu, F., Yan, X., Han, J., Yu, P.S.: gPrune: A constraint pushing framework for graph pattern mining. In: PAKDD, pp. 388–400 (2007)Google Scholar
  68. 68.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2). http://doi.acm.org/10.1145/1132956.1132959 (2006)

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany

Personalised recommendations