Skip to main content
Log in

Optimizing RPQs over a compact graph representation

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We propose techniques to evaluate regular path queries (RPQs) over labeled graphs (e.g., RDF). We apply a bit-parallel simulation of a Glushkov automaton representing the query over a ring: a compact wavelet-tree-based index of the graph. To the best of our knowledge, our approach is the first to evaluate RPQs over a compact representation of such graphs, where we show the key advantages of using Glushkov automata in this setting. Our scheme obtains optimal time, in terms of alternation complexity, for traversing the product graph. We further introduce various optimizations, such as the ability to process several automaton states and graph nodes/labels simultaneously, and to estimate relevant selectivities. Experiments show that our approach uses 3–5\(\times \) less space, and is over 5\(\times \) faster, on average, than the next best state-of-the-art system for evaluating RPQs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. To the best of our knowledge, ArangoDB, Neo4j and OrientDB do not support RPQs declaratively with the standard semantics as we define, though Neo4j does provide some RPQ-l:ike features While our queries could be run in TigerGraph, its licenses forbid benchmarking.

  2. We could support property paths under SPARQL semantics by evaluating recursive operators under set semantics using the techniques described here and rewriting non-recursive parts to (unions of) basic graph patterns evaluated on the Ring [7].

References

  1. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: Proceedings of the VLDB, pp. 411–422 (2007)

  2. Abul-Basher, Z.: Multiple-query optimization of regular path queries. In: Proceedings of the ICDE, pp. 1426–1430 (2017)

  3. Alkhateeb, F., Euzenat, J.: Constrained regular expressions for answering RDF-path queries modulo RDFS. Int. J. Web Inf. Syst. 10(1), 24–50 (2014)

    Article  Google Scholar 

  4. Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J.L., Vrgoc, D.: Foundations of modern query languages for graph databases. ACM Comput. Surv. 50(5), 68:1-68:40 (2017)

    Google Scholar 

  5. Angles, R., Arenas, M., Barceló, P., Boncz, P.A., Fletcher, G.H.L., Gutiérrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J.F., van Rest, O., Voigt, H.: G-CORE: a core for future graph query languages. In: Proceedings of the SIGMOD, pp. 1421–1432 (2018)

  6. Arenas, M., Conca, S., Pérez, J.: Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. In: Proceedings of the WWW, pp. 629–638 (2012)

  7. Arroyuelo, D., Hogan, A., Navarro, G., Reutter, J., Rojas-Ledesma, J., Soto, A.: Worst-case optimal graph joins in almost no space. In: Proceedings of the SIGMOD, pp. 102–114 (2021)

  8. Arroyuelo, D., Hogan, A., Navarro, G., Rojas-Ledesma, J.: Time- and space-efficient regular path queries. In: Proceedings of the ICDE, pp. 3091–3105 (2022)

  9. Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. SIAM J. Comput. 42(4), 1737–1767 (2013)

    Article  MathSciNet  Google Scholar 

  10. Baier, J.A., Daroch, D., Reutter, J.L., Vrgoc, D.: Evaluating navigational RDF queries over the Web. In: Proceedings of the ACM HT, pp. 165–174 (2017)

  11. Barbay, J., Kenyon, C.: Alternation and redundancy analysis of the intersection problem. ACM Trans. Algorithm 4(1), 1–18 (2008)

    Article  MathSciNet  Google Scholar 

  12. Berry, G., Sethi, R.: From regular expression to deterministic automata. Theor. Comput. Sci. 48(1), 117–126 (1986)

    Article  MathSciNet  Google Scholar 

  13. Biega, J., Kuzey, E., Suchanek, F.M.: Inside YAGO2s: a transparent information extraction architecture. In: Proceedings of the WWW, pp. 325–328 (2013)

  14. Bonchi, F., Gionis, A., Gullo, F., Ukkonen, A.: Distance oracles in edge-labeled graphs. In: Proceedings of the EDBT, pp. 547–558 (2014)

  15. Bonifati, A., Martens, W., Timm, T.: Navigating the maze of Wikidata query logs. In: Proceedings of the WWW, pp. 127–138 (2019)

  16. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. VLDB J. 29(2–3), 655–679 (2020)

    Article  Google Scholar 

  17. Brüggemann-Klein, A.: Regular expressions into finite automata. Theor. Comput. Sci. 120(2), 197–213 (1993)

    Article  MathSciNet  Google Scholar 

  18. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  19. Clark, D.R.: Compact PAT trees. PhD thesis, University of Waterloo, Canada (1996)

  20. Claude, F., Navarro, G., Ordóñez, A.: The wavelet matrix: an efficient wavelet tree for large alphabets. Inf. Syst. 47, 15–32 (2015)

    Article  Google Scholar 

  21. Colazzo, D., Mecca, V., Nolé, M., Sartiani, C.: PathGraph: querying and exploring big data graphs. In: Proceedings of the SSDBM, pp. 29:1–29:4 (2018)

  22. Cruz, I.F., Mendelzon, A.O., Wood, P.T.: A graphical query language supporting recursion. In: Proceedings of the SIGMOD, pp. 323–330 (1987)

  23. Deutsch, A., Xu, Y., Wu, M., Lee, V.E.: Aggregation support for modern graph analytics in TigerGraph. In: Proceedings of the SIGMOD, pp. 377–392 (2020)

  24. Deutsch, A., Francis, N., Green, A., Hare, K., Li, B., Libkin, L., Lindaaker, T., Marsault, V., Martens, W., Michels, J., Murlak, F., Plantikow, S., Selmer, P., van Rest, O., Voigt, H., Vrgoc, D., Wu, M., Zemke, F.: Graph pattern matching in GQL and SQL/PGQ. In: Proceedings of the SIGMOD, pp. 2246–2258 (2022)

  25. Dey, S.C., Cuevas-Vicentín, V., Köhler, S., Gribkoff, E., Wang, M., Ludäscher, B.: On implementing provenance-aware regular path queries with relational query engines. In: Proceedings of the EDBT/ICDT, pp. 214–223 (2013)

  26. Erling, O., Mikhailov, I.: RDF support in the virtuoso DBMS. In: Networked Knowledge—Networked Media, pp. 7–24. Springer (2009)

  27. Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  28. Fionda, V., Pirrò, G., Consens, M.P.: Querying knowledge graphs with extended property paths. Semant. Web 10(6), 1127–1168 (2019)

    Article  Google Scholar 

  29. Fletcher, G.H.L., Peters, J., Poulovassilis, A.: Efficient regular path query evaluation using path indexes. In: Proceedings of the EDBT, pp. 636–639 (2016)

  30. Francis, N., Green, A., Guagliardo, P., Libkin, L., Lindaaker, T., Marsault, V., Plantikow, S., Rydberg, M., Selmer, P., Taylor, A.: Cypher: An evolving query language for property graphs. In: Proceedings of the SIGMOD, pp. 1433–1445 (2018)

  31. Gagie, T., Navarro, G., Puglisi, S.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426, 25–41 (2012)

    Article  MathSciNet  Google Scholar 

  32. Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426–427, 25–41 (2012)

    Article  MathSciNet  Google Scholar 

  33. Gagie, T., Kärkkäinen, J., Navarro, G., Puglisi, S.J.: Colored range queries and document retrieval. Theor. Comput. Sci. 483, 36–50 (2013)

    Article  MathSciNet  Google Scholar 

  34. Glushkov, V.-M.: The abstract theory of automata. Russ. Math. Surv. 16, 1–53 (1961)

    Article  MathSciNet  Google Scholar 

  35. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the SODA, pp. 841–850 (2003)

  36. Gubichev, A., Bedathur, S.J., Seufert, S.: Sparqling kleene: fast property paths in RDF-3X. In: Proceedings of the GRADES, pp. 14 (2013)

  37. Guo, X., Gao, H., Zou, Z.: Distributed processing of regular path queries in RDF graphs. Knowl. Inf. Syst. 63(4), 993–1027 (2021)

    Article  Google Scholar 

  38. Harris, S., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 Query Language. W3C Recommendation (2013). http://www.w3.org/TR/sparql11-query/

  39. Hartig, O., Pirrò, G.: SPARQL with property paths on the Web. Semant. Web 8(6), 773–795 (2017)

    Article  Google Scholar 

  40. Jachiet, L., Genevès, P., Gesbert, N., Layaïda, N.: On the optimization of recursive relational queries: application to graph queries. In: Proceedings of the SIGMOD, pp. 681–697 (2020)

  41. Jin, R., Hong, H., Wang, H., Ruan, N., Xiang, Y.: Computing label-constraint reachability in graph databases. In: Proceedings of the SIGMOD, pp. 123–134 (2010)

  42. Koschmieder, A., Leser, U.: Regular path queries on large graphs. In: Proceedings of the SSDBM, pp. 177–194 (2012)

  43. Kostylev, E.V., Reutter, J.L., Romero, M., Vrgoc, D.: SPARQL with property paths. In: Proceedings of the ISWC, pp. 3–18 (2015)

  44. Kuijpers, J., Fletcher, G., Lindaaker, T., Yakovets, N.: Path indexing in the cypher query pipeline. In: Proceedings of the EDBT, pp. 582–587 (2021)

  45. Liu, B., Wang, X., Liu, P., Li, S., Wang, X.: PAIRPQ: an efficient path index for regular path queries on knowledge graphs. In: Proceedings of the APWeb-WAIM, pp. 106–120 (2021)

  46. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph. In: Proceedings of the ISWC, pp. 376–394 (2018)

  47. Martínez-Prieto, M.A., Brisaboa, N., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016)

    Article  Google Scholar 

  48. Mehmood, Q., Saleem, M., Sahay, R., Ngomo, A.N., d’Aquin, M.: QPPDs: querying property paths over distributed RDF datasets. IEEE Access 7, 101031–101045 (2019)

    Article  Google Scholar 

  49. Mendelzon, A.O., Wood, P.T.: Finding regular simple paths in graph databases. SIAM J. Comput. 24(6), 1235–1258 (1995)

    Article  MathSciNet  Google Scholar 

  50. Miura, K., Amagasa, T., Kitagawa, H.: Accelerating regular path queries using FPGA. In: Bordawekar, R., Lahiri, T. (eds.) Proceedings of the ADMS@VLDB, pp. 47–54 (2019)

  51. Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) Foundations of Software Technology and Theoretical Computer Science, pp. 37–42. Springer, Berlin, Heidelberg (1996). https://doi.org/10.1007/3-540-62034-6_35

    Chapter  Google Scholar 

  52. Munro, J.I., Raman, R., Raman, V., S., S.R.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012). https://doi.org/10.1016/j.tcs.2012.03.005

    Article  MathSciNet  Google Scholar 

  53. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the SODA, pp. 657–666 (2002)

  54. Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), 52:1-52:47 (2013)

    Google Scholar 

  55. Navarro, G.: Wavelet trees for all. J. Discrete Algorithm 25, 2–20 (2014)

    Article  MathSciNet  Google Scholar 

  56. Navarro, G., Raffinot, M.: New techniques for regular expression searching. Algorithmica 41(2), 89–116 (2005)

    Article  MathSciNet  Google Scholar 

  57. Nguyen, V., Kim, K.: Efficient regular path query evaluation by splitting with unit-subquery cost matrix. IEICE Trans. Inf. Syst. 100(10), 2648–2652 (2017)

    Article  Google Scholar 

  58. Nolé, M., Sartiani, C.: Regular path queries on massive graphs. In: Proceedings of the SSDBM, pp. 13:1–13:12 (2016)

  59. Pacaci, A., Bonifati, A., Özsu, M.T.: Regular path query evaluation on streaming graphs. In: Proceedings of the SIGMOD, pp. 1415–1430 (2020)

  60. Peng, Y., Zhang, Y., Lin, X., Qin, L., Zhang, W.: Answering billion-scale label-constrained reachability queries within microsecond. PVLDB 13(6), 812–825 (2020)

    Google Scholar 

  61. Peng, Y., Lin, X., Zhang, Y., Zhang, W., Qin, L.: Answering reachability and k-reach queries on large graphs with label constraints. VLDB J. 31(1), 101–127 (2022)

    Article  Google Scholar 

  62. Pérez, J., Arenas, M., Gutiérrez, C.: nSPARQL: a navigational language for RDF. J. Web Semant. 8(4), 255–270 (2010)

    Article  Google Scholar 

  63. Seufert, S., Anand, A., Bedathur, S.J., Weikum, G.: FERRARI: flexible and efficient reachability range assignment for graph indexing. In: Proceedings of the ICDE, pp. 1009–1020 (2013)

  64. Tetzel, F., Lehner, W., Kasperovics, R.: Efficient compilation of regular path queries. Datenbank Spektrum 20(3), 243–259 (2020)

    Article  Google Scholar 

  65. Thompson, B.B., Personick, M., Cutcher, M.: The Bigdata®RDF graph database. In: Linked data management, pp. 193–237. Chapman and Hall/CRC (2014)

  66. Valstar, L.D.J., Fletcher, G.H.L., Yoshida, Y.: Landmark indexing for evaluation of label-constrained reachability queries. In: Proceedings of the SIGMOD, pp. 345–358 (2017)

  67. van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: a property graph query language. In: Proceedings of the GRADES, p. 7 (2016)

  68. Veldhuizen, T.L.: Triejoin: a simple, worst-case optimal join algorithm. In: Proceedings of the ICDT, pp. 96–106 (2014)

  69. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

  70. Wadhwa, S., Prasad, A., Ranu, S., Bagchi, A., Bedathur, S.: Efficiently answering regular simple path queries on large labeled networks. In: Proceedings of the SIGMOD, pp. 1463–1480 (2019)

  71. Wang, X., Rao, G., Jiang, L., Lyu, X., Yang, Y., Feng, Z.: TraPath: fast regular path query evaluation on large-scale RDF graphs. In: Proceedings of the WAIM, pp. 372–383 (2014)

  72. Wang, X., Wang, J., Zhang, X.: Efficient distributed regular path queries on RDF graphs using partial evaluation. In: Proceedings of the CIKM, pp. 1933–1936 (2016)

  73. Yakovets, N., Godfrey, P., Gryz, J.: Evaluation of SPARQL property paths via recursive SQL. In: Proceedings of the AMW (2013)

  74. Yakovets, N., Godfrey, P., Gryz, J.: Query planning for evaluating SPARQL property paths. In: Proceedings of the SIGMOD, pp. 1875–1889 (2016)

  75. Zou, L., Xu, K., Yu, J.X., Chen, L., Xiao, Y., Zhao, D.: Efficient processing of label-constraint reachability queries in large graphs. Inf. Syst. 40, 47–66 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrián Gómez-Brandón.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by ANID—Millennium Science Initiative Program—Code ICN17_002; Fondecyt Grant 1-230755; Fondecyt Grants 1221926 and 1-200038; Xunta de Galicia, FEDER Galicia 2014-2020 80%, SXU 20% Grant ED431G 2019/01; GAIN/Xunta de Galicia Grant ED431C 2021/53 (GRC); Xunta de Galicia/FEDER-UE Grant IN852D 2021/3; Ministerio de Ciencia e Innovación Grants PID2020-114635RB-I00 and PDC2021-120917-C21.

A Complete running example

A Complete running example

Fig. 14
figure 14

The whole process to match the RPQ of Fig. 8 in our graph of Figs. 1 and 5

Figure 14 illustrates the whole process of matching the (2)RPQ (y,l5 \(^+\) / \(^{\hat{\,}}\) bus,Baq), with the NFA of Fig. 8, in the graph of Fig. 1, as represented in Fig. 5. We use a top-down tree to show branches in the process; Fig. 9 shows the states of the product graph \(G_{\!\mathcal {M}}'\) we traverse (backwards in BFS). The top nodes of the tree illustrate what we have already done in previous examples (edge \(1 \rightarrow 2\) of \(G_{\!\mathcal {M}}'\)): starting from \(L_\textrm{p}[14\mathinner {.\,.}15]\) (Baq) and \(D=0001\), we identified the only edge label reaching that node, l5, that is relevant in our NFA. The label l1 also appears in \(L_\textrm{p}[15\mathinner {.\,.}16]\) because a transition labeled l1 reaches Baq, but \( D ~ \& ~ B[\texttt {l1} ] = 0000\) because our NFA does not match it; this pruned branch is shown with a dashed arrow leading to an X. We have also seen that the only subject of those edges labeled l5 is \(\texttt {BA} \), at \(L_\textrm{s}[10\mathinner {.\,.}10]\), where the NFA is active at states 0110. We move to BA because \(S[\texttt {BA} ]=S[4]=0000\), so \(D=0110\) contains some unseen NFA states at this node. We mark \(S[4]=0110\) to indicate that we have already reached \(\texttt {BA} \) with those active NFA states. In part 3 of our process, we find the interval of \(L_\textrm{p}\) corresponding to \(L_\textrm{s}[10]=4\) (BA), \(L_\textrm{p}[C_\textrm{o}[4]+1\mathinner {.\,.}C_\textrm{o}[5]] = L_\textrm{p}[11\mathinner {.\,.}14]\), completing one step.

Three symbols appear on \(L_\textrm{p}[11\mathinner {.\,.}14]\) (i.e., three edge labels reach BA), but only l5 (left child) and \(^{\hat{\,}}\) bus (right child) match our NFA. By l5 we reach \(L_\textrm{s}[8\mathinner {.\,.}9]\) using backward search. In this interval we find two sources that, by l5, reach BA: SA (left child) and Baq (right child), both with NFA state \(D=0110\) (the same as before). Conversely, by \(^{\hat{\,}}\) bus, we reach \(L_\textrm{s}[16\mathinner {.\,.}16]\) using backward search. There we find the only source, SA, that reaches BA, with NFA state \(D=1000\). We process the three sources in BFS order, left to right:

  1. 1.

    By l5 we reach BA from SA (leftmost tree node in this level). We accept going to SA because \(S[\texttt {SA} ]=S[1]=0000\) and \(D=0110\) has new states, so we set \(S[1]=0110\). In part 3 we obtain the interval \(L_\textrm{p}[1\mathinner {.\,.}4]\) for SA. This is transition \(2 \rightarrow 3\) in the product graph.

  2. 2.

    By l5 we reach BA from Baq (middle tree node in this level). Although we had already seen Baq, it was only with states \(S[\texttt {Baq} ]=S[5]=0001\), so the current state \(D=0110\) has some unvisited NFA states; we set \(S[5]=0111\) and part 3 leads us to \(L_\textrm{p}[15\mathinner {.\,.}16]\). This is transition \(2 \rightarrow 4\) in the product graph.

  3. 3.

    By \(^{\hat{\,}}\) bus we reach BA from SA as well (rightmost tree node in this level). Since \(S[\texttt {SA} ]=S[1]=0110\) and \(D=1000\), we have new states and we accept going to SA, setting \(S[1]=1110\). The NFA state is still 1000 (transition \(2 \rightarrow 5\) in the product graph), which contains the initial state, so we report node SA as a solution to our 2RPQ. We then continue from it, reaching \(L_\textrm{p}[1\mathinner {.\,.}4]\) using part 3.

Our BFS traversal now branches from each of the tree nodes identified above:

  1. 1.

    From \(L_\textrm{p}[1\mathinner {.\,.}4]\) (SA) with \(D=0110\), we find edges labeled l1, l5, and \(^{\hat{\,}}\) bus leading to it:

    1. (a)

      Our NFA cannot process l2 (\( D ~ \& ~ B[\texttt {l2} ]=0000\)), so we abandon that edge.

    2. (b)

      By l5 we find the source BA, but since \(S[\texttt {BA} ]=S[4]=0110\) and \(D=0110\), we have already visited BA with those active states, so we abandon this branch too, thus avoiding to fall into a cycle.

    3. (c)

      By \(^{\hat{\,}}\) bus we reach \(L_\textrm{s}[14\mathinner {.\,.}14]\) with state \(D=1000\). The only source here is \(L_\textrm{s}[14]=2=\) UCh. Since \(S[\texttt {UCh} ]=S[2]=0000\), we enter this state and set \(S[2]=1000\) (transition \(3 \rightarrow 6\) in the product graph). Furthermore, since D contains the initial state, we report UCh as the second solution to the 2RPQ.

  2. 2.

    From \(L_\textrm{p}[15\mathinner {.\,.}16]\) (Baq) with \(D=0110\), we find edges labeled l1 and l5 leading to it:

    1. (a)

      Our NFA cannot process l1, so we abandon this branch.

    2. (b)

      By l5 we reach BA again, and once again we prune the branch to avoid falling into cycles, because \(S[\texttt {BA} ]=S[4]=0110\).

  3. 3.

    Finally, from \(L_\textrm{p}[1\mathinner {.\,.}4]\) (SA) and \(D=1000\), which we had reported, the NFA has nowhere to go, so we reject the three possible edge labels, l2, l5, and \(^{\hat{\,}}\) bus. The same happens in the last tree level from \(L_\textrm{p}[5\mathinner {.\,.}8]\) (UCh) after reporting it, so we finish.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arroyuelo, D., Gómez-Brandón, A., Hogan, A. et al. Optimizing RPQs over a compact graph representation. The VLDB Journal 33, 349–374 (2024). https://doi.org/10.1007/s00778-023-00811-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00811-2

Keywords

Navigation