Skip to main content

Efficient provenance tracking for datalog using top-k queries


Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, defined here as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm that computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16


  1. This requires a slight change of the definition of patterns, which is easy to support, to allow * in relation names.


  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  2. Ailamaki, A., Ioannidis, Y.E., Livny, M.: Scientific workflow management by database management. In: SSDBM (1998)

  3. Arora, T., Ramakrishnan, R., Roth, W.G., Seshadri, P., Srivastava, D.: Explaining program execution in deductive systems. In: DOOD (1993)

  4. Bao, Z., Davidson, S.B., Milo, T.: Labeling recursive workflow executions on-the-fly. In: SIGMOD (2011)

  5. Bao, Z., Köhler, H., Wang, L., Zhou, X., Sadiq, S.: Efficient provenance storage for relational queries. In: CIKM (2012)

  6. Benjelloun, O., Sarma, A., Halevy, A., Theobald, M., Widom, J.: Databases with uncertainty and lineage. VLDB J. 17, 243 (2008)

    Article  Google Scholar 

  7. Buneman, P., Cheney, J., Vansummeren, S.: On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst. 33(4), 1 (2008)

    Article  Google Scholar 

  8. Chang, L., Yu, J.X., Qin, L.: Query ranking in probabilistic XML data. In: EDBT (2009)

  9. Chapman, A.P., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: ACM SIGMOD, SIGMOD ’08 (2008)

  10. Cheney, J., Ahmed, A., Acar, U.A.: Database queries that explain their work. In: CoRR, abs/1408.1675 (2014)

  11. Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in databases: why, how, and where. Found. Trends Databases 1(4), 379 (2009)

    Article  Google Scholar 

  12. Cohen, S., Kimelfeld, B.: Querying parse trees of stochastic context-free grammars. In: ICDT (2010)

  13. Cohn, D., Hull, R.: Business artifacts: a data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull. 32(3), 3 (2009)

    Google Scholar 

  14. Damásio, C.V., Analyti, A., Antoniou, G.: Justifications for logic programming. In: Logic Programming and Nonmonotonic Reasoning (2013)

  15. Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: SIGMOD (2008)

  16. Deutch, D., Gilad, A., Moskovitch, Y.: Selective provenance for datalog programs using top-k queries. PVLDB 8(12), 1394 (2015)

    Google Scholar 

  17. Deutch, D., Gilad, A., Moskovitch, Y.: selp: selective tracking and presentation of data provenance (demo). In: ICDE (2015)

  18. Deutch, D., Koch, C., Milo, T.: On probabilistic fixpoint and markov chain query languages. In: PODS (2010)

  19. Deutch, D., Milo, T., Roy, S., Tannen, V.: Circuits for datalog provenance. In: ICDT (2014)

  20. Eppstein, D.: Finding the k shortest paths. SIAM J. Comput. 28(2), 652 (1998)

    MathSciNet  Article  MATH  Google Scholar 

  21. Fink, R., Han, L., Olteanu, D.: Aggregation in probabilistic databases via knowledge compilation. PVLDB 5(5), 490 (2012)

    Google Scholar 

  22. Foster, I., Vockler, J., Wilde, M., Zhao, A.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: SSDBM (2002)

  23. Fuhr, N.: Probabilistic datalog:a logic for powerful retrieval methods. In: SIGIR (1995)

  24. Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.M.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)

  25. Geerts, F., Poggi, A.: On database query languages for k-relations. J. Appl. Logic 8(2), 173–185 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  26. Glavic B., Alonso, G.: Perm: processing provenance and data on the same data model through query rewriting. In: ICDE, pp. 174–185 (2009)

  27. Glavic, B., Alonso, G., Miller, R.J., Haas, L.M.: TRAMP: understanding the behavior of schema mappings through provenance. PVLDB 3(1), 1314–1325 (2010)

    Google Scholar 

  28. Glavic, B., Miller, R.J., Alonso, G.: Using sql for efficient generation and querying of provenance information. In: In Search of Elegance in the Theory and Practice of Computation. Springer (2013)

  29. Glavic, B., Siddique, J., Andritsos, P., Miller, R.J.: Provenance for data mining. In: Tapp (2013)

  30. Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS (2007)

  31. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34, W729 (2006)

    Article  Google Scholar 


  33. Imieliński, T., Lipski Jr., W.: Incomplete information in relational databases. J. ACM 31(4), 761 (1984)

    MathSciNet  Article  MATH  Google Scholar 

  34. Ives, Z.G., Haeberlen, A., Feng, T., Gatterbauer, W.: Querying provenance for ranking and recommending. In: TaPP (2012)

  35. Jha, A.K., Suciu, D.: Probabilistic databases with markoviews. PVLDB 5(11), 1160 (2012)

    Google Scholar 

  36. Karvounarakis, G., Ives, Z.G., Tannen, V.: Querying data provenance. In: SIGMOD (2010)

  37. Kenig, B., Gal, A., Strichman, O.: A new class of lineage expressions over probabilistic databases computable in p-time. In: SUM, pp. 219–232 (2013)

  38. Kimelfeld, B., Kosharovsky, Y., Sagiv, Y.: Query evaluation over probabilistic XML. VLDB J. 18(5), 1117 (2009)

    Article  Google Scholar 

  39. Kimelfeld, B., Sagiv, Y.: Matching twigs in probabilistic XML. In: VLDB (2007)

  40. Knuth, D.E.: A generalization of Dijkstra’s algorithm. Inf. Process. Lett. 6(1), 1 (1977)

    MathSciNet  Article  MATH  Google Scholar 

  41. Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Datalog in Academia and Industry (2012)

  42. Köstler, G., Kießling, W., Thöne, H., Güntzer, U.: Fixpoint iteration with subsumption in deductive databases. J. Intell. Inf. Syst. 4(2), 123 (1995)

    Article  Google Scholar 

  43. Li, J., Liu, C., Zhou, R., Wang, W.: Top-k keyword search over probabilistic XML data. In: ICDE (2011)

  44. Loo, B.T. et al.: Declarative networking: language, execution and optimization. In: SIGMOD (2006)

  45. Meliou, A., Gatterbauer, W., Suciu, D.: Reverse data management. PVLDB 4(12), 1490 (2011)

    Google Scholar 

  46. Meliou, A., Suciu, D.: Tiresias: the database oracle for how-to queries. In: SIGMOD (2012)

  47. Missier, P., Paton, N., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT (2010)

  48. Ning, B., Liu, C., Yu, J.X.: Efficient processing of top-k twig queries over probabilistic XML data. World Wide Web 16(3), 299 (2013)

    Article  Google Scholar 

  49. Niu, F., Zhang, C., Re, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pp. 25–28 (2012)

  50. Olteanu, D., Zavodny, J.: Factorised representations of query results: size bounds and readability. In: ICDT (2012)

  51. Perera, R., Acar, U.A., Cheney, J., Levy, P.B.: Functional programs that explain their work. In: SIGPLAN (2012)

  52. Prov-overview, w3c working group note. 2013

  53. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107 (2006)

    Article  Google Scholar 

  54. Ronen, R., Shmueli, O.: Automated interaction in social networks with datalog. In: CIKM (2010)

  55. Roy, S., Suciu, D.: A formal approach to finding explanations for database queries. In: SIGMOD (2014)

  56. Shmueli, O., Tsur, S.: Logical diagnosis of LDL programs. New Gener. Comput. 9(3/4), 277 (1991)

    Article  MATH  Google Scholar 

  57. Simhan, Y.L., Plale, B., Gammon, D.: Karma2: provenance management for data-driven workflows. Int. J. Web Serv. Res. 5(2), 317 (2008)

    Google Scholar 

  58. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)

  59. Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2011)

Download references


This research has been partially funded by the Israeli Science Foundation (978/17, 1636/13) and the Blavatnik Interdisciplinary Cyber Research Center (TAU ICRC). The contribution of Yuval Moskovitch is part of Ph.D. thesis research conducted at Tel Aviv University.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Amir Gilad.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Deutch, D., Gilad, A. & Moskovitch, Y. Efficient provenance tracking for datalog using top-k queries. The VLDB Journal 27, 245–269 (2018).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Provenance
  • Datalog
  • Top-K