The VLDB Journal

, Volume 27, Issue 2, pp 245–269 | Cite as

Efficient provenance tracking for datalog using top-k queries

  • Daniel Deutch
  • Amir Gilad
  • Yuval Moskovitch
Regular Paper


Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, defined here as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm that computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.


Provenance Datalog Top-K 



This research has been partially funded by the Israeli Science Foundation (978/17, 1636/13) and the Blavatnik Interdisciplinary Cyber Research Center (TAU ICRC). The contribution of Yuval Moskovitch is part of Ph.D. thesis research conducted at Tel Aviv University.


  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)zbMATHGoogle Scholar
  2. 2.
    Ailamaki, A., Ioannidis, Y.E., Livny, M.: Scientific workflow management by database management. In: SSDBM (1998)Google Scholar
  3. 3.
    Arora, T., Ramakrishnan, R., Roth, W.G., Seshadri, P., Srivastava, D.: Explaining program execution in deductive systems. In: DOOD (1993)Google Scholar
  4. 4.
    Bao, Z., Davidson, S.B., Milo, T.: Labeling recursive workflow executions on-the-fly. In: SIGMOD (2011)Google Scholar
  5. 5.
    Bao, Z., Köhler, H., Wang, L., Zhou, X., Sadiq, S.: Efficient provenance storage for relational queries. In: CIKM (2012)Google Scholar
  6. 6.
    Benjelloun, O., Sarma, A., Halevy, A., Theobald, M., Widom, J.: Databases with uncertainty and lineage. VLDB J. 17, 243 (2008)CrossRefGoogle Scholar
  7. 7.
    Buneman, P., Cheney, J., Vansummeren, S.: On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst. 33(4), 1 (2008)CrossRefGoogle Scholar
  8. 8.
    Chang, L., Yu, J.X., Qin, L.: Query ranking in probabilistic XML data. In: EDBT (2009)Google Scholar
  9. 9.
    Chapman, A.P., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: ACM SIGMOD, SIGMOD ’08 (2008)Google Scholar
  10. 10.
    Cheney, J., Ahmed, A., Acar, U.A.: Database queries that explain their work. In: CoRR, abs/1408.1675 (2014)Google Scholar
  11. 11.
    Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in databases: why, how, and where. Found. Trends Databases 1(4), 379 (2009)CrossRefGoogle Scholar
  12. 12.
    Cohen, S., Kimelfeld, B.: Querying parse trees of stochastic context-free grammars. In: ICDT (2010)Google Scholar
  13. 13.
    Cohn, D., Hull, R.: Business artifacts: a data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull. 32(3), 3 (2009)Google Scholar
  14. 14.
    Damásio, C.V., Analyti, A., Antoniou, G.: Justifications for logic programming. In: Logic Programming and Nonmonotonic Reasoning (2013)Google Scholar
  15. 15.
    Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: SIGMOD (2008)Google Scholar
  16. 16.
    Deutch, D., Gilad, A., Moskovitch, Y.: Selective provenance for datalog programs using top-k queries. PVLDB 8(12), 1394 (2015)Google Scholar
  17. 17.
    Deutch, D., Gilad, A., Moskovitch, Y.: selp: selective tracking and presentation of data provenance (demo). In: ICDE (2015)Google Scholar
  18. 18.
    Deutch, D., Koch, C., Milo, T.: On probabilistic fixpoint and markov chain query languages. In: PODS (2010)Google Scholar
  19. 19.
    Deutch, D., Milo, T., Roy, S., Tannen, V.: Circuits for datalog provenance. In: ICDT (2014)Google Scholar
  20. 20.
    Eppstein, D.: Finding the k shortest paths. SIAM J. Comput. 28(2), 652 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Fink, R., Han, L., Olteanu, D.: Aggregation in probabilistic databases via knowledge compilation. PVLDB 5(5), 490 (2012)Google Scholar
  22. 22.
    Foster, I., Vockler, J., Wilde, M., Zhao, A.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: SSDBM (2002)Google Scholar
  23. 23.
    Fuhr, N.: Probabilistic datalog:a logic for powerful retrieval methods. In: SIGIR (1995)Google Scholar
  24. 24.
    Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.M.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)Google Scholar
  25. 25.
    Geerts, F., Poggi, A.: On database query languages for k-relations. J. Appl. Logic 8(2), 173–185 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Glavic B., Alonso, G.: Perm: processing provenance and data on the same data model through query rewriting. In: ICDE, pp. 174–185 (2009)Google Scholar
  27. 27.
    Glavic, B., Alonso, G., Miller, R.J., Haas, L.M.: TRAMP: understanding the behavior of schema mappings through provenance. PVLDB 3(1), 1314–1325 (2010)Google Scholar
  28. 28.
    Glavic, B., Miller, R.J., Alonso, G.: Using sql for efficient generation and querying of provenance information. In: In Search of Elegance in the Theory and Practice of Computation. Springer (2013)Google Scholar
  29. 29.
    Glavic, B., Siddique, J., Andritsos, P., Miller, R.J.: Provenance for data mining. In: Tapp (2013)Google Scholar
  30. 30.
    Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS (2007)Google Scholar
  31. 31.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34, W729 (2006)CrossRefGoogle Scholar
  32. 32.
  33. 33.
    Imieliński, T., Lipski Jr., W.: Incomplete information in relational databases. J. ACM 31(4), 761 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Ives, Z.G., Haeberlen, A., Feng, T., Gatterbauer, W.: Querying provenance for ranking and recommending. In: TaPP (2012)Google Scholar
  35. 35.
    Jha, A.K., Suciu, D.: Probabilistic databases with markoviews. PVLDB 5(11), 1160 (2012)Google Scholar
  36. 36.
    Karvounarakis, G., Ives, Z.G., Tannen, V.: Querying data provenance. In: SIGMOD (2010)Google Scholar
  37. 37.
    Kenig, B., Gal, A., Strichman, O.: A new class of lineage expressions over probabilistic databases computable in p-time. In: SUM, pp. 219–232 (2013)Google Scholar
  38. 38.
    Kimelfeld, B., Kosharovsky, Y., Sagiv, Y.: Query evaluation over probabilistic XML. VLDB J. 18(5), 1117 (2009)CrossRefGoogle Scholar
  39. 39.
    Kimelfeld, B., Sagiv, Y.: Matching twigs in probabilistic XML. In: VLDB (2007)Google Scholar
  40. 40.
    Knuth, D.E.: A generalization of Dijkstra’s algorithm. Inf. Process. Lett. 6(1), 1 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Datalog in Academia and Industry (2012)Google Scholar
  42. 42.
    Köstler, G., Kießling, W., Thöne, H., Güntzer, U.: Fixpoint iteration with subsumption in deductive databases. J. Intell. Inf. Syst. 4(2), 123 (1995)CrossRefGoogle Scholar
  43. 43.
    Li, J., Liu, C., Zhou, R., Wang, W.: Top-k keyword search over probabilistic XML data. In: ICDE (2011)Google Scholar
  44. 44.
    Loo, B.T. et al.: Declarative networking: language, execution and optimization. In: SIGMOD (2006)Google Scholar
  45. 45.
    Meliou, A., Gatterbauer, W., Suciu, D.: Reverse data management. PVLDB 4(12), 1490 (2011)Google Scholar
  46. 46.
    Meliou, A., Suciu, D.: Tiresias: the database oracle for how-to queries. In: SIGMOD (2012)Google Scholar
  47. 47.
    Missier, P., Paton, N., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT (2010)Google Scholar
  48. 48.
    Ning, B., Liu, C., Yu, J.X.: Efficient processing of top-k twig queries over probabilistic XML data. World Wide Web 16(3), 299 (2013)CrossRefGoogle Scholar
  49. 49.
    Niu, F., Zhang, C., Re, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pp. 25–28 (2012)Google Scholar
  50. 50.
    Olteanu, D., Zavodny, J.: Factorised representations of query results: size bounds and readability. In: ICDT (2012)Google Scholar
  51. 51.
    Perera, R., Acar, U.A., Cheney, J., Levy, P.B.: Functional programs that explain their work. In: SIGPLAN (2012)Google Scholar
  52. 52.
    Prov-overview, w3c working group note. 2013
  53. 53.
    Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107 (2006)CrossRefGoogle Scholar
  54. 54.
    Ronen, R., Shmueli, O.: Automated interaction in social networks with datalog. In: CIKM (2010)Google Scholar
  55. 55.
    Roy, S., Suciu, D.: A formal approach to finding explanations for database queries. In: SIGMOD (2014)Google Scholar
  56. 56.
    Shmueli, O., Tsur, S.: Logical diagnosis of LDL programs. New Gener. Comput. 9(3/4), 277 (1991)CrossRefzbMATHGoogle Scholar
  57. 57.
    Simhan, Y.L., Plale, B., Gammon, D.: Karma2: provenance management for data-driven workflows. Int. J. Web Serv. Res. 5(2), 317 (2008)Google Scholar
  58. 58.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)Google Scholar
  59. 59.
    Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2011)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Tel Aviv UniversityTel AvivIsrael

Personalised recommendations