The VLDB Journal

, Volume 17, Issue 2, pp 243–264 | Cite as

Databases with uncertainty and lineage

  • Omar Benjelloun
  • Anish Das Sarma
  • Alon Halevy
  • Martin Theobald
  • Jennifer Widom
Special Issue Paper

Abstract

This paper introduces uldbs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We show that the uldb representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of uldb minimality—data-minimal and lineage-minimal—and study minimization of uldb representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. We also show how uldbs enable a new approach to query processing in probabilistic databases. Finally, we describe the current state of the Trio system, our implementation of uldbs under development at Stanford.

Keywords

Uncertainty in Databases Lineage Provenance Probabilistic data management 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, New York (1995)MATHGoogle Scholar
  2. 2.
    Abiteboul S., Kanellakis P., Grahne G. (1991) On the representation and querying of sets of possible worlds. Theor. Comput. Sci. 78(1): 137–158CrossRefMathSciNetGoogle Scholar
  3. 3.
    Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: Proc. of CIDR (2003)Google Scholar
  4. 4.
    Barbará D., Garcia-Molina H., Porter D. (1992) The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5): 487–502CrossRefGoogle Scholar
  5. 5.
    Barga R.S., Pu C. (1993) Accessing imprecise data: an approach based on intervals. IEEE Data Eng. Bull. 16(2): 12–15Google Scholar
  6. 6.
    Benjelloun, O., Das Sarma, A., Halevy, A., Widom, J.: ULDBs: databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)Google Scholar
  7. 7.
    Benjelloun O., Das Sarma A., Hayworth C., Widom J. (2006) An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29(1): 5–16Google Scholar
  8. 8.
    Bhagwat, D., Chiticariu, L., Tan, W., Vijayvargiya, G.: An annotation management system for relational databases. In: Proc. of VLDB (2004)Google Scholar
  9. 9.
    Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proc. of ACM SIGMOD (2005)Google Scholar
  10. 10.
    Buckles B.P., Petry F.E. (1982) A fuzzy model for relational databases. Int. J. Fuzzy Sets Systems 7: 213–226MATHCrossRefGoogle Scholar
  11. 11.
    Buneman, P., Khanna, S., Tan, W.: Why and where: a charaterization of data provenance. In: Proc. of ICDT (2001)Google Scholar
  12. 12.
    Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proc. of VLDB (1987)Google Scholar
  13. 13.
    Chang, K.C.C., He, B., Zhang, Z.: Toward large scale integration: building a metaquerier over databases on the web. In: Proc. of CIDR, pp. 44–55 (2005)Google Scholar
  14. 14.
    Cheng, R., Singh, S., Prabhakar, S.: U-DBMS: A database system for managing constantly-evolving data. In: Proc. of VLDB (2005)Google Scholar
  15. 15.
    The CherryPy web development framework. http://www.cherrypy.org
  16. 16.
    Chiticariu, L., Tan, W., Vijayvargiya, G.: DBNotes: a post-it system for relational databases based on provenance. In: Proc. of ACM SIGMOD (2005)Google Scholar
  17. 17.
    Cui, Y., Widom, J.: Practical lineage tracing in data warehouses. In: Proc. of ICDE (2000)Google Scholar
  18. 18.
    Cui Y., Widom J. (2003) Lineage tracing for general data warehouse transformations. VLDB J. 12(1): 41–58CrossRefGoogle Scholar
  19. 19.
    Cui Y., Widom J., Wiener J.L. (2000) Tracing the lineage of view data in a warehousing environment. ACM TODS 25(2): 179–227CrossRefGoogle Scholar
  20. 20.
    Dalvi, N., Miklau, G., Suciu, D.: Asymptotic conditional probabilities for conjunctive queries. In: Proc. of ICDT (2005)Google Scholar
  21. 21.
    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proc. of VLDB (2004)Google Scholar
  22. 22.
    Dalvi, N., Suciu, D.: Answering queries from statistics and probabilistic views. In: Proc. of VLDB (2005)Google Scholar
  23. 23.
    Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proc. of ICDE (2006)Google Scholar
  24. 24.
    Das Sarma, A., Nabar, S., Widom, J.: Representing uncertain data: uniqueness, equivalence, minimization, and approximation. Tech. rep., Stanford InfoLab (2005). Available at http://dbpubs.stanford.edu/pub/2005-38
  25. 25.
    Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. Tech. rep., Stanford InfoLab (2007). Available on http://dbpubs.stanford.edu
  26. 26.
    Fuhr, N.: A probabilistic framework for vague queries and imprecise information in databases. In: Proc. of VLDB (1990)Google Scholar
  27. 27.
    Fuhr, N., Rölleke, T.: A probabilistic NF2 relational algebra for imprecision in databases. Unpublished Manuscript (1997)Google Scholar
  28. 28.
    Fuhr N., Rölleke T. (1997) A probabilistic relational algebra for the integration of information retrieval and database systems. ACM TOIS 14(1): 32–66CrossRefGoogle Scholar
  29. 29.
    Grahne, G.: Dependency satisfaction in databases with incomplete information. In: Proc. of VLDB (1984)Google Scholar
  30. 30.
    Grahne, G.: Horn tables—an efficient tool for handling incomplete information in databases. In: Proc. of ACM PODS (1989)Google Scholar
  31. 31.
    Imielinski T., Lipski W. Jr. (1984) Incomplete information in relational databases. J. ACM 31(4): 761–791MATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Ives, Z.G., Khandelwal, N., Kapur, A., Cakir, M.: Orchestra: rapid, collaborative sharing of dynamic data. In: Proc. of CIDR (2005)Google Scholar
  33. 33.
    Karp, R.M., Luby, M.: Monte Carlo algorithms for enumeration and reliability problems. In: Proc. of FOCS (1983)Google Scholar
  34. 34.
    Lakshmanan L.V.S., Leone N., Ross R., Subrahmanian V. (1997) ProbView: a flexible probabilistic database system. ACM TODS 22(3): 419–469CrossRefGoogle Scholar
  35. 35.
    Levy A.Y., Fikes R.E., Sagiv S. (1997) Speeding up inferences using relevance reasoning: a formalism and algorithms. Artif. Intell. 97(1–2): 83–136MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    Levy, A.Y., Sagiv, Y.: Queries independent of updates. In: Proc. of VLDB (1993)Google Scholar
  37. 37.
    Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: Proc. of CIDR, pp. 342–350 (2007)Google Scholar
  38. 38.
    Mutsuzaki, M., Theobald, M., de Keijzer, A., Widom, J., Agrawal, P., Benjelloun, O., Sarma, A.D., Murthy, R., Sugihara, T.: Trio-one: layering uncertainty and lineage on a conventional dbms (system demonstration). In: Proc. of CIDR, pp. 269–274 (2007)Google Scholar
  39. 39.
    Buneman, P., Khanna, S., Tan, W.: Data provenance: some basic issues. In: Proc. of FSTTCS (2000)Google Scholar
  40. 40.
    Buneman, P., Khanna, S., Tan, W.: On propagation of deletions and annotations through views. In: Proc. of ACM PODS (2002)Google Scholar
  41. 41.
    Tao, Y., Cheng, R., Xiao, X., Ngai, W.K., Kao, B., Prabhakar, S.: Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proc. of VLDB (2005)Google Scholar
  42. 42.
    Taylor, N.E., Ives, Z.G.: Reconciling while tolerating disagreement in collaborative data sharing. In: Proc. of ACM SIGMOD (2006)Google Scholar
  43. 43.
    Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of xml data using indexes and ontologies. In: Proc. of ACM SIGMOD (2002)Google Scholar
  44. 44.
    TriQL: The Trio query language. Available from http://infolab.stanford.edu/trio
  45. 45.
    Vardi, M.Y.: Querying logical databases. In: Proc. of ACM PODS (1985)Google Scholar
  46. 46.
    Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. of CIDR (2005)Google Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  • Omar Benjelloun
    • 1
  • Anish Das Sarma
    • 2
  • Alon Halevy
    • 1
  • Martin Theobald
    • 2
  • Jennifer Widom
    • 2
  1. 1.Google Inc.Mountain ViewUSA
  2. 2.Stanford UniversityPalo AltoUSA

Personalised recommendations