Skip to main content
Log in

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the values of some other attributes are sufficiently similar. Assuming the existence of matching functions for making two attribute values equal, we formally introduce the process of cleaning an instance using matching dependencies, as a chase-like procedure. We show that matching functions naturally introduce a lattice structure on attribute domains, and a partial order of semantic domination between instances. Using the latter, we define the semantics of clean query answering in terms of certain/possible answers as the greatest lower bound/least upper bound of all possible answers obtained from the clean instances. We show that clean query answering is intractable in general. Then we study queries that behave monotonically w.r.t. semantic domination order, and show that we can provide an under/over approximation for clean answers to monotone queries. Moreover, non-monotone positive queries can be relaxed into monotone queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. All the variables in X i ,Y j are implicitly universally quantified in front of the formula.

  2. Matching of boolean attributes requires the existence of the top element ⊤.

  3. Remember that comparable attributes share the similarity relation and matching function (and lattice structure).

  4. For the same purpose we could also use the classic four-value lattice: ⊥⪯false,true⪯⊤, like the one in Example 3(c).

  5. Finiteness is shown for the case when match and merge have the representativity property (equivalent to being similarity preserving) in addition to other properties. However, the proof in [9] can be modified so that representativity is not necessary.

  6. Notice that the single MD does not form an interaction-free set of MDs.

  7. We use the superscript s, for Swoosh, to distinguish them from the properties listed in Sect. 3.

  8. In our MD framework the sets of MDs provide a logical specification, and the semantics is model-theoretic, as captured by the clean instances. Admittedly, the latter have a procedural component.

References

  1. Abiteboul, S., Kanellakis, P.C., Grahne, G.: On the representation and querying of sets of possible worlds. Theor. Comput. Sci. 78(1), 158–187 (1991)

    MathSciNet  Google Scholar 

  2. Antoniou, G., van Harmelen, F.: A Semantic Web Primer, 2nd edn. The MIT Press, Cambridge (2008)

    Google Scholar 

  3. Afrati, F., Kolaities, Ph.: Answering aggregate queries in data exchange. In: Proc. ACM PODS, pp. 129–138 (2008)

    Chapter  Google Scholar 

  4. Arasu, A., Re, Ch., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: Proc. ICDE, pp. 952–963 (2009)

    Google Scholar 

  5. Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent Databases. In: Proc. ACM PODS, pp. 68–79 (1999)

    Google Scholar 

  6. Arenas, M., Bertossi, L., Chomicki, J.: Answer sets for consistent query answering in inconsistent databases. Theory Pract. Log. Program. 3(4–5), 393–424 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  7. Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  8. Bancilhon, F., Khoshafian, S.: A calculus for complex objects. J. Comput. Syst. Sci. 38(2), 326–340 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  9. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, E.S., Widom, J.: Swoosh a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  10. Bertossi, L.: Consistent query answering in databases. SIGMOD Rec. 35(2), 68–76 (2006)

    Article  Google Scholar 

  11. Bertossi, L.: Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2011)

    Google Scholar 

  12. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1) (2008)

  13. Buneman, P., Jung, A., Ohori A.: Using powerdomains to generalize relational databases. Theor. Comput. Sci. 91(1), 23–55 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  14. Caniupan, M., Bertossi, L.: The consistency extractor system: answer set programs for consistent query answering in databases. Data Knowl. Eng. 69(6), 545–572 (2010)

    Article  Google Scholar 

  15. Chomicki, J.: Consistent query answering: five easy pieces. In: Proc. ICDT. LNCS, vol. 4353, pp. 1–17. Springer, Berlin (2007)

    Google Scholar 

  16. Eiter, T., Fink, M., Greco, G., Lembo, D.: Repair localization for query answering from inconsistent databases. ACM Trans. Database Syst. 33(2) (2008)

  17. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  18. Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. VLDB J. 18(6), 1261–1277 (2009)

    Article  Google Scholar 

  19. Fan, W.: Dependencies revisited for improving data quality. In: Proc. ACM PODS, pp. 159–170 (2008)

    Chapter  Google Scholar 

  20. Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: Proc. VLDB, vol. 2, pp. 407–418 (2009)

    Google Scholar 

  21. Fan, W., Li, J., Ma, Sh., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proc. ACM SIGMOD, pp. 469–480 (2011)

    Google Scholar 

  22. Fuxman, A., Fazli, E., Miller, R.: ConQuer efficient management of inconsistent databases. In: Proc. ACM SIGMOD, pp. 155–166 (2005)

    Google Scholar 

  23. Gaasterland, T., Godfrey, P., Minker, J.: Relaxation as a platform for cooperative answering. J. Intell. Inf. Syst. 1(3/4), 293–321 (1992)

    Article  Google Scholar 

  24. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: language, model, and algorithms. In: Proc. VLDB, pp. 371–380 (2001)

    Google Scholar 

  25. Gardezi, J., Bertossi, L., Kiringa, I.: Matching dependencies with arbitrary attribute values: semantics, query answering and integrity constraints. In: Proc. of the International Workshop on Logic in Databases (LID’11). ACM Press, New York (2011)

    Google Scholar 

  26. Greco, G., Greco, S., Zumpano, E.: A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng. 15(6), 1389–1408 (2003)

    Article  Google Scholar 

  27. Gunter, C.A., Scott, D.S.: Semantic domains. In: Handbook of Theoretical Computer Science, vol. B, Chap. 12. Elsevier, Amsterdam (1990)

    Google Scholar 

  28. Hernández, M., Stolfo, S.: The merge/purge problem for large databases. In: Proc. ACM SIGMOD, pp. 127–138 (1995)

    Google Scholar 

  29. Imielinski, T., Lipski, W. Jr.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  30. Jagadish, H., Mendelzon, A., Milo, T.: Similarity-based queries. In: Proc. ACM PODS, pp. 36–45 (1995)

    Google Scholar 

  31. Kifer, M., Lausen, G.: F-Logic: a higher-order language for reasoning about objects, inheritance, and scheme. In: Proc. ACM SIGMOD, pp. 134–146 (1989)

    Google Scholar 

  32. Koudas, N., Li, Ch., Tung, A., Vernica, R.: Relaxing join and selection queries. In: Proc. VLDB, pp. 199–210 (2006)

    Google Scholar 

  33. Levene, M., Loizou, G.: Database design of incomplete relations. ACM Trans. Database Syst. 24, 35–68 (1999)

    Article  Google Scholar 

  34. Libkin, L.: A semantics-based approach to design of query languages for partial information. In: Semantics in Databases. LNCS, vol. 1358, pp. 170–208. Springer, Berlin (1998)

    Chapter  Google Scholar 

  35. Libkin, L.: Data exchange and incomplete information. In: Proc. ACM PODS, pp. 60–69 (2006)

    Google Scholar 

  36. Lipski, W. Jr.: On semantic issues connected with incomplete information databases. ACM Trans. Database Syst. 4(3), 262–296 (1979)

    Article  Google Scholar 

  37. Naumann, F., Herschel, M.: In: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2010)

    Google Scholar 

  38. Ng, W., Levene, M., Fenner, T.: On the expressive power of the relational algebra with partially ordered domains. Int. J. Comput. Math. 71, 53–62 (2000)

    Article  MathSciNet  Google Scholar 

  39. Saïs, F., Pernelle, N., Rousset, M.-C.: L2R: a logical method for reference reconciliation. In: Proc. AAAI, pp. 329–334 (2007)

    Google Scholar 

Download references

Acknowledgements

This work was supported by NSERC Strategic Network on Business Intelligence (BIN ADC01, Years 1 and 2) and (BIN ADC05, Year 3); and NSERC/IBM CRDPJ/371084-2008, which is gratefully acknowledged. L. Bertossi is a Faculty Fellow of the IBM Center for Advanced Studies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leopoldo Bertossi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertossi, L., Kolahi, S. & Lakshmanan, L.V.S. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput Syst 52, 441–482 (2013). https://doi.org/10.1007/s00224-012-9402-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-012-9402-7

Keywords

Navigation