Advertisement

Theory of Computing Systems

, Volume 52, Issue 3, pp 441–482 | Cite as

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

  • Leopoldo BertossiEmail author
  • Solmaz Kolahi
  • Laks V. S. Lakshmanan
Article

Abstract

Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the values of some other attributes are sufficiently similar. Assuming the existence of matching functions for making two attribute values equal, we formally introduce the process of cleaning an instance using matching dependencies, as a chase-like procedure. We show that matching functions naturally introduce a lattice structure on attribute domains, and a partial order of semantic domination between instances. Using the latter, we define the semantics of clean query answering in terms of certain/possible answers as the greatest lower bound/least upper bound of all possible answers obtained from the clean instances. We show that clean query answering is intractable in general. Then we study queries that behave monotonically w.r.t. semantic domination order, and show that we can provide an under/over approximation for clean answers to monotone queries. Moreover, non-monotone positive queries can be relaxed into monotone queries.

Keywords

Databases Data cleaning Matching dependency Entity resolution Matching function Semantic domination Lattice Certain answer Possible answer Query relaxation 

Notes

Acknowledgements

This work was supported by NSERC Strategic Network on Business Intelligence (BIN ADC01, Years 1 and 2) and (BIN ADC05, Year 3); and NSERC/IBM CRDPJ/371084-2008, which is gratefully acknowledged. L. Bertossi is a Faculty Fellow of the IBM Center for Advanced Studies.

References

  1. 1.
    Abiteboul, S., Kanellakis, P.C., Grahne, G.: On the representation and querying of sets of possible worlds. Theor. Comput. Sci. 78(1), 158–187 (1991) MathSciNetGoogle Scholar
  2. 2.
    Antoniou, G., van Harmelen, F.: A Semantic Web Primer, 2nd edn. The MIT Press, Cambridge (2008) Google Scholar
  3. 3.
    Afrati, F., Kolaities, Ph.: Answering aggregate queries in data exchange. In: Proc. ACM PODS, pp. 129–138 (2008) CrossRefGoogle Scholar
  4. 4.
    Arasu, A., Re, Ch., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: Proc. ICDE, pp. 952–963 (2009) Google Scholar
  5. 5.
    Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent Databases. In: Proc. ACM PODS, pp. 68–79 (1999) Google Scholar
  6. 6.
    Arenas, M., Bertossi, L., Chomicki, J.: Answer sets for consistent query answering in inconsistent databases. Theory Pract. Log. Program. 3(4–5), 393–424 (2003) zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003) zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Bancilhon, F., Khoshafian, S.: A calculus for complex objects. J. Comput. Syst. Sci. 38(2), 326–340 (1989) zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, E.S., Widom, J.: Swoosh a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009) CrossRefGoogle Scholar
  10. 10.
    Bertossi, L.: Consistent query answering in databases. SIGMOD Rec. 35(2), 68–76 (2006) CrossRefGoogle Scholar
  11. 11.
    Bertossi, L.: Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2011) Google Scholar
  12. 12.
    Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1) (2008) Google Scholar
  13. 13.
    Buneman, P., Jung, A., Ohori A.: Using powerdomains to generalize relational databases. Theor. Comput. Sci. 91(1), 23–55 (1991) zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Caniupan, M., Bertossi, L.: The consistency extractor system: answer set programs for consistent query answering in databases. Data Knowl. Eng. 69(6), 545–572 (2010) CrossRefGoogle Scholar
  15. 15.
    Chomicki, J.: Consistent query answering: five easy pieces. In: Proc. ICDT. LNCS, vol. 4353, pp. 1–17. Springer, Berlin (2007) Google Scholar
  16. 16.
    Eiter, T., Fink, M., Greco, G., Lembo, D.: Repair localization for query answering from inconsistent databases. ACM Trans. Database Syst. 33(2) (2008) Google Scholar
  17. 17.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007) CrossRefGoogle Scholar
  18. 18.
    Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. VLDB J. 18(6), 1261–1277 (2009) CrossRefGoogle Scholar
  19. 19.
    Fan, W.: Dependencies revisited for improving data quality. In: Proc. ACM PODS, pp. 159–170 (2008) CrossRefGoogle Scholar
  20. 20.
    Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: Proc. VLDB, vol. 2, pp. 407–418 (2009) Google Scholar
  21. 21.
    Fan, W., Li, J., Ma, Sh., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proc. ACM SIGMOD, pp. 469–480 (2011) Google Scholar
  22. 22.
    Fuxman, A., Fazli, E., Miller, R.: ConQuer efficient management of inconsistent databases. In: Proc. ACM SIGMOD, pp. 155–166 (2005) Google Scholar
  23. 23.
    Gaasterland, T., Godfrey, P., Minker, J.: Relaxation as a platform for cooperative answering. J. Intell. Inf. Syst. 1(3/4), 293–321 (1992) CrossRefGoogle Scholar
  24. 24.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: language, model, and algorithms. In: Proc. VLDB, pp. 371–380 (2001) Google Scholar
  25. 25.
    Gardezi, J., Bertossi, L., Kiringa, I.: Matching dependencies with arbitrary attribute values: semantics, query answering and integrity constraints. In: Proc. of the International Workshop on Logic in Databases (LID’11). ACM Press, New York (2011) Google Scholar
  26. 26.
    Greco, G., Greco, S., Zumpano, E.: A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng. 15(6), 1389–1408 (2003) CrossRefGoogle Scholar
  27. 27.
    Gunter, C.A., Scott, D.S.: Semantic domains. In: Handbook of Theoretical Computer Science, vol. B, Chap. 12. Elsevier, Amsterdam (1990) Google Scholar
  28. 28.
    Hernández, M., Stolfo, S.: The merge/purge problem for large databases. In: Proc. ACM SIGMOD, pp. 127–138 (1995) Google Scholar
  29. 29.
    Imielinski, T., Lipski, W. Jr.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984) zbMATHCrossRefMathSciNetGoogle Scholar
  30. 30.
    Jagadish, H., Mendelzon, A., Milo, T.: Similarity-based queries. In: Proc. ACM PODS, pp. 36–45 (1995) Google Scholar
  31. 31.
    Kifer, M., Lausen, G.: F-Logic: a higher-order language for reasoning about objects, inheritance, and scheme. In: Proc. ACM SIGMOD, pp. 134–146 (1989) Google Scholar
  32. 32.
    Koudas, N., Li, Ch., Tung, A., Vernica, R.: Relaxing join and selection queries. In: Proc. VLDB, pp. 199–210 (2006) Google Scholar
  33. 33.
    Levene, M., Loizou, G.: Database design of incomplete relations. ACM Trans. Database Syst. 24, 35–68 (1999) CrossRefGoogle Scholar
  34. 34.
    Libkin, L.: A semantics-based approach to design of query languages for partial information. In: Semantics in Databases. LNCS, vol. 1358, pp. 170–208. Springer, Berlin (1998) CrossRefGoogle Scholar
  35. 35.
    Libkin, L.: Data exchange and incomplete information. In: Proc. ACM PODS, pp. 60–69 (2006) Google Scholar
  36. 36.
    Lipski, W. Jr.: On semantic issues connected with incomplete information databases. ACM Trans. Database Syst. 4(3), 262–296 (1979) CrossRefGoogle Scholar
  37. 37.
    Naumann, F., Herschel, M.: In: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2010) Google Scholar
  38. 38.
    Ng, W., Levene, M., Fenner, T.: On the expressive power of the relational algebra with partially ordered domains. Int. J. Comput. Math. 71, 53–62 (2000) CrossRefMathSciNetGoogle Scholar
  39. 39.
    Saïs, F., Pernelle, N., Rousset, M.-C.: L2R: a logical method for reference reconciliation. In: Proc. AAAI, pp. 329–334 (2007) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Leopoldo Bertossi
    • 1
    Email author
  • Solmaz Kolahi
    • 2
  • Laks V. S. Lakshmanan
    • 2
  1. 1.Carleton UniversityOttawaCanada
  2. 2.University of British ColumbiaVancouverCanada

Personalised recommendations