Abstract
Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the values of some other attributes are sufficiently similar. Assuming the existence of matching functions for making two attribute values equal, we formally introduce the process of cleaning an instance using matching dependencies, as a chase-like procedure. We show that matching functions naturally introduce a lattice structure on attribute domains, and a partial order of semantic domination between instances. Using the latter, we define the semantics of clean query answering in terms of certain/possible answers as the greatest lower bound/least upper bound of all possible answers obtained from the clean instances. We show that clean query answering is intractable in general. Then we study queries that behave monotonically w.r.t. semantic domination order, and show that we can provide an under/over approximation for clean answers to monotone queries. Moreover, non-monotone positive queries can be relaxed into monotone queries.
Similar content being viewed by others
Notes
All the variables in X i ,Y j are implicitly universally quantified in front of the formula.
Matching of boolean attributes requires the existence of the top element ⊤.
Remember that comparable attributes share the similarity relation and matching function (and lattice structure).
For the same purpose we could also use the classic four-value lattice: ⊥⪯false,true⪯⊤, like the one in Example 3(c).
Finiteness is shown for the case when match and merge have the representativity property (equivalent to being similarity preserving) in addition to other properties. However, the proof in [9] can be modified so that representativity is not necessary.
Notice that the single MD does not form an interaction-free set of MDs.
We use the superscript s, for Swoosh, to distinguish them from the properties listed in Sect. 3.
In our MD framework the sets of MDs provide a logical specification, and the semantics is model-theoretic, as captured by the clean instances. Admittedly, the latter have a procedural component.
References
Abiteboul, S., Kanellakis, P.C., Grahne, G.: On the representation and querying of sets of possible worlds. Theor. Comput. Sci. 78(1), 158–187 (1991)
Antoniou, G., van Harmelen, F.: A Semantic Web Primer, 2nd edn. The MIT Press, Cambridge (2008)
Afrati, F., Kolaities, Ph.: Answering aggregate queries in data exchange. In: Proc. ACM PODS, pp. 129–138 (2008)
Arasu, A., Re, Ch., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: Proc. ICDE, pp. 952–963 (2009)
Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent Databases. In: Proc. ACM PODS, pp. 68–79 (1999)
Arenas, M., Bertossi, L., Chomicki, J.: Answer sets for consistent query answering in inconsistent databases. Theory Pract. Log. Program. 3(4–5), 393–424 (2003)
Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3), 405–434 (2003)
Bancilhon, F., Khoshafian, S.: A calculus for complex objects. J. Comput. Syst. Sci. 38(2), 326–340 (1989)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, E.S., Widom, J.: Swoosh a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bertossi, L.: Consistent query answering in databases. SIGMOD Rec. 35(2), 68–76 (2006)
Bertossi, L.: Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2011)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1) (2008)
Buneman, P., Jung, A., Ohori A.: Using powerdomains to generalize relational databases. Theor. Comput. Sci. 91(1), 23–55 (1991)
Caniupan, M., Bertossi, L.: The consistency extractor system: answer set programs for consistent query answering in databases. Data Knowl. Eng. 69(6), 545–572 (2010)
Chomicki, J.: Consistent query answering: five easy pieces. In: Proc. ICDT. LNCS, vol. 4353, pp. 1–17. Springer, Berlin (2007)
Eiter, T., Fink, M., Greco, G., Lembo, D.: Repair localization for query answering from inconsistent databases. ACM Trans. Database Syst. 33(2) (2008)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. VLDB J. 18(6), 1261–1277 (2009)
Fan, W.: Dependencies revisited for improving data quality. In: Proc. ACM PODS, pp. 159–170 (2008)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: Proc. VLDB, vol. 2, pp. 407–418 (2009)
Fan, W., Li, J., Ma, Sh., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proc. ACM SIGMOD, pp. 469–480 (2011)
Fuxman, A., Fazli, E., Miller, R.: ConQuer efficient management of inconsistent databases. In: Proc. ACM SIGMOD, pp. 155–166 (2005)
Gaasterland, T., Godfrey, P., Minker, J.: Relaxation as a platform for cooperative answering. J. Intell. Inf. Syst. 1(3/4), 293–321 (1992)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: language, model, and algorithms. In: Proc. VLDB, pp. 371–380 (2001)
Gardezi, J., Bertossi, L., Kiringa, I.: Matching dependencies with arbitrary attribute values: semantics, query answering and integrity constraints. In: Proc. of the International Workshop on Logic in Databases (LID’11). ACM Press, New York (2011)
Greco, G., Greco, S., Zumpano, E.: A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng. 15(6), 1389–1408 (2003)
Gunter, C.A., Scott, D.S.: Semantic domains. In: Handbook of Theoretical Computer Science, vol. B, Chap. 12. Elsevier, Amsterdam (1990)
Hernández, M., Stolfo, S.: The merge/purge problem for large databases. In: Proc. ACM SIGMOD, pp. 127–138 (1995)
Imielinski, T., Lipski, W. Jr.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)
Jagadish, H., Mendelzon, A., Milo, T.: Similarity-based queries. In: Proc. ACM PODS, pp. 36–45 (1995)
Kifer, M., Lausen, G.: F-Logic: a higher-order language for reasoning about objects, inheritance, and scheme. In: Proc. ACM SIGMOD, pp. 134–146 (1989)
Koudas, N., Li, Ch., Tung, A., Vernica, R.: Relaxing join and selection queries. In: Proc. VLDB, pp. 199–210 (2006)
Levene, M., Loizou, G.: Database design of incomplete relations. ACM Trans. Database Syst. 24, 35–68 (1999)
Libkin, L.: A semantics-based approach to design of query languages for partial information. In: Semantics in Databases. LNCS, vol. 1358, pp. 170–208. Springer, Berlin (1998)
Libkin, L.: Data exchange and incomplete information. In: Proc. ACM PODS, pp. 60–69 (2006)
Lipski, W. Jr.: On semantic issues connected with incomplete information databases. ACM Trans. Database Syst. 4(3), 262–296 (1979)
Naumann, F., Herschel, M.: In: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2010)
Ng, W., Levene, M., Fenner, T.: On the expressive power of the relational algebra with partially ordered domains. Int. J. Comput. Math. 71, 53–62 (2000)
Saïs, F., Pernelle, N., Rousset, M.-C.: L2R: a logical method for reference reconciliation. In: Proc. AAAI, pp. 329–334 (2007)
Acknowledgements
This work was supported by NSERC Strategic Network on Business Intelligence (BIN ADC01, Years 1 and 2) and (BIN ADC05, Year 3); and NSERC/IBM CRDPJ/371084-2008, which is gratefully acknowledged. L. Bertossi is a Faculty Fellow of the IBM Center for Advanced Studies.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bertossi, L., Kolahi, S. & Lakshmanan, L.V.S. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput Syst 52, 441–482 (2013). https://doi.org/10.1007/s00224-012-9402-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-012-9402-7