Abstract
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are guaranteed correct, and worse still, may even introduce new errors when attempting to repair the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We also develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. Furthermore, we present a framework and an algorithm to find certain fixes, by interacting with the users to ensure that one of the certain regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.
Similar content being viewed by others
References
Abiteboul S., Hull R., Vianu V.: Foundations of Databases. Addison-Wesley, Boston (1995)
Arenas M., Bertossi L.E., Chomicki J.: Consistent query answers in inconsistent databases. TPLP 3(4–5), 393–424 (2003)
Arora S., Barak B.: Computational Complexity: A Modern Approach. Cambridge University Press, Cambridge (2009)
Batini C., Scannapieco M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)
Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., WhangS.E. Widom J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD (2005)
Bravo, L., Fan, W., Geerts, F., Ma, S.: Increasing the expressivity of conditional functional dependencies without extra complexity. In: Proceedings of the International Conference on Data Engineering (ICDE) (2008)
Bravo, L., Fan, W., Ma, S.: Extending dependencies with conditions. In: Proceedings of Very Large Data Bases (VLDB) (2007)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD (2003)
Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. In: Proceedings of the International Conference on Data Engineering (ICDE) (2010)
Chen, W., Fan, W., Ma, S.: Analyses and validation of conditional dependencies with built-in predicates. In: Proceedings of Database and Expert Systems Applications (2009)
Chiang, F., Miller, R.: Discovering data quality rules. PVLDB 1(1) (2008)
Chomicki J., Marcinkowski J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of Very Large Data Bases (VLDB) (2007)
Eckerson, W.W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data Warehousing Institute (2002)
Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)
Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)
Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. (To appear)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2) (2008)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), (2010)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proceedings of the ACM SIGMOD (2011)
Faruquie, T., et al.: Data cleansing as a transient service. In: Proceedings of the International Conference on Data Engineering (ICDE) (2010)
Fellegi I., Holt D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)
Gartner.: Forecast: data quality tools, Worldwide, 2006–2011. Technical report, Gartner (2007)
Giles P.: A model for generalized edit and imputation of survey data. Can. J. Stat. 16, 57–73 (1988)
Golab, L., Karloff, H.J., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1) (2008)
Guo, S., Dong, X., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. PVLDB 3(1), (2010)
Herzog T.N., Scheuren F.J., Winkler W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2009)
Knuth, D.E.: The Art of Computer Programming Volume 4, Fascicle 1: Bitwise tricks & techniques; Binary Decision Diagrams. Addison-Wesley Professional, Boston (2009)
Kolahi, S., Lakshmanan, L.: On approximating optimum repairs for functional dependency violations. In: Proceedings of International Conference on Database Theory (ICDT) (2009)
Loshin D.: Master Data Management. Knowledge Integrity, Inc., California (2009)
Naumann F., Bilke A., Bleiholder J., Weis M.: Data fusion in three steps: resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29(2), 21–31 (2006)
Papadimitriou C.H.: Computational Complexity. Addison- Wesley, Boston (1994)
Raman, V., Hellerstein, J.M.: Potter’s Wheel: An interactive data cleaning system. In: Proceedings of Very Large Data Bases (VLDB) (2001)
Raz, R., Safra, S.: A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In: Proceedings of Symposium on the Theory of Computing (STOC) (1997)
Redman T.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)
Sauter, G., Mathews, B., Ostic, E.: Information service patterns, part 3: Data cleansing pattern. IBM (2007)
Song, S., Chen, L.: Discovering matching dependencies. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM) (2009)
Widom J., Ceri S.: Active database systems: triggers and rules for advanced database processing. Morgan Kaufmann, California (1996)
Wijsen J.: Database repairing using updates. TODS 30(3), 722–768 (2005)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(1), (2011)
Author information
Authors and Affiliations
Corresponding authors
Electronic Supplementary Material
The Below is the Electronic Supplementary Material.
Rights and permissions
About this article
Cite this article
Fan, W., Li, J., Ma, S. et al. Towards certain fixes with editing rules and master data. The VLDB Journal 21, 213–238 (2012). https://doi.org/10.1007/s00778-011-0253-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0253-7