Skip to main content
Log in

Towards certain fixes with editing rules and master data

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are guaranteed correct, and worse still, may even introduce new errors when attempting to repair the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We also develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. Furthermore, we present a framework and an algorithm to find certain fixes, by interacting with the users to ensure that one of the certain regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abiteboul S., Hull R., Vianu V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  2. Arenas M., Bertossi L.E., Chomicki J.: Consistent query answers in inconsistent databases. TPLP 3(4–5), 393–424 (2003)

    MathSciNet  MATH  Google Scholar 

  3. Arora S., Barak B.: Computational Complexity: A Modern Approach. Cambridge University Press, Cambridge (2009)

    MATH  Google Scholar 

  4. Batini C., Scannapieco M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)

    MATH  Google Scholar 

  5. Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., WhangS.E. Widom J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  6. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD (2005)

  7. Bravo, L., Fan, W., Geerts, F., Ma, S.: Increasing the expressivity of conditional functional dependencies without extra complexity. In: Proceedings of the International Conference on Data Engineering (ICDE) (2008)

  8. Bravo, L., Fan, W., Ma, S.: Extending dependencies with conditions. In: Proceedings of Very Large Data Bases (VLDB) (2007)

  9. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD (2003)

  10. Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. In: Proceedings of the International Conference on Data Engineering (ICDE) (2010)

  11. Chen, W., Fan, W., Ma, S.: Analyses and validation of conditional dependencies with built-in predicates. In: Proceedings of Database and Expert Systems Applications (2009)

  12. Chiang, F., Miller, R.: Discovering data quality rules. PVLDB 1(1) (2008)

  13. Chomicki J., Marcinkowski J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  14. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of Very Large Data Bases (VLDB) (2007)

  15. Eckerson, W.W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data Warehousing Institute (2002)

  16. Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)

    Google Scholar 

  17. Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)

  18. Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. (To appear)

  19. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2) (2008)

  20. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), (2010)

  21. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proceedings of the ACM SIGMOD (2011)

  22. Faruquie, T., et al.: Data cleansing as a transient service. In: Proceedings of the International Conference on Data Engineering (ICDE) (2010)

  23. Fellegi I., Holt D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)

    Article  Google Scholar 

  24. Gartner.: Forecast: data quality tools, Worldwide, 2006–2011. Technical report, Gartner (2007)

  25. Giles P.: A model for generalized edit and imputation of survey data. Can. J. Stat. 16, 57–73 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  26. Golab, L., Karloff, H.J., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1) (2008)

  27. Guo, S., Dong, X., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. PVLDB 3(1), (2010)

  28. Herzog T.N., Scheuren F.J., Winkler W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2009)

    Google Scholar 

  29. Knuth, D.E.: The Art of Computer Programming Volume 4, Fascicle 1: Bitwise tricks & techniques; Binary Decision Diagrams. Addison-Wesley Professional, Boston (2009)

  30. Kolahi, S., Lakshmanan, L.: On approximating optimum repairs for functional dependency violations. In: Proceedings of International Conference on Database Theory (ICDT) (2009)

  31. Loshin D.: Master Data Management. Knowledge Integrity, Inc., California (2009)

    Google Scholar 

  32. Naumann F., Bilke A., Bleiholder J., Weis M.: Data fusion in three steps: resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29(2), 21–31 (2006)

    Google Scholar 

  33. Papadimitriou C.H.: Computational Complexity. Addison- Wesley, Boston (1994)

    MATH  Google Scholar 

  34. Raman, V., Hellerstein, J.M.: Potter’s Wheel: An interactive data cleaning system. In: Proceedings of Very Large Data Bases (VLDB) (2001)

  35. Raz, R., Safra, S.: A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In: Proceedings of Symposium on the Theory of Computing (STOC) (1997)

  36. Redman T.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)

    Article  Google Scholar 

  37. Sauter, G., Mathews, B., Ostic, E.: Information service patterns, part 3: Data cleansing pattern. IBM (2007)

  38. Song, S., Chen, L.: Discovering matching dependencies. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM) (2009)

  39. Widom J., Ceri S.: Active database systems: triggers and rules for advanced database processing. Morgan Kaufmann, California (1996)

    Google Scholar 

  40. Wijsen J.: Database repairing using updates. TODS 30(3), 722–768 (2005)

    Article  Google Scholar 

  41. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(1), (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jianzhong Li or Shuai Ma.

Electronic Supplementary Material

The Below is the Electronic Supplementary Material.

ESM 1 (PDF 63 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, W., Li, J., Ma, S. et al. Towards certain fixes with editing rules and master data. The VLDB Journal 21, 213–238 (2012). https://doi.org/10.1007/s00778-011-0253-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0253-7

Keywords

Navigation