The VLDB Journal

, Volume 21, Issue 2, pp 213–238 | Cite as

Towards certain fixes with editing rules and master data

Special Issue Paper

Abstract

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are guaranteed correct, and worse still, may even introduce new errors when attempting to repair the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We also develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. Furthermore, we present a framework and an algorithm to find certain fixes, by interacting with the users to ensure that one of the certain regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.

Keywords

Certain fix Editing rule Master data Data cleaning Data quality 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

778_2011_253_MOESM1_ESM.pdf (62 kb)
ESM 1 (PDF 63 kb)

References

  1. 1.
    Abiteboul S., Hull R., Vianu V.: Foundations of Databases. Addison-Wesley, Boston (1995)MATHGoogle Scholar
  2. 2.
    Arenas M., Bertossi L.E., Chomicki J.: Consistent query answers in inconsistent databases. TPLP 3(4–5), 393–424 (2003)MathSciNetMATHGoogle Scholar
  3. 3.
    Arora S., Barak B.: Computational Complexity: A Modern Approach. Cambridge University Press, Cambridge (2009)MATHGoogle Scholar
  4. 4.
    Batini C., Scannapieco M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)MATHGoogle Scholar
  5. 5.
    Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., WhangS.E. Widom J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  6. 6.
    Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD (2005)Google Scholar
  7. 7.
    Bravo, L., Fan, W., Geerts, F., Ma, S.: Increasing the expressivity of conditional functional dependencies without extra complexity. In: Proceedings of the International Conference on Data Engineering (ICDE) (2008)Google Scholar
  8. 8.
    Bravo, L., Fan, W., Ma, S.: Extending dependencies with conditions. In: Proceedings of Very Large Data Bases (VLDB) (2007)Google Scholar
  9. 9.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD (2003)Google Scholar
  10. 10.
    Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. In: Proceedings of the International Conference on Data Engineering (ICDE) (2010)Google Scholar
  11. 11.
    Chen, W., Fan, W., Ma, S.: Analyses and validation of conditional dependencies with built-in predicates. In: Proceedings of Database and Expert Systems Applications (2009)Google Scholar
  12. 12.
    Chiang, F., Miller, R.: Discovering data quality rules. PVLDB 1(1) (2008)Google Scholar
  13. 13.
    Chomicki J., Marcinkowski J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of Very Large Data Bases (VLDB) (2007)Google Scholar
  15. 15.
    Eckerson, W.W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data Warehousing Institute (2002)Google Scholar
  16. 16.
    Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)Google Scholar
  17. 17.
    Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)Google Scholar
  18. 18.
    Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. (To appear)Google Scholar
  19. 19.
    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2) (2008)Google Scholar
  20. 20.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), (2010)Google Scholar
  21. 21.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proceedings of the ACM SIGMOD (2011)Google Scholar
  22. 22.
    Faruquie, T., et al.: Data cleansing as a transient service. In: Proceedings of the International Conference on Data Engineering (ICDE) (2010)Google Scholar
  23. 23.
    Fellegi I., Holt D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)CrossRefGoogle Scholar
  24. 24.
    Gartner.: Forecast: data quality tools, Worldwide, 2006–2011. Technical report, Gartner (2007)Google Scholar
  25. 25.
    Giles P.: A model for generalized edit and imputation of survey data. Can. J. Stat. 16, 57–73 (1988)MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Golab, L., Karloff, H.J., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1) (2008)Google Scholar
  27. 27.
    Guo, S., Dong, X., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. PVLDB 3(1), (2010)Google Scholar
  28. 28.
    Herzog T.N., Scheuren F.J., Winkler W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2009)Google Scholar
  29. 29.
    Knuth, D.E.: The Art of Computer Programming Volume 4, Fascicle 1: Bitwise tricks & techniques; Binary Decision Diagrams. Addison-Wesley Professional, Boston (2009)Google Scholar
  30. 30.
    Kolahi, S., Lakshmanan, L.: On approximating optimum repairs for functional dependency violations. In: Proceedings of International Conference on Database Theory (ICDT) (2009)Google Scholar
  31. 31.
    Loshin D.: Master Data Management. Knowledge Integrity, Inc., California (2009)Google Scholar
  32. 32.
    Naumann F., Bilke A., Bleiholder J., Weis M.: Data fusion in three steps: resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29(2), 21–31 (2006)Google Scholar
  33. 33.
    Papadimitriou C.H.: Computational Complexity. Addison- Wesley, Boston (1994)MATHGoogle Scholar
  34. 34.
    Raman, V., Hellerstein, J.M.: Potter’s Wheel: An interactive data cleaning system. In: Proceedings of Very Large Data Bases (VLDB) (2001)Google Scholar
  35. 35.
    Raz, R., Safra, S.: A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In: Proceedings of Symposium on the Theory of Computing (STOC) (1997)Google Scholar
  36. 36.
    Redman T.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)CrossRefGoogle Scholar
  37. 37.
    Sauter, G., Mathews, B., Ostic, E.: Information service patterns, part 3: Data cleansing pattern. IBM (2007)Google Scholar
  38. 38.
    Song, S., Chen, L.: Discovering matching dependencies. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM) (2009)Google Scholar
  39. 39.
    Widom J., Ceri S.: Active database systems: triggers and rules for advanced database processing. Morgan Kaufmann, California (1996)Google Scholar
  40. 40.
    Wijsen J.: Database repairing using updates. TODS 30(3), 722–768 (2005)CrossRefGoogle Scholar
  41. 41.
    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(1), (2011)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.University of EdinburghEdinburghUK
  2. 2.Harbin Institute of TechnologyHarbinChina
  3. 3.Beihang UniversityBeijingChina

Personalised recommendations