The VLDB Journal

, Volume 20, Issue 4, pp 495–520 | Cite as

Dynamic constraints for record matching

  • Wenfei Fan
  • Hong Gao
  • Xibei Jia
  • Jianzhong Li
  • Shuai Ma
Regular Paper

Abstract

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

Keywords

Deduction Inference Matching dependencies Record matching Data cleaning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul S., Hull R., Vianu V.: Foundations of Databases. Addison-Wesley, Boston (1995)MATHGoogle Scholar
  2. 2.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)Google Scholar
  3. 3.
    Sarma, J.W.A.D., Ullman, J.: Schema design for uncertain databases. In: Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management (2009)Google Scholar
  4. 4.
    Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE (2008)Google Scholar
  5. 5.
    Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE (2009)Google Scholar
  6. 6.
    Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD (2005)Google Scholar
  7. 7.
    Batini C., Scannapieco M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)MATHGoogle Scholar
  8. 8.
    Beeri C., Bernstein P.A.: Computational problems related to the design of normal form relational schemas. ACM Trans. Database Syst. 4(1), 30–59 (1979)CrossRefGoogle Scholar
  9. 9.
    Belohlávek, R., Vychodil, V.: Data tables with similarity relations: functional dependencies, complete rules and non-redundant bases. In: DASFAA, (2006)Google Scholar
  10. 10.
    Cautis, B., Abiteboul, S., Milo, T.: Reasoning about XML update constraints. In: PODS, (2007)Google Scholar
  11. 11.
    Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB (2007)Google Scholar
  12. 12.
    Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007)Google Scholar
  13. 13.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD (2002)Google Scholar
  14. 14.
    Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: discovering complex mappings between database schemas. In: SIGMOD (2004)Google Scholar
  15. 15.
    Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  16. 16.
    Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)Google Scholar
  17. 17.
    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. (2010)Google Scholar
  18. 18.
    Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: VLDB (2009)Google Scholar
  19. 19.
    Fellegi Ivan, Holt D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)CrossRefGoogle Scholar
  20. 20.
    Fellegi I., Sunter A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  21. 21.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: VLDB (2001)Google Scholar
  22. 22.
    Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB (2004)Google Scholar
  23. 23.
    Haas, L., Hernández, M., Ho, H., Popa, L., Roth, Mary: Clio grows up: from research prototype to industrial tool. In: SIGMOD (2005)Google Scholar
  24. 24.
    Hernndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995)Google Scholar
  25. 25.
    Hernndez M.A., Stolfo S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  26. 26.
    Huhtala Y., Kärkkäinen J., Porkka P., Toivonen H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)MATHCrossRefGoogle Scholar
  27. 27.
  28. 28.
  29. 29.
    Jaro M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. Am. Stat. Assoc. 89, 414–420 (1989)CrossRefGoogle Scholar
  30. 30.
    Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE (2009)Google Scholar
  31. 31.
    Lim E.-P., Srivastava J., Prabhakar S., Richardson J.: Entity identification in database integration. Inf. Sci. 89(1–2), 1–38 (1996)CrossRefGoogle Scholar
  32. 32.
    Loshin D.: Master Data Management. Knowledge Integrity, Inc., New York (2009)Google Scholar
  33. 33.
    Lucchesi C.L., Osborn S.L.: Candidate keys for relations. J. Comput. Syst. Sci. 17(2), 270–279 (1978)MathSciNetMATHCrossRefGoogle Scholar
  34. 34.
    Maier D.: The Theory of Relational Databases. Computer Science Press, Rockville (1983)MATHGoogle Scholar
  35. 35.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. (2001)Google Scholar
  36. 36.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)Google Scholar
  37. 37.
    Sauter G., Mathews B., Ostic E.: Information Service Patterns, Part 3: Data Cleansing Pattern. IBM, USA (2007)Google Scholar
  38. 38.
    Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In AAAI (2005)Google Scholar
  39. 39.
    Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: PKDD (2005)Google Scholar
  40. 40.
    Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: efficient and scalable discovery of composite keys. In: VLDB (2006)Google Scholar
  41. 41.
  42. 42.
    Verykios V.S., Elmagarmid A.K., Houstis E.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–89 (2002)Google Scholar
  43. 43.
    Vianu V.: Dynamic functional dependencies and database aging. J. ACM 34(1), 28–59 (1987)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In: SIGMOD (2005)Google Scholar
  45. 45.
    Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: VLDB (2008)Google Scholar
  46. 46.
    Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report RRS2002/05, U.S. Census Bureau (2002)Google Scholar
  47. 47.
    Winkler W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29(7), 531–550 (2004)CrossRefGoogle Scholar
  48. 48.
    Yancey, W.: BigMatch: A program for extracting probable matches from a large file. Technical report computing 2007/01, U.S. Census Bureau (2007)Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Wenfei Fan
    • 1
    • 3
  • Hong Gao
    • 3
  • Xibei Jia
    • 2
  • Jianzhong Li
    • 3
  • Shuai Ma
    • 2
  1. 1.University of EdinburghEdinburghUK
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUK
  3. 3.Harbin Institute of TechnologyHarbinChina

Personalised recommendations