Advertisement

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

  • Zeinab Bahmani
  • Leopoldo BertossiEmail author
  • Nikolaos Vasiloglou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9310)

Abstract

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.

Keywords

Entity resolution Matching dependencies Support-vector machines Classification Datalog 

Notes

Acknowledgments

Part of this research was funded by an NSERC Discovery grant and the NSERC Strategic Network on Business Intelligence (BIN). Z. Bahmani and L. Bertossi are very much grateful for the support from LogicBlox during their internship and sabbatical visit.

References

  1. 1.
    Aref, M., ten Cate, B., Green, T.J., Kimelfeld, B., Olteanu, D., Pasalic, E., Veldhuizen, T.L., Washburn, G.: Design and Implementation of the LogicBlox System. In: Proceeding SIGMOD 2015, pp. 125–141 (2015)Google Scholar
  2. 2.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceeding ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification , pp. 234–256 (2003)Google Scholar
  3. 3.
    Bahmani, Z., Bertossi, L., Kolahi, S., Lakshmanan, L.: Declarative entity resolution via matching dependencies and answer set programs. In: Proceeding KR 2012, pp. 380–390 (2012)Google Scholar
  4. 4.
    Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Comput. 12(3), 2385–2404 (2000)CrossRefGoogle Scholar
  5. 5.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., EuijongWhang, S., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  6. 6.
    Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. In: Proceeding ICDT 2011. ACM Press (2011)Google Scholar
  7. 7.
    Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. Thoer. Comp. Syst. 52(3), 441–482 (2013)zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008)CrossRefGoogle Scholar
  9. 9.
    Ceri, S., Gottlob, G., Tanca, L.: Logic Programming and Databases. Springer, Heidelberg (1989)Google Scholar
  10. 10.
    Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H. (eds.) Quality Measures in Data Mining. SCI, pp. 127–151. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  11. 11.
    Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceeding SIGKDD 2008, pp. 151–159 (2008)Google Scholar
  12. 12.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2011)MathSciNetGoogle Scholar
  13. 13.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Proceeding Workshop on Data Cleaning and Object Consolidation 2003, pp. 123–134 (2003)Google Scholar
  14. 14.
    Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967)zbMATHCrossRefGoogle Scholar
  15. 15.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  16. 16.
    Fan, W.: Dependencies revisited for improving data quality. In: Proceeding PODS (2008)Google Scholar
  17. 17.
    Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about Record Matching Rules. PVLDB 2(1), 407–418 (2009)Google Scholar
  18. 18.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Soc. 64(1), 328–339 (1969)Google Scholar
  19. 19.
    Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)zbMATHGoogle Scholar
  20. 20.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
  21. 21.
    Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)Google Scholar
  22. 22.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  23. 23.
    Euijong Whang, S., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceeding SIGMOD 2009, pp. 219–232 (2009)Google Scholar
  24. 24.
    Vapnik, V.N.: Statistical Learning Theory. Wiley (1998)Google Scholar
  25. 25.
    Winkler, W.E.: The State of record linkage and currentresearch problems. Technical Report, U.S. Census Bureau (1999)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Zeinab Bahmani
    • 1
  • Leopoldo Bertossi
    • 1
    Email author
  • Nikolaos Vasiloglou
    • 2
  1. 1.Carleton University, School of Computer ScienceOttawaCanada
  2. 2.LogicBlox Inc.AtlantaUSA

Personalised recommendations