Skip to main content

Entity Matching with String Transformation and Similarity-Based Features

  • 85 Accesses

Part of the Communications in Computer and Information Science book series (CCIS,volume 1457)

Abstract

Entity matching is an important task in common data cleaning and data integration problems of determining two records that refer to the same real-world entity. Many research use string similarity as features to infer entity matching but the power of the similarity may be affected by the pairs of hard-to-classify entities, which are actually different entities but have a high similarity or the same entity with low similarity. String transformation is a good solution to solve different representations between two domains of datasets, such as abbreviations, misspellings, and other expressions.

In this paper, we propose two powerful features, similarity gain and dissimilarity gain, that enables us to discriminate whether the two entities refer to the same entity after string transformation. The similarity gain is defined by the maximum amount of similarity increase among the variations in similarity before and after applying string transformations. The dissimilarity is defined by the maximum amount of similarity decrease. Moreover, the similarity gain and dissimilarity gain can also be used for selecting valuable samples in a limited labeling budget. Sufficient experiments are conducted, and our method with the proposed features improves the best accuracy in most cases.

Keywords

  • Entity matching
  • Entity resolution
  • Supervised learning
  • String similarity
  • String transformation
  • Feature engineering

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-93849-9_5
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-93849-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   74.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.

Notes

  1. 1.

    F can also include function \(f_0\) that does not transform anything.

References

  1. Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 40–49. IEEE (2008)

    Google Scholar 

  2. Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. Proc. VLDB Endow. 2(1), 514–525 (2009)

    CrossRef  Google Scholar 

  3. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 783–794 (2010)

    Google Scholar 

  4. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)

    Google Scholar 

  5. Çakal, Ö.Ö., Mahdavi, M., Abedjan, Z.: CLRL: feature engineering for cross-language record linkage. In: EDBT, pp. 678–681 (2019)

    Google Scholar 

  6. Christen, P.: Automatic training example selection for scalable unsupervised record linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_45

    CrossRef  Google Scholar 

  7. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)

    Google Scholar 

  8. Deng, D., et al.: Unsupervised string transformation learning for entity consolidation. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 196–207. IEEE (2019)

    Google Scholar 

  9. Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)

    CrossRef  Google Scholar 

  10. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    CrossRef  Google Scholar 

  11. Konda, P., Das, S., Suganthan G.C.P., Doan, A., Ardalan, A., et al.: Magellan: Toward building entity matching management systems

    Google Scholar 

  12. Li, P., Cheng, X., Chu, X., He, Y., Chaudhuri, S.: Auto-fuzzyjoin: auto-program fuzzy similarity joins without labeled examples (2021)

    Google Scholar 

  13. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)

    CrossRef  Google Scholar 

  14. Michelson, M., Knoblock, C.A.: Mining the heterogeneous transformations between data sources to aid record linkage. In: ICAI (2009)

    Google Scholar 

  15. Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Fifth IEEE International Conference on Data Mining (ICDM’05). 8p. IEEE (2005)

    Google Scholar 

  16. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018. pp. 19–34. ACM (2018)

    Google Scholar 

  17. Paganelli, M., Buono, F.D., Pevarello, M., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: Velegrakis, Y., Zeinalipour-Yazti, D., Chrysanthis, P.K., Guerra, F. (eds.) Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, 23–26 March 2021, pp. 325–330. OpenProceedings.org (2021)

    Google Scholar 

  18. Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: A survey of blocking and filtering techniques for entity resolution. CoRR abs/1905.06167 (2019). http://arxiv.org/abs/1905.06167

  19. Wang, P., Zheng, W., Wang, J., Pei, J.: Automating entity matching model development. In: 37th IEEE International Conference on Data Engineering, ICDE 2021. IEEE (2021)

    Google Scholar 

  20. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)

    Google Scholar 

  21. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 350–359 (2002)

    Google Scholar 

  22. Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Liu, L., et al. (eds.) The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 2413–2424. ACM (2019)

    Google Scholar 

  23. Zhu, E., He, Y., Chaudhuri, S.: Auto-join: Joining tables by leveraging transformations. Proc. VLDB Endow. 10(10), 1034–1045 (2017). http://www.vldb.org/pvldb/vol10/p1034-he.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuyang Dong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Sakai, K., Dong, Y., Oyamada, M., Takeoka, K., Okadome, T. (2022). Entity Matching with String Transformation and Similarity-Based Features. In: Fletcher, G., Nakano, K., Sasaki, Y. (eds) Software Foundations for Data Interoperability. SFDI 2021. Communications in Computer and Information Science, vol 1457. Springer, Cham. https://doi.org/10.1007/978-3-030-93849-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93849-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93848-2

  • Online ISBN: 978-3-030-93849-9

  • eBook Packages: Computer ScienceComputer Science (R0)