Journal of Intelligent Information Systems

, Volume 43, Issue 1, pp 101–127 | Cite as

Discriminative and deterministic approaches towards entity resolution

  • Byung-Won On
  • Ingyu Lee
  • Gyu Sang Choi
  • Ho-Sik Park
Article
  • 276 Downloads

Abstract

To address the entity resolution problem, existing studies usually consist of two-steps. Given two lists of records, in the first step a small set of duplicate records (a candidate set) are selected based on index structures and algorithms for efficient entity resolution. Then, a given similarity function is applied to quantify the similarity of records in the candidate set. However, for real applications, it is a non-trivial task to select appropriate indexing techniques and similarity functions. In this paper, we tackle the problem of indexing and similarity function identification using both discriminative and deterministic approaches that select the best of indexing and similarity measures. According to our experimental results, our proposed solution considering both discriminative and deterministic approaches shows more than a 90 % average accuracy within hundreds of seconds.

Keywords

Entity resolution Approximate string matching Similarities Support vector machines Blocking techniques 

References

  1. Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international world wide web conference (WWW’05). Chiba, Japan, 10–14 May 2005.Google Scholar
  2. Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J. (2005). Swoosh: a generic approach to entity resolution. Technical Report 2005-5, InforLab, Stanford University.Google Scholar
  3. Bennett, C., Gacs, P., Li, M., Vitanyi, P., Zurek, W. (2002). Information distance. IEEE Transactions on Information Theory, 44(4), 1407–1423.CrossRefMathSciNetGoogle Scholar
  4. Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.CrossRefGoogle Scholar
  5. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S. (2003). Adaptive name-matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.CrossRefGoogle Scholar
  6. Chaudhuri, S., Chen, B., Ganti, V., Kaushik, R. (2007). Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on very large data bases (VLDB’07). Vienna, Austria, 23–27 September 2007.Google Scholar
  7. Cochinwala, M., Kurien, V., Lalk, G., Shasha, D. (2001). Efficient data reconciliation. Information Sciences, 137(1), 1–15.CrossRefMATHGoogle Scholar
  8. Cohen, W., Ravikumar, P., Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 8th international joint conference on artificial intelligence (IJCAI’03). Acapulco, Mexico, 9–15 August 2003.Google Scholar
  9. Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactios on Knowledge and Data Engineering, 24(9), 1537–1555.CrossRefGoogle Scholar
  10. Doan, A., Lu, Y., Lee, Y., Han, J. (2003). Profile-based object matching for information integraion. IEEE Intelligent Systems, 18(5), 54–59.CrossRefGoogle Scholar
  11. Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 24th ACM SIGMOD international conference on management of data (SIGMOD’05). Baltimore, Maryland, USA, 13–16 June 2005.Google Scholar
  12. Elmagarmid, A., Ipeirotis, P., Verykios, V. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRefGoogle Scholar
  13. Fan, W., Jia, X., Li, J., Ma, S. (2009). Reasoning about record matching rules. In Proceedings of the 35th internation conference on very large data bases (VLDB’09). Lyon, France, 24–28 August 2009.Google Scholar
  14. Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.CrossRefGoogle Scholar
  15. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnana, S., Pietarinen, L., Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 90–101.Google Scholar
  16. Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D. (2003). Text joins in an RDBMS for web data integration. In Proceedings of the 12th international world wide web conference (WWW’03). Budapest, Hungary, 20–24 May 2003.Google Scholar
  17. Guo, S., Dong, X., Srivastava, D., Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.Google Scholar
  18. Halbert, D. (2008). Record linkage. American Journal of Public Health, 36(12), 1412–1416.Google Scholar
  19. Hammouda, K., & Kamel, M. (2004). Document similarity using a phrase indexing graph model. Knowledge and Information Systems, 6, 710–727.CrossRefGoogle Scholar
  20. Han, H., Zha, H., Lee Giles, C. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denvor, 7–11 June 2005.Google Scholar
  21. Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’95). San Jose, 22–25 May 1995.Google Scholar
  22. Herranz, J., Nin, J., Sole, M. (2010). Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Transactios on Knowledge and Data Engineering, 23(10), 1541–1554.CrossRefGoogle Scholar
  23. Hong, Y., On, B., Lee, D. (2004). System support for name authority control problem in digital libraries: OpenDBLP approach. In Proceedings of the 8th European conference on digital libraries (ECDL’04). Bath, 12–17 September 2004.Google Scholar
  24. Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of Tampa Florida. Journal of American Statistical Association, 84(406), 414–420.CrossRefGoogle Scholar
  25. Kalashnikov, D., Mehrotra, S., Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceedings of the SIAM data mining conference (SDM’05). Newport Beach, 21–23 April 2005.Google Scholar
  26. Kim, H., & Lee, D. (2010). HARRA: fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th international conference on extending database technology (EDBT’10). Lausanne, Switzerland, 22–26 March 2010.Google Scholar
  27. Koudas, N., Sarawagi, S., Srivastava, D. (2006). Record linkage: Similarity measures and algorithms. In Proceedings of the 25th ACM SIGMOD international conference on management of data (SIGMOD’06). Chicago, 26–29 June 2006.Google Scholar
  28. Lawrence, S., Lee Giles, C., Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.CrossRefGoogle Scholar
  29. Lee, D., On, B., Kang, J., Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of ACM SIGMOD workshop on information quality in information systems (IQIS’05). Baltimore, 13–16 June 2005.Google Scholar
  30. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.CrossRefMathSciNetGoogle Scholar
  31. Li, P., Dong, X., Maurino, A., Srivastava, D. (2011). Linking temporal records. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.Google Scholar
  32. Lim, E., Srivastava, J., Prabhakar, S., Richardson, J. (1993). Entity identification in database integration. In Proceedings of international conference on data engineering (ICDE’93). Vienna, 19–23 April 1993.Google Scholar
  33. Lu, W., Milios, J., Japkowicz, M., Zhang, Y. (2006). Node similarity in the citation graph. Knowledge and Information Systems, 11, 105–129.CrossRefGoogle Scholar
  34. Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of international conference on knowledge discovery and data mining (KDD’96). Portland.Google Scholar
  35. On, B., & Choi, G. (2012). Acase study of understanding the nature of redundant entities in bibliographic digital libraries. Technical Report (2012–001), Public Data Research Center, Advanced Institutes of Convergence Technology, Seoul National University, Suwon, Korea.Google Scholar
  36. On, B., Koudas, N., Lee, D., Srivastava, D. (2007). Group linkage. In Proceedings of international conference on data engineering (ICDE’07). Istanbul, 15–20 April 2007.Google Scholar
  37. On, B., Lee, D., Kang, J., Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denver, 7–11 June 2005.Google Scholar
  38. Pasula, H., Marthi, B., Milch, B., Russell, S., Shapitser, I. (2003). Identity uncertainty and citation matching. Advances in neural information processing (Vol. 15). Cambridge: MIT press.Google Scholar
  39. Rastogi, V., Dalvi, N., Garofalakis, M. (2011). Large-scale collective entity matching. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.Google Scholar
  40. Sarawagi, S., & Bhamidipty, A. (2002). Interactive deduplication using active learning. In Proceedings of international conference on knowledge discovery and data mining (KDD’02). Edmonton, 23–26 July 2002.Google Scholar
  41. Shen, W., Li, X., Doan, A. (2005). Constraint-based entity matching. In Proceedings of the 25th national conference on artificial intelligence (AAAI’05). Pittsburgh, 9–13 July 2005.Google Scholar
  42. Verykios, V., Elmagarmid, A., Houstis, E. (2000). Automating the approximate record matching process. Information Sciences, 126(1), 83–98.CrossRefMATHGoogle Scholar
  43. Wang, J., Li, G., Yu, J., Feng, J. (2011). Entity matching: How similar is similar. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.Google Scholar
  44. Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.Google Scholar
  45. Xiao, C., Wang, W., Lin, X. (2008). Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of the 34th international conference on very large data bases (VLDB’08). Auckland, 24–30 August 2008.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Byung-Won On
    • 1
  • Ingyu Lee
    • 1
  • Gyu Sang Choi
    • 2
  • Ho-Sik Park
    • 3
  1. 1.Advanced Institutes of Convergence TechnologySeoul National UniversityGyeonggi-doKorea
  2. 2.Department of Information and Communication EngineeringYeungnam UniversityGyeongsangbukKorea
  3. 3.Division of Information and Computer EngineeringAjou UniversityGyeonggi-doKorea

Personalised recommendations