Advertisement

EIF: A Framework of Effective Entity Identification

  • Lingli Li
  • Hongzhi Wang
  • Hong Gao
  • Jianzhong Li
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6184)

Abstract

Entity identification, that is to build corresponding relationships between objects and entities in dirty data, plays an important role in data cleaning. The confusion between entities and their names often results in dirty data. That is, different entities may share the identical name and different names may correspond to the identical entity. Therefore, the major task of entity identification is to distinguish entities sharing the same name and recognize different names referring to the same entity. However, current research focuses on only one aspect and cannot solve the problem completely. To address this problem, in this paper, EIF, a framework of entity identification with the consideration of the both kinds of confusions, is proposed. With effective clustering techniques, approximate string matching algorithms and a flexible mechanism of knowledge integration, EIF can be widely used to solve many different kinds of entity identification problems. In this paper, as an application of EIF, we solved the author identification problem. The effectiveness of this framework is verified by extensive experiments.

Keywords

entity identification data cleaning graph partition 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Newcombe, H., Kennedy, J., Axford, S.: Automatic Linkage of Vital Records. Science 130, 954–959 (1959)CrossRefGoogle Scholar
  2. 2.
    Yin, X., Han, J., Yu, P.S.: Object Distinction: Distinguishing Objects with Identical Names. In: ICDE 2007 (2007)Google Scholar
  3. 3.
  4. 4.
    Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE 2008 (2008)Google Scholar
  5. 5.
    Arasu, A., Kaushik, R.: A grammar-based entity representation framework for data cleaning. In: SIGMOD, pp. 233–244 (2009)Google Scholar
  6. 6.
    Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD, pp. 207–218 (2009)Google Scholar
  7. 7.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)Google Scholar
  8. 8.
    Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: Proc. CIKM 2005, pp. 257–258 (2005)Google Scholar
  9. 9.
    Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proc. SIGMOD 2005, pp. 85–96 (2005)Google Scholar
  10. 10.
    Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 297–308. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  11. 11.
    Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE 2009 (2009)Google Scholar
  12. 12.
    Koudas, N., Saha, A., Srivastava, D., et al.: Metric functional dependencies In: ICDE (2009)Google Scholar
  13. 13.
    Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. In: VLDB 2009 (2009)Google Scholar
  14. 14.
    Chaudhuri, S., Chen, B.C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB 2007 (2007)Google Scholar
  15. 15.
    Milch, B., Marthi, B., Sontag, D., Russell, S., Ong, D.L.: BLOG: Probabilistic models with unknown objects. In: Proc. IJCAI 2005, pp. 1352–1359 (2005)Google Scholar
  16. 16.
    Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: SIGMOD Conference, pp. 802–803 (2006)Google Scholar
  17. 17.
  18. 18.
    Barabási, Albert-László, et al.: Scale-Free Networks. Scientific American 288, 50–59 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Lingli Li
    • 1
  • Hongzhi Wang
    • 1
  • Hong Gao
    • 1
  • Jianzhong Li
    • 1
  1. 1.Department of Computer Science and EngineeringHarbin Institute of TechnologyChina

Personalised recommendations