Skip to main content

EIF: A Framework of Effective Entity Identification

  • Conference paper
Web-Age Information Management (WAIM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

Abstract

Entity identification, that is to build corresponding relationships between objects and entities in dirty data, plays an important role in data cleaning. The confusion between entities and their names often results in dirty data. That is, different entities may share the identical name and different names may correspond to the identical entity. Therefore, the major task of entity identification is to distinguish entities sharing the same name and recognize different names referring to the same entity. However, current research focuses on only one aspect and cannot solve the problem completely. To address this problem, in this paper, EIF, a framework of entity identification with the consideration of the both kinds of confusions, is proposed. With effective clustering techniques, approximate string matching algorithms and a flexible mechanism of knowledge integration, EIF can be widely used to solve many different kinds of entity identification problems. In this paper, as an application of EIF, we solved the author identification problem. The effectiveness of this framework is verified by extensive experiments.

Supported by the National Science Foundation of China (No 60703012, 60773063), the NSFC-RGC of China(No. 60831160525), National Grant of Fundamental Research 973 Program of China (No.2006CB303000), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Postdoctor Foundtaion of China (No. 20090450126), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (no. HITQNJS.2009.052).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Newcombe, H., Kennedy, J., Axford, S.: Automatic Linkage of Vital Records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  2. Yin, X., Han, J., Yu, P.S.: Object Distinction: Distinguishing Objects with Identical Names. In: ICDE 2007 (2007)

    Google Scholar 

  3. http://www.cervantesvirtual.com/research/congresos/jbidi2003/slides/jbidi2003-michael.ley.ppt

  4. Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE 2008 (2008)

    Google Scholar 

  5. Arasu, A., Kaushik, R.: A grammar-based entity representation framework for data cleaning. In: SIGMOD, pp. 233–244 (2009)

    Google Scholar 

  6. Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD, pp. 207–218 (2009)

    Google Scholar 

  7. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)

    Google Scholar 

  8. Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: Proc. CIKM 2005, pp. 257–258 (2005)

    Google Scholar 

  9. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proc. SIGMOD 2005, pp. 85–96 (2005)

    Google Scholar 

  10. Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 297–308. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE 2009 (2009)

    Google Scholar 

  12. Koudas, N., Saha, A., Srivastava, D., et al.: Metric functional dependencies In: ICDE (2009)

    Google Scholar 

  13. Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. In: VLDB 2009 (2009)

    Google Scholar 

  14. Chaudhuri, S., Chen, B.C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB 2007 (2007)

    Google Scholar 

  15. Milch, B., Marthi, B., Sontag, D., Russell, S., Ong, D.L.: BLOG: Probabilistic models with unknown objects. In: Proc. IJCAI 2005, pp. 1352–1359 (2005)

    Google Scholar 

  16. Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: SIGMOD Conference, pp. 802–803 (2006)

    Google Scholar 

  17. http://dblp.uni-trier.de/

  18. Barabási, Albert-László, et al.: Scale-Free Networks. Scientific American 288, 50–59 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, L., Wang, H., Gao, H., Li, J. (2010). EIF: A Framework of Effective Entity Identification. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_68

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14246-8_68

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14245-1

  • Online ISBN: 978-3-642-14246-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics