Skip to main content

Entity Linking: Finding Extracted Entities in a Knowledge Base

  • Chapter
  • First Online:
Multi-source, Multilingual Information Extraction and Summarization

Abstract

In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.clsp.jhu.edu/~markus/fstrain

  2. 2.

    www.cs.cornell.edu/people/tj/svm_light/svm_rank.html

  3. 3.

    Bunescu and Pasca [13] report learning tens of thousands of support vectors with their “taxonomy” kernel while a linear kernel represents all support vectors with a single weight vector, enabling faster training and prediction.

  4. 4.

    We used multiple lists, including class-specific lists (i.e., for PER, ORG, and GPE) lists extracted from Freebase [41] and Wikipedia redirects. PER, ORG, and GPE are the commonly used terms for entity types for people, organizations and geo-political regions respectively.

  5. 5.

    Without such a limit, the objective function may diverge for certain parameters of the model; we detect such cases and learn to avoid them during training.

  6. 6.

    Data available from www.dredze.com

  7. 7.

    http://en.wikipedia.org/wiki/Help:Infobox

  8. 8.

    http://research.microsoft.com/en-us/um/people/silviu/WebAssistant/TestData/

  9. 9.

    One of the MSNBC news articles is no longer available so we used 759 total entities.

  10. 10.

    We removed Google, FST and conjunction features which reduced system accuracy but increased performance.

  11. 11.

    vs. 2006 version used in [14] We could not get the 2006 version from the author or the Internet.

  12. 12.

    Since our KB was derived from infoboxes, entities not having an infobox were left out.

  13. 13.

    As this article went to press we became aware of the efforts by Mayfield et al. [49] to construct a cross-language entity linking test collection where the language of the knowledge base is English, but query names are in many languages.

References

  1. Sang, E.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Conference on Natural Language Learning (CONLL), Edmonton. The Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  2. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - vol. 1, NAACL ’03, Stroudsburg, pp. 8–15. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  3. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - vol. 4, CONLL ’03, Stroudsburg, pp. 188–191. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  4. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, pp. 100–110. Association for Computational Linguistics, New Brunswick (1999)

    Google Scholar 

  5. Cucerzan, S., Yarowsky, D.: Language independent ner using a unified model of internal and contextual evidence. In: Proceedings of the 6th Conference on Natural Language Learning - vol. 20, COLING-02, Stroudsburg, pp. 1–4. Association for Computational Linguistics, Stroudsburg (2002)

    Google Scholar 

  6. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes, vol. 30, pp. 3–26. John Benjamins, Amsterdam (2007)

    Google Scholar 

  7. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Conference on Computational Linguistics (COLING), Montreal. Association for Computational Linguistics, Stroudsburg (1998)

    Google Scholar 

  8. van Deemter, K., Kibble, R.: On coreferring: coreference in muc and related annotation schemes. Comput. Linguist. 26, 629–637 (2000)

    Google Scholar 

  9. Yang, X., Zhou, G., Su, J., Tan, C.L.: Coreference resolution using competition learning approach. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, pp. 176–183 (2003)

    Google Scholar 

  10. Ng, V.: Supervised noun phrase coreference research: The first fifteen years. In: Proceedings of the ACL, Uppsala, pp. 1396–1411. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  11. Banko, M., Etzioni, O.: The tradeoffs between open and traditional relation extraction. In: Association for Computational Linguistics, Columbus (2008)

    Google Scholar 

  12. Sutton, C., Mccallum, A.: Introduction to conditional random fields for relational learning. In: Getoor, L., Taskar, B., (eds.) Introduction to Statistical Relational Learning. MIT, Cambridge (2006)

    Google Scholar 

  13. Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: European Chapter of the Assocation for Computational Linguistics (EACL), Trento (2006)

    Google Scholar 

  14. Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: Empirical Methods in Natural Language Processing (EMNLP), Prague. Association for Computational Linguistics, Stroudsburg (2007)

    Google Scholar 

  15. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Conference on Natural Language Learning (CONLL), Edmonton. Johns Hopkins University, Baltimore (2003)

    Google Scholar 

  16. Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: WWW, Beijing. ACM, New York (2008)

    Google Scholar 

  17. Poesio, M., Day, D., Artstein, R., Duncan, J., Eidelman, V., Giuliano, C., Hall, R., Hitzeman, J., Jern, A., Kabadjov, M., Yong, S., Keong, W., Mann, G., Moschitti, A., Ponzetto, S., Smith, J., Steinberger, J., Strube, M., Su, J., Versley, Y., Yang, X., Wick, M.: Exploiting lexical and encyclopedic resources for entity disambiguation: final report. Technical Report, JHU CLSP 2007 Summer Workshop, Johns Hopkins University, Baltimore (2008)

    Google Scholar 

  18. Popescu, O.: Dynamic parameters for cross document coreference. In: Conference on Computational Linguistics (COLING), Beijing (2010)

    Google Scholar 

  19. Huang, J., Treeratpituk, P., Taylor, S., Giles, C.L.: Enhancing cross document coreference of web documents with context similarity and very large scale text categorization. In: Conference on Computational Linguistics (COLING), Beijing. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  20. Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Association for Computational Linguistics, Portland. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  21. Rao, D., McNamee, P., Dredze, M.: Streaming cross document entity coreference resolution. In: Conference on Computational Linguistics (COLING), Beijing. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  22. Ng, V.: Supervised noun phrase coreference research: The first fifteen years. In: Association for Computational Linguistics, Uppsala. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  23. Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., Manning, C.: A multi-pass sieve for coreference resolution. In: Empirical Methods in Natural Language Processing (EMNLP), Massachusetts. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  24. Elsner, M., Charniak, E.: The same-head heuristic for coreference. In: Association for Computational Linguistics, Uppsala. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  25. Stoyanov, V., Cardie, C., Gilbert, N., Riloff, E., Buttler, D., Hysom, D.: Reconcile: A coreference resolution research platform. In: Association for Computational Linguistics, Uppsala (2010)

    Google Scholar 

  26. Gabbard, R., Freedman, M., Weischedel, R.: Coreference for learning to extract relations: Yes virginia, coreference matters. In: Association for Computational Linguistics, Portland. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  27. McNamee, P., Dang, H.T.: Overview of the TAC 2009 knowledge base population track. In: Text Analysis Conference (TAC), Gaithersburg (2009)

    Google Scholar 

  28. Ji, H., Grishman, R.: Knowledge base population: successful approaches and challenges. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-HLT), Portland. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  29. Zhang, W., Su, J., Tan, C.L.: Entity linking leveraging automatically generated annotation. In: Conference on Computational Linguistics (COLING), Beijing (2010)

    Google Scholar 

  30. Gottipati, S., Jiang, J.: Linking entities to a knowledge base with query expansion. In: Empirical Methods in Natural Language Processing (EMNLP), Edinburgh. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  31. Han, X., Sun, L.: A generative entity-mention model for linking entities with knowledge base. In: Association for Computational Linguistics, Portland. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  32. Lehmann, J., Monahan, S., Nezda, L., Jung, A., Shi, Y.: Lcc approaches to knowledge base population at tac 2010. In: Proceeding TAC 2010 Workshop, National Institute of Standards and Technology, Gaithersburg (2010)

    Google Scholar 

  33. Zhang, W., Sim, Y., Su, J., Tan, C.: Nus-i2r: Learning a combined system for entity linking. In: Proceeding TAC 2010 Workshop, National Institute of Standards and Technology, Gaithersburg (2010)

    Google Scholar 

  34. Zhang, W., Sim, Y.C., Su, J., Tan, C.L.: Entity linking with effective acronym expansion instance selection and topic modeling. In: International Joint Conference on Artificial Intelligence, Barcelona (2011)

    Google Scholar 

  35. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)

    Google Scholar 

  36. McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Knowledge Discovery and Data Mining (KDD), Boston. ACM, New York (2000)

    Google Scholar 

  37. Joachims, T.: Optimizing search engines using clickthrough data. In: Knowledge Discovery and Data Mining (KDD), Edmonton. ACM, New York (2002)

    Google Scholar 

  38. McNamee, P., Dredze, M., Gerber, A., Garera, N., Finin, T., Mayfield, J., Piatko, C., Rao, D., Yarowsky, D., Dreyer, M.: HLTCOE approaches to knowledge base population at TAC 2009. In: Text Analysis Conference (TAC), Gaithersburg (2009)

    Google Scholar 

  39. Christen, P.: A comparison of personal name matching: techniques and practical issues. Technical Report TR-CS-06-02, Australian National University, Australia (2006)

    Google Scholar 

  40. Dreyer, M., Smith, J., Eisner, J.: Latent-variable modeling of string transductions with finite-state methods. In: Empirical Methods in Natural Language Processing (EMNLP), Honolulu. Association for Computational Linguistics, Stroudsburg (2008)

    Google Scholar 

  41. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD Management of Data, Vancouver. ACM, New York (2008)

    Google Scholar 

  42. Syed, Z., Finin, T., Joshi, A.: Wikipedia as an ontology for describing documents. In: Proceedings of the Second International Conference on Weblogs and Social Media, Chicago. AAAI, Menlo Park (2008)

    Google Scholar 

  43. Fader, A., Soderland, S., Etzioni, O.: Scaling Wikipedia-based named entity disambiguation to arbitrary web text. In: WikiAI09 Workshop at IJCAI 2009, Pasadena (2009)

    Google Scholar 

  44. Boschee, E., Weischedel, R., Zamanian, A.: Automatic information extraction. In: Conference on Intelligence Analysis, Washington (2005)

    Google Scholar 

  45. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  46. Klein, M., Nelson, M.L.: A comparison of techniques for estimating IDF values to generate lexical signatures for the web. In: Workshop on Web Information and Data Management (WIDM), Napa Valley. ACM, New York (2008)

    Google Scholar 

  47. Simpson, H., Parker, R., Strassel, S., Dang, H.T., McNamee, P.: Wikipedia and the web of confusable entities: experience from entity profile creation for tac knowledge base population. In: Proceedings of the Seventh International Language Resources and Evaluation Conference (LREC), Valletta. European Language Resources Association, Valletta (2010)

    Google Scholar 

  48. Li, F., Zhang, Z., Bu, F., Tang, Y., Zhu, X., Huang, M.: THU QUANTA at TAC 2009 KBP and RTE track. In: Text Analysis Conference (TAC), National Institute of Standards and Technology, Gaithersburg (2009)

    Google Scholar 

  49. Mayfield, J., Lawrie, D., McNamee, P., Oard, D.W.: Building a cross-language entity linking collection in twenty-one languages. In: Proceedings of the Cross Language Evaluate Forum (CLEF), Amsterdam (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Delip Rao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Rao, D., McNamee, P., Dredze, M. (2013). Entity Linking: Finding Extracted Entities in a Knowledge Base. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28569-1_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28568-4

  • Online ISBN: 978-3-642-28569-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics