Entity Linking: Finding Extracted Entities in a Knowledge Base

Rao, Delip; McNamee, Paul; Dredze, Mark

doi:10.1007/978-3-642-28569-1_5

Delip Rao⁵,
Paul McNamee⁶ &
Mark Dredze^5,6

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

3135 Accesses
59 Citations
3 Altmetric

Abstract

In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.clsp.jhu.edu/~markus/fstrain
2.
www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
3.
Bunescu and Pasca [13] report learning tens of thousands of support vectors with their “taxonomy” kernel while a linear kernel represents all support vectors with a single weight vector, enabling faster training and prediction.
4.
We used multiple lists, including class-specific lists (i.e., for PER, ORG, and GPE) lists extracted from Freebase [41] and Wikipedia redirects. PER, ORG, and GPE are the commonly used terms for entity types for people, organizations and geo-political regions respectively.
5.
Without such a limit, the objective function may diverge for certain parameters of the model; we detect such cases and learn to avoid them during training.
6.
Data available from www.dredze.com
7.
http://en.wikipedia.org/wiki/Help:Infobox
8.
http://research.microsoft.com/en-us/um/people/silviu/WebAssistant/TestData/
9.
One of the MSNBC news articles is no longer available so we used 759 total entities.
10.
We removed Google, FST and conjunction features which reduced system accuracy but increased performance.
11.
vs. 2006 version used in [14] We could not get the 2006 version from the author or the Internet.
12.
Since our KB was derived from infoboxes, entities not having an infobox were left out.
13.
As this article went to press we became aware of the efforts by Mayfield et al. [49] to construct a cross-language entity linking test collection where the language of the knowledge base is English, but query names are in many languages.

References

Sang, E.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Conference on Natural Language Learning (CONLL), Edmonton. The Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - vol. 1, NAACL ’03, Stroudsburg, pp. 8–15. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - vol. 4, CONLL ’03, Stroudsburg, pp. 188–191. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, pp. 100–110. Association for Computational Linguistics, New Brunswick (1999)
Google Scholar
Cucerzan, S., Yarowsky, D.: Language independent ner using a unified model of internal and contextual evidence. In: Proceedings of the 6th Conference on Natural Language Learning - vol. 20, COLING-02, Stroudsburg, pp. 1–4. Association for Computational Linguistics, Stroudsburg (2002)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes, vol. 30, pp. 3–26. John Benjamins, Amsterdam (2007)
Google Scholar
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Conference on Computational Linguistics (COLING), Montreal. Association for Computational Linguistics, Stroudsburg (1998)
Google Scholar
van Deemter, K., Kibble, R.: On coreferring: coreference in muc and related annotation schemes. Comput. Linguist. 26, 629–637 (2000)
Google Scholar
Yang, X., Zhou, G., Su, J., Tan, C.L.: Coreference resolution using competition learning approach. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, pp. 176–183 (2003)
Google Scholar
Ng, V.: Supervised noun phrase coreference research: The first fifteen years. In: Proceedings of the ACL, Uppsala, pp. 1396–1411. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Banko, M., Etzioni, O.: The tradeoffs between open and traditional relation extraction. In: Association for Computational Linguistics, Columbus (2008)
Google Scholar
Sutton, C., Mccallum, A.: Introduction to conditional random fields for relational learning. In: Getoor, L., Taskar, B., (eds.) Introduction to Statistical Relational Learning. MIT, Cambridge (2006)
Google Scholar
Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: European Chapter of the Assocation for Computational Linguistics (EACL), Trento (2006)
Google Scholar
Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: Empirical Methods in Natural Language Processing (EMNLP), Prague. Association for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Conference on Natural Language Learning (CONLL), Edmonton. Johns Hopkins University, Baltimore (2003)
Google Scholar
Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: WWW, Beijing. ACM, New York (2008)
Google Scholar
Poesio, M., Day, D., Artstein, R., Duncan, J., Eidelman, V., Giuliano, C., Hall, R., Hitzeman, J., Jern, A., Kabadjov, M., Yong, S., Keong, W., Mann, G., Moschitti, A., Ponzetto, S., Smith, J., Steinberger, J., Strube, M., Su, J., Versley, Y., Yang, X., Wick, M.: Exploiting lexical and encyclopedic resources for entity disambiguation: final report. Technical Report, JHU CLSP 2007 Summer Workshop, Johns Hopkins University, Baltimore (2008)
Google Scholar
Popescu, O.: Dynamic parameters for cross document coreference. In: Conference on Computational Linguistics (COLING), Beijing (2010)
Google Scholar
Huang, J., Treeratpituk, P., Taylor, S., Giles, C.L.: Enhancing cross document coreference of web documents with context similarity and very large scale text categorization. In: Conference on Computational Linguistics (COLING), Beijing. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Association for Computational Linguistics, Portland. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Rao, D., McNamee, P., Dredze, M.: Streaming cross document entity coreference resolution. In: Conference on Computational Linguistics (COLING), Beijing. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Ng, V.: Supervised noun phrase coreference research: The first fifteen years. In: Association for Computational Linguistics, Uppsala. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., Manning, C.: A multi-pass sieve for coreference resolution. In: Empirical Methods in Natural Language Processing (EMNLP), Massachusetts. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Elsner, M., Charniak, E.: The same-head heuristic for coreference. In: Association for Computational Linguistics, Uppsala. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Stoyanov, V., Cardie, C., Gilbert, N., Riloff, E., Buttler, D., Hysom, D.: Reconcile: A coreference resolution research platform. In: Association for Computational Linguistics, Uppsala (2010)
Google Scholar
Gabbard, R., Freedman, M., Weischedel, R.: Coreference for learning to extract relations: Yes virginia, coreference matters. In: Association for Computational Linguistics, Portland. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
McNamee, P., Dang, H.T.: Overview of the TAC 2009 knowledge base population track. In: Text Analysis Conference (TAC), Gaithersburg (2009)
Google Scholar
Ji, H., Grishman, R.: Knowledge base population: successful approaches and challenges. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-HLT), Portland. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Zhang, W., Su, J., Tan, C.L.: Entity linking leveraging automatically generated annotation. In: Conference on Computational Linguistics (COLING), Beijing (2010)
Google Scholar
Gottipati, S., Jiang, J.: Linking entities to a knowledge base with query expansion. In: Empirical Methods in Natural Language Processing (EMNLP), Edinburgh. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Han, X., Sun, L.: A generative entity-mention model for linking entities with knowledge base. In: Association for Computational Linguistics, Portland. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Lehmann, J., Monahan, S., Nezda, L., Jung, A., Shi, Y.: Lcc approaches to knowledge base population at tac 2010. In: Proceeding TAC 2010 Workshop, National Institute of Standards and Technology, Gaithersburg (2010)
Google Scholar
Zhang, W., Sim, Y., Su, J., Tan, C.: Nus-i2r: Learning a combined system for entity linking. In: Proceeding TAC 2010 Workshop, National Institute of Standards and Technology, Gaithersburg (2010)
Google Scholar
Zhang, W., Sim, Y.C., Su, J., Tan, C.L.: Entity linking with effective acronym expansion instance selection and topic modeling. In: International Joint Conference on Artificial Intelligence, Barcelona (2011)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Knowledge Discovery and Data Mining (KDD), Boston. ACM, New York (2000)
Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Knowledge Discovery and Data Mining (KDD), Edmonton. ACM, New York (2002)
Google Scholar
McNamee, P., Dredze, M., Gerber, A., Garera, N., Finin, T., Mayfield, J., Piatko, C., Rao, D., Yarowsky, D., Dreyer, M.: HLTCOE approaches to knowledge base population at TAC 2009. In: Text Analysis Conference (TAC), Gaithersburg (2009)
Google Scholar
Christen, P.: A comparison of personal name matching: techniques and practical issues. Technical Report TR-CS-06-02, Australian National University, Australia (2006)
Google Scholar
Dreyer, M., Smith, J., Eisner, J.: Latent-variable modeling of string transductions with finite-state methods. In: Empirical Methods in Natural Language Processing (EMNLP), Honolulu. Association for Computational Linguistics, Stroudsburg (2008)
Google Scholar
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD Management of Data, Vancouver. ACM, New York (2008)
Google Scholar
Syed, Z., Finin, T., Joshi, A.: Wikipedia as an ontology for describing documents. In: Proceedings of the Second International Conference on Weblogs and Social Media, Chicago. AAAI, Menlo Park (2008)
Google Scholar
Fader, A., Soderland, S., Etzioni, O.: Scaling Wikipedia-based named entity disambiguation to arbitrary web text. In: WikiAI09 Workshop at IJCAI 2009, Pasadena (2009)
Google Scholar
Boschee, E., Weischedel, R., Zamanian, A.: Automatic information extraction. In: Conference on Intelligence Analysis, Washington (2005)
Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Google Scholar
Klein, M., Nelson, M.L.: A comparison of techniques for estimating IDF values to generate lexical signatures for the web. In: Workshop on Web Information and Data Management (WIDM), Napa Valley. ACM, New York (2008)
Google Scholar
Simpson, H., Parker, R., Strassel, S., Dang, H.T., McNamee, P.: Wikipedia and the web of confusable entities: experience from entity profile creation for tac knowledge base population. In: Proceedings of the Seventh International Language Resources and Evaluation Conference (LREC), Valletta. European Language Resources Association, Valletta (2010)
Google Scholar
Li, F., Zhang, Z., Bu, F., Tang, Y., Zhu, X., Huang, M.: THU QUANTA at TAC 2009 KBP and RTE track. In: Text Analysis Conference (TAC), National Institute of Standards and Technology, Gaithersburg (2009)
Google Scholar
Mayfield, J., Lawrie, D., McNamee, P., Oard, D.W.: Building a cross-language entity linking collection in twenty-one languages. In: Proceedings of the Cross Language Evaluate Forum (CLEF), Amsterdam (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Johns Hopkins University, 3400 N. Charles St., Baltimore, MD, 21218, USA
Delip Rao & Mark Dredze
Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA
Paul McNamee & Mark Dredze

Authors

Delip Rao
View author publications
You can also search for this author in PubMed Google Scholar
Paul McNamee
View author publications
You can also search for this author in PubMed Google Scholar
Mark Dredze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Delip Rao .

Editor information

Editors and Affiliations

Universite Sorbonne Nouvelle, LATTICE-CNRS, Ecole Normale Superieure and, rue d'Ulm 45, Paris, 75005, France
Thierry Poibeau
, Information & Communication Technologies, Universitat Pompeu Fabra, C/ Tanger 122-140, Barcelona, 08018, Spain
Horacio Saggion
Institute for Computer Science, Polish Acadmey of Science, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Jakub Piskorski
Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2, Helsinki, 00014, Finland
Roman Yangarber

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rao, D., McNamee, P., Dredze, M. (2013). Entity Linking: Finding Extracted Entities in a Knowledge Base. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-28569-1_5
Published: 12 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics