Building a Cross-Language Entity Linking Collection in Twenty-One Languages

  • James Mayfield
  • Dawn Lawrie
  • Paul McNamee
  • Douglas W. Oard
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6941)


We describe an efficient way to create a test collection for evaluating the accuracy of cross-language entity linking. Queries are created by semi-automatically identifying person names on the English side of a parallel corpus, using judgments obtained through crowdsourcing to identify the entity corresponding to the name, and projecting the English name onto the non-English document using word alignments. We applied the technique to produce the first publicly available multilingual cross-language entity linking collection. The collection includes approximately 55,000 queries, comprising between 875 and 4,329 queries for each of twenty-one non-English languages.


Entity Linking Cross-Language Entity Linking Multilingual Corpora Crowdsourcing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1. Amazon Mechanical Turk (2005),
  2. 2.
    Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigo, E.: Overview of the web people search clustering and attribute extraction tasks. In: CLEF Third WEPS Evaluation Workshop (2010)Google Scholar
  3. 3.
    Balog, K., Azzopardi, L., Kamps, J., Rijke, M.D.: Overview of WebCLEF 2006. In: Cross-Language Evaluation Forum, pp. 803–819 (2006)Google Scholar
  4. 4.
    Baron, A., Freedman, M.: Who is Who and What is What: Experiments in cross-document co-reference. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 274–283. Association for Computational Linguistics, Honolulu (2008), Google Scholar
  5. 5.
    Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: European Chapter of the Assocation for Computational Linguistics, EACL (2006)Google Scholar
  6. 6.
    Callison-Burch, C., Dredze, M.: Creating speech and language data with Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT 2010, pp. 1–12. Association for Computational Linguistics, Stroudsburg (2010), Google Scholar
  7. 7.
    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)Google Scholar
  8. 8.
    Haghighi, A., Blitzer, J., DeNero, J., Klein, D.: Better word alignments with supervised itg models. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, vol. 2, pp. 923–931. Association for Computational Linguistics, Stroudsburg (2009), Google Scholar
  9. 9.
    Ji, H., Grishman, R., Dang, H.T., Griffitt, K., Ellis, J.: Overview of the TAC 2010 Knowledge Base Population track. In: Text Analysis Conference, TAC (2010)Google Scholar
  10. 10.
    Joachims, T.: Training Linear SVMs in Linear Time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 217–226. ACM, New York (2006), Google Scholar
  11. 11.
    McNamee, P., Dang, H.T.: Overview of the TAC 2009 Knowledge Base Population track. In: Proceedings of the Text Analysis Conference (2009)Google Scholar
  12. 12.
    McNamee, P., Dredze, M., Gerber, A., Garera, N., Finin, T., Mayfield, J., Piatko, C., Rao, D., Yarowsky, D., Dreyer, M.: HLTCOE Approaches to Knowledge Base Population at TAC 2009. In: Proceedings of the Text Analysis Conference (2009)Google Scholar
  13. 13.
    Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155. Association for Computational Linguistics, Boulder (2009), CrossRefGoogle Scholar
  14. 14.
    Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp. 254–263. Association for Computational Linguistics, Stroudsburg (2008), CrossRefGoogle Scholar
  15. 15.
    Tiwari, C.: et al.: News Assist: Identifying set of relevant entities used in news article. In: FIRE (2010)Google Scholar
  16. 16.
    Tjong Kim Sang, E., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Conference on Natural Language Learning, CoNLL (2003)Google Scholar
  17. 17.
    Yarowsky, D., Ngai, G.: Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In: NAACL (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • James Mayfield
    • 1
  • Dawn Lawrie
    • 1
    • 2
  • Paul McNamee
    • 1
  • Douglas W. Oard
    • 1
    • 3
  1. 1.Johns Hopkins University Human Language Technology Center of ExcellenceUSA
  2. 2.Loyola UniversityMarylandUSA
  3. 3.University of MarylandCollege ParkUSA

Personalised recommendations