An Approach to Web-Scale Named-Entity Disambiguation

  • Luís Sarmento
  • Alexander Kehlenbeck
  • Eugénio Oliveira
  • Lyle Ungar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5632)

Abstract

We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information from documents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: HLT-NAACL, pp. 9–16 (2004)Google Scholar
  2. 2.
    Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security in conjunction with the SIAM International Conference on Data Mining, pp. 93–102 (2005)Google Scholar
  3. 3.
    Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, Morristown, NJ, USA, pp. 33–40. Association for Computational Linguistics (2003)Google Scholar
  4. 4.
    Yates, A., Etzioni, O.: Unsupervised resolution of objects and relations on the web. In: Proceedings of NAACL HLT, Rochester, NY, April 2007, pp. 121–130 (2007)Google Scholar
  5. 5.
    Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 9–16 (2006)Google Scholar
  6. 6.
    Cucerzan, S.: Large scale named entity disambiguation based on wikipedia data. In: The EMNLP-CoNLL Joint Conference, June 2007, pp. 708–716 (2007)Google Scholar
  7. 7.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 178–186. ACM, New York (2003)Google Scholar
  8. 8.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007)CrossRefGoogle Scholar
  9. 9.
    Gale, W.A., Church, K.W., Yarowsky, D.: One sense per discourse. In: HLT 1991: Proceedings of the workshop on Speech and Natural Language, Morristown, NJ, USA, pp. 233–237. Association for Computational Linguistics (1992)Google Scholar
  10. 10.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. of 30th STOC, pp. 604–613 (1998)Google Scholar
  11. 11.
    Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 199–206. ACM Press, New York (2002)CrossRefGoogle Scholar
  12. 12.
    Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)CrossRefGoogle Scholar
  13. 13.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, google labs, pp. 137–150 (2004)Google Scholar
  14. 14.
    Sarmento, L., Kehlenbeck, A., Oliveira, E., Ungar, L.: Efficient clustering of web-derived data sets. In: Perner, P. (ed.) MLDM 2009. LNCS (LNAI), vol. 5632, pp. 398–412. Springer, Heidelberg (2009)Google Scholar
  15. 15.
    Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical report, University of Minnesota, Minneapolis (2001)Google Scholar
  16. 16.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Information Systems 17, 107–145 (2001)CrossRefMATHGoogle Scholar
  17. 17.
    Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Whitelaw, C., Kehlenbeck, A., Petrovic, N., Ungar, L.: Web-scale named entity recognition. In: ACM 17th Conference on Information and Knowledge Management: CIKM 2008. ACM Press, New York (2008)Google Scholar
  19. 19.
    Krovetz, R.: Homonymy and polysemy in information retrieval. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1997), pp. 72–79 (1997)Google Scholar
  20. 20.
    Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The Automatic Content Extraction (ACE) Program–Tasks, Data, and Evaluation. In: Proceedings of LREC 2004, pp. 837–840 (2004)Google Scholar
  21. 21.
    Santos, D., Seco, N., Cardoso, N., Vilela, R.: Harem: An advanced ner evaluation contest for portuguese. In: Calzolari, N., Choukri, K., Gangemi, A., Maegaard, B., Mariani, J., Odjik, J., Tapias, D. (eds.) Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, ELRA, May 22-28, pp. 1986–1991 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Luís Sarmento
    • 1
  • Alexander Kehlenbeck
    • 2
  • Eugénio Oliveira
    • 1
  • Lyle Ungar
    • 3
  1. 1.Faculdade de Engenharia da Universidade do Porto - DEI - LIACCPortoPortugal
  2. 2.Google IncUSA
  3. 3.University of Pennsylvania - CSPhiladelphiaUSA

Personalised recommendations