Information Retrieval

, Volume 13, Issue 5, pp 534–567 | Cite as

Why finding entities in Wikipedia is difficult, sometimes

  • Gianluca Demartini
  • Claudiu S. Firan
  • Tereza Iofciu
  • Ralf Krestel
  • Wolfgang Nejdl
Focused Retrieval and Result Aggr.

Abstract

Entity Retrieval (ER)—in comparison to classical search—aims at finding individual entities instead of relevant documents. Finding a list of entities requires therefore techniques different to classical search engines. In this paper, we present a model to describe entities more formally and how an ER system can be build on top of it. We compare different approaches designed for finding entities in Wikipedia and report on results using standard test collections. An analysis of entity-centric queries reveals different aspects and problems related to ER and shows limitations of current systems performing ER with Wikipedia. It also indicates which approaches are suitable for which kinds of queries.

Keywords

Entity search Evaluation Model Algorithms Experimentation 

References

  1. Adler, B. T., & de Alfaro, L. (2007). A content-driven reputation system for the wikipedia. In WWW ’07: Proceedings of the 16th international conference on World Wide Web (pp. 261–270). New York, NY, USA: ACM.Google Scholar
  2. Alias-i. (2008). LingPipe named entity tagger. Available at: http://www.alias-i.com/lingpipe/.
  3. Allan, J., & Raghavan, H. (2002). Using part-of-speech patterns to reduce query ambiguity. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp 307–314). New York, NY, USA: ACM.Google Scholar
  4. Anick, P. G., & Tipirneni, S. (1999). The paraphrase search assistant: Terminological feedback for iterative information seeking. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 153–159). New York, NY, USA: ACM.Google Scholar
  5. Bailey, P., de Vries, A. P., Craswell, N., & Soboroff, I. (2007). Overview of the TREC 2007 enterprise track. In E. M. Voorhees & L. P. Buckland (Eds.), Proceedings of the sixteenth text REtrieval conference, TREC 2007, Gaithersburg, Maryland, USA, November 5–9, 2007, volume Special Publication 500-274. National Institute of Standards and Technology (NIST).Google Scholar
  6. Bast, H., Chitea, A., Suchanek, F., & Weber, I. (2007). ESTER: efficient search on text, entities, and relations. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 671–678). New York, NY, USA: ACM.Google Scholar
  7. Bhogal, J., Macfarlane, A., & Smith, P. (2007). A review of ontology based query expansion. Information Processing & Management 43(4), 866–886.CrossRefGoogle Scholar
  8. Bouquet, P., Halpin, H., Stoermer, H., & Tummarello, G. (Eds.). (2008) Proceedings of the 1st international workshop on Identity and reference on the Semantic Web (IRSW2008) at the 5th European Semantic Web Conference (ESWC 2008), Tenerife, Spain, June 2, 2008, CEUR workshop proceedings. CEUR-WS.org.Google Scholar
  9. Bouquet, P., Stoermer, H., & Bazzanella, B. (2008). An Entity Name System (ENS) for the Semantic Web. In S. Bechhofer, M. Hauswirth, J. Hoffmann, & M. Koubarakis (Eds.), The semantic web: Research and applications, 5th European Semantic Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1–5, 2008, Proceedings, volume 5021 of Lecture notes in computer science (pp. 258–272). New York: Springer.Google Scholar
  10. Bouquet, P., Stoermer, H., Tummarello, G., & Halpin, H. (Eds.). (2007). Proceedings of the WWW2007 workshop I 3 : Identity, identifiers, identification, entity-centric approaches to information and knowledge management on the Web, Banff, Canada, May 8, 2007, volume 249 of CEUR Workshop Proceedings. CEUR-WS.org.Google Scholar
  11. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRefGoogle Scholar
  12. Carmel, D., Yom-Tov, E., & Soboroff, I. (2005). SIGIR workshop report: Predicting query difficulty—Methods and applications. SIGIR Forum, 39(2), 25–28.CrossRefGoogle Scholar
  13. Cheng, T., & Chang, K. C.-C. (2007). Entity search engine: Towards Agile best-effort information integration over the web. In CIDR 2007, Third Biennial conference on innovative data systems research, Asilomar, CA, USA, January 7–10, 2007, Online Proceedings (pp. 108–113). http://www.crdrdb.org.
  14. Cheng, T., Yan, X., & Chang, K. C.-C. (2007). EntityRank: Searching entities directly and holistically. In VLDB ’07: Proceedings of the 33rd international conference on very large data bases (pp. 387–398). VLDB Endowment.Google Scholar
  15. Chirita, P. A., Firan, C. S., & Nejdl, W. (2007) Personalized query expansion for the Web. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 7–14). New York, NY, USA: ACM.Google Scholar
  16. Demartini, G., Firan, C. S., Iofciu, T., Krestel, R., & Nejdl, W. (2008). A model for ranking entities and its application to Wikipedia. In LA-WEB ’08: Proceedings of the 2008 latin American web conference (pp. 29–38). Washington, DC, USA: IEEE Computer Society.Google Scholar
  17. Demartini, G., Firan, C. S., Iofciu, T., & Nejdl, W. (2008). Semantically enhanced entity ranking. In WISE ’08: Proceedings of the 9th international conference on web information systems engineering (pp. 176–188). Berlin, Heidelberg: Springer.Google Scholar
  18. Denoyer, L., & Gallinari, P. (2006). The Wikipedia XML corpus. SIGIR Forum, 40(1), 64–69.CrossRefGoogle Scholar
  19. Fellbaum, C. (1998). WordNet: An electronic lexical database (language, speech, and communication). Cambridge: The MIT Press.Google Scholar
  20. Heath, T., & Motta, E. (2008). Revyu: Linking reviews and ratings into the Web of data. Journal of Web Semantics, 6(4), 266–273.Google Scholar
  21. Hsu, M.-H., Tsai, M.-F., & Chen, H.-H. (2006). Query expansion with ConceptNet and WordNet: An intrinsic comparison. In H. T. Ng, M.-K. Leong, M.-Y. Kan, & D. Ji (Eds), Information retrieval technology, third Asia information retrieval symposium, AIRS 2006, Singapore, October 16–18, 2006, Proceedings, volume 4182 of Lecture notes in computer ccience (pp. 1–13). New York: Springer.Google Scholar
  22. Jordi Atserias, M. C., Zaragoza, H., & Attardi, G. (2008). Semantically annotated snapshot of the English Wikipedia. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds), Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech, Morocco, May 2008. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/.
  23. Palpanas, T., Chaudhry, J., Andritsos, P., & Velegrakis, Y. (2008). Entity data management in OKKAM. In DEXA ’08: Proceedings of the 2008 19th international conference on database and expert systems application (pp. 729–733). Washington, DC, USA: IEEE Computer Society.Google Scholar
  24. Pehcevski, J., Vercoustre, A.-M., & Thom, J. A. (2008). Exploiting locality of Wikipedia links in entity ranking. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, & R. W. White (Eds), Advances in information retrieval, 30th European conference on IR research, ECIR 2008, Glasgow, UK, March 30–April 3, 2008. Proceedings, volume 4956 of Lecture notes in computer science (pp. 258–269). New York: Springer.Google Scholar
  25. Raimond, Y., Sutton, C., & Sandler, M. (2008). Automatic interlinking of music datasets on the semantic web. In Linked Data on the Web (LDOW2008).Google Scholar
  26. Rode, H., Hiemstra, D., Vries, A., & Serdyukov, P. (2009). Efficient XML and entity retrieval with PF/Tijah: CWI and University of Twente at INEX’08. In Advances in focused retrieval: 7th international workshop of the initiative for the evaluation of XML retrieval, INEX 2008, Dagstuhl Castle, Germany, December 15–18, 2008. Revised and selected papers (pp. 207–217). Berlin, Heidelberg: Springer.Google Scholar
  27. Rölleke, T., Tsikrika, T., & Kazai, G. (2006). A general matrix framework for modelling information retrieval. Information Processing & Management, 42(1), 4–30.MATHCrossRefGoogle Scholar
  28. Schenkel, R., Suchanek, F. M., & Kasneci, G. (2007). YAWN: A semantically annotated Wikipedia XML corpus. In A. Kemper, H. Schöning, T. Rose, M. Jarke, T. Seidl, C. Quix, & C. Brochhaus (Eds), Datenbanksysteme in Business, Technologie und Web (BTW 2007), 12. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), Proceedings, 7–9 März 2007, Aachen, Germany, volume 103 of LNI (pp. 277–291). GI.Google Scholar
  29. Semeraro, G., Degemmis, M., Lops, P., & Basile, P. (2007). Combining learning and word sense disambiguation for intelligent user profiling. In IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence (pp. 2856–2861) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Google Scholar
  30. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: a core of semantic knowledge. In C. L. Williamson, M. E. Zurko, P. F. Patel-Schneider, & P. J. Shenoy (Eds.), Proceedings of the 16th international conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8–12, 2007 (pp. 697–706). New York: ACM.Google Scholar
  31. Tsikrika, T., Serdyukov, P., Rode, H., Westerveld, T., Aly, R., Hiemstra, D., & Vries, A. P. (2008). Structured document retrieval, multimedia retrieval, and entity ranking using PF/Tijah. In Focused access to XML documents: 6th international workshop of the initiative for the evaluation of XML retrieval, INEX 2007 Dagstuhl Castle, Germany, December 17–19, 2007. Selected papers (pp. 306–320). Berlin, Heidelberg: Springer.Google Scholar
  32. Vercoustre, A.-M., Pehcevski, J., & Naumovski, V. (2009). Topic difficulty prediction in entity ranking. In Advances in focused retrieval: 7th international workshop of the initiative for the evaluation of XML retrieval, INEX 2008, Dagstuhl Castle, Germany, December 15–18, 2008. Revised and selected papers (pp. 280–291). Berlin, Heidelberg: Springer.Google Scholar
  33. Voorhees, E. M. (1993). On expanding query vectors with lexically related words. In TREC (pp. 223–232).Google Scholar
  34. Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 61–69). New York, NY, USA: Springer.Google Scholar
  35. Vries, A. P., Vercoustre, A.-M., Thom, J. A., Craswell, N., & Lalmas, M. (2007). 0verview of the INEX 2007 entity ranking track. In Focused access to XML documents: 6th international workshop of the initiative for the evaluation of XML retrieval, INEX 2007 Dagstuhl Castle, Germany, December 17–19, 2007. Selected Papers (pp. 245–251). Berlin, Heidelberg: Springer-Verlag.Google Scholar
  36. Webber, W., Moffat, A., Zobel, J., & Sakai, T. (2008). Precision-at-ten considered redundant. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 695–696). New York, NY, USA: ACM.Google Scholar
  37. Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management (pp. 102–111). New York, NY, USA: ACM.Google Scholar
  38. Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 603–610), New York, NY, USA: ACM.Google Scholar
  39. Zaragoza, H., Rode, H., Mika, P., Atserias, J., Ciaramita, M., & Attardi, G. (2007) Ranking very many typed entities on Wikipedia. In CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 1015–1018). New York, NY, USA: ACM.Google Scholar
  40. Zhu, J., de Vries, A. P., Demartini, G., & Iofciu, T. (2008). Relation retrieval for entities and experts. In Future challenges in expertise retrieval (fCHER 2008), SIGIR 2008 Workshop.Google Scholar
  41. Zirn, C., Nastase, V., & Strube, M. (2008). Distinguishing between Instances and Classes in the Wikipedia taxonomy. In S. Bechhofer, M. Hauswirth, J. Hoffmann, & M. Koubarakis (Eds), The semantic web: Research and applications, 5th European semantic web conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1–5, 2008, Proceedings, volume 5021 of Lecture notes in computer science (pp. 376–387). New York: Springer.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Gianluca Demartini
    • 1
  • Claudiu S. Firan
    • 1
  • Tereza Iofciu
    • 1
  • Ralf Krestel
    • 1
  • Wolfgang Nejdl
    • 1
  1. 1.L3S Research CenterLeibniz Universität HannoverHannoverGermany

Personalised recommendations