Information Retrieval

, Volume 13, Issue 5, pp 568–600 | Cite as

Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction

  • Jovan Pehcevski
  • James A. Thom
  • Anne-Marie Vercoustre
  • Vladimir Naumovski
S.I.: Focused Retrieval and Result Aggr.

Abstract

Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be exploited to further improve the entity ranking performance.

Keywords

Entity ranking INEX XML Retrieval Wikipedia 

References

  1. Adelberg, B., & Denny, M. (1999). Nodose version 2.0. In Proceedings of the 1999 ACM SIGMOD international conference on management of data (SIGMOD’99), Philadelphia, Pennsylvania, pp. 559–561.Google Scholar
  2. Awang Iskandar, D., Pehcevski, J., Thom, J. A., & Tahaghoghi, S. M. M. (2007). Social media retrieval using image features and structured text. In Comparative evaluation of XML information retrieval systems: Fifth workshop of the INitiative for the evaluation of XML retrieval, INEX 2006, Lecture notes in computer science, Vol. 4518, pp. 358–372.Google Scholar
  3. Bast, H., Chitea, A., Suchanek, F., & Weber, I. (2007). ESTER: Efficient search on text, entities, and relations. In Proceedings of the 30th ACM international conference on research and development in information retrieval (SIGIR’07), Amsterdam, The Netherlands, pp. 671–678.Google Scholar
  4. Blanchard, E., Harzallah, M., & Henri Briand, P. K. (2005). A typology of ontology-based semantic measures. In Proceedings of the open interop workshop on enterprise modelling and ontologies for interoperability (EMOI-INTEROP’05), Porto, Portugal. http://www.sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-160/paper26.pdf.
  5. Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32MATHCrossRefGoogle Scholar
  6. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th international conference on world wide web, Brisbane, Australia, pp. 107–117.Google Scholar
  7. Cai, D., He, X., Wen, J. R., & Ma, W. Y. (2004). Block-level link analysis. In Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR’04), Sheffield, UK, pp. 440–447.Google Scholar
  8. Callan, J., & Mitamura, T. (2002). Knowledge-based extraction of named entities. In Proceedings of the 11th ACM conference on information and knowledge management (CIKM’02), McLean, Virginia, pp. 532–537.Google Scholar
  9. Carmel, D., Yom-Tov, E., & Soboroff, I. (2005). Predicting query difficulty—methods and applications. SIGIR Forum 39(2), 25–28.CrossRefGoogle Scholar
  10. Cronen-Townsend, S., Zhou, Y., & Croft, W. B. (2002). Predicting query performance. In Proceedings of the 25th ACM SIGIR conference on research and development in information retrieval (SIGIR’02), Tampere, Finland, pp. 299–306.Google Scholar
  11. Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 joint conference on EMNLP and CoNLL, Prague, The Czech Republic, pp. 708–716.Google Scholar
  12. Cucerzan, S., & Yarowsky, D. (1999). Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the 1999 joint SIGDAT conference on EMNLP and VLC, Maryland, MD, pp. 90–99.Google Scholar
  13. de Vries A. P., Vercoustre A. M., Thom J. A., Craswell N., & Lalmas M. (2008). Overview of the INEX 2007 entity ranking track. In Focused access to XML documents: Sixth international workshop of the initiative for the evaluation of XML retrieval, INEX 2007, Lecture notes in computer science, Vol. 4862, pp. 1–23.Google Scholar
  14. Demartini, G., de Vries, A. P., Iofciu, T., & Zhu, J. (2009). Overview of the INEX 2008 entity ranking track. In Advances in focused retrieval: Seventh international workshop of the initiative for the evaluation of XML retrieval, INEX 2008, Lecture notes in computer science, Vol. 5631.Google Scholar
  15. Denoyer, L., & Gallinari, P. (2006). The Wikipedia XML corpus. SIGIR Forum 40(1), 64–69CrossRefGoogle Scholar
  16. Ehrig, M., Haase, P., Stojanovic, N., & Hefke, M. (2005). Similarity for ontologies—a comprehensive framework. In Proceedings of the 13th European conference on information systems.Google Scholar
  17. Fissaha Adafre, S., de Rijke, M., & Sang, E. T. K. (2007). Entity retrieval. In Proceedings of international conference on recent advances in natural language processing (RANLP—2007), September 27–29, Borovets, Bulgaria.Google Scholar
  18. Grivolla, J., Jourlin, P., & de Mori, R. (2005). Automatic classification of queries by expected retrieval performance. In Proceedings of the SIGIR workshop on predicting query difficulty, Salvador, Brazil.Google Scholar
  19. Hassell, J., Aleman-Meza, B., & Arpinar, I. B. (2006). Ontology-driven automatic entity disambiguation in unstructured text. In Proceedings of the 5th international semantic web conference (ISWC), Athens, GA, Lecture notes in computer science, Vol. 4273, pp. 44–57.Google Scholar
  20. He, B., & Ounis, I. (2006). Query performance prediction. Information Systems 31(7), 585–594.CrossRefGoogle Scholar
  21. Hu, G., Liu, J., Li, H., Cao, Y., Nie, J. Y., & Gao, J. (2006). A supervised learning approach to entity search. In Proceedings of the Asia information retrieval symposium (AIRS 2006). Lecture notes in computer science, Vol. 4182, pp. 54–66.Google Scholar
  22. Kamps, J., & Larsen, B. (2006). Understanding differences between search requests in XML element retrieval. In Proceedings of the SIGIR 2006 workshop on XML element retrieval methodology, Seattle, Washington, pp. 13–19.Google Scholar
  23. Kaptein, R., & Kamps, J. (2009). Finding entities or information using annotations. In ECIR workshop on information retrieval over social networks, pp. 71–78.Google Scholar
  24. Kazama, J., & Torisawa, K. (2007). Exploiting Wikipedia as external knowledge for named entity recognition. In Proceedings of the 2007 joint conference on EMNLP and CoNLL, Prague, The Czech Republic, pp. 698–707.Google Scholar
  25. Kleinberg, J. M. (1999). Authoritative sources in hyperlinked environment. Journal of the ACM 46(5), 604–632.MATHCrossRefMathSciNetGoogle Scholar
  26. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1–2), 15–68.MATHCrossRefMathSciNetGoogle Scholar
  27. Kwok, K. (2005). An attempt to identify weakest and strongest queries. In Proceedings of the SIGIR workshop on predicting query difficulty, Salvador, Brazil.Google Scholar
  28. Lang, H., Wang, B., Jones, G., Li, J. T., Ding, F., & Liu, Y. X. (2008). Query performance prediction for information retrieval based on covering topic score. Journal of Computer Science and technology 23(4), 590–601.CrossRefGoogle Scholar
  29. Lerman, K., Minton, S. N., & Knoblock, C. A. (2003). Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18, 149–181.MATHGoogle Scholar
  30. Loper, E., & Bird, S. (2002). NLTK: The natural language toolkit. In Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, Philadelphia, Pennsylvania, pp. 63–70.Google Scholar
  31. Mizzaro, S. (2008). The good, the bad, the difficult, and the easy: Something wrong with information retrieval evaluation? In Proceedings of the 30th European conference on information retrieval (ECIR’08), Lecture Notes in Computer Science, Vol. 4956, pp. 642–646.Google Scholar
  32. Mizzaro, S., & Robertson, S. (2007). HITS hits TREC: Exploring IR evaluation results with network analysis. In Proceedings of the 30th ACM SIGIR conference on research and development in information retrieval (SIGIR’07), Amsterdam, The Netherlands, pp. 479–486.Google Scholar
  33. Mothe, J., & Tanguy, L. (2005). Linguistic features to predict query difficulty. In Proceedings of the SIGIR workshop on predicting query difficulty, Salvador, Brazil.Google Scholar
  34. Nie, L., Davison, B. D., & Qi, X. (2006). Topical link analysis for web search. In Proceedings of the 29th ACM international conference on research and development in information retrieval (SIGIR’06), Seattle, Washington, pp. 91–98.Google Scholar
  35. Pehcevski, J., Thom, J. A., & Vercoustre, A. M. (2005). Hybrid XML retrieval: Combining information retrieval and a native XML database. Information Retrieval 8(4), 571–600.CrossRefGoogle Scholar
  36. Pehcevski, J., Vercoustre, A. M., & Thom, J. A. (2008). Exploiting locality of Wikipedia links in entity ranking. In Proceedings of the 30th European conference on information retrieval (ECIR’08), Lecture notes in computer science, Vol. 4956, pp. 258–269.Google Scholar
  37. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers, Inc.Google Scholar
  38. Sahuguet, A., & Azavant, F. (1999). Building light-weight wrappers for legacy web data-sources using W4F. In Proceedings of 25th international conference on very large data bases (VLDB’99), Edinburgh, Scotland, UK, pp. 738–741.Google Scholar
  39. Soboroff, I., de Vries, A. P., & Craswell, N. (2006). Overview of the TREC 2006 Enterprise track. In Proceedings of the fifteenth text retrieval conference (TREC 2006), pp. 32–51.Google Scholar
  40. Thom, J. A., Pehcevski, J., & Vercoustre, A. M. (2007). Use of Wikipedia categories in entity ranking. In Proceedings of 12th Australasian document computing symposium (ADCS’07), Melbourne, Australia, pp. 56–63.Google Scholar
  41. Tsikrika, T., Serdyukov, P., Rode, H., Westerveld, T., Aly, R., Hiemstra, D., et al. (2008). Structured document retrieval, multimedia retrieval, and entity ranking using PF/Tijah. In Focused access to XML documents: Sixth international workshop of the initiative for the evaluation of XML retrieval, INEX 2007, Lecture notes in computer science, Vol. 4862, pp. 306–320.Google Scholar
  42. Vercoustre, A. M., & Paradis, F. (1997). A descriptive language for information object reuse through virtual documents. In 4th International conference on object-oriented information systems (OOIS’97), Brisbane, Australia, pp. 299–311.Google Scholar
  43. Vercoustre, A. M., Pehcevski, J., & Thom, J. A. (2008a). Using Wikipedia categories and links in entity ranking. In Focused access to XML documents: Sixth international workshop of the initiative for the evaluation of XML retrieval, INEX 2007, Lecture notes in computer science, vol. 4862, pp. 321–335.Google Scholar
  44. Vercoustre, A. M., Thom, J. A., & Pehcevski, J. (2008b). Entity ranking in Wikipedia. In Proceedings of the 23rd ACM symposium on applied computing, Fortaleza, Ceará, Brazil, pp. 1101–1106.Google Scholar
  45. Vercoustre, A. M., Pehcevski, J., & Naumovski, V. (2009). Topic difficulty prediction in entity ranking. In Advances in focused retrieval: Seventh international workshop of the initiative for the evaluation of XML retrieval, INEX 2008, Lecture notes in computer science, Vol. 5631.Google Scholar
  46. Voorhees, E. M. (2004). The TREC robust retrieval track. In Proceedings of the thirteenth text retrieval conference (TREC 2004).Google Scholar
  47. Webber, W., Moffat, A., & Zobel, J. (2008). Score standardization for inter-collection comparison of retrieval systems. In Proceedings of the 31st ACM SIGIR conference on research and development in information retrieval (SIGIR’08), Singapore, pp. 51–58.Google Scholar
  48. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques, second edition. Morgan Kaufmann Publishers, Inc.Google Scholar
  49. Yom-Tov, E., Fine, S., Carmel, D., Darlow, A., & Amitay, E. (2004). Juru at TREC 2004: Experiments with prediction of query difficulty. In Proceedings of the thirteenth text retrieval conference (TREC 2004).Google Scholar
  50. Yu, J., Thom, J. A., & Tam, A. (2007). Ontology evaluation using Wikipedia categories for browsing. In Proceedings of the 16th ACM conference on information and knowledge management (CIKM’07), Lisboa, Portugal, pp. 223–232.Google Scholar
  51. Zhou, Y., & Croft, W. B. (2007). Query performance prediction in web search environments. In Proceedings of the 30th ACM SIGIR conference on research and development in information retrieval (SIGIR’07), Amsterdam, The Netherlands, pp. 543–550.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Jovan Pehcevski
    • 1
  • James A. Thom
    • 2
  • Anne-Marie Vercoustre
    • 3
  • Vladimir Naumovski
    • 4
  1. 1.Faculty of InformaticsEuropean UniversitySkopjeMacedonia
  2. 2.School of Computer ScienceRMIT UniversityMelbourneAustralia
  3. 3.INRIARocquencourtFrance
  4. 4.T-MobileSkopjeMacedonia

Personalised recommendations