Knowledge and Information Systems

, Volume 44, Issue 3, pp 689–718 | Cite as

Using knowledge-based relatedness for information retrieval

  • Arantxa Otegi
  • Xabier Arregi
  • Olatz Ansa
  • Eneko Agirre
Regular Paper

Abstract

Traditional information retrieval (IR) systems use keywords to index and retrieve documents. The limitations of keywords were recognized since the early days, specially when different but closely related words are used in the query and the relevant document. Query expansion techniques like pseudo-relevance feedback (PRF) and document clustering techniques rely on the target document set in order to bridge the gap between those words. This paper explores the use of knowledge-based semantic relatedness techniques to overcome the vocabulary mismatch between the query and documents, both on IR and Passage Retrieval for question answering. We performed query expansion and document expansion using WordNet, with positive effects over a language modeling baseline on three datasets, and over PRF on two of those datasets. Our analysis shows that our models and PRF are complementary; in that, PRF is better for easy queries, and our models are stronger for difficult queries and that our models generalize better to other collections, being more robust to parameter adjustments. In addition, we show that our method has a positive impact in an end-to-end question answering system for Basque and that it can be readily applied to other knowledge bases, as our good results using Wikipedia show, paving the way for the use of other knowledge structures such as medical ontologies and linked data repositories.

Keywords

Knowledge-based systems Semantic similarity Semantic relatedness Information retrieval Query and document expansion 

References

  1. 1.
    Agirre E, Arregi X, Otegi A (2010) Document expansion based on WordNet for robust IR. In: Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10, Association for Computational Linguistics, pp 9–17Google Scholar
  2. 2.
    Agirre E, Clough P, Fernando S, Hall M, Otegi A, Stevenson M (2012) The Sheffield and Basque Country Universities Entry to CHiC: using random walks and similarity to access cultural heritage. In: CLEF (Online Working Notes/Labs/Workshop)’12Google Scholar
  3. 3.
    Agirre E, Cuadros M, Rigau G, Soroa A (2010) Exploring knowledge bases for similarity. In: Proceedings of the seventh international conference on language resources and evaluation (LREC ’10), European Language Resources Association (ELRA), pp 373–377Google Scholar
  4. 4.
    Agirre E, Di Nunzio GM, Mandl T, Otegi A (2010) CLEF 2009 ad hoc track overview: robust-WSD task. In: Multilingual information access evaluation I. Text retrieval experiments, Vol. 6241 of Lecture Notes in Computer Science. Springer, Berlin, pp 36–49Google Scholar
  5. 5.
    Agirre E, Soroa A (2009) Personalizing pagerank for word sense disambiguation. In: Proceedings of the 12th conference of the European chapter of the the association for computational linguistics, EACL ’09, Association for Computational Linguistics, pp 33–41Google Scholar
  6. 6.
    Agirre E, Soroa A, Alfonseca E, Hall K, Kravalova J, Paşca M (2009) A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, Association for Computational Linguistics, pp 19–27Google Scholar
  7. 7.
    Ansa O, Arregi X, Otegi A, Soraluze A (2009) Ihardetsi: a basque question answering system at QA@CLEF 2008. In: Evaluating systems for multilingual and multimodal information access, Vol. 5706 of Lecture Notes in Computer Science. Springer, Berlin, pp 369–376Google Scholar
  8. 8.
    Areta N, Gurrutxaga A, Leturia I, Polin Z, Saiz R, Alegria I, Artola X, de Ilarraza AD, Ezeiza N, Sologaistoa A, Soroa A, Valverde A (2006) Structure, annotation and tools in the basque ZT corpus. In: International conference on language resources and evaluations (LREC 2006), pp 1406–1411Google Scholar
  9. 9.
    Bai J, Song D, Bruza P, Nie JY, Cao G (2005) Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05, ACM, pp 688–695Google Scholar
  10. 10.
    Berger A, Caruana R, Cohn D, Freitag D, Mittal V (2000) Bridging the lexical chasm: statistical approaches to answer-finding. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 192–199Google Scholar
  11. 11.
    Boldi P, Vigna S (2005) MG4J at TREC 2005. In: The fourteenth text retrieval conference (TREC 2005) proceedings, number SP 500–266 in ‘Special Publications’, National Institute of Standards and Technology (NIST)Google Scholar
  12. 12.
    Buckley C, Sanderson M (2008) Relevance feedback track overview: TREC 2008. In: Proceedings of The seventeenth text retrieval conference, TREC 2008, Vol. Special Publication 500-277, National Institute of Standards and Technology (NIST)Google Scholar
  13. 13.
    Cao G, Nie J, Bai J (2005) Integrating word relationships into language models. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, pp 298–305Google Scholar
  14. 14.
    Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst 29(2):8:1–8:34CrossRefGoogle Scholar
  15. 15.
    Fang H (2008) A re-examination of query expansion using lexical resources. In: Proceedings of the 46th annual meeting of the association for computational linguistics. Human language technologies. Association for Computational Linguistics, pp 139–147. http://www.aclweb.org/anthology/P/P08/P08-1017
  16. 16.
    Fellbaum C (1998) WordNet: an electronic lexical database and some of its applications. MIT Press, CambridgeGoogle Scholar
  17. 17.
    Forner P, Penas A, Agirre E, Alegria I, Forăscu C, Moreau N, Osenova P, Prokopidis P, Rocha P, Sacaleanu B, Sutcliffe R, Sang E (2009) Overview of the CLEF 2008 multilingual question answering track. In: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access, CLEF’08. Springer, Berlin, pp 262–295Google Scholar
  18. 18.
    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc., pp 1606–1611Google Scholar
  19. 19.
    Gonzalo J, Verdejo F, Chugur I, Cigarran J (1998) Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL workshop on usage of wordnet in natural language processing systems, pp 38–44Google Scholar
  20. 20.
    Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn Sens a #twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 368–378Google Scholar
  21. 21.
    Haveliwala TH (2002) Topic-sensitive PageRank. In: Proceedings of the 11th international conference on world wide web, WWW ’02. ACM, pp 517–526Google Scholar
  22. 22.
    Huang Y, Sun L, Nie J (2009) Smoothing document language model with local word graph. In: Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09, ACM, pp 1943–1946Google Scholar
  23. 23.
    Hughes T, Ramage D (2007) Lexical semantic relatedness with random graph walks. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 581–589Google Scholar
  24. 24.
    Humphreys L, Lindberg D, Schoolman H, Barnett G (1998) The unified medical language system: an informatics research collaboration. J Am Med Inf Assoc 1(5):1–11CrossRefGoogle Scholar
  25. 25.
    Kim S, Seo H, Rim H (2004) Information retrieval using word senses: root sense tagging approach. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 258–265Google Scholar
  26. 26.
    Kurland O, Lee L (2004) Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 194–201Google Scholar
  27. 27.
    Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01. ACM, pp 120–127Google Scholar
  28. 28.
    Li Y, Bandar Z, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15(4):871–882CrossRefGoogle Scholar
  29. 29.
    Liu S, Liu F, Yu C, Meng W (2004) An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 266–272Google Scholar
  30. 30.
    Liu S, Yu C, Meng W (2005) Word sense disambiguation in queries. In: Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05. ACM, pp 525–532Google Scholar
  31. 31.
    Liu X, Croft WB, Bruce W (2004) Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04. ACM, pp 186–193Google Scholar
  32. 32.
    Manning CD, Raghavan P, Schütze H (2009) An introduction to information retrieval. Cambridge University Press, CambridgeGoogle Scholar
  33. 33.
    Mei Q, Zhang D, Zhai C (2008) A general optimization framework for smoothing language models on graph structures. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, pp 611–618Google Scholar
  34. 34.
    Meo PD, Quattrone G, Rosaci D, Ursino D (2012) Bilateral semantic negotiation: a decentralised approach to ontology enrichment in open multi-agent systems. Int J Data Mining Model Manag (IJDMMM) 4(1):1–38CrossRefGoogle Scholar
  35. 35.
    Metzler D (2006) Estimation, sensitivity, and generalization in parameterized retrieval models. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06. ACM, pp 812–813Google Scholar
  36. 36.
    Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1’, AAAI ’06. AAAI Press, pp 775–780Google Scholar
  37. 37.
    Mitra M, Singhal A, Buckley C (1998) Improving automatic query expansion. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, pp 206–214Google Scholar
  38. 38.
    Moldovan D, Surdeanu M (2003) On the role of information retrieval and information extraction in question answering systems. Information Extraction in the Web Era, pp 129–147Google Scholar
  39. 39.
    Otegi A, Arregi X, Agirre E (2011) Query expansion for IR using knowledge-based relatedness. In: Proceedings of 5th international joint conference on natural language processing, Asian Federation of Natural Language Processing, pp 1467–1471Google Scholar
  40. 40.
    Palopoli L, Rosaci D, Terracina G, Ursino D (2005) A graph-based approach for extracting terminological properties from information sources with heterogeneous formats. Knowl Inf Syst 8(4):462–497CrossRefGoogle Scholar
  41. 41.
    Peñas A, Forner P, Sutcliffe R, Rodrigo A, Forăscu C, Alegria I, Giampiccolo D, Moreau N, Osenova P (2009) Overview of ResPubliQA 2009: question answering evaluation over European legislation. In: Proceedings of the 10th cross-language evaluation forum conference on multilingual information access evaluation: text retrieval experiments, CLEF ’09. Springer, Berlin, pp 174–196Google Scholar
  42. 42.
    Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, pp 275–281Google Scholar
  43. 43.
    Riezler S, Vasserman A, Tsochantaridis I, Mittal V, Liu Y (2007) Statistical machine translation for query expansion in answer retrieval. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, pp 464–471Google Scholar
  44. 44.
    Robertson S (2006) On GMAP: and other transformations. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06. ACM, pp 78–83Google Scholar
  45. 45.
    Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The smart retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323Google Scholar
  46. 46.
    Rosaci D (2007) CILIOS: connectionist inductive learning and inter-ontology similarities for recommending information agents. Inf. Syst. 32(6):793–825CrossRefGoogle Scholar
  47. 47.
    Ruthven I, Lalmas M (2003) A survey on the use of relevance feedback for information access systems. Knowl Eng Rev 18(2):95–145CrossRefGoogle Scholar
  48. 48.
    Singhal A, Pereira F (1999) Document expansion for speech retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’99. ACM, pp 34–41Google Scholar
  49. 49.
    Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on information and knowledge management, CIKM ’07. ACM, pp 623–632Google Scholar
  50. 50.
    Stokoe C, Oakes MP, Tait J (2003) Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, pp 159–166Google Scholar
  51. 51.
    Strohman T, Metzler D, Turtle H, Croft WB (2005) Indri: a language-model based search engine for complex queries. In: Technical report, Proceedings of the international conference on intelligent analysisGoogle Scholar
  52. 52.
    Surdeanu M, Ciaramita M, Zaragoza H (2008) Learning to rank answers on large online QA collections. In: Proceedings of the 46th annual meeting of the association for computational linguistics. The Association for Computer Linguistics, pp 719–727Google Scholar
  53. 53.
    Tao T, Wang X, Mei Q, Zhai C (2006) Language model information retrieval with document expansion. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06. Association for Computational Linguistics, pp 407–414Google Scholar
  54. 54.
    Voorhees EM (1994) Query expansion using lexical-semantic relations. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94. Springer, New York, pp 61–69Google Scholar
  55. 55.
    Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96. ACM, pp 4–11Google Scholar
  56. 56.
    Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01. ACM, pp 334–342Google Scholar
  57. 57.
    Zhong Z, Ng HT (2012) Word sense disambiguation improves information retrieval. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Long Papers—Volume 1. Association for Computational Linguistics, pp 273–282Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Arantxa Otegi
    • 1
  • Xabier Arregi
    • 1
  • Olatz Ansa
    • 1
  • Eneko Agirre
    • 1
  1. 1.IXA GroupUniversity of the Basque Country UPV/EHUDonostiaBasque Country

Personalised recommendations