Abstract
Traditional information retrieval (IR) systems use keywords to index and retrieve documents. The limitations of keywords were recognized since the early days, specially when different but closely related words are used in the query and the relevant document. Query expansion techniques like pseudo-relevance feedback (PRF) and document clustering techniques rely on the target document set in order to bridge the gap between those words. This paper explores the use of knowledge-based semantic relatedness techniques to overcome the vocabulary mismatch between the query and documents, both on IR and Passage Retrieval for question answering. We performed query expansion and document expansion using WordNet, with positive effects over a language modeling baseline on three datasets, and over PRF on two of those datasets. Our analysis shows that our models and PRF are complementary; in that, PRF is better for easy queries, and our models are stronger for difficult queries and that our models generalize better to other collections, being more robust to parameter adjustments. In addition, we show that our method has a positive impact in an end-to-end question answering system for Basque and that it can be readily applied to other knowledge bases, as our good results using Wikipedia show, paving the way for the use of other knowledge structures such as medical ontologies and linked data repositories.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
As opposed to open question answering, which typically searches the web.
Yahoo! Webscope dataset: L4—Yahoo! Answers Manner Questions, version 1.0 http://webscope.sandbox.yahoo.com/catalog.php?datatype=l.
Note that Table 3 shows the number of paragraphs, which conform the units we indexed.
In which century was the Philosofiae Naturalis Principia Mathematica published?
Who developed the method to apply nuclear magnetic resonance (NMR) to large biological molecules?
References
Agirre E, Arregi X, Otegi A (2010) Document expansion based on WordNet for robust IR. In: Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10, Association for Computational Linguistics, pp 9–17
Agirre E, Clough P, Fernando S, Hall M, Otegi A, Stevenson M (2012) The Sheffield and Basque Country Universities Entry to CHiC: using random walks and similarity to access cultural heritage. In: CLEF (Online Working Notes/Labs/Workshop)’12
Agirre E, Cuadros M, Rigau G, Soroa A (2010) Exploring knowledge bases for similarity. In: Proceedings of the seventh international conference on language resources and evaluation (LREC ’10), European Language Resources Association (ELRA), pp 373–377
Agirre E, Di Nunzio GM, Mandl T, Otegi A (2010) CLEF 2009 ad hoc track overview: robust-WSD task. In: Multilingual information access evaluation I. Text retrieval experiments, Vol. 6241 of Lecture Notes in Computer Science. Springer, Berlin, pp 36–49
Agirre E, Soroa A (2009) Personalizing pagerank for word sense disambiguation. In: Proceedings of the 12th conference of the European chapter of the the association for computational linguistics, EACL ’09, Association for Computational Linguistics, pp 33–41
Agirre E, Soroa A, Alfonseca E, Hall K, Kravalova J, Paşca M (2009) A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, Association for Computational Linguistics, pp 19–27
Ansa O, Arregi X, Otegi A, Soraluze A (2009) Ihardetsi: a basque question answering system at QA@CLEF 2008. In: Evaluating systems for multilingual and multimodal information access, Vol. 5706 of Lecture Notes in Computer Science. Springer, Berlin, pp 369–376
Areta N, Gurrutxaga A, Leturia I, Polin Z, Saiz R, Alegria I, Artola X, de Ilarraza AD, Ezeiza N, Sologaistoa A, Soroa A, Valverde A (2006) Structure, annotation and tools in the basque ZT corpus. In: International conference on language resources and evaluations (LREC 2006), pp 1406–1411
Bai J, Song D, Bruza P, Nie JY, Cao G (2005) Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05, ACM, pp 688–695
Berger A, Caruana R, Cohn D, Freitag D, Mittal V (2000) Bridging the lexical chasm: statistical approaches to answer-finding. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 192–199
Boldi P, Vigna S (2005) MG4J at TREC 2005. In: The fourteenth text retrieval conference (TREC 2005) proceedings, number SP 500–266 in ‘Special Publications’, National Institute of Standards and Technology (NIST)
Buckley C, Sanderson M (2008) Relevance feedback track overview: TREC 2008. In: Proceedings of The seventeenth text retrieval conference, TREC 2008, Vol. Special Publication 500-277, National Institute of Standards and Technology (NIST)
Cao G, Nie J, Bai J (2005) Integrating word relationships into language models. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, pp 298–305
Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst 29(2):8:1–8:34
Fang H (2008) A re-examination of query expansion using lexical resources. In: Proceedings of the 46th annual meeting of the association for computational linguistics. Human language technologies. Association for Computational Linguistics, pp 139–147. http://www.aclweb.org/anthology/P/P08/P08-1017
Fellbaum C (1998) WordNet: an electronic lexical database and some of its applications. MIT Press, Cambridge
Forner P, Penas A, Agirre E, Alegria I, Forăscu C, Moreau N, Osenova P, Prokopidis P, Rocha P, Sacaleanu B, Sutcliffe R, Sang E (2009) Overview of the CLEF 2008 multilingual question answering track. In: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access, CLEF’08. Springer, Berlin, pp 262–295
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc., pp 1606–1611
Gonzalo J, Verdejo F, Chugur I, Cigarran J (1998) Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL workshop on usage of wordnet in natural language processing systems, pp 38–44
Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn Sens a #twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 368–378
Haveliwala TH (2002) Topic-sensitive PageRank. In: Proceedings of the 11th international conference on world wide web, WWW ’02. ACM, pp 517–526
Huang Y, Sun L, Nie J (2009) Smoothing document language model with local word graph. In: Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09, ACM, pp 1943–1946
Hughes T, Ramage D (2007) Lexical semantic relatedness with random graph walks. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 581–589
Humphreys L, Lindberg D, Schoolman H, Barnett G (1998) The unified medical language system: an informatics research collaboration. J Am Med Inf Assoc 1(5):1–11
Kim S, Seo H, Rim H (2004) Information retrieval using word senses: root sense tagging approach. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 258–265
Kurland O, Lee L (2004) Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 194–201
Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01. ACM, pp 120–127
Li Y, Bandar Z, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15(4):871–882
Liu S, Liu F, Yu C, Meng W (2004) An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 266–272
Liu S, Yu C, Meng W (2005) Word sense disambiguation in queries. In: Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05. ACM, pp 525–532
Liu X, Croft WB, Bruce W (2004) Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04. ACM, pp 186–193
Manning CD, Raghavan P, Schütze H (2009) An introduction to information retrieval. Cambridge University Press, Cambridge
Mei Q, Zhang D, Zhai C (2008) A general optimization framework for smoothing language models on graph structures. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, pp 611–618
Meo PD, Quattrone G, Rosaci D, Ursino D (2012) Bilateral semantic negotiation: a decentralised approach to ontology enrichment in open multi-agent systems. Int J Data Mining Model Manag (IJDMMM) 4(1):1–38
Metzler D (2006) Estimation, sensitivity, and generalization in parameterized retrieval models. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06. ACM, pp 812–813
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1’, AAAI ’06. AAAI Press, pp 775–780
Mitra M, Singhal A, Buckley C (1998) Improving automatic query expansion. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, pp 206–214
Moldovan D, Surdeanu M (2003) On the role of information retrieval and information extraction in question answering systems. Information Extraction in the Web Era, pp 129–147
Otegi A, Arregi X, Agirre E (2011) Query expansion for IR using knowledge-based relatedness. In: Proceedings of 5th international joint conference on natural language processing, Asian Federation of Natural Language Processing, pp 1467–1471
Palopoli L, Rosaci D, Terracina G, Ursino D (2005) A graph-based approach for extracting terminological properties from information sources with heterogeneous formats. Knowl Inf Syst 8(4):462–497
Peñas A, Forner P, Sutcliffe R, Rodrigo A, Forăscu C, Alegria I, Giampiccolo D, Moreau N, Osenova P (2009) Overview of ResPubliQA 2009: question answering evaluation over European legislation. In: Proceedings of the 10th cross-language evaluation forum conference on multilingual information access evaluation: text retrieval experiments, CLEF ’09. Springer, Berlin, pp 174–196
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, pp 275–281
Riezler S, Vasserman A, Tsochantaridis I, Mittal V, Liu Y (2007) Statistical machine translation for query expansion in answer retrieval. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, pp 464–471
Robertson S (2006) On GMAP: and other transformations. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06. ACM, pp 78–83
Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The smart retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323
Rosaci D (2007) CILIOS: connectionist inductive learning and inter-ontology similarities for recommending information agents. Inf. Syst. 32(6):793–825
Ruthven I, Lalmas M (2003) A survey on the use of relevance feedback for information access systems. Knowl Eng Rev 18(2):95–145
Singhal A, Pereira F (1999) Document expansion for speech retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’99. ACM, pp 34–41
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on information and knowledge management, CIKM ’07. ACM, pp 623–632
Stokoe C, Oakes MP, Tait J (2003) Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, pp 159–166
Strohman T, Metzler D, Turtle H, Croft WB (2005) Indri: a language-model based search engine for complex queries. In: Technical report, Proceedings of the international conference on intelligent analysis
Surdeanu M, Ciaramita M, Zaragoza H (2008) Learning to rank answers on large online QA collections. In: Proceedings of the 46th annual meeting of the association for computational linguistics. The Association for Computer Linguistics, pp 719–727
Tao T, Wang X, Mei Q, Zhai C (2006) Language model information retrieval with document expansion. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06. Association for Computational Linguistics, pp 407–414
Voorhees EM (1994) Query expansion using lexical-semantic relations. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94. Springer, New York, pp 61–69
Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96. ACM, pp 4–11
Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01. ACM, pp 334–342
Zhong Z, Ng HT (2012) Word sense disambiguation improves information retrieval. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Long Papers—Volume 1. Association for Computational Linguistics, pp 273–282
Acknowledgments
This work was partially funded by MINECO in Projects READERS and SKATER (PCIN-2013-002-C02-01, TIN2012-38584-C06-02) and by the European Commission in Project NEWSREADER (ICT FP7-ICT-2011-8-316404).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Otegi, A., Arregi, X., Ansa, O. et al. Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44, 689–718 (2015). https://doi.org/10.1007/s10115-014-0785-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0785-4