Using knowledge-based relatedness for information retrieval

Abstract

Traditional information retrieval (IR) systems use keywords to index and retrieve documents. The limitations of keywords were recognized since the early days, specially when different but closely related words are used in the query and the relevant document. Query expansion techniques like pseudo-relevance feedback (PRF) and document clustering techniques rely on the target document set in order to bridge the gap between those words. This paper explores the use of knowledge-based semantic relatedness techniques to overcome the vocabulary mismatch between the query and documents, both on IR and Passage Retrieval for question answering. We performed query expansion and document expansion using WordNet, with positive effects over a language modeling baseline on three datasets, and over PRF on two of those datasets. Our analysis shows that our models and PRF are complementary; in that, PRF is better for easy queries, and our models are stronger for difficult queries and that our models generalize better to other collections, being more robust to parameter adjustments. In addition, we show that our method has a positive impact in an end-to-end question answering system for Basque and that it can be readily applied to other knowledge bases, as our good results using Wikipedia show, paving the way for the use of other knowledge structures such as medical ontologies and linked data repositories.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. 1.

    http://ixa2.si.ehu.es/clirwsd/.

  2. 2.

    As opposed to open question answering, which typically searches the web.

  3. 3.

    http://ixa2.si.ehu.es/ukb/.

  4. 4.

    http://ixa2.si.ehu.es/ukb/.

  5. 5.

    Yahoo! Webscope dataset: L4—Yahoo! Answers Manner Questions, version 1.0 http://webscope.sandbox.yahoo.com/catalog.php?datatype=l.

  6. 6.

    http://answers.yahoo.com/.

  7. 7.

    Note that Table 3 shows the number of paragraphs, which conform the units we indexed.

  8. 8.

    http://incubator.apache.org/opennlp/.

  9. 9.

    http://www.lemurproject.org.

  10. 10.

    http://www.freebase.com.

  11. 11.

    http://dbpedia.org.

  12. 12.

    http://wordnetweb.princeton.edu/perl/webwn.

  13. 13.

    http://www.ztcorpusa.net.

  14. 14.

    http://zientzia.net/.

  15. 15.

    In which century was the Philosofiae Naturalis Principia Mathematica published?

  16. 16.

    Who developed the method to apply nuclear magnetic resonance (NMR) to large biological molecules?

  17. 17.

    http://ixa2.si.ehu.es/ukb.

  18. 18.

    http://trec.nist.gov/pubs/call2012.html.

References

  1. 1.

    Agirre E, Arregi X, Otegi A (2010) Document expansion based on WordNet for robust IR. In: Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10, Association for Computational Linguistics, pp 9–17

  2. 2.

    Agirre E, Clough P, Fernando S, Hall M, Otegi A, Stevenson M (2012) The Sheffield and Basque Country Universities Entry to CHiC: using random walks and similarity to access cultural heritage. In: CLEF (Online Working Notes/Labs/Workshop)’12

  3. 3.

    Agirre E, Cuadros M, Rigau G, Soroa A (2010) Exploring knowledge bases for similarity. In: Proceedings of the seventh international conference on language resources and evaluation (LREC ’10), European Language Resources Association (ELRA), pp 373–377

  4. 4.

    Agirre E, Di Nunzio GM, Mandl T, Otegi A (2010) CLEF 2009 ad hoc track overview: robust-WSD task. In: Multilingual information access evaluation I. Text retrieval experiments, Vol. 6241 of Lecture Notes in Computer Science. Springer, Berlin, pp 36–49

  5. 5.

    Agirre E, Soroa A (2009) Personalizing pagerank for word sense disambiguation. In: Proceedings of the 12th conference of the European chapter of the the association for computational linguistics, EACL ’09, Association for Computational Linguistics, pp 33–41

  6. 6.

    Agirre E, Soroa A, Alfonseca E, Hall K, Kravalova J, Paşca M (2009) A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, Association for Computational Linguistics, pp 19–27

  7. 7.

    Ansa O, Arregi X, Otegi A, Soraluze A (2009) Ihardetsi: a basque question answering system at QA@CLEF 2008. In: Evaluating systems for multilingual and multimodal information access, Vol. 5706 of Lecture Notes in Computer Science. Springer, Berlin, pp 369–376

  8. 8.

    Areta N, Gurrutxaga A, Leturia I, Polin Z, Saiz R, Alegria I, Artola X, de Ilarraza AD, Ezeiza N, Sologaistoa A, Soroa A, Valverde A (2006) Structure, annotation and tools in the basque ZT corpus. In: International conference on language resources and evaluations (LREC 2006), pp 1406–1411

  9. 9.

    Bai J, Song D, Bruza P, Nie JY, Cao G (2005) Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05, ACM, pp 688–695

  10. 10.

    Berger A, Caruana R, Cohn D, Freitag D, Mittal V (2000) Bridging the lexical chasm: statistical approaches to answer-finding. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 192–199

  11. 11.

    Boldi P, Vigna S (2005) MG4J at TREC 2005. In: The fourteenth text retrieval conference (TREC 2005) proceedings, number SP 500–266 in ‘Special Publications’, National Institute of Standards and Technology (NIST)

  12. 12.

    Buckley C, Sanderson M (2008) Relevance feedback track overview: TREC 2008. In: Proceedings of The seventeenth text retrieval conference, TREC 2008, Vol. Special Publication 500-277, National Institute of Standards and Technology (NIST)

  13. 13.

    Cao G, Nie J, Bai J (2005) Integrating word relationships into language models. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, pp 298–305

  14. 14.

    Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst 29(2):8:1–8:34

    Article  Google Scholar 

  15. 15.

    Fang H (2008) A re-examination of query expansion using lexical resources. In: Proceedings of the 46th annual meeting of the association for computational linguistics. Human language technologies. Association for Computational Linguistics, pp 139–147. http://www.aclweb.org/anthology/P/P08/P08-1017

  16. 16.

    Fellbaum C (1998) WordNet: an electronic lexical database and some of its applications. MIT Press, Cambridge

    Google Scholar 

  17. 17.

    Forner P, Penas A, Agirre E, Alegria I, Forăscu C, Moreau N, Osenova P, Prokopidis P, Rocha P, Sacaleanu B, Sutcliffe R, Sang E (2009) Overview of the CLEF 2008 multilingual question answering track. In: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access, CLEF’08. Springer, Berlin, pp 262–295

  18. 18.

    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc., pp 1606–1611

  19. 19.

    Gonzalo J, Verdejo F, Chugur I, Cigarran J (1998) Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL workshop on usage of wordnet in natural language processing systems, pp 38–44

  20. 20.

    Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn Sens a #twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 368–378

  21. 21.

    Haveliwala TH (2002) Topic-sensitive PageRank. In: Proceedings of the 11th international conference on world wide web, WWW ’02. ACM, pp 517–526

  22. 22.

    Huang Y, Sun L, Nie J (2009) Smoothing document language model with local word graph. In: Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09, ACM, pp 1943–1946

  23. 23.

    Hughes T, Ramage D (2007) Lexical semantic relatedness with random graph walks. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 581–589

  24. 24.

    Humphreys L, Lindberg D, Schoolman H, Barnett G (1998) The unified medical language system: an informatics research collaboration. J Am Med Inf Assoc 1(5):1–11

    Article  Google Scholar 

  25. 25.

    Kim S, Seo H, Rim H (2004) Information retrieval using word senses: root sense tagging approach. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 258–265

  26. 26.

    Kurland O, Lee L (2004) Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 194–201

  27. 27.

    Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01. ACM, pp 120–127

  28. 28.

    Li Y, Bandar Z, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15(4):871–882

    Article  Google Scholar 

  29. 29.

    Liu S, Liu F, Yu C, Meng W (2004) An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04. ACM, pp 266–272

  30. 30.

    Liu S, Yu C, Meng W (2005) Word sense disambiguation in queries. In: Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM ’05. ACM, pp 525–532

  31. 31.

    Liu X, Croft WB, Bruce W (2004) Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04. ACM, pp 186–193

  32. 32.

    Manning CD, Raghavan P, Schütze H (2009) An introduction to information retrieval. Cambridge University Press, Cambridge

    Google Scholar 

  33. 33.

    Mei Q, Zhang D, Zhai C (2008) A general optimization framework for smoothing language models on graph structures. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, pp 611–618

  34. 34.

    Meo PD, Quattrone G, Rosaci D, Ursino D (2012) Bilateral semantic negotiation: a decentralised approach to ontology enrichment in open multi-agent systems. Int J Data Mining Model Manag (IJDMMM) 4(1):1–38

    Article  Google Scholar 

  35. 35.

    Metzler D (2006) Estimation, sensitivity, and generalization in parameterized retrieval models. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06. ACM, pp 812–813

  36. 36.

    Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1’, AAAI ’06. AAAI Press, pp 775–780

  37. 37.

    Mitra M, Singhal A, Buckley C (1998) Improving automatic query expansion. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, pp 206–214

  38. 38.

    Moldovan D, Surdeanu M (2003) On the role of information retrieval and information extraction in question answering systems. Information Extraction in the Web Era, pp 129–147

  39. 39.

    Otegi A, Arregi X, Agirre E (2011) Query expansion for IR using knowledge-based relatedness. In: Proceedings of 5th international joint conference on natural language processing, Asian Federation of Natural Language Processing, pp 1467–1471

  40. 40.

    Palopoli L, Rosaci D, Terracina G, Ursino D (2005) A graph-based approach for extracting terminological properties from information sources with heterogeneous formats. Knowl Inf Syst 8(4):462–497

    Article  Google Scholar 

  41. 41.

    Peñas A, Forner P, Sutcliffe R, Rodrigo A, Forăscu C, Alegria I, Giampiccolo D, Moreau N, Osenova P (2009) Overview of ResPubliQA 2009: question answering evaluation over European legislation. In: Proceedings of the 10th cross-language evaluation forum conference on multilingual information access evaluation: text retrieval experiments, CLEF ’09. Springer, Berlin, pp 174–196

  42. 42.

    Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, pp 275–281

  43. 43.

    Riezler S, Vasserman A, Tsochantaridis I, Mittal V, Liu Y (2007) Statistical machine translation for query expansion in answer retrieval. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, pp 464–471

  44. 44.

    Robertson S (2006) On GMAP: and other transformations. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06. ACM, pp 78–83

  45. 45.

    Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The smart retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323

  46. 46.

    Rosaci D (2007) CILIOS: connectionist inductive learning and inter-ontology similarities for recommending information agents. Inf. Syst. 32(6):793–825

    Article  Google Scholar 

  47. 47.

    Ruthven I, Lalmas M (2003) A survey on the use of relevance feedback for information access systems. Knowl Eng Rev 18(2):95–145

    Article  Google Scholar 

  48. 48.

    Singhal A, Pereira F (1999) Document expansion for speech retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’99. ACM, pp 34–41

  49. 49.

    Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on information and knowledge management, CIKM ’07. ACM, pp 623–632

  50. 50.

    Stokoe C, Oakes MP, Tait J (2003) Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, pp 159–166

  51. 51.

    Strohman T, Metzler D, Turtle H, Croft WB (2005) Indri: a language-model based search engine for complex queries. In: Technical report, Proceedings of the international conference on intelligent analysis

  52. 52.

    Surdeanu M, Ciaramita M, Zaragoza H (2008) Learning to rank answers on large online QA collections. In: Proceedings of the 46th annual meeting of the association for computational linguistics. The Association for Computer Linguistics, pp 719–727

  53. 53.

    Tao T, Wang X, Mei Q, Zhai C (2006) Language model information retrieval with document expansion. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06. Association for Computational Linguistics, pp 407–414

  54. 54.

    Voorhees EM (1994) Query expansion using lexical-semantic relations. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94. Springer, New York, pp 61–69

  55. 55.

    Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96. ACM, pp 4–11

  56. 56.

    Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01. ACM, pp 334–342

  57. 57.

    Zhong Z, Ng HT (2012) Word sense disambiguation improves information retrieval. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Long Papers—Volume 1. Association for Computational Linguistics, pp 273–282

Download references

Acknowledgments

This work was partially funded by MINECO in Projects READERS and SKATER (PCIN-2013-002-C02-01, TIN2012-38584-C06-02) and by the European Commission in Project NEWSREADER (ICT FP7-ICT-2011-8-316404).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Arantxa Otegi.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Otegi, A., Arregi, X., Ansa, O. et al. Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44, 689–718 (2015). https://doi.org/10.1007/s10115-014-0785-4

Download citation

Keywords

  • Knowledge-based systems
  • Semantic similarity
  • Semantic relatedness
  • Information retrieval
  • Query and document expansion