A non-parametric topical relevance model

Article
  • 34 Downloads

Abstract

An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version.

Keywords

Latent Dirichlet allocation Pseudo-relevance feedback Query-likelihood model Relevance model Non-parametric topic modeling 

Notes

Acknowledgements

This research was initiated by the support from Science Foundation Ireland (SFI) as a part of the CNGL Centre for Global Intelligent Content (Grant No: 12/CE/I2267) and continued as a part of the SFI funded ADAPT centre (Grant No. 13/RC/2106) in DCU.

References

  1. Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. (2009). Diversifying search results. In Proceeding of of WSDM ’09 (pp. 5–14).Google Scholar
  2. Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.MATHGoogle Scholar
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.MATHGoogle Scholar
  4. Cao, G., Nie, J. -Y., Gao, J., & Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of SIGIR (pp. 243–250).Google Scholar
  5. Cronen-Townsend, S., & Croft, W. B. (2002). Quantifying query ambiguity. In Proceedings of HLT ’02 (pp. 104–109).Google Scholar
  6. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.CrossRefGoogle Scholar
  7. Deveaud, R., SanJuan, E., & Bellot, P. (2013). Are semantically coherent topic models useful for ad hoc information retrieval? In Proceedings of ACL ’13 (pp. 148–152).Google Scholar
  8. Diaz, F. (2015). Condensed list relevance models. In Proceedings of ICTIR ’15.Google Scholar
  9. Diaz, F., Mitra, B., & Craswell, N. (2016). Query expansion with locally-trained word embeddings. In Proceedings of the 54th annual meeting of the association for computational linguistics. ACL Association for Computational Linguistics.Google Scholar
  10. Efron, M., Organisciak, P., & Fenlon, K. (2012). Improving retrieval of short texts through document expansion. In Proceedings of SIGIR ’12 (pp. 911–920). New York: ACM.Google Scholar
  11. Ganguly, D., Leveling, J., & Jones, G. J. F. (2011). Query expansion for language modeling using sentence similarities. In Proceedings of the IRFC (pp. 62–77).Google Scholar
  12. Ganguly, D., Roy, D., Mitra, M., & Jones, G. J. F. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th international ACM sigir conference on research and development in information retrieval. Sigir ’15 (pp. 795–798).Google Scholar
  13. Geman, S., & Geman, D. (1987). Readings in computer vision: Issues, problems, principles, and paradigms (pp. 564–584). Amsterdam: Elsevier.Google Scholar
  14. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. PNAS, 101(suppl. 1), 5228–5235.CrossRefGoogle Scholar
  15. Harman, D, & Buckley, C. (2004). The NRRC reliable information access (RIA) workshop. In Sigir (pp. 528–529).Google Scholar
  16. Heinrich, G. (2011). Infinite LDA Implementing the HDP with minimum code complexity, Technical Report TN2011/1.Google Scholar
  17. Hiemstra, D. (2000). Using language models for information retrieval. Center of Telematics and Information Technology, AE Enschede: Ph.D. diss.Google Scholar
  18. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’99 (pp. 50–57). ISBN 1-58113-096-1.Google Scholar
  19. Jaleel, N. A., Allan, J., Croft, W. B., Fernando, D., Larkey, L. S., Li, X., Smucker, M. D., & Wade, C. (2004). UMass at TREC 2004: Novelty and HARD. In Proceedings of the thirteenth text retrieval conference, TREC 2004 Gaithersburg, Maryland, USA, November 16–19, 2004.Google Scholar
  20. Krikon, E., & Kurland, O. (2011). A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Information Retrieval, 14(6), 593–616.CrossRefMATHGoogle Scholar
  21. Lavrenko, V., & Croft, B. W. (2001). Relevance based language models. In SIGIR 2001 (pp. 120–127). New York: ACM. ISBN 1-58113-331-6.Google Scholar
  22. Leveling, J., & Jones, G. J. F. (2010). Classifying and filtering blind feedback terms to improve information retrieval effectiveness. In RIAO 2010. CID.Google Scholar
  23. Li, X., & Zhu, Z. (2008). Enhancing relevance models with adaptive passage retrieval. In Proceedings of ECIR ’08 (pp. 463–471).Google Scholar
  24. Liang, S., Ren, Z., & de Rijke, M. (2014). Fusion helps diversification. In Proceedings of the 37th international acm sigir conference on research & #38; development in information retrieval. Sigir ’14 (pp. 303–312). ISBN978-1-4503-2257-7.Google Scholar
  25. Liu, X., Bouchoucha, A., Sordoni, A., & Nie, J. -Y. (2014). Compact aspect embedding for diversified query expansions. In Proceedings of the twenty-eighth aaai conference on artificial intelligence. AAAI ’14 (pp. 115–121). Chicago: AAAI Press.Google Scholar
  26. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’04 (pp. 186–193).Google Scholar
  27. Lv, Y., & Zhai, C. X. (2010). Positional relevance model for pseudo-relevance feedback. In Sigir ’10 (pp. 579–586). New York: ACM. ISBN 978-1-4503-0153-4.Google Scholar
  28. Minka, T., & Lafferty, J.. (2002). Expectation-propagation for the generative aspect model. In Proceedings of the eighteenth conference on uncertainty in artificial intelligence (pp. 352–359).Google Scholar
  29. Ogilvie, P., Vorhees, E., & Callan, J. (2009). On the number of terms used in automatic query expansion. Information Retrieval, 12(6), 666–679.CrossRefGoogle Scholar
  30. Ponte, J. M. (1998). A language modeling approach to information retrieval. Ph.D. diss: University of Massachusetts.Google Scholar
  31. Ramage, D., Manning, C. D., & Dumais, S. (2011). Partially labeled topic models for interpretable text mining. In Proceedings of KDD ’11 (pp. 457–465). New York: ACM. ISBN 978-1-4503-0813-7.Google Scholar
  32. Robertson, S. E., Walker, S., Jones, S., & Hancock-Beaulieu, M. (1994). Okapi at TREC-3. In Proceedings of the third text retrieval conference (TREC 1994). NIST.Google Scholar
  33. Robertson, S. E. (1990). On term selection for query expansion. Journal of Documentation, 46(4), 359–364.CrossRefGoogle Scholar
  34. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639–650.MathSciNetMATHGoogle Scholar
  35. Tao, T., Wang, X., Mei, Q., & Zhai, C. X. (2006). Language model information retrieval with document expansion. In Proceedings of HLT-NAACL ’06.Google Scholar
  36. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2004). Hierarchical dirichlet processes. Journal of the American Statistical Association 101.Google Scholar
  37. Voorhees, E. M. (2004). Overview of the TREC 2004 robust track. In Proceedings of TREC ’04.Google Scholar
  38. Warren, R. H., & Liu, T. (2004). A review of relevance feedback experiments at the 2003 Reliable Information Access (RIA) workshop. In Proceedings of SIGIR 2004 (pp. 570–571). New York: ACM. ISBN 1-58113-881-4.Google Scholar
  39. Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR ’06 (pp. 178–185). New York: ACM.Google Scholar
  40. Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In SIGIR 1996 (pp. 4–11). New York: ACM.Google Scholar
  41. Xu, J., & Croft, W. B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79–112.CrossRefGoogle Scholar
  42. Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31th european conference on IR research on advances in information retrieval. ECIR ’09 (pp. 29–41). Berlin: Springer.Google Scholar
  43. Zamani, H., & Croft, W. B. (2017). Relevance-based word embedding. In Proceedings of SIGIR ’17 (pp. 505–514).Google Scholar
  44. Zuccon, G., Koopman, B., Bruza, P., & Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian document computing symposium. ADCS ’15 (pp. 12–1128).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM Research LabDublinIreland
  2. 2.Adapt Centre, School of ComputingDublin City UniversityDublin 9Ireland

Personalised recommendations