Advertisement

Information Retrieval

, Volume 17, Issue 4, pp 380–406 | Cite as

Distance matters! Cumulative proximity expansions for ranking documents

  • Jeroen B. P. VuurensEmail author
  • Arjen P. de Vries
Article

Abstract

In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections.

Keywords

Term dependency Term proximity Query expansion 

References

  1. Beeferman, D., Berger, A., & Lafferty, J. (1997). A model of lexical attraction and repulsion. In Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the European chapter of the association for computational linguistics (pp. 373–380). Association for computational linguistics.Google Scholar
  2. Bendersky, M., & Croft, W. B. (2012). Modeling higher-order term dependencies in information retrieval using query hypergraphs. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 941–950). ACM.Google Scholar
  3. Bendersky, M., Metzler, D. & Croft, W. B. (2010). Learning concept importance using a weighted dependence model. In Proceedings of the third ACM international conference on Web search and data mining (pp. 31–40). ACM.Google Scholar
  4. Büttcher, S., Clarke, C. L., & Lushman, B. (2006). Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 621–622). ACM.Google Scholar
  5. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 651–658). ACM.Google Scholar
  6. Clarke, C. L., Cormack, G. V., & Tudhope, E. A. (2000). Relevance ranking for one to three term queries. Information Processing and Management, 36(2), 291–311.CrossRefGoogle Scholar
  7. Collins-Thompson, K., & Callan, J. (2007). Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 303–310). ACM.Google Scholar
  8. Croft, W. B., Turtle, H. R., & Lewis, D. D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 32–45). ACM.Google Scholar
  9. Cummins, R., & O’Riordan, C. (2009). Learning in a pairwise term-term proximity framework for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 251–258). ACM.Google Scholar
  10. De Kretser, O. & Moffat, A. (1999). Effective document presentation with a locality-based similarity heuristic. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 113–120). ACM.Google Scholar
  11. Fagan, J. (1987). Automatic phrase indexing for document retrieval. In Proceedings of the 10th annual international ACM SIGIR conference on research and development in information retrieval (pp. 91–101). ACM.Google Scholar
  12. Gao, J., Nie, J.-Y., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 170–177). ACM.Google Scholar
  13. Hawking, D., & Thistlewaite, P. (1995). Proximity operators-so near and yet so far. In Proceedings of the 4th text retrieval conference (pp. 131–143).Google Scholar
  14. He, B., Huang, J. X., & Zhou, X. (2011). Modeling term proximity for probabilistic information retrieval models. Information Sciences, 181(14), 3017–3031.CrossRefMathSciNetGoogle Scholar
  15. Keen, E. M. (1991). The use of term position devices in ranked output experiments. Journal of Documentation, 47(1), 1–22.CrossRefGoogle Scholar
  16. Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 120–127). ACM.Google Scholar
  17. Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of the eleventh international conference on information and knowledge management (pp. 375–382). ACM.Google Scholar
  18. Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 299–306). ACM.Google Scholar
  19. Metzler, D., & Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 472–479). ACM.Google Scholar
  20. Metzler, D., & Croft, W. B. (2007). Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 311–318). ACM.Google Scholar
  21. Miao, J., Huang, J. X., & Ye, Z. (2012). Proximity-based rocchio’s model for pseudo relevance. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 535–544). ACM.Google Scholar
  22. Nallapati, R., & Allan, J. (2002). Capturing term dependencies using a language model based on sentence trees. In Proceedings of the eleventh international conference on information and knowledge management (pp. 383–390). ACM.Google Scholar
  23. Rasolofo, Y., & Savoy, J. (2003). Term proximity scoring for keyword-based retrieval systems. In Advances in information retrieval (pp. 207–218). Springer.Google Scholar
  24. Sakai, T., Manabe, T., & Koyama, M. (2005). Flexible pseudo-relevance feedback via selective sampling. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), 111–135.CrossRefGoogle Scholar
  25. Shi, L., & Nie, J.-Y. (2010). Using various term dependencies according to their utilities. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1493–1496). ACM.Google Scholar
  26. Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on Information and knowledge management (pp. 316–321). ACM.Google Scholar
  27. Song, R., Taylor, M. J., Wen, J.-R., Hon, H.-W., & Yu, Y. (2008). Viewing term proximity from a different perspective. In Advances in information retrieval (pp. 346–357). Springer.Google Scholar
  28. Svore, K. M., Kanani, P. H., & Khan, N. (2010). How good is a span of terms? Exploiting proximity to improve web retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 154–161). ACM.Google Scholar
  29. Tao, T., & Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 295–302). ACM.Google Scholar
  30. Tellex, S., Katz, B., Lin, J., Fernandes, A., & Marton, G. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 41–47). ACM.Google Scholar
  31. Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2), 106–119.CrossRefGoogle Scholar
  32. Vechtomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 2(4), 324–333.CrossRefGoogle Scholar
  33. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.CrossRefGoogle Scholar
  34. Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 291–298). ACM.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.The Hague University of Applied SciencesThe HagueThe Netherlands
  2. 2.CWIAmsterdamThe Netherlands
  3. 3.Delft University of TechnologyDelftThe Netherlands

Personalised recommendations