A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms

Abstract

A typical information retrieval (IR) system applies a single retrieval strategy to every information need of users. However, the results of the past IR experiments show that a particular retrieval strategy is in general good at fulfilling some type of information needs while failing to fulfil some other type, i.e., high variation in retrieval effectiveness across information needs. On the other hand, the same results also show that an information need that a particular retrieval strategy failed to fulfil could be fulfilled by one of the other existing retrieval strategies. The challenge in here is therefore to determine in advance what retrieval strategy should be applied to which information need. This challenge is related to the robustness of IR systems in retrieval effectiveness. For an IR system, robustness can be defined as fulfilling every information need of users with an acceptable level of satisfaction. Maintaining robustness in retrieval effectiveness is a long-standing challenge and in this article we propose a simple but powerful method as a remedy. The method is a selective approach to index term weighting and for any given query (i.e., information need) it predicts the “best” term weighting model amongst a set of alternatives, on the basis of the frequency distributions of query terms on a target document collection. To predict the best term weighting model, the method uses the Chi-square statistic, the statistic of the Chi-square goodness-of-fit test. The results of the experiments, performed using the official query sets of the TREC Web track and the Million Query track, reveal in general that the frequency distributions of query terms provide relevant information on the retrieval effectiveness of term weighting models. In particular, the results show that the selective approach proposed in this article is, on average, more effective and more robust than the most effective single term weighting model.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    https://ir.nist.gov/ria.

  2. 2.

    Since the number of documents in which a term does not occur is usually far more higher in magnitude than that of the complementary case, inclusion of the relative frequency value of 0 into such a plot would in general make the plot unreadable, especially with respect to the semantically selective words.

  3. 3.

    The value of the parameter \(\alpha\) is usually taken as 5 in practice.

  4. 4.

    The number of queries listed for 8 models does not sum up to 194 due to the ties on the highest scores for some queries. In the case of a tie, that query is counted for each model, separately.

  5. 5.

    In order to calculate this feature, the test query must be searched without fetching the result list. Thus, this feature is not a pure pre-retrieval type.

  6. 6.

    http://jsoup.org.

  7. 7.

    http://github.com/trec-web/trec-web-2014.

  8. 8.

    http://ir.cis.udel.edu/million/statAP_MQ_eval_v4.pl.

  9. 9.

    http://terrier.org.

  10. 10.

    http://lucene.apache.org.

References

  1. Agresti, A. (2002). Categorical data analysis. New York: Wiley-Interscience.

    Google Scholar 

  2. Amati, G. (2006). Frequentist and Bayesian approach to information retrieval. In Advances in information retrieval, lecture notes in computer science (Vol. 3936, pp. 13–24). Berlin: Springer.

    Google Scholar 

  3. Amati, G. (2009). Divergence from randomness models (pp. 929–932). Boston, MA: Springer.

    Google Scholar 

  4. Amati, G., Carpineto, C., & Romano, G. (2004). Query difficulty, robustness, and selective application of query expansion. In Advances in information retrieval, lecture notes in computer science (Vol. 2997, pp. 127–137). Berlin: Springer.

  5. Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.

    Article  Google Scholar 

  6. Arguello, J., Crane, M., Diaz, F., Lin, J., & Trotman, A. (2016). Report on the SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). SIGIR Forum, 49(2), 107–116.

    Article  Google Scholar 

  7. Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.

    MathSciNet  Article  Google Scholar 

  8. Azzopardi, L., Crane, M., Fang, H., Ingersoll, G., Lin, J., Moshfeghi, Y., Scells, H., Yang, P., & Zuccon, G. (2017). The Lucene for information access and retrieval research (LIARR) workshop at SIGIR 2017. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, Shinjuku, Tokyo, Japan, SIGIR ’17 (pp. 1429–1430). ACM.

  9. Balasubramanian, N., & Allan, J. (2010). Learning to select rankers. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, Switzerland, SIGIR ’10 (pp. 855–856) ACM.

  10. Białecki, A., Muir, R., & Ingersoll, G. (2012). Apache Lucene 4. In Proceedings of the SIGIR 2012 workshop on open source information retrieval, Portland, Oregon, USA (pp. 17–24).

  11. Buckley, C. (2009). Why current IR engines fail. Information Retrieval, 12(6), 652–665.

    Article  Google Scholar 

  12. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on machine learning, Bonn, Germany, ICML ’05 (pp. 89–96).

  13. Callan, J., Hoy, M., Yoo, C., & Zhao, L. (2009). The ClueWeb09 dataset. http://boston.lti.cs.cmu.edu/classes/11-742/S10-TREC/TREC-Nov19-09.pdf. Accessed 15 October 2017.

  14. Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million query track 2009 overview. Technical report. National Institute of Standards and Technology.

  15. Clinchant, S., & Gaussier, É. (2010). Information-based models for ad hoc IR. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, Switzerland, SIGIR ’10 (pp. 234–241). ACM.

  16. Clinchant, S., & Gaussier, É. (2011). Retrieval constraints and word frequency distributions a log-logistic model for IR. Information Retrieval, 14(1), 5–25.

    Article  Google Scholar 

  17. Collins-Thompson, K. (2009). Reducing the risk of query expansion via robust constrained optimization. In Proceedings of the 18th ACM conference on information and knowledge management, New York, NY, USA, CIKM ’09 (pp. 837–846). ACM.

  18. Cormack, G. V., Smucker, M. D., & Clarke, C. L. A. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5), 441–465.

    Article  Google Scholar 

  19. Dinçer, B. T., Macdonald, C., & Ounis, I. (2014). Hypothesis testing for the risk-sensitive evaluation of retrieval systems. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, Gold Coast, Queensland, Australia, SIGIR ’14 (pp. 23–32). ACM.

  20. Dinçer, B. T., Macdonald, C., & Ounis, I. (2016). Risk-sensitive evaluation and learning to rank using multiple baselines. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, Pisa, Italy, SIGIR ’16 (pp. 483–492). ACM.

  21. Geng, X., Liu, T. Y., Qin, T., Arnold, A., Li, H., & Shum, H. Y. (2008). Query dependent ranking using \(k\)-nearest neighbor. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, Singapore, Singapore, SIGIR ’08 (pp. 115–122). ACM.

  22. Harman, D., & Buckley, C. (2004). The NRRC reliable information access (RIA) workshop. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, SIGIR ’04 (pp. 528–529). ACM.

  23. Harman, D., & Buckley, C. (2009). Overview of the reliable information access workshop. Information Retrieval, 12(6), 615–641.

    Article  Google Scholar 

  24. Harter, S. (1975a). A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science (JASIS), 26, 197–216.

    Article  Google Scholar 

  25. Harter, S. (1975b). A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. Journal of the American Society for Information Science (JASIS), 26, 280–289.

    Article  Google Scholar 

  26. He, B., & Ounis, I. (2003a). A study of parameter tuning for term frequency normalization. In Proceedings of the twelfth international conference on information and knowledge management, New Orleans, LA, USA, CIKM ’03 (pp. 10–16). ACM.

  27. He, B., & Ounis, I. (2003b). University of Glasgow at the robust track—A query-based model selection approach for the poorly-performing queries. Technical report. National Institute of Standards and Technology.

  28. He, B., & Ounis, I. (2004). A query-based pre-retrieval model selection approach to information retrieval. In Proceedings of the RIAO 2004—Coupling approaches, coupling media and coupling languages for information retrieval, Vaucluse, France, RIAO ’04 (pp. 706–719).

  29. He, B., & Ounis, I. (2005). Term frequency normalisation tuning for BM25 and DFR models. In D. E. Losada & J. M. Fernández-Luna (Eds.), Advances in information retrieval (pp. 200–214). Berlin: Springer.

    Google Scholar 

  30. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.

    Article  Google Scholar 

  31. Kocabaş, I., Dinçer, B. T., & Karaoğlan, B. (2014). A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Information Retrieval, 17(2), 153–176.

    Article  Google Scholar 

  32. Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, Pennsylvania, USA, SIGIR ’93 (pp. 191–202). ACM.

  33. Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source IR reproducibility challenge. In Advances in information retrieval: 38th European conference on IR research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings (pp. 408–420). Cham: Springer.

  34. Mackenzie, J., Culpepper, J. S., Blanco, R., Crane, M., Clarke, C. L. A, & Lin, J. (2018). Query driven algorithm selection in early stage retrieval. In Proceedings of the eleventh ACM international conference on web search and data mining, Marina Del Rey, CA, USA, WSDM ’18 (pp. 396–404). ACM.

  35. Peng, J., He, B., & Ounis, I. (2009a). Predicting the usefulness of collection enrichment for enterprise search. In Proceedings of the 2nd international conference on theory of information retrieval: Advances in information retrieval theory, ICTIR ’09 (pp. 366–370). Berlin: Springer.

  36. Peng, J., Macdonald, C., He, B., & Ounis, I. (2009b). A study of selective collection enrichment for enterprise search. In Proceedings of the 18th ACM conference on information and knowledge management, Hong Kong, China, CIKM ’09 (pp. 1999–2002). ACM.

  37. Peng, J., Macdonald, C., & Ounis, I. (2010). Learning to select a ranking function. In Proceedings of the 32nd European conference on advances in information retrieval, Milton Keynes, UK, ECIR’2010 (pp. 114–126). Springer.

  38. Peng, J., & Ounis, I. (2009). Selective application of query-independent features in web information retrieval. In Advances in information retrieval, lecture notes in computer science (Vol. 5478, pp. 375–387). Berlin: Springer.

  39. Petersen, C., Simonsen, J. G., Järvelin, K., & Lioma, C. (2016). Adaptive distributional extensions to DFR ranking. In Proceedings of the 25th ACM international on conference on information and knowledge management, Indianapolis, Indiana, USA, CIKM ’16 (pp. 2005–2008). ACM.

  40. Plachouras, V., Cacheda, F., & Ounis, I. (2006). A decision mechanism for the selective combination of evidence in topic distillation. Information Retrieval, 9(2), 139–163.

    Article  Google Scholar 

  41. Plachouras, V., Ounis, I., & Cacheda, F. (2004). Selective combination of evidence for topic distillation using document and aggregate-level information. In Proceedings of the RIAO 2004—coupling approaches, coupling media and coupling languages for information retrieval, Vaucluse, France, RIAO ’04 (pp. 610–622).

  42. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing (3rd ed.). New York, NY: Cambridge University Press.

    Google Scholar 

  43. Robertson, S., & Walker, S. (1994). Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’94) (pp. 232–241).

  44. Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.

    Article  Google Scholar 

  45. Robertson, S. E., van Rijsbergen, C. J., & Porter, M. (1981). Probabilistic models of indexing and searching, chap. 4. In S. E. Robertson, C. J. van Rijsbergen, & P. Williams (Eds.), Information retrieval research (pp. 35–56). Oxford: Butterworths.

    Google Scholar 

  46. Santos, R. L., Macdonald, C., & Ounis, I. (2010). Selectively diversifying web search results. In Proceedings of the 19th ACM international conference on information and knowledge management, Toronto, ON, Canada, CIKM ’10 (pp. 1179–1188). ACM.

  47. Teevan, J., Dumais, S. T., & Liebling, D. J. (2008). To personalize or not to personalize: Modeling queries with variation in user intent. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, Singapore, Singapore, SIGIR ’08 (pp. 163–170). ACM.

  48. Tonellotto, N., Macdonald, C., & Ounis, I. (2013). Efficient and effective retrieval using selective pruning. In Proceedings of the Sixth ACM international conference on web search and data mining, Rome, Italy, WSDM ’13 (pp. 63–72). ACM.

  49. Voorhees, E. M. (2004). Measuring ineffectiveness. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, SIGIR ’04 (pp. 562–563). ACM.

  50. Voorhees, E. M., Rajput, S., & Soboroff, I. (2016). Promoting repeatability through open runs. In Proceedings of the seventh international workshop on evaluating information access, Tokyo, Japan, EVIA 2016 (pp. 17–20).

  51. Wang, L., Bennett, P. N., & Collins-Thompson, K. (2012). Robust ranking models via risk-sensitive optimization. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, Portland, Oregon, USA, SIGIR ’12 (pp. 761–770). ACM.

  52. White, R. W., Richardson, M., Bilenko, M., & Heath, A. P. (2008). Enhancing web search by promoting multiple search engine use. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, Singapore, Singapore, SIGIR ’08 (pp. 43–50). ACM.

  53. Yom-Tov, E., Fine, S., Carmel, D., & Darlow, A. (2005). Learning to estimate query difficulty: Including applications to missing content detection and distributed information retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, SIGIR ’05 (pp. 512–519). ACM.

  54. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by TÜBİTAK, scientific and technological research projects funding program, under Grant 114E558. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ahmet Arslan.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Experimental setup and details

Appendix: Experimental setup and details

The IR community has recently encouraged the reproducibility of the published experimental results (Arguello et al. 2016; Lin et al. 2016; Voorhees et al. 2016). In this section, therefore, we provide every detail of our experimental setup necessary to reproduce the results of the experiment presented in this article. To further promote open-source sharing, repeatability and reproducibility, the source codes are made publicly available on GitHub, https://github.com/iorixxx/lucene-clueweb-retrieval, so that IR researchers could benefit from the presented work as much as possible.

The ClueWeb09-English documents are indexed after the Hyper Text Markup Language (HTML) tags are stripped from every document using the jsoupFootnote 6 library (version 1.10.2). The HTML tag stripping procedure yields empty text blocks for some documents, which are skipped during indexing. We employ no structural document representation i.e. the title and body sections of each document are combined together to form a single text block. However, anchor texts from incoming links are appended to the document contents.

In our experiments, we discard the topics that have no relevant documents in the judgment set. The exact number of effective topics used in this study is 759. The descriptive statistics for those 759 queries are given in Table 8.

Table 8 Salient statistics for the query sets used in the experiments

We use gdeval.pl (version 1.3) TREC evaluation tool (downloaded from trec-web-2014Footnote 7 GitHub repository) to calculate nDCG@k values. The tool computes nDCG using the standard rank-plus-one discount function and exponential gain (Burges et al. 2005). The relevance judgments for the Million Query (Carterette et al. 2009) 2009 topics are distributed as a five-column prels file instead of a four-column standard qrels file. Therefore, statAP_MQ_eval_v4.plFootnote 8 evaluation script is used to calculate nDCG@k values for the Million Query 2009.

The models LGD, PL2, and Language Modeling with Dirichlet smoothing (DLM) contain one free parameter, while BM25 contains two free parameters. It is important to fine tune the parameters of those models because they can affect the retrieval effectiveness of the models to a statistically significant degree. To obtain strong baselines, we use the optimum parameter values that attain the highest average retrieval effectiveness scores. Table 9 shows ranges of free-parameters used during parameter tuning. The optimum parameter values are as follows for the Web Tracks (2009–2012) of ClueWeb09A: BM25 (\(k_1=1.0\, b=0.3\)), LGD (\(c=8.0\)), DLM (\(\mu =800\)) and PL2 (\(c=8.0\)); for the Million Query 2009 of ClueWeb09B: BM25 (\(k_1=1.6\,b=0.5\)), LGD (\(c=1.0\)), DLM (\(\mu =200\)) and PL2 (\(c=12.0\)).

Table 9 Free-parameter values

Cormack et al. (2011) carried out the first systematic spam study for the ClueWeb09-English dataset, and presented the quantitative results of the impact of spam filtering on IR effectiveness. They reported that a substantial fraction of the ClueWeb09-English dataset consist of “spam” documents, spam in the sense of carrying no relevant information to any information need. It is also reported that the use of spam filtering significantly improves retrieval effectiveness for most of the systems that participated in the TREC 2009 Web Track. We use Cormack et al’s fusion spam scores to exclude the \(t\%\) spammy documents from the result lists, where \(t\% \in [0,90]\) (in increments of 5). The spam threshold t% value that maximizes the mean nDCG@100 scores of eight term-weighting models is 45% for the Web Tracks (2009–2012) while it is 10% for the Million Query 2009.

We use Apache Lucene (Białecki et al. 2012), an open-source search library written in Java, for indexing and searching. We adopted several term-weighting model implementations from TerrierFootnote 9 (version 4.0) retrieval platform to LuceneFootnote 10 (version 7.4.0). Over time, Lucene has become an industry standard and the usage of Lucene in academic work has been gaining a remarkable momentum (Azzopardi et al. 2017).

We keep the preprocessing of documents and queries minimum: after case-folding, we apply KStemming (Krovetz 1993) and do not perform stop word removal because stop words are essential for certain queries, such as “to be or not to be,” “the current,” “the wall,” “the who,” and “the sun.” The preprocessing pipeline, as a result, filters StandardTokenizer with LowerCaseFilter and KStemFilter of the Apache Lucene search engine.

To split the available query set into training and test subsets/samples we employ the leave-one-out method, which is widely used for exhaustive cross-validation evaluation efforts (Arlot and Celisse 2010). In this method, each query is in turn “left out,” one at a time, from the query set and used for the purpose of testing, while the remaining queries are used for training. Given that only a limited amount of queries is available, omitting each query in turn and using the remaining subset for training purposes is the maximal use of the query set at hand because only one query is omitted at each step. Furthermore, the procedure is deterministic since no sampling is involved.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arslan, A., Dinçer, B.T. A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms. Inf Retrieval J 22, 543–569 (2019). https://doi.org/10.1007/s10791-018-9347-9

Download citation

Keywords

  • Chi-square goodness-of-fit
  • Index term weighting
  • Robustness in retrieval effectiveness
  • Selective information retrieval