Abstract
Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Amati, G.: Probabilistic Models for Information Retrieval based on Divergence from Randomness. PhD thesis, University of Glasgow (2003)
Chowdhury, A., et al.: Document normalization revisited. In: Proceedings of ACM SIGIR, ACM Press, New York (2002)
Clarke, C., Scholer, F., Soboroff, I.: Overview of the TREC-2005 Terabyte Track. In: Proceedings of TREC (2005)
DeGroot, M.: Probability and Statistics, 2nd edn. Addison-Wesley, Reading (1989)
Eiron, N., McCurley, K.: Analysis of anchor text for web search. In: Proceedings ACM SIGIR 2003, ACM Press, New York (2003), http://mccurley.org/papers/anchor.pdf
Harter, S.: A probabilistic approach to automatic keyword indexing. PhD thesis, The University of Chicago (1974)
He, B., Ounis, I.: Term frequency normalisation tuning for BM25 and DFR model. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, Springer, Heidelberg (2005)
He, B., Ounis, I.: A study of the Dirichlet Priors for term frequency normalisation. In: Proceedings of ACM SIGIR 2005, ACM Press, New York (2005)
Jansen, B., Spink, A.: How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management 42(1) (2006)
Macdonald, C., et al.: University of Glasgow at TREC 2005: Experiments in Terabyte and Enterprise tracks with Terrier. In: Proceedings of TREC (2005)
Ounis, I., et al.: Terrier: A high performance and scalable Information Retrieval platform. In: Proceedings of ACM SIGIR OSIR Workshop, ACM Press, New York (2006)
Plachouras, V.: Selective Web Information Retrieval. PhD thesis, University of Glasgow (2006)
Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of TREC 7 (1998)
Robertson, S.E., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proceedings ACM CIKM, ACM Press, New York (2004)
Salton, G.: The SMART Retrieval System. Prentice Hall, Englewood Cliffs (1971)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of ACM SIGIR, ACM Press, New York (1996)
Voorhees, E.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)
Zaragoza, H., et al.: Microsoft Cambridge at TREC 13: Web and Hard Tracks. In: Proceedings of TREC (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
He, B., Ounis, I. (2007). Setting Per-field Normalisation Hyper-parameters for the Named-Page Finding Search Task. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_42
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)